Troubleshooting

Check liveness, then readiness, then stats, and only then move into DNS, audit, wiretap, or cluster-specific detail.

Troubleshooting is fastest when you follow the same order the runtime depends on.

First Checks

Start here:

GET /health
GET /ready
GET /metrics
GET /api/v1/stats

Interpretation:

/health tells you whether the management process is up
/ready tells you whether the node is actually able to operate
/metrics gives detailed counters
/api/v1/stats gives a compact runtime summary

Suggested Triage Order

For most incidents:

check /health
check /ready
inspect the exact failing readiness checks
inspect /api/v1/stats
use DNS, audit, or wiretap surfaces depending on the symptom
collect a sysdump if the issue is broad or cluster-wide

If `/ready` Fails

Focus on the failing readiness key:

dataplane_running: dataplane thread or shutdown problem
dataplane_config: dataplane bootstrap or addressing problem
policy_ready: active policy not available or not yet replayed
dns_allowlist: DNS runtime not ready
service_plane: TLS intercept or other service-runtime issue
cluster: leader or cluster-health issue
policy_replication: cluster state not replayed locally yet

If Traffic Is Being Blocked

Use the symptom to choose the next surface:

hostname policy problem: GET /api/v1/dns-cache
repeated denies: GET /api/v1/audit/findings
live confirmation needed: GET /api/v1/wiretap/stream

That split matters because a DNS denial and a dataplane denial look different to the operator even though both end with traffic not flowing.

If audit or wiretap returns 503, check performance mode before digging deeper into the traffic path.

If The Problem Is Cluster-Wide

In clustered environments, verify:

whether a leader is known
whether followers are caught up
whether policy replication is ready everywhere

If the problem is still unclear, collect:

POST /api/v1/support/sysdump/cluster

Practical Rule

Do not start incident response with raw logs unless the process is failing to start. Readiness and stats usually tell you which subsystem is broken before logs tell you why.