Troubleshooting
Check liveness, then readiness, then stats, and only then move into DNS, audit, wiretap, or cluster-specific detail.
Troubleshooting is fastest when you follow the same order the runtime depends on.
First Checks
Start here:
GET /healthGET /readyGET /metricsGET /api/v1/stats
Interpretation:
/healthtells you whether the management process is up/readytells you whether the node is actually able to operate/metricsgives detailed counters/api/v1/statsgives a compact runtime summary
Suggested Triage Order
For most incidents:
- check
/health - check
/ready - inspect the exact failing readiness checks
- inspect
/api/v1/stats - use DNS, audit, or wiretap surfaces depending on the symptom
- collect a sysdump if the issue is broad or cluster-wide
If /ready Fails
Focus on the failing readiness key:
dataplane_running: dataplane thread or shutdown problemdataplane_config: dataplane bootstrap or addressing problempolicy_ready: active policy not available or not yet replayeddns_allowlist: DNS runtime not readyservice_plane: TLS intercept or other service-runtime issuecluster: leader or cluster-health issuepolicy_replication: cluster state not replayed locally yet
If Traffic Is Being Blocked
Use the symptom to choose the next surface:
- hostname policy problem:
GET /api/v1/dns-cache - repeated denies:
GET /api/v1/audit/findings - live confirmation needed:
GET /api/v1/wiretap/stream
That split matters because a DNS denial and a dataplane denial look different to the operator even though both end with traffic not flowing.
If audit or wiretap returns 503, check performance mode before digging deeper into the traffic
path.
If The Problem Is Cluster-Wide
In clustered environments, verify:
- whether a leader is known
- whether followers are caught up
- whether policy replication is ready everywhere
If the problem is still unclear, collect:
POST /api/v1/support/sysdump/cluster
Practical Rule
Do not start incident response with raw logs unless the process is failing to start. Readiness and stats usually tell you which subsystem is broken before logs tell you why.