Troubleshooting

Check liveness, then readiness, then stats, and only then move into DNS, audit, wiretap, or cluster-specific detail.

Troubleshooting is fastest when you follow the same order the runtime depends on.

First Checks

Start here:

  • GET /health
  • GET /ready
  • GET /metrics
  • GET /api/v1/stats

Interpretation:

  • /health tells you whether the management process is up
  • /ready tells you whether the node is actually able to operate
  • /metrics gives detailed counters
  • /api/v1/stats gives a compact runtime summary

Suggested Triage Order

For most incidents:

  1. check /health
  2. check /ready
  3. inspect the exact failing readiness checks
  4. inspect /api/v1/stats
  5. use DNS, audit, or wiretap surfaces depending on the symptom
  6. collect a sysdump if the issue is broad or cluster-wide

If /ready Fails

Focus on the failing readiness key:

  • dataplane_running: dataplane thread or shutdown problem
  • dataplane_config: dataplane bootstrap or addressing problem
  • policy_ready: active policy not available or not yet replayed
  • dns_allowlist: DNS runtime not ready
  • service_plane: TLS intercept or other service-runtime issue
  • cluster: leader or cluster-health issue
  • policy_replication: cluster state not replayed locally yet

If Traffic Is Being Blocked

Use the symptom to choose the next surface:

  • hostname policy problem: GET /api/v1/dns-cache
  • repeated denies: GET /api/v1/audit/findings
  • live confirmation needed: GET /api/v1/wiretap/stream

That split matters because a DNS denial and a dataplane denial look different to the operator even though both end with traffic not flowing.

If audit or wiretap returns 503, check performance mode before digging deeper into the traffic path.

If The Problem Is Cluster-Wide

In clustered environments, verify:

  • whether a leader is known
  • whether followers are caught up
  • whether policy replication is ready everywhere

If the problem is still unclear, collect:

POST /api/v1/support/sysdump/cluster

Practical Rule

Do not start incident response with raw logs unless the process is failing to start. Readiness and stats usually tell you which subsystem is broken before logs tell you why.