Observability

Use health, readiness, metrics, stats, audit, and wiretap together to understand node health and enforcement behavior.

Neuwerk exposes several different observability surfaces because no single endpoint answers every operational question.

The main ones are:

  • /health for process liveness
  • /ready for operational readiness
  • /metrics for Prometheus metrics
  • /api/v1/stats for a compact runtime snapshot
  • /api/v1/audit/findings for structured deny history
  • /api/v1/wiretap/stream for live traffic observation
  • process logs for runtime detail

Audit and wiretap depend on performance mode. If performance mode is disabled, those API surfaces return 503 until it is enabled again.

Start Here First

When you need a quick health check, use this order:

  1. GET /health
  2. GET /ready
  3. GET /metrics
  4. GET /api/v1/stats

That sequence separates “is the process up” from “is the node actually ready to enforce traffic”.

Liveness And Readiness

/health answers whether the management process is alive.

/ready answers whether the node is ready to do useful work. It returns 200 only when the dataplane, policy state, DNS proxy, service plane, and any cluster-specific checks are in a usable state.

Important readiness checks include:

  • dataplane_running
  • dataplane_config
  • policy_ready
  • dns_allowlist
  • service_plane
  • draining
  • cluster
  • policy_replication

Metrics

/metrics is the raw Prometheus surface. Use it for:

  • dashboards
  • alerting
  • capacity trend analysis
  • error-rate monitoring

Metrics are best for trends and saturation. They are not always the fastest path to an incident answer, which is why /ready and /api/v1/stats are better first checks.

Runtime Snapshot

/api/v1/stats is the fastest compact view of the running node.

Use it when you want:

  • dataplane counters
  • DNS state summary
  • TLS and service-plane state
  • cluster catch-up context

It is easier to interpret during incident response than parsing the entire Prometheus surface.

Audit And Wiretap

Audit and wiretap answer different questions:

  • audit shows persisted structured findings about denies and selected auth events
  • wiretap shows live traffic observation

Use audit when you need evidence of repeated policy outcomes. Use wiretap when you need to confirm what is happening right now on a live path.

Both surfaces are intentionally gated by performance mode. Read Performance Mode if those workflows are unavailable.

Logs

Logs remain the best source for startup failures, component crashes, and explicit runtime errors.

Use JSON logs when you need to ship events into an external logging system. Use compact logs when you need local readability.

Practical Rule

Start with readiness, confirm with stats, then use metrics, audit, wiretap, and logs to narrow the problem. That order usually gets you to the right runtime faster than jumping straight into raw counters.