Monitoring & Observability
What the platform sees about itself — metrics, dashboards, alerts. Where to look when something's wrong, and what gets paged in the middle of the night.
Why It Matters
A sovereign platform that can't see itself can't be trusted. The monitoring stack answers three questions every good operator asks at least weekly:
- Is the supply invariant still zero? (If not, nothing else matters.)
- Is the network hitting its SLOs — latency, error rate, transfer throughput?
- What's the shape of the next failure? — trends, saturation, early warnings before anything breaks.
This chapter covers how the stack produces that visibility, where to look for what, and what alerts exist so you know an incident has been recognized.
The Stack
Prometheus for metrics. Grafana for dashboards. Alertmanager for routing. cAdvisor for container stats. That's it — deliberately boring, industry-standard, auditable.
Illustrative of the reference multi-frame deployment; a single-registry install scrapes one registry, TEG, and event-store — the same metric families, fewer targets.
Continuous development
Dashboards and alert rules ship as a continuously evolving set — the dashboards and alert rules below are the current state, not a frozen surface. New panels, new rules, and protocol_* business metrics land regularly as the platform itself evolves. If something looks light, it's because the next refresh is queued.
Every service exposes /metrics in Prometheus format. Prometheus scrapes every 15 seconds, stores time series, and evaluates alert rules. Alertmanager routes firing alerts to email (and any other notifier you wire up). Grafana reads from Prometheus for dashboards and from the registry's own APIs for app-level views.
What's Measured
Three layers, with honest scope of what's wired today.
Infrastructure
cadvisor— container CPU, memory, network I/O per service.node-exporter— host-level CPU, memory, disk, network.postgres-exporter— every Postgres instance scraped (the registry DB, TEG DB, and event-store DB; more in a multi-frame deployment). Connection counts, query latency, replication lag, cache hit ratio (thepg_*family).
Not currently wired: Redis exporter, Redpanda/Kafka exporter — both are in development. Redis health is observed today via container-level cAdvisor stats and via the alerts that fire when Redis-backed features fail (e.g. leader-election fallout in ServiceDown correlations); Kafka/Redpanda similarly.
Application (per service /metrics)
- HTTP middleware (every service) —
http_requests_total,http_request_duration_seconds,http_request_size_bytes,http_response_size_bytes. Standard Prometheus FastAPI middleware. Used by theHighErrorRateandHighLatencyP95alerts. - Registry IRONHAND —
ironhand_enrolled_agents,ironhand_revoked_agents,ironhand_spire_agent_entries,ironhand_orphan_entries,ironhand_missing_entries. mTLS enrollment health. - EventStore —
eventstore_events_total,eventstore_events_by_type,eventstore_unique_agents,eventstore_ledger_age_seconds,eventstore_ledger_freshness_seconds,eventstore_events_ingested_total,eventstore_kafka_consumer_batches_total,eventstore_kafka_batch_size,eventstore_info. Drives theSupplyLedgerStaleandEventStoreStaleLedgeralerts. - OPA shadow-mode (gated by
OPA_ENABLED, off by default) —opa_decisions_total,opa_decision_latency_seconds. See chapter 08.
The TEG layer emits only HTTP middleware metrics today — no teg_transfers_total, no teg_fees_collected, no staking gauges. Business-level rates are derived from eventstore_events_by_type{event_type="TokensTransferred"} and similar event-store queries.
Business Invariants — In Development
The stack of business-level Prometheus metrics that would let you watch supply delta / federation drift / A2A token states / dispute phases / veToken turnout directly is not wired today. The data exists in the EventStore and the registry's own DB; it's just not yet exported as Prometheus gauges/counters. Today's surface for invariants:
- Supply audit delta —
GET /api/v1/projections/supply-audit(EventStore), oreventstore_ledger_freshness_secondsas a proxy (theSupplyLedgerStalealert uses this). - Federation peer health —
up{job="peer-registry"} == 0and theFederationPeerDownalert. - Dispute / governance state — query the registry's own DB or
/api/v1/governance/proposals?status=*endpoints.
Promoting these to first-class protocol_* Prometheus metrics is on the roadmap.
The Canary Reactor
Most synthetic monitors hit a health endpoint and call it a day. The canary reactor moves real production AVT between funded canary agents on real production infrastructure, watches the resulting events propagate through five reactors and the EventStore, and tells you exactly which phase silently broke when something silently breaks. It runs continuously, uses the same code paths as user traffic, has no test harness.
Live at /ui#/admin/canaries. Backed by canary_runner (scheduler registry only, leader-elected, fires every ~60s) and canary_judge (scheduler registry only, leader-elected, sweeps every 5s). Five validator reactors subscribe to EventStore events and stamp per-event ISO timestamps into a canary:<test_id> Redis hash; the judge reads the hash, decides outcome, persists to canary_test_results.
Three backends, three event signatures
Each path runs one of three transfer backends, and each backend emits a different set of events. The canary's per-backend "done" definition follows what the backend actually emits, not what would be tidy.
| Backend | Endpoint | Events emitted | Canary required-stamp |
|---|---|---|---|
intra | /teg/transfer | TokensTransferred + TransactionFeeCollected | tokens_xfer_seen_iso |
async | /teg/cross-registry-transfer?backend=async | Sender frame: TokensTransferred + CrossFrameTransferInitiated + TransactionFeeCollected. Receiver frame (after credit): CrossFrameTransferSettled. SF-4 broadcasts both Initiated and Settled across frames. | cf_initiated_seen_iso AND (cf_settled_local_seen_iso OR cf_settled_remote_seen_iso) |
2pc | /teg/cross-registry-transfer?backend=2pc | Always: TokensTransferred + CrossRegistryTransferCompleted + TransactionFeeCollected. Cross-frame 2pc additionally: one extra CrossFrameTransferSettled (Phase 7.5 visibility helper — TokensTransferred is in SF-4's foreign_events but not in CROSS_FRAME_BROADCAST_TYPES, so without that extra emit the canary on the other frame would never see the transfer land). | tokens_xfer_seen_iso OR (cf_settled_local_seen_iso OR cf_settled_remote_seen_iso) |
The canonical 2pc completion event is CrossRegistryTransferCompleted. There is no canary validator subscribed to it — the canary detects 2pc via the always-emitted TokensTransferred (memo wrapped as cross_registry:canary-<test_id>; the reactor strips the prefix before correlating). Keeps the validator subscription set to five event types and the per-backend logic symmetric.
The five validator reactors
| Event | Reactor stamps | Notes |
|---|---|---|
TokensTransferred | tokens_xfer_seen_iso | All 3 backends emit. Required for intra; one of two acceptable for 2pc; informational for cross-frame async. |
TransactionFeeCollected | fee_collected_seen_iso | All 3 backends emit. The fee event has no memo field, so memo-fallback alone won't catch it; only the canary_xfer:<transfer_id> reverse-index does. Treated as opportunistic latency signal, never required. |
CrossFrameTransferInitiated | cf_initiated_seen_iso | async cross-frame only. 2pc does not emit this. Required for async. |
CrossFrameTransferSettled | cf_settled_local_seen_iso (if source_frame == local_frame) or cf_settled_remote_seen_iso | async receiver frame after credit; cross-frame 2pc Phase 7.5 helper. Required for async; one of two acceptable for cross-frame 2pc. |
CrossFrameTransferRefunded | cf_refunded_seen_iso + failure_signal=refunded | async saga sweeper after 5-min timeout. Presence is an explicit failure — judge marks failed/refunded immediately. |
The (i) and the live log
The dashboard hero has a small italic i next to the title — toggles a six-block inline panel documenting path matrix, fire cadence, validation pipeline, failure semantics, observability surface, and source files. Click-to-expand, no page navigation.
The bottom of the view is a live event log feed: every distinct outcome the dashboard has seen across refresh cycles, capped at 200 entries, color-coded, filterable by outcome, collapsible. Auto-tail toggle prepends new arrivals; collapsed-and-filling sprouts a +N badge so you don't miss movement.
Latency: don't let the observer dominate the measurement
total_latency_ms went through three definitions before settling. The first version was wallclock(judge_completed - fired) — every test averaged ~17s because the judge ran every 30s, so the metric was dominated by sweep cadence rather than transfer time. The fix was to compute latency from the latest *_seen_iso stamp the validator wrote, not from when the judge happened to look. Average dropped from 17s → ~935ms. General lesson: never let an observer interval dominate a measurement. If a synthetic monitor lies about latency, alerts are gated on a number that doesn't mean what it says.
What the canary is not
The canary is not the supply auditor. The auditor verifies tokens_issued − tokens_destroyed + transit_net == tokens_circulating across the EventStore on its own cycle (60s) with zero tolerance and its own dashboard at auditor.example.com. The canary watches whether transfers complete; the auditor watches whether they conserve mass. Different invariants, different failure modes, different runbooks. Don't conflate.
TIP
Post-restart of the scheduler registry, expect ~5-10 minutes of canary noise while the five validator reactors re-acquire leader locks, re-establish EventStore WebSocket subscriptions, and let the polling-backstop close the watermark gap. Don't page on canary failures within the first 10 minutes after a scheduler-registry recreate. After the warmup, the next cycle judges 30/30 passed.
🔗 Grafana dashboard: 28 — Canary Reactor — 8 sections covering pass-rate aggregates, per-path scoreboard, validator pipeline, upstream EventStore + TEG-DB pressure, and a 30d trend via VictoriaMetrics. Annotations mark scheduler-registry restarts so you don't misread post-restart warmup noise as a real regression. · Long-form blog post: The Canary Reactor: How TheProtocol Watches Itself, One Real Transfer at a Time
Where Things Live
Grafana https://grafana.example.com
Prometheus https://prometheus.example.com
Alertmanager https://alerts.example.comCurated dashboards (shipped at monitoring/grafana/provisioning/dashboards/json/, editable):
| # | Real title | Purpose |
|---|---|---|
| 01 | 01 — System Overview | top-level health — start here |
| 02 | 02 — Registry & Service Health | per-service uptime, request rate, p95 latency |
| 03 | 03 — TEG / Token Economics | transfers, staking ops, fee collection (mostly via EventStore-derived series today) |
| 04 | 04 — EventStore: Dual-Frame Unified | write rate, ledger freshness, supply audit cross-frame |
| 05 | 05 — PostgreSQL Performance | connections, locks, slow queries, replication lag |
| 07 | Container Resources (cAdvisor) | CPU, memory, network per container |
| 08 | Frame B Sovereign | per-frame health for an additional frame's registry / TEG / event-store — multi-frame deployments only |
TIP
For everyday operation, start at System Overview. If everything is green, close the tab. If anything isn't, the panel title tells you which dashboard to drill into.
Alerts
Alertmanager evaluates rules and sends to security@example.com. The rule set is deliberately conservative — every firing alert represents something an operator should look at within an hour.
An INFO severity tier is reserved in Prometheus convention but no INFO-severity alerts are wired today — capacity-trend / worker-lag / TVL-swing alerts are in development; the platform pages on critical+warning only for now.
Representative alerts (selected from the live prometheus_alerts.yml):
| Alert | Severity | Fires when |
|---|---|---|
SupplyLedgerStale | CRITICAL | EventStore ledger freshness > 30 minutes (no new events) for 5m — supply audit may be unreliable |
ServiceDown | CRITICAL | up{tier="application"} == 0 for 2m — registry / TEG / event-store unreachable |
PostgreSQLDown | CRITICAL | postgres-exporter up == 0 for 2m |
EventStoreStaleLedger | CRITICAL | EventStore-side staleness signal (separate from ledger freshness) |
FrameBRegistryDown / FrameBEventStoreDown / FrameBLedgerStale / FrameBHighErrorRate | CRITICAL/WARNING | per-additional-frame equivalents — registry/eventstore down, ledger stale, or 5xx rate elevated (multi-frame deployments) |
HighErrorRate | WARNING | 5xx rate > 5% over 5 minutes |
HighLatencyP95 | WARNING | p95 route latency > 2s for 5 minutes |
FederationPeerDown | WARNING | up{job="peer-registry"} == 0 for 5m — cross-frame sync compromised |
DBConnectionsHigh / DBConnectionsCritical | WARNING | Postgres connection pool utilization elevated |
HostHighCPU / HostHighMemory / HostDiskSpaceLow / DiskSpaceCritical | WARNING | Infrastructure pressure |
ContainerRestarting / ContainerMemoryHigh | WARNING | cAdvisor container instability |
EventStoreDeadlockBurst | WARNING | DB deadlock spike |
PostgreSQLHighConnections / PostgreSQLLowCacheHitRatio | WARNING | Postgres performance degradation |
The supply-side critical alerts are the only ones that mean "stop everything and investigate now." Everything else is "look in the next hour."
PromQL Cheatsheet
A few queries worth bookmarking (verified against live /metrics endpoints):
# EventStore freshness — proxy for supply audit health
eventstore_ledger_freshness_seconds{job="event-store"}
# EventStore total events recorded
eventstore_events_total
# p95 latency per service (matches the HighLatencyP95 alert)
histogram_quantile(0.95, sum by (job, le) (rate(http_request_duration_seconds_bucket[5m])))
# 5xx rate per service
sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))
/ sum by (job) (rate(http_requests_total[5m]))
# Service liveness
up{tier="application"}
# IRONHAND mTLS enrollment count
ironhand_enrolled_agents
# OPA shadow-mode decisions (when sandbox/prod has OPA_ENABLED=true)
sum by (policy, decision, source) (rate(opa_decisions_total[5m]))Real metric prefixes: eventstore_* (EventStore service), opa_* (OPA shadow-mode, see chapter 08), ironhand_* (agent mTLS), http_* (HTTP middleware on every service), container_* (cAdvisor), pg_* (postgres-exporter), up (Prometheus liveness). A unified protocol_* app-metric namespace (per-transfer counters, staking-TVL gauge, federation-peer status) is in development — the scaffolding is there in prometheus_client Counter/Gauge/Histogram registrations, just not yet rolled out across the registry's transfer/stake/governance paths.
Admin API for Queries
If you need to query metrics or dashboards from code (a CI health check, a Claude tool call, an external monitor), use the admin MCP tools:
theprotocol_adminPromQuery(query="up", start=None, end=None, step=None)
theprotocol_adminGrafanaQuery(path="/api/dashboards/uid/system-overview", method="GET")Both forward to the underlying services via httpx (tools_admin.py:313, 389). For HTTP-direct access without the MCP wrapper, the admin MCP theprotocol_adminRequest proxy can hit any registry or upstream-Prometheus/Grafana endpoint. There's no dedicated /api/v1/admin/prom/query registry endpoint today — query Prometheus directly at https://prometheus.example.com/api/v1/query (admin auth on the nginx layer) or use the MCP tool. See chapter 12.
Security Monitoring (Lightweight)
Separate from operational telemetry, the platform runs:
- CrowdSec — community threat intel + firewall bouncer.
cscli alerts listshows current decisions. - fail2ban — SSH brute-force protection on the host.
- Lynis — periodic security audit of the host; check your own score and harden from its findings.
These feed the same alertmanager email channel when a security event fires.
What's Next
- 🔗 07 — The Event Store & Supply Audit — where the supply invariant comes from
- 🔗 08 — Security Architecture — how SPIRE identity feeds service-to-service observability
- 🔗 19 — Compliance & Governance — monitoring posture for auditors (ISO 42001)