Monitoring & Observability

What the platform sees about itself — metrics, dashboards, alerts. Where to look when something's wrong, and what gets paged in the middle of the night.

Why It Matters

A sovereign platform that can't see itself can't be trusted. The monitoring stack answers three questions every good operator asks at least weekly:

Is the supply invariant still zero? (If not, nothing else matters.)
Is the network hitting its SLOs — latency, error rate, transfer throughput?
What's the shape of the next failure? — trends, saturation, early warnings before anything breaks.

This chapter covers how the stack produces that visibility, where to look for what, and what alerts exist so you know an incident has been recognized.

The Stack

Prometheus for metrics. Grafana for dashboards. Alertmanager for routing. cAdvisor for container stats. That's it — deliberately boring, industry-standard, auditable.

Illustrative of the reference multi-frame deployment; a single-registry install scrapes one registry, TEG, and event-store — the same metric families, fewer targets.

Continuous development

Dashboards and alert rules ship as a continuously evolving set — the dashboards and alert rules below are the current state, not a frozen surface. New panels, new rules, and protocol_* business metrics land regularly as the platform itself evolves. If something looks light, it's because the next refresh is queued.

Every service exposes /metrics in Prometheus format. Prometheus scrapes every 15 seconds, stores time series, and evaluates alert rules. Alertmanager routes firing alerts to email (and any other notifier you wire up). Grafana reads from Prometheus for dashboards and from the registry's own APIs for app-level views.

What's Measured

Three layers, with honest scope of what's wired today.

Infrastructure

cadvisor — container CPU, memory, network I/O per service.
node-exporter — host-level CPU, memory, disk, network.
postgres-exporter — every Postgres instance scraped (the registry DB, TEG DB, and event-store DB; more in a multi-frame deployment). Connection counts, query latency, replication lag, cache hit ratio (the pg_* family).

Not currently wired: Redis exporter, Redpanda/Kafka exporter — both are in development. Redis health is observed today via container-level cAdvisor stats and via the alerts that fire when Redis-backed features fail (e.g. leader-election fallout in ServiceDown correlations); Kafka/Redpanda similarly.

Application (per service `/metrics`)

HTTP middleware (every service) — http_requests_total, http_request_duration_seconds, http_request_size_bytes, http_response_size_bytes. Standard Prometheus FastAPI middleware. Used by the HighErrorRate and HighLatencyP95 alerts.
Registry IRONHAND — ironhand_enrolled_agents, ironhand_revoked_agents, ironhand_spire_agent_entries, ironhand_orphan_entries, ironhand_missing_entries. mTLS enrollment health.
EventStore — eventstore_events_total, eventstore_events_by_type, eventstore_unique_agents, eventstore_ledger_age_seconds, eventstore_ledger_freshness_seconds, eventstore_events_ingested_total, eventstore_kafka_consumer_batches_total, eventstore_kafka_batch_size, eventstore_info. Drives the SupplyLedgerStale and EventStoreStaleLedger alerts.
OPA shadow-mode (gated by OPA_ENABLED, off by default) — opa_decisions_total, opa_decision_latency_seconds. See chapter 08.

The TEG layer emits only HTTP middleware metrics today — no teg_transfers_total, no teg_fees_collected, no staking gauges. Business-level rates are derived from eventstore_events_by_type{event_type="TokensTransferred"} and similar event-store queries.

Business Invariants — In Development

The stack of business-level Prometheus metrics that would let you watch supply delta / federation drift / A2A token states / dispute phases / veToken turnout directly is not wired today. The data exists in the EventStore and the registry's own DB; it's just not yet exported as Prometheus gauges/counters. Today's surface for invariants:

Supply audit delta — GET /api/v1/projections/supply-audit (EventStore), or eventstore_ledger_freshness_seconds as a proxy (the SupplyLedgerStale alert uses this).
Federation peer health — up{job="peer-registry"} == 0 and the FederationPeerDown alert.
Dispute / governance state — query the registry's own DB or /api/v1/governance/proposals?status=* endpoints.

Promoting these to first-class protocol_* Prometheus metrics is on the roadmap.

The Canary Reactor

Most synthetic monitors hit a health endpoint and call it a day. The canary reactor moves real production AVT between funded canary agents on real production infrastructure, watches the resulting events propagate through five reactors and the EventStore, and tells you exactly which phase silently broke when something silently breaks. It runs continuously, uses the same code paths as user traffic, has no test harness.

Live at /ui#/admin/canaries. Backed by canary_runner (scheduler registry only, leader-elected, fires every ~60s) and canary_judge (scheduler registry only, leader-elected, sweeps every 5s). Five validator reactors subscribe to EventStore events and stamp per-event ISO timestamps into a canary:<test_id> Redis hash; the judge reads the hash, decides outcome, persists to canary_test_results.

Three backends, three event signatures

Each path runs one of three transfer backends, and each backend emits a different set of events. The canary's per-backend "done" definition follows what the backend actually emits, not what would be tidy.

Backend	Endpoint	Events emitted	Canary required-stamp
`intra`	`/teg/transfer`	`TokensTransferred` + `TransactionFeeCollected`	`tokens_xfer_seen_iso`
`async`	`/teg/cross-registry-transfer?backend=async`	Sender frame: `TokensTransferred` + `CrossFrameTransferInitiated` + `TransactionFeeCollected`. Receiver frame (after credit): `CrossFrameTransferSettled`. SF-4 broadcasts both Initiated and Settled across frames.	`cf_initiated_seen_iso` AND (`cf_settled_local_seen_iso` OR `cf_settled_remote_seen_iso`)
`2pc`	`/teg/cross-registry-transfer?backend=2pc`	Always: `TokensTransferred` + `CrossRegistryTransferCompleted` + `TransactionFeeCollected`. Cross-frame 2pc additionally: one extra `CrossFrameTransferSettled` (Phase 7.5 visibility helper — `TokensTransferred` is in SF-4's `foreign_events` but not in `CROSS_FRAME_BROADCAST_TYPES`, so without that extra emit the canary on the other frame would never see the transfer land).	`tokens_xfer_seen_iso` OR (`cf_settled_local_seen_iso` OR `cf_settled_remote_seen_iso`)

The canonical 2pc completion event is CrossRegistryTransferCompleted. There is no canary validator subscribed to it — the canary detects 2pc via the always-emitted TokensTransferred (memo wrapped as cross_registry:canary-<test_id>; the reactor strips the prefix before correlating). Keeps the validator subscription set to five event types and the per-backend logic symmetric.

The five validator reactors

Event	Reactor stamps	Notes
`TokensTransferred`	`tokens_xfer_seen_iso`	All 3 backends emit. Required for `intra`; one of two acceptable for `2pc`; informational for cross-frame `async`.
`TransactionFeeCollected`	`fee_collected_seen_iso`	All 3 backends emit. The fee event has no `memo` field, so memo-fallback alone won't catch it; only the `canary_xfer:<transfer_id>` reverse-index does. Treated as opportunistic latency signal, never required.
`CrossFrameTransferInitiated`	`cf_initiated_seen_iso`	`async` cross-frame only. 2pc does not emit this. Required for `async`.
`CrossFrameTransferSettled`	`cf_settled_local_seen_iso` (if `source_frame == local_frame`) or `cf_settled_remote_seen_iso`	`async` receiver frame after credit; cross-frame `2pc` Phase 7.5 helper. Required for `async`; one of two acceptable for cross-frame `2pc`.
`CrossFrameTransferRefunded`	`cf_refunded_seen_iso` + `failure_signal=refunded`	`async` saga sweeper after 5-min timeout. Presence is an explicit failure — judge marks `failed/refunded` immediately.

The (i) and the live log

The dashboard hero has a small italic i next to the title — toggles a six-block inline panel documenting path matrix, fire cadence, validation pipeline, failure semantics, observability surface, and source files. Click-to-expand, no page navigation.

The bottom of the view is a live event log feed: every distinct outcome the dashboard has seen across refresh cycles, capped at 200 entries, color-coded, filterable by outcome, collapsible. Auto-tail toggle prepends new arrivals; collapsed-and-filling sprouts a +N badge so you don't miss movement.

Latency: don't let the observer dominate the measurement

total_latency_ms went through three definitions before settling. The first version was wallclock(judge_completed - fired) — every test averaged ~17s because the judge ran every 30s, so the metric was dominated by sweep cadence rather than transfer time. The fix was to compute latency from the latest *_seen_iso stamp the validator wrote, not from when the judge happened to look. Average dropped from 17s → ~935ms. General lesson: never let an observer interval dominate a measurement. If a synthetic monitor lies about latency, alerts are gated on a number that doesn't mean what it says.

What the canary is not

The canary is not the supply auditor. The auditor verifies tokens_issued − tokens_destroyed + transit_net == tokens_circulating across the EventStore on its own cycle (60s) with zero tolerance and its own dashboard at auditor.example.com. The canary watches whether transfers complete; the auditor watches whether they conserve mass. Different invariants, different failure modes, different runbooks. Don't conflate.

TIP

Post-restart of the scheduler registry, expect ~5-10 minutes of canary noise while the five validator reactors re-acquire leader locks, re-establish EventStore WebSocket subscriptions, and let the polling-backstop close the watermark gap. Don't page on canary failures within the first 10 minutes after a scheduler-registry recreate. After the warmup, the next cycle judges 30/30 passed.

🔗 Grafana dashboard: 28 — Canary Reactor — 8 sections covering pass-rate aggregates, per-path scoreboard, validator pipeline, upstream EventStore + TEG-DB pressure, and a 30d trend via VictoriaMetrics. Annotations mark scheduler-registry restarts so you don't misread post-restart warmup noise as a real regression. · Long-form blog post: The Canary Reactor: How TheProtocol Watches Itself, One Real Transfer at a Time

Where Things Live

Grafana        https://grafana.example.com
Prometheus     https://prometheus.example.com
Alertmanager   https://alerts.example.com

Curated dashboards (shipped at monitoring/grafana/provisioning/dashboards/json/, editable):

#	Real title	Purpose
01	01 — System Overview	top-level health — start here
02	02 — Registry & Service Health	per-service uptime, request rate, p95 latency
03	03 — TEG / Token Economics	transfers, staking ops, fee collection (mostly via EventStore-derived series today)
04	04 — EventStore: Dual-Frame Unified	write rate, ledger freshness, supply audit cross-frame
05	05 — PostgreSQL Performance	connections, locks, slow queries, replication lag
07	Container Resources (cAdvisor)	CPU, memory, network per container
08	Frame B Sovereign	per-frame health for an additional frame's registry / TEG / event-store — multi-frame deployments only

TIP

For everyday operation, start at System Overview. If everything is green, close the tab. If anything isn't, the panel title tells you which dashboard to drill into.

Alerts

Alertmanager evaluates rules and sends to security@example.com. The rule set is deliberately conservative — every firing alert represents something an operator should look at within an hour.

An INFO severity tier is reserved in Prometheus convention but no INFO-severity alerts are wired today — capacity-trend / worker-lag / TVL-swing alerts are in development; the platform pages on critical+warning only for now.

Representative alerts (selected from the live prometheus_alerts.yml):

Alert	Severity	Fires when
`SupplyLedgerStale`	CRITICAL	EventStore ledger freshness > 30 minutes (no new events) for 5m — supply audit may be unreliable
`ServiceDown`	CRITICAL	`up{tier="application"} == 0` for 2m — registry / TEG / event-store unreachable
`PostgreSQLDown`	CRITICAL	postgres-exporter `up == 0` for 2m
`EventStoreStaleLedger`	CRITICAL	EventStore-side staleness signal (separate from ledger freshness)
`FrameBRegistryDown` / `FrameBEventStoreDown` / `FrameBLedgerStale` / `FrameBHighErrorRate`	CRITICAL/WARNING	per-additional-frame equivalents — registry/eventstore down, ledger stale, or 5xx rate elevated (multi-frame deployments)
`HighErrorRate`	WARNING	5xx rate > 5% over 5 minutes
`HighLatencyP95`	WARNING	p95 route latency > 2s for 5 minutes
`FederationPeerDown`	WARNING	`up{job="peer-registry"} == 0` for 5m — cross-frame sync compromised
`DBConnectionsHigh` / `DBConnectionsCritical`	WARNING	Postgres connection pool utilization elevated
`HostHighCPU` / `HostHighMemory` / `HostDiskSpaceLow` / `DiskSpaceCritical`	WARNING	Infrastructure pressure
`ContainerRestarting` / `ContainerMemoryHigh`	WARNING	cAdvisor container instability
`EventStoreDeadlockBurst`	WARNING	DB deadlock spike
`PostgreSQLHighConnections` / `PostgreSQLLowCacheHitRatio`	WARNING	Postgres performance degradation

The supply-side critical alerts are the only ones that mean "stop everything and investigate now." Everything else is "look in the next hour."

PromQL Cheatsheet

A few queries worth bookmarking (verified against live /metrics endpoints):

# EventStore freshness — proxy for supply audit health
eventstore_ledger_freshness_seconds{job="event-store"}

# EventStore total events recorded
eventstore_events_total

# p95 latency per service (matches the HighLatencyP95 alert)
histogram_quantile(0.95, sum by (job, le) (rate(http_request_duration_seconds_bucket[5m])))

# 5xx rate per service
sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))
  / sum by (job) (rate(http_requests_total[5m]))

# Service liveness
up{tier="application"}

# IRONHAND mTLS enrollment count
ironhand_enrolled_agents

# OPA shadow-mode decisions (when sandbox/prod has OPA_ENABLED=true)
sum by (policy, decision, source) (rate(opa_decisions_total[5m]))

Real metric prefixes: eventstore_* (EventStore service), opa_* (OPA shadow-mode, see chapter 08), ironhand_* (agent mTLS), http_* (HTTP middleware on every service), container_* (cAdvisor), pg_* (postgres-exporter), up (Prometheus liveness). A unified protocol_* app-metric namespace (per-transfer counters, staking-TVL gauge, federation-peer status) is in development — the scaffolding is there in prometheus_client Counter/Gauge/Histogram registrations, just not yet rolled out across the registry's transfer/stake/governance paths.

Admin API for Queries

If you need to query metrics or dashboards from code (a CI health check, a Claude tool call, an external monitor), use the admin MCP tools:

theprotocol_adminPromQuery(query="up", start=None, end=None, step=None)
theprotocol_adminGrafanaQuery(path="/api/dashboards/uid/system-overview", method="GET")

Both forward to the underlying services via httpx (tools_admin.py:313, 389). For HTTP-direct access without the MCP wrapper, the admin MCP theprotocol_adminRequest proxy can hit any registry or upstream-Prometheus/Grafana endpoint. There's no dedicated /api/v1/admin/prom/query registry endpoint today — query Prometheus directly at https://prometheus.example.com/api/v1/query (admin auth on the nginx layer) or use the MCP tool. See chapter 12.

Security Monitoring (Lightweight)

Separate from operational telemetry, the platform runs:

CrowdSec — community threat intel + firewall bouncer. cscli alerts list shows current decisions.
fail2ban — SSH brute-force protection on the host.
Lynis — periodic security audit of the host; check your own score and harden from its findings.

These feed the same alertmanager email channel when a security event fires.

What's Next

🔗 07 — The Event Store & Supply Audit — where the supply invariant comes from
🔗 08 — Security Architecture — how SPIRE identity feeds service-to-service observability
🔗 19 — Compliance & Governance — monitoring posture for auditors (ISO 42001)

Monitoring & Observability ​

Why It Matters ​

The Stack ​

What's Measured ​

Infrastructure ​

Application (per service /metrics) ​

Business Invariants — In Development ​

The Canary Reactor ​

Three backends, three event signatures ​

The five validator reactors ​

The (i) and the live log ​

Latency: don't let the observer dominate the measurement ​

What the canary is not ​

Where Things Live ​

Alerts ​

PromQL Cheatsheet ​

Admin API for Queries ​

Security Monitoring (Lightweight) ​

What's Next ​