Security & Identity Fabric
Zero-trust by construction. No shared secrets between services. Every container proves its identity cryptographically on every connection. Authorization is centralized in a small set of well-defined boundaries — not scattered across endpoints.
Why It Matters
A platform that moves real money cannot rely on "we trust our own microservices." The Protocol is built on the opposite assumption: every service has to prove who it is on every request, and every authorization decision is explicit, auditable, and confined to a small set of well-known boundaries. That's zero-trust, and it's what makes the platform safe to self-host.
Three components form the core, with a fourth available as an option:
- SPIRE — issues short-lived cryptographic workload identities (SVIDs)
- SPIFFE — the open standard for those identities (
spiffe://<trust_domain>/...) - mTLS via nginx sidecars — encrypts service-to-service traffic and terminates identity at the nginx layer
- OPA (optional) — a policy extension operators can opt into for declarative, hot-reloadable Rego rules layered on top of the Python authorization checks
This chapter is how they fit together.
The Trust Domain
Every sovereign frame is its own trust domain — for example, one frame's trust domain might be example.com (with its public API at registry.example.com) and a peer frame's peer.example.com. Within a trust domain, every service has a SPIFFE ID and a matching X.509 SVID with a 4-hour TTL; the SPIRE agent refreshes ahead of expiry so workloads see fresh certs every ~2 hours.
Click any diagram to enlarge. All architecture diagrams in this chapter open in a full-screen overlay for detailed inspection.
The diagram below shows the mTLS fabric: the sovereign frames, every service tier, every nginx sidecar that terminates mTLS, and the cross-frame edges (SPIFFE bundle federation + stream-SNI hairpin proxy for cloud-op-to-cloud-op traffic).
Key properties:
- The SPIRE server is the anchor of trust inside a domain. Its signing key is the root.
- A SPIRE agent runs on every host. It attests workloads via the Docker socket (checks the container ID, image hash, labels) before issuing an SVID.
- Every internal service hop terminates at an nginx sidecar. Apps speak plain HTTP on localhost; the sidecar presents the service's SVID outbound and validates client SVIDs inbound. Two dedicated sidecars per frame (
nginx-event-store,nginx-teg) terminate the Registry → EventStore and Registry → TEG hops end-to-end. - No service has a pre-shared secret. Authentication is structural — you're running a container SPIRE trusts, or you aren't.
- Cross-frame federation goes through SPIFFE bundle exchange, not shared keys. Each frame fetches the other's bundle on a 5-minute auto-refresh cycle.
- Cloud operators hairpin through the host nginx stream-SNI proxy at
:8443. Each operator's nginx-federation sidecar carries SVID DNS SANs matching its public hostname; the stream block routes by$ssl_preread_server_nameto the right upstream port.
TIP
For reviewers: the fact that nothing on the inter-service plane holds a pre-shared service-to-service password is the strongest single security claim the platform makes. Inspecting env vars and compose files confirms it. Federation auth has a layered model — mTLS first, with shared-secret fallback paths for legacy operators that haven't migrated yet — and the TEG admin API uses an admin token. The internal mesh is the part that's keyless.
SVID Issuance
When a container starts and hits the registry or TEG code, it needs an identity. Here's how it gets one:
- Selectors (
docker:label:service=registry-a) are pre-registered on the SPIRE server. They're the rule that says "a container with THIS label gets THIS SPIFFE ID." - Auto-rotation before expiry — the X.509 SVID has a 4-hour TTL (
default_x509_svid_ttl = "4h"); the SPIRE agent re-requests well before expiry, and the cert-writer sidecar in the container watches for new pairs and reloads nginx without dropping connections. JWT SVIDs (used for short-lived authentication tokens) carry a 5-minute TTL by contrast. Agent-side IRONHAND SVIDs use a tighter 30-minute TTL for faster revocation (see below). - Revocation is passive. To revoke, remove the workload entry on the SPIRE server; the next rotation fails and the container's next connection fails closed.
Service-to-Service mTLS
Every internal call is mutually-authenticated TLS. nginx sidecars in front of each service terminate mTLS and forward the request to the app container over localhost. The Python app stays plain HTTP; the sidecar pattern means a one-line env-var change (EVENT_STORE_URL: http://... → https://nginx-...:8443) flips a hop to mTLS without touching application code.
The sidecar anatomy — Registry → TEG hop
This is the highest-volume internal hop on either mainframe. The diagram shows every cert + container in play.
This shape generalizes to every internal mTLS hop on both frames:
| Caller | Sidecar | Callee | Notes |
|---|---|---|---|
registry-a / teg-layer | nginx-event-store | event-store | mTLS-terminated |
registry-a | nginx-teg-a | teg-layer | mTLS-terminated |
frame-b-registry / frame-b-teg | frame-b-nginx-event-store | frame-b-event-store | mTLS-terminated |
frame-b-registry | frame-b-nginx-teg | frame-b-teg | mTLS-terminated |
| Cloud-op registry | (sibling container DNS) | its own teg-layer | plain HTTP — deferred sidecar (see note below) |
| Cross-frame peer | nginx-federation (each frame) | peer's /api/v1/federation/* | SPIFFE bundle federation |
TIP
On the one plain-HTTP hop. A cloud operator's Registry call to its own TEG (http://teg-layer:8080) is currently the only platform service hop that doesn't terminate at an mTLS sidecar. This is not an architectural decision against mTLS on that path — it's a priority call. As long as the operator's full stack runs on a single host, the Registry and the TEG are sibling containers on the same Docker bridge: no network boundary to cross, no eavesdropper to defeat by adding TLS termination 5cm away.
That changes the moment an operator's services start leaving a single host. The roadmap target is: at the scale where one high-tier box no longer covers the operator's traffic (rough order of magnitude: tens of thousands of agents interacting through a single registry) the Registry and TEG split onto separate hosts and the sidecar becomes load-bearing. The implementation is mechanically the same as the mainframe Registry → TEG sidecar work — a fresh nginx-teg sidecar in the operator compose template, a cert-writer-teg next to it, one SPIRE entry per operator, and a one-line TEG_API_BASE_URL flip in .env.operator. An afternoon's work whenever we decide the scale justifies it. Until then it's deferred, not declined.
Sequence — what happens on every internal call
What an attacker sees on the wire: TLS 1.3 AES-256-GCM. What an attacker needs to impersonate a service: a valid SVID signed by the SPIRE CA. What the SPIRE CA requires before issuing one: a matching selector on a workload that passes Docker attestation.
The practical effect: an attacker who lands on a host but doesn't own the SPIRE server cannot talk to any other service. Zero-trust in the sense that mattered when the term was invented.
The Python side — one helper, one kwarg
The ssl-context helper (es_ssl_helper.py on the Registry side, ssl_helper.py on the TEG side) implements a single heuristic — dot in target hostname = public CA; no dot = SPIRE bundle — and is reused across every httpx.AsyncClient that dials another internal service:
from .es_ssl_helper import get_httpx_verify
_teg_http_client = httpx.AsyncClient(
limits=httpx.Limits(max_connections=200, max_keepalive_connections=50),
timeout=httpx.Timeout(10.0, connect=5.0),
verify=get_httpx_verify(), # ← SSL context: SPIRE bundle + own SVID
)A codebase-wide refactor added that one kwarg to every httpx.AsyncClient(...) in the registry codebase that dials TEG or EventStore. The helper auto-rotates the SSL context when cert-writer rewrites the cert files (mtime-based cache invalidation). Flipping a transport from plain HTTP to mTLS is now a single environment variable change.
Agent mTLS (IRONHAND, f052)
The same SPIRE infrastructure can issue SVIDs to agents, not just platform services. This is IRONHAND: any agent that opts in via POST /api/v1/agent/enable-mtls and ships its container with the right Docker label + SPIRE socket mount + ENABLE_MTLS=true env gets a SPIFFE SVID that authenticates its A2A calls to peers.
The opt-in is what makes this elegant rather than mandatory. IRONHAND ships with six demo containers (Alpha–Foxtrot) as reference implementations; adoption beyond that demo set is a per-agent choice, opt-in via the enable-mtls endpoint above. Most service and sovereign agents currently authenticate via JWT + payment-token alone and would migrate to mTLS only when the value of stronger peer authentication outweighs the operational cost. The infrastructure is ready; adoption is a per-agent choice.
Revocation has three layers (all visible in the diagram above):
- EventStore WebSocket (< 500ms) — immediate notification, all subscribed peers drop the suspended agent's connections within the round-trip
- Polling (60s fallback) — for agents not subscribed to the WS broadcast
- SVID expiry (30 min) — ultimate failsafe; the suspended agent's SPIRE entry is gone, so next rotation fails closed
The SDK helper A2AAuthenticator auto-injects the SVID into outbound httpx clients when ENABLE_MTLS=true and SPIFFE_ENDPOINT_SOCKET are set in the container env. On the receiving side, create_a2a_router() auto-injects the PaymentVerifier middleware when REGISTRY_URL is set. Both pieces ship in theprotocol-sdk — see Chapter 14 — SDK for the runnable code path.
Authorization — Python-First, Auditable
Authentication gets you in the door. Authorization — what you can do once in — happens at a small set of FastAPI dependency factories in security.py: get_current_developer, get_current_agent, and require_admin_flag(<flag>). Every endpoint declares which one it requires; the dependency layer resolves it; the request either flows or returns a structured 403. Terse, fast, and the unambiguous source of truth for what the platform allows.
The five admin sub-flags scope which class of admin operation a developer can invoke:
admin_treasury— fund grants, treasury transfers, fee configadmin_support— support tickets, agent assistanceadmin_enforcement— suspend/reinstate, dispute slashing, blocklistsadmin_federation— peer onboarding, license issuance, drift responseadmin_platform— operational health, network params, emission policies
Legacy is_admin=True is the super-flag (passes every check) for backward compat. Sub-admins get only the flags explicitly granted, and the must_change_password gate runs first so freshly-bootstrapped operator admins must rotate before doing anything sensitive.
OPA — Optional Declarative Policy
For operators who want a more declarative, hot-reloadable policy layer on top of the Python authorization checks, the platform ships an opt-in integration with OPA (Open Policy Agent). OPA is not the default; it is a tool you can reach for when your operational story benefits from one.
What the OPA extension gives you when you turn it on:
- One place to read every cross-cutting rule. "What governs cross-registry transfers?" becomes a single
.regofile rather than a grep across forty handlers. - Hot reload without redeploys. Tighten a reputation floor, add a new admin flag, change a per-tier staking cap — push the new policy bytes to OPA's REST API and the next request honors it.
- Structured deny reasons. Instead of a generic 403, callers can receive
{ "allowed": false, "reason": "insufficient_reputation", "required": 0.1, "actual": 0.05 }— which turns an opaque rejection into an actionable developer error. - Multi-frame policy reuse. A single Rego bundle can be served to every operator and every sovereign frame, with operator-specific overrides where needed.
Two flags, defaults off
The integration is gated behind two stacked environment variables on the registry:
OPA_ENABLED=false # default — OPA is bypassed entirely; Python is the only path
OPA_ENFORCE=false # default — only honored when OPA_ENABLED=trueIn a reference deployment, OPA can run in shadow mode — OPA_ENABLED=true, OPA_ENFORCE=false — where every authz check runs the Python decision and queries OPA in parallel. The Python decision is what the request actually sees; the OPA decision is logged and surfaced as Prometheus counters (opa_decisions_total, opa_decision_latency_seconds). Shadow mode is the validation phase before any enforce flip — it lets an operator watch the Rego policies match real traffic without changing a single user-visible authz outcome. Rollout across additional frames and cloud operators follows once shadow mode produces clean parity data (a compose file must ship an OPA sidecar before its registry can enable the integration).
When both flags are true, OPA can tighten the Python decision: the result becomes python_allow AND opa_allow. OPA can deny what Python would have allowed; OPA cannot grant what Python denied. Python remains defense-in-depth.
The path to enabling enforce-mode in production: align the MCP auth_bridge.py code with the agent + developer Rego policies (the .rego files are deliberately stricter today, so flipping enforce without alignment would unfairly deny suspended-agent MCP calls). That alignment work is an explicit prerequisite. Both flags are read at request time, so flipping OPA_ENABLED=false and restarting the registry is a sub-30-second rollback to pure Python.
Typical policy shapes
- Staking rules — minimum stake, maximum lock period, per-tier cooldowns
- Dispute rules — reputation floors for filing, bond sizing, anti-spam rate limits
- Cross-registry rules — license valid + drift acceptable → cross-registry ops allowed
- Admin sub-roles — Rego decides which combination of
admin_*flags + 2FA +must_change_passwordgates each operation, with structured deny reasons
INFO
OPA is one option among several for declarative policy. Operators are welcome to keep the platform pure-Python (the default) if their policy story doesn't justify the extra component. Both shapes are first-class.
Defense in Depth
Beyond SPIRE + mTLS + (optional OPA), the platform layers:
- Input validation — every endpoint passes through a Pydantic schema. No ad-hoc parsing. OWASP-class checks on anything user-controlled.
- Rate limits — per-developer, per-IP, per-endpoint. Enforced at the nginx edge before the app.
- CrowdSec — community threat-intel bouncer at the firewall layer.
- fail2ban — SSH brute-force protection.
- Secret hygiene — every secret in
.env.production, never committed, not in container env dumps. - Signed container images — future work; pinning by digest today.
INFO
No one of these is the "real" defense. They're layered so that bypassing any single one still leaves the others between an attacker and anything that matters. That's the meaning of defense in depth — not "we used lots of tools."
Admin Actions Leave a Trail
Every admin operation emits an auditable event. Revoking an agent's credentials, changing an emission policy, issuing a treasury grant, approving a federation peer — all logged with the admin's DID, the timestamp, the before/after state where applicable. The Event Store is, by design, the same ledger that records admin actions and financial flows.
You can query admin history through several real paths:
- Per-developer activity —
GET /api/v1/admin/developers/{developer_id}/activityreturns recent EventStore events concerning that developer (suspensions, agent grants, federation-license issuance, etc.). Requiresadmin_support. - Agent detail —
GET /api/v1/admin/agents/{agent_did}returns the agent's current lifecycle state —enforcement_status(enum),offense_count,last_offense_at, open-dispute counts — plus the underlying agent card and developer link. Requiresadmin_support. For full chronological lifecycle (suspended_at, reinstated_at, slash events with timestamps), query the EventStore aggregate via the raw stream below — that's the canonical audit trail. - Raw event stream —
GET /api/v1/events/aggregate/{aggregate_id}on the EventStore returns every event ever emitted for that aggregate, in order. The Mission Control activity feed is built on top of this.
All admin mutations emit named events (e.g. EmissionPolicyChanged, DeveloperSuspended, AgentBalanceCorrected) tagged with the admin's developer_id and a timestamp; the EventStore is the canonical audit ledger.
MCP Tool-Call Audit
Separately from the EventStore (which records state changes), every MCP tool invocation against /mcp or /mcp/admin lands as a row in security_audit_logs (Postgres) when MCP_AUDIT_LOGGING_ENABLED=true on a registry. Schema: actor (developer / admin / agent / anonymous) + actor_ip + target tool name + outcome + latency + sanitized arguments + sanitized result summary + OPA shadow decision (when present) + correlation_id linking to the originating HTTP request.
Sensitive fields (jwt / secret / password / token / api_key / private_key / bearer / authorization) are redacted by a deny-list sanitizer before write. Strings are capped at 500 chars; the details JSON is capped at 4 KB. The writer is fire-and-forget — MCP latency is unaffected.
Reads are developer-scoped:
GET /api/v1/me/mcp-audit-log— returns only the calling developer's own rowsGET /api/v1/admin/mcp-audit-log?actor_id=N— admin (admin_support) variant. Calls for someone other than self insert anadmin_read_mcp_auditrow into the target developer's log, so admin reads are themselves audited and visible to the developer being read.
The MCP per-tool audit lives alongside the existing OAuth/token lifecycle audit (token_exchange_*, token_revocation, rate_limit_exceeded, jti_reuse_attempt, unauthorized_access, suspicious_activity) — both backed by the same security_audit_logs table. State-changing MCP calls also hit the EventStore as usual; the MCP-tool audit row is the additional layer that adds the tool-name dimension to the audit chain.
Cross-Developer Admin Access — Audit Now, Ticket-Gate Next
Two controls cover the case of an admin reading another developer's data:
Audit-first, today. Any admin call to
/admin/mcp-audit-log?actor_id=N(or any other admin endpoint that touches another developer's records) writes anadmin_read_mcp_auditrow into the target developer's own audit log whenactor_id≠ the calling admin's id. The target sees the read at/me/mcp-audit-log— admin investigations are observable to the subject by construction. This is shipped fleet-wide.Support-ticket hard-gate, next. The next pass introduces an additional gate that requires an approved support ticket from the developer before an admin can open their drawer at all — the audit row becomes a confirmation, not the only control. The gate has one exception: an active security incident involving that specific account, which the admin can declare against the developer's record. Outside that exception, the path is: developer files a ticket → developer (or their delegate) approves admin access → admin opens drawer within the ticket's time-bound window → window closes, gate re-engages.
The forward-looking rationale: operators across jurisdictions face very different rules about ticketed-and-audited cross-account access. Several enterprise frameworks (and several national data-protection regimes) require an explicit, time-bounded, developer-approved authorization before any cross-account read. Building the gate now means an operator running TheProtocol in a jurisdiction that mandates this control can flip it on per-operator via env flag — no custom compliance code per region, no fork of the admin surface, no operator-specific build pipeline. The audit row that already fires becomes the trail; the ticket gate becomes the up-front control. Together they form the cross-account-access standard that most serious enterprise procurement asks about on the second call.
The mechanism is a single support_ticket_grants table keyed by (target_developer_id, admin_developer_id, expires_at), a guard at the admin-drawer mount point, and a banner-with-deny when no row matches. Roughly an afternoon to ship and another afternoon to wire into the operator-provisioner's .env.operator template as a tristate flag (disabled / audit_only / gate_required).
What's Next
- 🔗 01 — Agents & Identity — DIDs and SPIFFE IDs in context
- 🔗 05 — Federation & Cross-Registry — how SPIFFE bundle exchange federates trust
- 🔗 07 — The Event Store & Supply Audit — where admin actions and supply events live
- 🔗 17 — Operators & Self-Hosting — SPIRE and mTLS for operator-run registries
- 🔗 19 — Compliance & Governance — auditor workflow and control mapping
Canonical Sources
identity-fabric/(SPIRE + OPA configuration)services/agent_spire.py·services/coordinated_suspension.py(IRONHAND enrollment + revocation)es_ssl_helper.py·security.py(mTLS ssl-context helper + authorization dependencies)