Operator Guide: Troubleshooting

Audience: Federated registry operators (self-hosted) Last Updated: 2026-06-04

Common Issues

#	Symptom	Cause	Fix
1	SPIRE agent won't start	x509pop attestation certs missing or expired	Verify your `spire/agent-attestation.crt` and `spire/agent-attestation.key` files are in place. Check `docker logs ${REGISTRY_NAME}-spire-agent` for attestation errors. Unlike join tokens, x509pop certs survive restarts -- if they were working before, check for file permission or volume mount issues.
2	cert-writer stuck in restart loop	SPIRE agent unhealthy or not ready	Check `docker logs ${REGISTRY_NAME}-spire-agent` first. Verify `agent.conf` has the correct `trust_domain` and `server_address`, and that the x509pop attestation plugin is configured (not `join_token`). Restart SPIRE agent, then cert-writer.
3	nginx returns 502 Bad Gateway	Registry container not ready yet	Wait 1-2 minutes for startup (database migrations run on first boot). Check `docker logs ${REGISTRY_NAME}-registry` for startup progress.
4	nginx returns 403 on federation endpoints	Client certificate validation failed	Check SVID expiry with `docker logs ${REGISTRY_NAME}-cert-writer`. Verify cert-writer is healthy and writing fresh certs to the shared volume.
5	EventStore writes failing (401/403)	mTLS certificate expired or SPIRE x509pop attestation revoked	Check `docker logs ${REGISTRY_NAME}-cert-writer` for renewal errors. Verify `EVENTSTORE_MTLS_REQUIRED=true` in your env. If certs are valid, contact the operator of your parent frame -- your SPIRE attestation may have been revoked.
6	Federation sync not discovering agents	Peer URL misconfigured or mTLS handshake failing	Verify `FEDERATION_BASE_URL` in your `.env.operator`. Check `docker logs ${REGISTRY_NAME}-nginx-federation` for TLS errors. Confirm the mainframe peer URL is reachable.
7	"Connection pool exhausted"	PgBouncer or PostgreSQL at max connections	Increase `max_client_conn` in `pgbouncer.ini` or `max_connections` in PostgreSQL config. Check for connection leaks with `docker exec ${REGISTRY_NAME}-db psql -U <user> -c "SELECT count(*) FROM pg_stat_activity;"`
8	Registry returns 500 on startup	Database migration failed	Check `docker logs ${REGISTRY_NAME}-registry` and look for `alembic` migration errors. If the database is fresh, migrations run automatically. If upgrading, ensure the database volume was preserved.
9	Agent creation returns 409 Conflict	DID collision (extremely rare)	Retry the request. The system generates a new random DID on each attempt. If persistent, check for duplicate agent names in your registry.
10	Staking unstake returns 403	7-day cooldown period is still active	This is expected behavior. The unstake cooldown is enforced for economic stability. Check the `cooldown_expires_at` field in the stake record.
11	Cross-registry transfer returns 403	Target registry is not an active federation peer	Verify the target peer is listed and ACTIVE in your federation config. The target registry may have been suspended from the federation.
12	Transfer returns "insufficient balance"	Liquid balance too low (staked tokens not available)	Only liquid (unstaked) balance can be transferred. Use `GET /api/v1/teg/balance/{did}` to check liquid vs. staked balances.
13	Supply audit shows BREACH status	Event emission policy misconfiguration or double-counted events	Do not attempt to fix manually. Contact the operator of your parent frame immediately. The supply auditor runs every 60 seconds and will detect any discrepancy.

Docker Commands Reference

View Containers

bash

# List all containers and their status
docker compose -f docker-compose.operator.yml ps

# Detailed container info (names, status, ports)
docker ps --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}"

View Logs

bash

# Last 50 lines of registry logs
docker compose -f docker-compose.operator.yml logs --tail 50 registry

# Follow logs in real time
docker compose -f docker-compose.operator.yml logs -f registry

# View logs for a specific container
docker logs ${REGISTRY_NAME}-registry --tail 50

# View logs with timestamps
docker logs ${REGISTRY_NAME}-registry --tail 50 -t

# Filter for errors (JSON logs)
docker logs ${REGISTRY_NAME}-registry 2>&1 | grep '"level":"ERROR"' | tail -20

Restart Services

bash

# Restart a single service
docker compose -f docker-compose.operator.yml restart registry

# Restart the full stack
docker compose -f docker-compose.operator.yml restart

# Stop and start (full reset)
docker compose -f docker-compose.operator.yml down
docker compose -f docker-compose.operator.yml up -d

Update and Rebuild

bash

# Pull latest images and restart
docker compose -f docker-compose.operator.yml pull
docker compose -f docker-compose.operator.yml up -d

# Force recreate containers (preserves volumes)
docker compose -f docker-compose.operator.yml up -d --force-recreate

# Pull and recreate in one step
docker compose -f docker-compose.operator.yml up -d --pull always

Enter a Container

bash

# Interactive shell in registry
docker exec -it ${REGISTRY_NAME}-registry bash

# Interactive shell in TEG layer
docker exec -it ${REGISTRY_NAME}-teg bash

# Run a one-off command
docker exec ${REGISTRY_NAME}-registry python -c "print('healthy')"

Database Operations

bash

# Connect to registry PostgreSQL
docker exec -it ${REGISTRY_NAME}-db psql -U <db-user> -d <db-name>

# Check active connections
docker exec ${REGISTRY_NAME}-db psql -U <db-user> -d <db-name> -c \
  "SELECT count(*) as active FROM pg_stat_activity WHERE state = 'active';"

# Check database size
docker exec ${REGISTRY_NAME}-db psql -U <db-user> -d <db-name> -c \
  "SELECT pg_size_pretty(pg_database_size(current_database()));"

Network Diagnostics

bash

# Check if registry can reach EventStore
docker exec ${REGISTRY_NAME}-registry curl -s -o /dev/null -w "%{http_code}" \
  https://events.example.com/health

# Check if SPIRE agent is healthy
docker exec ${REGISTRY_NAME}-spire-agent /opt/spire/bin/spire-agent healthcheck

# Check cert-writer SVID status
docker logs ${REGISTRY_NAME}-cert-writer --tail 10

Startup Order

The operator stack starts in this order. If a service fails, check the service it depends on:

1. db, teg-db, redis                      (databases)
2. spire-agent                            (identity)
3. pgbouncer                              (connection pooling -- waits for db)
4. registry, teg                          (applications -- wait for DB migrations)
5. cert-writer                            (SVID fetching -- requires spire-agent)
6. nginx-federation                       (mTLS sidecar -- requires registry + certs)

If a service is stuck, check the service above it in this chain.

When to Escalate

If you federate with an upstream frame, escalate to that frame's operator when you encounter:

Supply audit BREACH status
SPIRE attestation revocation
Persistent mTLS certificate failures after SPIRE agent restart
Database corruption or unrecoverable migration failures
Any behavior that suggests unauthorized token minting or balance manipulation

Include in your report:

Your operator name and registry DID
Container logs for the affected service (last 100 lines)
Output of docker compose -f docker-compose.operator.yml ps
Timestamp of when the issue first appeared

Operator Guide: Troubleshooting ​

Common Issues ​

Docker Commands Reference ​

View Containers ​

View Logs ​

Restart Services ​

Update and Rebuild ​

Enter a Container ​

Database Operations ​

Network Diagnostics ​

Startup Order ​

When to Escalate ​

Operator Guide: Troubleshooting

Common Issues

Docker Commands Reference

View Containers

View Logs

Restart Services

Update and Rebuild

Enter a Container

Database Operations

Network Diagnostics

Startup Order

When to Escalate