Operator Guide: Troubleshooting
Audience: Federated registry operators (self-hosted) Last Updated: 2026-06-04
Common Issues
| # | Symptom | Cause | Fix |
|---|---|---|---|
| 1 | SPIRE agent won't start | x509pop attestation certs missing or expired | Verify your spire/agent-attestation.crt and spire/agent-attestation.key files are in place. Check docker logs ${REGISTRY_NAME}-spire-agent for attestation errors. Unlike join tokens, x509pop certs survive restarts -- if they were working before, check for file permission or volume mount issues. |
| 2 | cert-writer stuck in restart loop | SPIRE agent unhealthy or not ready | Check docker logs ${REGISTRY_NAME}-spire-agent first. Verify agent.conf has the correct trust_domain and server_address, and that the x509pop attestation plugin is configured (not join_token). Restart SPIRE agent, then cert-writer. |
| 3 | nginx returns 502 Bad Gateway | Registry container not ready yet | Wait 1-2 minutes for startup (database migrations run on first boot). Check docker logs ${REGISTRY_NAME}-registry for startup progress. |
| 4 | nginx returns 403 on federation endpoints | Client certificate validation failed | Check SVID expiry with docker logs ${REGISTRY_NAME}-cert-writer. Verify cert-writer is healthy and writing fresh certs to the shared volume. |
| 5 | EventStore writes failing (401/403) | mTLS certificate expired or SPIRE x509pop attestation revoked | Check docker logs ${REGISTRY_NAME}-cert-writer for renewal errors. Verify EVENTSTORE_MTLS_REQUIRED=true in your env. If certs are valid, contact the operator of your parent frame -- your SPIRE attestation may have been revoked. |
| 6 | Federation sync not discovering agents | Peer URL misconfigured or mTLS handshake failing | Verify FEDERATION_BASE_URL in your .env.operator. Check docker logs ${REGISTRY_NAME}-nginx-federation for TLS errors. Confirm the mainframe peer URL is reachable. |
| 7 | "Connection pool exhausted" | PgBouncer or PostgreSQL at max connections | Increase max_client_conn in pgbouncer.ini or max_connections in PostgreSQL config. Check for connection leaks with docker exec ${REGISTRY_NAME}-db psql -U <user> -c "SELECT count(*) FROM pg_stat_activity;" |
| 8 | Registry returns 500 on startup | Database migration failed | Check docker logs ${REGISTRY_NAME}-registry and look for alembic migration errors. If the database is fresh, migrations run automatically. If upgrading, ensure the database volume was preserved. |
| 9 | Agent creation returns 409 Conflict | DID collision (extremely rare) | Retry the request. The system generates a new random DID on each attempt. If persistent, check for duplicate agent names in your registry. |
| 10 | Staking unstake returns 403 | 7-day cooldown period is still active | This is expected behavior. The unstake cooldown is enforced for economic stability. Check the cooldown_expires_at field in the stake record. |
| 11 | Cross-registry transfer returns 403 | Target registry is not an active federation peer | Verify the target peer is listed and ACTIVE in your federation config. The target registry may have been suspended from the federation. |
| 12 | Transfer returns "insufficient balance" | Liquid balance too low (staked tokens not available) | Only liquid (unstaked) balance can be transferred. Use GET /api/v1/teg/balance/{did} to check liquid vs. staked balances. |
| 13 | Supply audit shows BREACH status | Event emission policy misconfiguration or double-counted events | Do not attempt to fix manually. Contact the operator of your parent frame immediately. The supply auditor runs every 60 seconds and will detect any discrepancy. |
Docker Commands Reference
View Containers
bash
# List all containers and their status
docker compose -f docker-compose.operator.yml ps
# Detailed container info (names, status, ports)
docker ps --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}"View Logs
bash
# Last 50 lines of registry logs
docker compose -f docker-compose.operator.yml logs --tail 50 registry
# Follow logs in real time
docker compose -f docker-compose.operator.yml logs -f registry
# View logs for a specific container
docker logs ${REGISTRY_NAME}-registry --tail 50
# View logs with timestamps
docker logs ${REGISTRY_NAME}-registry --tail 50 -t
# Filter for errors (JSON logs)
docker logs ${REGISTRY_NAME}-registry 2>&1 | grep '"level":"ERROR"' | tail -20Restart Services
bash
# Restart a single service
docker compose -f docker-compose.operator.yml restart registry
# Restart the full stack
docker compose -f docker-compose.operator.yml restart
# Stop and start (full reset)
docker compose -f docker-compose.operator.yml down
docker compose -f docker-compose.operator.yml up -dUpdate and Rebuild
bash
# Pull latest images and restart
docker compose -f docker-compose.operator.yml pull
docker compose -f docker-compose.operator.yml up -d
# Force recreate containers (preserves volumes)
docker compose -f docker-compose.operator.yml up -d --force-recreate
# Pull and recreate in one step
docker compose -f docker-compose.operator.yml up -d --pull alwaysEnter a Container
bash
# Interactive shell in registry
docker exec -it ${REGISTRY_NAME}-registry bash
# Interactive shell in TEG layer
docker exec -it ${REGISTRY_NAME}-teg bash
# Run a one-off command
docker exec ${REGISTRY_NAME}-registry python -c "print('healthy')"Database Operations
bash
# Connect to registry PostgreSQL
docker exec -it ${REGISTRY_NAME}-db psql -U <db-user> -d <db-name>
# Check active connections
docker exec ${REGISTRY_NAME}-db psql -U <db-user> -d <db-name> -c \
"SELECT count(*) as active FROM pg_stat_activity WHERE state = 'active';"
# Check database size
docker exec ${REGISTRY_NAME}-db psql -U <db-user> -d <db-name> -c \
"SELECT pg_size_pretty(pg_database_size(current_database()));"Network Diagnostics
bash
# Check if registry can reach EventStore
docker exec ${REGISTRY_NAME}-registry curl -s -o /dev/null -w "%{http_code}" \
https://events.example.com/health
# Check if SPIRE agent is healthy
docker exec ${REGISTRY_NAME}-spire-agent /opt/spire/bin/spire-agent healthcheck
# Check cert-writer SVID status
docker logs ${REGISTRY_NAME}-cert-writer --tail 10Startup Order
The operator stack starts in this order. If a service fails, check the service it depends on:
1. db, teg-db, redis (databases)
2. spire-agent (identity)
3. pgbouncer (connection pooling -- waits for db)
4. registry, teg (applications -- wait for DB migrations)
5. cert-writer (SVID fetching -- requires spire-agent)
6. nginx-federation (mTLS sidecar -- requires registry + certs)If a service is stuck, check the service above it in this chain.
When to Escalate
If you federate with an upstream frame, escalate to that frame's operator when you encounter:
- Supply audit BREACH status
- SPIRE attestation revocation
- Persistent mTLS certificate failures after SPIRE agent restart
- Database corruption or unrecoverable migration failures
- Any behavior that suggests unauthorized token minting or balance manipulation
Include in your report:
- Your operator name and registry DID
- Container logs for the affected service (last 100 lines)
- Output of
docker compose -f docker-compose.operator.yml ps - Timestamp of when the issue first appeared