Skip to content

Operator Guide: Troubleshooting

Audience: Federated registry operators (self-hosted) Last Updated: 2026-06-04


Common Issues

#SymptomCauseFix
1SPIRE agent won't startx509pop attestation certs missing or expiredVerify your spire/agent-attestation.crt and spire/agent-attestation.key files are in place. Check docker logs ${REGISTRY_NAME}-spire-agent for attestation errors. Unlike join tokens, x509pop certs survive restarts -- if they were working before, check for file permission or volume mount issues.
2cert-writer stuck in restart loopSPIRE agent unhealthy or not readyCheck docker logs ${REGISTRY_NAME}-spire-agent first. Verify agent.conf has the correct trust_domain and server_address, and that the x509pop attestation plugin is configured (not join_token). Restart SPIRE agent, then cert-writer.
3nginx returns 502 Bad GatewayRegistry container not ready yetWait 1-2 minutes for startup (database migrations run on first boot). Check docker logs ${REGISTRY_NAME}-registry for startup progress.
4nginx returns 403 on federation endpointsClient certificate validation failedCheck SVID expiry with docker logs ${REGISTRY_NAME}-cert-writer. Verify cert-writer is healthy and writing fresh certs to the shared volume.
5EventStore writes failing (401/403)mTLS certificate expired or SPIRE x509pop attestation revokedCheck docker logs ${REGISTRY_NAME}-cert-writer for renewal errors. Verify EVENTSTORE_MTLS_REQUIRED=true in your env. If certs are valid, contact the operator of your parent frame -- your SPIRE attestation may have been revoked.
6Federation sync not discovering agentsPeer URL misconfigured or mTLS handshake failingVerify FEDERATION_BASE_URL in your .env.operator. Check docker logs ${REGISTRY_NAME}-nginx-federation for TLS errors. Confirm the mainframe peer URL is reachable.
7"Connection pool exhausted"PgBouncer or PostgreSQL at max connectionsIncrease max_client_conn in pgbouncer.ini or max_connections in PostgreSQL config. Check for connection leaks with docker exec ${REGISTRY_NAME}-db psql -U <user> -c "SELECT count(*) FROM pg_stat_activity;"
8Registry returns 500 on startupDatabase migration failedCheck docker logs ${REGISTRY_NAME}-registry and look for alembic migration errors. If the database is fresh, migrations run automatically. If upgrading, ensure the database volume was preserved.
9Agent creation returns 409 ConflictDID collision (extremely rare)Retry the request. The system generates a new random DID on each attempt. If persistent, check for duplicate agent names in your registry.
10Staking unstake returns 4037-day cooldown period is still activeThis is expected behavior. The unstake cooldown is enforced for economic stability. Check the cooldown_expires_at field in the stake record.
11Cross-registry transfer returns 403Target registry is not an active federation peerVerify the target peer is listed and ACTIVE in your federation config. The target registry may have been suspended from the federation.
12Transfer returns "insufficient balance"Liquid balance too low (staked tokens not available)Only liquid (unstaked) balance can be transferred. Use GET /api/v1/teg/balance/{did} to check liquid vs. staked balances.
13Supply audit shows BREACH statusEvent emission policy misconfiguration or double-counted eventsDo not attempt to fix manually. Contact the operator of your parent frame immediately. The supply auditor runs every 60 seconds and will detect any discrepancy.

Docker Commands Reference

View Containers

bash
# List all containers and their status
docker compose -f docker-compose.operator.yml ps

# Detailed container info (names, status, ports)
docker ps --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}"

View Logs

bash
# Last 50 lines of registry logs
docker compose -f docker-compose.operator.yml logs --tail 50 registry

# Follow logs in real time
docker compose -f docker-compose.operator.yml logs -f registry

# View logs for a specific container
docker logs ${REGISTRY_NAME}-registry --tail 50

# View logs with timestamps
docker logs ${REGISTRY_NAME}-registry --tail 50 -t

# Filter for errors (JSON logs)
docker logs ${REGISTRY_NAME}-registry 2>&1 | grep '"level":"ERROR"' | tail -20

Restart Services

bash
# Restart a single service
docker compose -f docker-compose.operator.yml restart registry

# Restart the full stack
docker compose -f docker-compose.operator.yml restart

# Stop and start (full reset)
docker compose -f docker-compose.operator.yml down
docker compose -f docker-compose.operator.yml up -d

Update and Rebuild

bash
# Pull latest images and restart
docker compose -f docker-compose.operator.yml pull
docker compose -f docker-compose.operator.yml up -d

# Force recreate containers (preserves volumes)
docker compose -f docker-compose.operator.yml up -d --force-recreate

# Pull and recreate in one step
docker compose -f docker-compose.operator.yml up -d --pull always

Enter a Container

bash
# Interactive shell in registry
docker exec -it ${REGISTRY_NAME}-registry bash

# Interactive shell in TEG layer
docker exec -it ${REGISTRY_NAME}-teg bash

# Run a one-off command
docker exec ${REGISTRY_NAME}-registry python -c "print('healthy')"

Database Operations

bash
# Connect to registry PostgreSQL
docker exec -it ${REGISTRY_NAME}-db psql -U <db-user> -d <db-name>

# Check active connections
docker exec ${REGISTRY_NAME}-db psql -U <db-user> -d <db-name> -c \
  "SELECT count(*) as active FROM pg_stat_activity WHERE state = 'active';"

# Check database size
docker exec ${REGISTRY_NAME}-db psql -U <db-user> -d <db-name> -c \
  "SELECT pg_size_pretty(pg_database_size(current_database()));"

Network Diagnostics

bash
# Check if registry can reach EventStore
docker exec ${REGISTRY_NAME}-registry curl -s -o /dev/null -w "%{http_code}" \
  https://events.example.com/health

# Check if SPIRE agent is healthy
docker exec ${REGISTRY_NAME}-spire-agent /opt/spire/bin/spire-agent healthcheck

# Check cert-writer SVID status
docker logs ${REGISTRY_NAME}-cert-writer --tail 10

Startup Order

The operator stack starts in this order. If a service fails, check the service it depends on:

1. db, teg-db, redis                      (databases)
2. spire-agent                            (identity)
3. pgbouncer                              (connection pooling -- waits for db)
4. registry, teg                          (applications -- wait for DB migrations)
5. cert-writer                            (SVID fetching -- requires spire-agent)
6. nginx-federation                       (mTLS sidecar -- requires registry + certs)

If a service is stuck, check the service above it in this chain.


When to Escalate

If you federate with an upstream frame, escalate to that frame's operator when you encounter:

  • Supply audit BREACH status
  • SPIRE attestation revocation
  • Persistent mTLS certificate failures after SPIRE agent restart
  • Database corruption or unrecoverable migration failures
  • Any behavior that suggests unauthorized token minting or balance manipulation

Include in your report:

  1. Your operator name and registry DID
  2. Container logs for the affected service (last 100 lines)
  3. Output of docker compose -f docker-compose.operator.yml ps
  4. Timestamp of when the issue first appeared

Server components AGPL-v3 · client SDK Apache-2.0. If a doc and the running stack disagree, trust the stack.