Runbook: Incident Response
When to use this
Use this runbook as a first-response playbook for the three most common Alexandria EE incidents: the API not serving traffic, quota exceeded errors blocking users, and audit chain verification failures.
Scenario 1: API not serving (5xx or no response)
Pre-checks
- Check pod status:
kubectl get pods -n <namespace> -l app.kubernetes.io/instance=alexandria-ee - Check recent events:
kubectl describe pods -n <namespace> -l app.kubernetes.io/instance=alexandria-ee
Procedure
-
Check /ready endpoint:
kubectl port-forward svc/alexandria-ee -n <namespace> 8080:80curl -sf http://localhost:8080/readyA 503 usually means the orchestrator gRPC socket is not yet up or has crashed.
-
Check orchestrator gRPC socket connectivity:
The api container connects to the orchestrator over a Unix socket (
/run/alexandria/orchestrator.sock) shared via an emptyDir volume. If the orchestrator container is crash-looping, the socket will not exist and the api container will log connection errors.kubectl logs -n <namespace> <pod> -c orchestrator --tail=50kubectl logs -n <namespace> <pod> -c api --tail=50Look for
orchestrator.sock: no such file or directoryorcontext deadline exceededin the api logs. -
Check license blob validity:
If the api logs contain
license blob invalidorlicense signature verification failed, the license secret is missing or corrupted. Verify:kubectl get secret -n <namespace> <license-secret-name>Restore the license blob if needed — see license-renewal.md.
-
Check resource exhaustion:
On GKE Autopilot, pods may be evicted if memory limits are hit. Confirm no OOMKilled containers:
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20If OOMKilled, increase
resources.orchestrator.limits.memoryin your site values and run a Helm upgrade (see upgrade.md).
Verification
curl -sf https://<host>/ready returns 200 and /license returns valid JSON.
Rollback
If a recent upgrade caused the outage, roll back the Helm release — see upgrade.md.
Scenario 2: Quota exceeded (HTTP 402)
Pre-checks
-
Identify which quota is exhausted (seats or tenants):
curl -sf https://<host>/license | jq '{current_seats, max_seats, current_tenants, max_tenants}'
Procedure
-
If seat quota is exceeded (
current_seats >= max_seats):- Identify inactive users via the admin API and deactivate or remove them to free seats.
- If demand is legitimate, initiate a license upgrade — see license-renewal.md to obtain a blob with higher
seat_count.
-
If tenant quota is exceeded (
current_tenants >= max_tenants):- Enterprise tier only — list existing tenants via the admin API and determine whether any are unused.
- If more tenants are needed, contact the issuer for a new blob with a higher
tenant_count.
-
Apply the upgraded license blob and restart (see license-renewal.md).
Verification
curl -sf https://<host>/license | jq '{current_seats, max_seats}' shows headroom.
Rollback
Not applicable — quota errors are resolved by reducing usage or upgrading the license. No infrastructure change needed.
Scenario 3: Audit chain verification failure
Pre-checks
-
Identify the failure: check API logs for
audit chain brokenorHMAC mismatchmessages. -
Determine scope: which tenant(s) and approximate row range are affected.
-
Capture the affected rows before taking any action. Export them for compliance review:
# Example — adjust table name and DSN to your deploymentDSN=$(kubectl get secret alexandria-ee -n <namespace> \-o jsonpath='{.data.database-dsn}' | base64 -d)psql "$DSN" -c "COPY (SELECT * FROM audit_logsWHERE tenant_id = '<affected-tenant>'ORDER BY seq) TO STDOUT CSV HEADER" > audit_chain_affected_$(date +%Y%m%d).csv
Procedure
-
Attempt chain re-verification after a restore. If the broken chain follows a recent Postgres restore, re-verify after ensuring the restore was complete — see backup-restore.md.
-
If the chain is irreparably broken and compliance review is complete, use
RetireUnverifiableAuditRows(implemented inapi-go/internal/store/audit.go, line 183). This is a destructive operation — it removes the unverifiable rows and resets the chain anchor.RetireUnverifiableAuditRowsis an internal store method. It is not exposed as a CLI command today. An operator with direct DB access and access to the Go binary can invoke it via a maintenance mode flag, or you can trigger it through a temporary admin API endpoint if your deployment includes one.Do not call
RetireUnverifiableAuditRowswithout:- Written compliance sign-off.
- The affected rows captured (step above).
- A Postgres backup taken immediately before the operation.
-
File an incident report documenting: the row range affected, the suspected cause (restore, replication lag, direct DB edit), and the disposition of the captured rows.
Verification
After RetireUnverifiableAuditRows completes, the API should resume accepting new audit writes without chain errors. Monitor the api logs for audit chain anchor reset confirmation.
Rollback
Restore from the pre-operation Postgres snapshot — see backup-restore.md. Note that rows written after the snapshot are lost; re-verify chain integrity on the restored data.
Cross-links
- upgrade.md — safe Helm upgrade and rollback
- backup-restore.md — Postgres backup and restore
- license-renewal.md — renewing or replacing the license blob
- key-rotation.md — rotating signing keys after compromise