Runbook: Incident Response

When to use this

Use this runbook as a first-response playbook for the three most common Alexandria EE incidents: the API not serving traffic, quota exceeded errors blocking users, and audit chain verification failures.

Scenario 1: API not serving (5xx or no response)

Pre-checks

Check pod status: kubectl get pods -n <namespace> -l app.kubernetes.io/instance=alexandria-ee
Check recent events: kubectl describe pods -n <namespace> -l app.kubernetes.io/instance=alexandria-ee

Procedure

Check /ready endpoint:

kubectl port-forward svc/alexandria-ee -n <namespace> 8080:80
curl -sf http://localhost:8080/ready

A 503 usually means the orchestrator gRPC socket is not yet up or has crashed.

Check orchestrator gRPC socket connectivity:

The api container connects to the orchestrator over a Unix socket (/run/alexandria/orchestrator.sock) shared via an emptyDir volume. If the orchestrator container is crash-looping, the socket will not exist and the api container will log connection errors.
```
kubectl logs -n <namespace> <pod> -c orchestrator --tail=50
kubectl logs -n <namespace> <pod> -c api --tail=50
```
Look for orchestrator.sock: no such file or directory or context deadline exceeded in the api logs.
Check license blob validity:

If the api logs contain license blob invalid or license signature verification failed, the license secret is missing or corrupted. Verify:
```
kubectl get secret -n <namespace> <license-secret-name>
```
Restore the license blob if needed — see license-renewal.md.
Check resource exhaustion:

On GKE Autopilot, pods may be evicted if memory limits are hit. Confirm no OOMKilled containers:
```
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20
```
If OOMKilled, increase resources.orchestrator.limits.memory in your site values and run a Helm upgrade (see upgrade.md).

Verification

curl -sf https://<host>/ready returns 200 and /license returns valid JSON.

Rollback

If a recent upgrade caused the outage, roll back the Helm release — see upgrade.md.

Scenario 2: Quota exceeded (HTTP 402)

Pre-checks

Identify which quota is exhausted (seats or tenants):

curl -sf https://<host>/license | jq '{current_seats, max_seats, current_tenants, max_tenants}'

Procedure

If seat quota is exceeded (current_seats >= max_seats):
- Identify inactive users via the admin API and deactivate or remove them to free seats.
- If demand is legitimate, initiate a license upgrade — see license-renewal.md to obtain a blob with higher seat_count.
If tenant quota is exceeded (current_tenants >= max_tenants):
- Enterprise tier only — list existing tenants via the admin API and determine whether any are unused.
- If more tenants are needed, contact the issuer for a new blob with a higher tenant_count.
Apply the upgraded license blob and restart (see license-renewal.md).

Verification

curl -sf https://<host>/license | jq '{current_seats, max_seats}' shows headroom.

Rollback

Not applicable — quota errors are resolved by reducing usage or upgrading the license. No infrastructure change needed.

Scenario 3: Audit chain verification failure

Pre-checks

Identify the failure: check API logs for audit chain broken or HMAC mismatch messages.
Determine scope: which tenant(s) and approximate row range are affected.

Capture the affected rows before taking any action. Export them for compliance review:

# Example — adjust table name and DSN to your deployment
DSN=$(kubectl get secret alexandria-ee -n <namespace> \
      -o jsonpath='{.data.database-dsn}' | base64 -d)
psql "$DSN" -c "COPY (
  SELECT * FROM audit_logs
  WHERE tenant_id = '<affected-tenant>'
  ORDER BY seq
) TO STDOUT CSV HEADER" > audit_chain_affected_$(date +%Y%m%d).csv

Procedure

Attempt chain re-verification after a restore. If the broken chain follows a recent Postgres restore, re-verify after ensuring the restore was complete — see backup-restore.md.
If the chain is irreparably broken and compliance review is complete, use RetireUnverifiableAuditRows (implemented in api-go/internal/store/audit.go, line 183). This is a destructive operation — it removes the unverifiable rows and resets the chain anchor.

RetireUnverifiableAuditRows is an internal store method. It is not exposed as a CLI command today. An operator with direct DB access and access to the Go binary can invoke it via a maintenance mode flag, or you can trigger it through a temporary admin API endpoint if your deployment includes one.

Do not call RetireUnverifiableAuditRows without:
- Written compliance sign-off.
- The affected rows captured (step above).
- A Postgres backup taken immediately before the operation.
File an incident report documenting: the row range affected, the suspected cause (restore, replication lag, direct DB edit), and the disposition of the captured rows.

Verification

After RetireUnverifiableAuditRows completes, the API should resume accepting new audit writes without chain errors. Monitor the api logs for audit chain anchor reset confirmation.

Rollback

Restore from the pre-operation Postgres snapshot — see backup-restore.md. Note that rows written after the snapshot are lost; re-verify chain integrity on the restored data.

Cross-links

upgrade.md — safe Helm upgrade and rollback
backup-restore.md — Postgres backup and restore
license-renewal.md — renewing or replacing the license blob
key-rotation.md — rotating signing keys after compromise

When to use this​

Scenario 1: API not serving (5xx or no response)​

Pre-checks​

Procedure​

Verification​

Rollback​

Scenario 2: Quota exceeded (HTTP 402)​

Pre-checks​

Procedure​

Verification​

Rollback​

Scenario 3: Audit chain verification failure​

Pre-checks​

Procedure​

Verification​

Rollback​

Cross-links​

When to use this

Scenario 1: API not serving (5xx or no response)

Pre-checks

Procedure

Verification

Rollback

Scenario 2: Quota exceeded (HTTP 402)

Pre-checks

Procedure

Verification

Rollback

Scenario 3: Audit chain verification failure

Pre-checks

Procedure

Verification

Rollback

Cross-links