Runbook: Postgres Backup and Restore

When to use this

Use this runbook to take a point-in-time snapshot of the Alexandria EE Postgres database before upgrades, as part of a scheduled backup policy, or to restore from a snapshot after data loss or a failed migration.

Postgres is always external to the cluster — there is no in-cluster database pod. All data is in the external Postgres instance referenced by the DSN secret.

Pre-checks

Locate the database DSN. It is stored in one of two places:
- Kubernetes Secret: kubectl get secret alexandria-ee -n <namespace> -o jsonpath='{.data.database-dsn}' | base64 -d
- Vault KV v2 (when vault.enabled=true): check the path under vault.prefix (default: alexandria/llm/) in your site values.
Confirm the backup destination has enough disk space (pg_dump output is typically 10–50% of raw table size).
If running alex backup (Quadlet/CLI installs): this backs up config files, tools, and the knowledge store — it does NOT back up the Postgres database. Run pg_dump separately for the DB.

Procedure

Taking a backup

# Export the DSN from the k8s Secret
DSN=$(kubectl get secret alexandria-ee -n <namespace> \
      -o jsonpath='{.data.database-dsn}' | base64 -d)

# Snapshot to a timestamped file
TIMESTAMP=$(date +%Y%m%dT%H%M%S)
pg_dump "$DSN" \
  --format=custom \
  --compress=9 \
  --file="alexandria-ee-${TIMESTAMP}.pgdump"

Store the .pgdump file in durable storage (GCS bucket, S3, encrypted volume) before proceeding with any upgrade or migration.

CLI backup (config + knowledge store, not DB)

# On a Quadlet/CLI node — backs up config, tools, knowledge store to a tar.gz
alexandria backup [--output ./alexandria-backup-$(date +%Y-%m-%d).tar.gz]

This produces a MANIFEST.json-anchored archive. It does not include Postgres data.

Restoring from a Postgres snapshot

Stop or scale down Alexandria pods to prevent writes during restore:

kubectl scale deployment/alexandria-ee -n <namespace> --replicas=0

Restore into the target database (drop and recreate if needed):

DSN=$(kubectl get secret alexandria-ee -n <namespace> \
      -o jsonpath='{.data.database-dsn}' | base64 -d)

# Drop and recreate the schema (destructive — confirm first)
psql "$DSN" -c "DROP SCHEMA public CASCADE; CREATE SCHEMA public;"

pg_restore "$DSN" \
  --format=custom \
  --no-owner \
  --no-privileges \
  alexandria-ee-<timestamp>.pgdump

Scale pods back up:

kubectl scale deployment/alexandria-ee -n <namespace> --replicas=<count>

Verify audit chain integrity after restore. The audit HMAC chain must be re-checked because a partial restore or replay could produce a broken chain:
- There is no standalone alex-cli audit verify command today. Audit chain verification is performed internally by the Go API via RetireUnverifiableAuditRows in api-go/internal/store/audit.go.
- Manual step: after restore, check the API logs at startup for any audit integrity warnings. If your compliance policy requires a full chain re-verification, query the audit_logs table and re-run the HMAC chain computation against the restored rows before returning the system to production.
- If a future alex-cli audit verify command is added, run it here.

Verification

Pods reach Ready state: kubectl get pods -n <namespace>
/ready returns 200: curl -sf https://<host>/ready
Spot-check that a known record exists: curl -sf https://<host>/license should reflect the correct license metadata.

Rollback

If the restore is incomplete or the chain is broken, restore from a different snapshot. Do not attempt to patch audit rows manually — capture the affected rows for compliance review first, then consider RetireUnverifiableAuditRows (see incident-response.md — it is destructive).

When to use this​

Pre-checks​

Procedure​

Taking a backup​

CLI backup (config + knowledge store, not DB)​

Restoring from a Postgres snapshot​

Verification​

Rollback​