Skip to main content

Runbook: Postgres Backup and Restore

When to use this

Use this runbook to take a point-in-time snapshot of the Alexandria EE Postgres database before upgrades, as part of a scheduled backup policy, or to restore from a snapshot after data loss or a failed migration.

Postgres is always external to the cluster — there is no in-cluster database pod. All data is in the external Postgres instance referenced by the DSN secret.

Pre-checks

  • Locate the database DSN. It is stored in one of two places:
    • Kubernetes Secret: kubectl get secret alexandria-ee -n <namespace> -o jsonpath='{.data.database-dsn}' | base64 -d
    • Vault KV v2 (when vault.enabled=true): check the path under vault.prefix (default: alexandria/llm/) in your site values.
  • Confirm the backup destination has enough disk space (pg_dump output is typically 10–50% of raw table size).
  • If running alex backup (Quadlet/CLI installs): this backs up config files, tools, and the knowledge store — it does NOT back up the Postgres database. Run pg_dump separately for the DB.

Procedure

Taking a backup

# Export the DSN from the k8s Secret
DSN=$(kubectl get secret alexandria-ee -n <namespace> \
-o jsonpath='{.data.database-dsn}' | base64 -d)

# Snapshot to a timestamped file
TIMESTAMP=$(date +%Y%m%dT%H%M%S)
pg_dump "$DSN" \
--format=custom \
--compress=9 \
--file="alexandria-ee-${TIMESTAMP}.pgdump"

Store the .pgdump file in durable storage (GCS bucket, S3, encrypted volume) before proceeding with any upgrade or migration.

CLI backup (config + knowledge store, not DB)

# On a Quadlet/CLI node — backs up config, tools, knowledge store to a tar.gz
alexandria backup [--output ./alexandria-backup-$(date +%Y-%m-%d).tar.gz]

This produces a MANIFEST.json-anchored archive. It does not include Postgres data.

Restoring from a Postgres snapshot

  1. Stop or scale down Alexandria pods to prevent writes during restore:

    kubectl scale deployment/alexandria-ee -n <namespace> --replicas=0
  2. Restore into the target database (drop and recreate if needed):

    DSN=$(kubectl get secret alexandria-ee -n <namespace> \
    -o jsonpath='{.data.database-dsn}' | base64 -d)

    # Drop and recreate the schema (destructive — confirm first)
    psql "$DSN" -c "DROP SCHEMA public CASCADE; CREATE SCHEMA public;"

    pg_restore "$DSN" \
    --format=custom \
    --no-owner \
    --no-privileges \
    alexandria-ee-<timestamp>.pgdump
  3. Scale pods back up:

    kubectl scale deployment/alexandria-ee -n <namespace> --replicas=<count>
  4. Verify audit chain integrity after restore. The audit HMAC chain must be re-checked because a partial restore or replay could produce a broken chain:

    • There is no standalone alex-cli audit verify command today. Audit chain verification is performed internally by the Go API via RetireUnverifiableAuditRows in api-go/internal/store/audit.go.
    • Manual step: after restore, check the API logs at startup for any audit integrity warnings. If your compliance policy requires a full chain re-verification, query the audit_logs table and re-run the HMAC chain computation against the restored rows before returning the system to production.
    • If a future alex-cli audit verify command is added, run it here.

Verification

  • Pods reach Ready state: kubectl get pods -n <namespace>
  • /ready returns 200: curl -sf https://<host>/ready
  • Spot-check that a known record exists: curl -sf https://<host>/license should reflect the correct license metadata.

Rollback

If the restore is incomplete or the chain is broken, restore from a different snapshot. Do not attempt to patch audit rows manually — capture the affected rows for compliance review first, then consider RetireUnverifiableAuditRows (see incident-response.md — it is destructive).