Alexandria EE — Day-2 Operations Runbook

Backup and Restore (Postgres)

Backup

Alexandria EE stores all persistent state in Postgres. Back up using pg_dump:

# Logical dump (recommended — portable, schema-aware)
pg_dump \
  -h pg.internal \
  -U alex \
  -d alexandria \
  --format=custom \
  --compress=9 \
  --file=alexandria-$(date +%Y%m%d-%H%M%S).dump

# Verify the dump is readable
pg_restore --list alexandria-*.dump | head -20

Schedule this as a cron job or use your cloud provider's managed Postgres backup feature (Cloud SQL automated backups, RDS snapshots).

Restore

# 1. Terminate API connections to the database
kubectl scale deployment alexandria-ee -n alexandria --replicas=0
# (or stop the alexandria-api systemd unit on Quadlet hosts)

# 2. Drop and recreate the database (for a clean restore)
psql -h pg.internal -U postgres -c "DROP DATABASE IF EXISTS alexandria;"
psql -h pg.internal -U postgres -c "CREATE DATABASE alexandria OWNER alex;"

# 3. Restore the dump
pg_restore \
  -h pg.internal \
  -U alex \
  -d alexandria \
  --no-owner \
  --role=alex \
  alexandria-20260506-120000.dump

# 4. Restart the API — it will re-apply any schema migrations newer than the dump
kubectl scale deployment alexandria-ee -n alexandria --replicas=1
kubectl rollout status deployment/alexandria-ee -n alexandria

Monitoring

Prometheus Metrics

Enable the metrics endpoint in values.yaml:

metrics:
  enabled: true
  serviceMonitor:
    enabled: true     # requires Prometheus Operator
    interval: "30s"

Key metrics exposed at GET /metrics:

Metric	Type	Description
`alexandria_http_requests_total`	Counter	HTTP requests by method, path, status
`alexandria_http_request_duration_seconds`	Histogram	Request latency by method, path
`cache_hit_total`	Counter	Cache hits by backend (inmem/redis/memcached)
`cache_miss_total`	Counter	Cache misses by backend
`cache_duration_ms`	Histogram	Round-trip cache latency by backend and operation

Sample PrometheusRule Alerts

The chart ships a PrometheusRule template (enabled by metrics.prometheusRule.enabled=true) with the following alerts. Enable with:

metrics:
  enabled: true
  prometheusRule:
    enabled: true

Included alerts:

AlexandriaHighErrorRate — fires when 5xx rate > 5% over 5 minutes.
AlexandriaAPIDown — fires when no successful /ready scrape for 2 minutes.
AlexandriaHighLatency — fires when P99 request latency > 5 seconds over 5 minutes.
AlexandriaCacheHitRateLow — fires when Memcached hit rate < 50% over 10 minutes (only meaningful with memcached_cache entitlement).

Grafana

If using the OTel LGTM stack (otel.enabled=true), Grafana is available at port 3000 on the otel-lgtm service. The Prometheus data source is auto-configured. Import the Alexandria dashboard from k8s/o11y/ if present.

HA Constraints and Upgrade Downtime

Current HA constraint

The main deployment uses a ReadWriteOnce PVC and updateStrategy.type: Recreate. Only one pod can run at a time. This means:

Every upgrade incurs a brief downtime (15–60 seconds; see UPGRADE.md).
No horizontal scaling of the main pod while RWO PVC is in use.

Path to zero-downtime upgrades

Migrate from SQLite to Postgres for all state (already the case in EE — SQLite is only used in CE).
Move to a ReadWriteMany PVC (e.g., Google Filestore NFS, Amazon EFS) or eliminate the PVC dependency entirely.
Switch updateStrategy to RollingUpdate with maxUnavailable: 0.

This is on the product roadmap. Until then, schedule upgrades during low-traffic windows.

Stateless components

The following components can be scaled horizontally without constraint:

vector-index — stateless gRPC service; scale via vectorIndex.replicaCount.
memcached — scales via cache.memcached.replicaCount; Memcached is not write-consistent so use consistent hashing.
redis — single-instance; for HA use Redis Sentinel or Cluster (external).

Incident Response

API is returning 5xx

# 1. Check pod status
kubectl get pods -n alexandria

# 2. Check API logs
kubectl logs -n alexandria -l app.kubernetes.io/name=alexandria-ee -c api --since=10m

# 3. Check orchestrator socket
kubectl exec -n alexandria deploy/alexandria-ee -c api -- \
  ls -la /var/run/alexandria/

# 4. Check DB connectivity
kubectl exec -n alexandria deploy/alexandria-ee -c api -- \
  curl -s http://localhost:8080/ready

# 5. Check audit log for recent errors
curl -s 'https://alexandria.example.com/admin/audit?limit=50' \
  -H "Authorization: Bearer <token>" | jq '.entries[] | select(.action | contains("error"))'

Admin locked out

If all admin users are disabled or the admin password is lost:

# Reset via helm upgrade with a new adminPassword
helm upgrade alexandria-ee k8s/helm/alexandria-ee/ \
  --reuse-values \
  --set auth.adminPassword='<new-password>'

# The API seeds/re-syncs the admin user from ALEX_ADMIN_PASSWORD on startup.

License expired

# Check current license state
curl -s https://alexandria.example.com/admin/license \
  -H "Authorization: Bearer <token>" | jq .

# Apply a new key (see docs/licensing.md)

Audit chain tampered

curl -s -X POST https://alexandria.example.com/admin/audit/verify \
  -H "Authorization: Bearer <token>" | jq .

# If verified: false, contact Alexandria support with the verify response.
# Do NOT modify the audit_log table — preserve evidence.

Certificate Rotation

JWT secret

Rotating the JWT secret invalidates all live tokens. Users will be logged out.

helm upgrade alexandria-ee k8s/helm/alexandria-ee/ \
  --reuse-values \
  --set auth.jwtSecret='<new-random-32-byte-hex>'

The helm.sh/resource-policy: keep annotation on the Secret prevents accidental rotation during helm upgrade --reuse-values. Always set the new secret explicitly.

SAML signing certificate

The chart creates a SAML signing certificate via a Job on first install. To rotate:

kubectl delete job -n alexandria -l app.kubernetes.io/component=saml-cert
helm upgrade alexandria-ee k8s/helm/alexandria-ee/ --reuse-values

The new certificate will need to be registered with each configured SAML IdP.

KEDA Backend Autoscaling

KEDA-driven autoscaling scales the LLM orchestrator deployment based on the number of in-flight completion requests observed by the OTel collector. This requires the backend_autoscaling EE license entitlement.

Prerequisites

KEDA controller installed in the cluster. Install with Helm:

helm repo add kedacore https://kedacore.github.io/charts
helm upgrade --install keda kedacore/keda \
  --namespace keda --create-namespace

A Prometheus-compatible API scraping the orchestrator's OTel metrics. The otel-lgtm bundle in k8s/o11y/ exposes Mimir's PromQL on port 9090 of the otel-lgtm Service in the observability namespace.
EE license with the backend_autoscaling entitlement.

Enabling KEDA scaling

In your values.yaml:

autoscale:
  keda:
    enabled: true
    minReplicas: 1
    maxReplicas: 8
    pollingIntervalSeconds: 30
    cooldownSeconds: 300
    triggerThreshold: "5"    # scale up when avg in-flight requests per pod >= 5
    prometheus:
      serverAddress: "http://otel-lgtm.observability.svc.cluster.local:9090"
      query: "sum(alexandria_backend_in_flight_requests)"

license:
  declaredEntitlements:
    - backend_autoscaling

The chart renders a ScaledObject that targets the main Alexandria deployment. Override autoscale.keda.scaleTargetName if the orchestrator runs in a separate deployment.

Expected scaling behavior

Condition	Action
`sum(in_flight_requests) / replicas >= triggerThreshold`	KEDA adds a replica (up to `maxReplicas`)
Metric falls below threshold for `cooldownSeconds`	KEDA removes a replica (down to `minReplicas`)
Prometheus unreachable	KEDA pauses scaling (does not scale down aggressively)

Reconciler interaction

When KEDA is enabled (autoscale.keda.enabled=true), the model controller reconciler defers replica count management to KEDA. Specifically: the reconciler will not snap the replica count back to DesiredReplicas on each tick — it only enforces min/max bounds at Deployment creation time and allows KEDA to scale within those bounds. This prevents a fight between the reconciler and the KEDA controller.

Tuning guidance

triggerThreshold: Set based on your backend's latency budget. Lower values (3–5) provide faster scale-up for latency-sensitive workloads. Higher values (10–20) are appropriate for throughput-optimised deployments where batching amortises per-request cost.
cooldownSeconds: 300s (5 minutes) is a safe default. GPU nodes take time to provision; premature scale-down wastes the startup cost. Increase to 600–900s for GPU-backed deployments.
pollingIntervalSeconds: 30s is the KEDA default. Reduce to 15s if you need faster reaction to burst traffic; below 10s is rarely useful given metric aggregation lag.

Disabling KEDA

Set autoscale.keda.enabled=false (or remove the backend_autoscaling entitlement). The ScaledObject will not be rendered and the reconciler resumes managing replicas directly via DesiredReplicas. The existing HPA (autoscaling.enabled) still scales the API pod on CPU/memory and is independent of KEDA.

KEDA Backend Autoscaling (autoscaling.keda) — Dual-Trigger Tuning

The autoscaling.keda block (distinct from autoscale.keda) governs autoscaling of the runtime-created LLM backend Deployments managed by alex-model-controller. It uses two Prometheus triggers simultaneously — KEDA scales up when either trigger fires.

Trigger summary

Trigger	Metric	Default threshold	Semantic
`queryCount`	`rate(query_count_total[1m])`	10 queries/second	Throughput pressure
`latencyP95`	`histogram_quantile(0.95, rate(llm_duration_ms_bucket[1m]))`	5000ms	Latency quality-of-service guard

Tuning the queryCount trigger

queryCountThreshold controls how many queries per second each replica should absorb before KEDA adds another. Tune based on your backend's concurrency capacity:

Low concurrency backends (single-GPU, sequential inference): set to 1–3. Each query blocks the GPU; more than a few concurrent requests will stack up.
High concurrency backends (multi-GPU, batched inference, vLLM): set to 20–50. Batching amortises per-request GPU cost; a single replica can absorb many concurrent requests.
Use rate(query_count_total[5m]) instead of [1m] if your traffic is bursty — the longer window smooths out transient spikes and reduces unnecessary scale events.

Tuning the latencyP95 trigger

latencyP95ThresholdMs is a QoS guard: scale up when the P95 latency crosses an SLO boundary, regardless of throughput. This catches saturation that the throughput metric misses (e.g. when a single very slow request is inflating tail latency without driving up query rate).

Interactive workloads (chat, low-latency API): 2000–3000ms. Users notice latency above 2 seconds; scale early.
Batch workloads (document processing, background agents): 10000–30000ms. Tail latency is less user-facing; scale conservatively to avoid over-provisioning.
If llm_duration_ms_bucket is missing from your Prometheus instance, check that otel.enabled is true and the orchestrator is emitting OTLP metrics to the collector. Run kubectl exec ... -c api -- curl -s localhost:4318/metrics | grep llm_duration to verify.

Replica bounds

autoscaling.keda.minReplicas and maxReplicas apply to every backend Deployment the reconciler manages. To set per-backend bounds, use the backend's min_replicas and max_replicas fields in the admin API — the reconciler uses those as the floor when initialising a new Deployment, and KEDA operates within the ScaledObject min/max.

Prometheus endpoint

The default serverAddress (http://otel-lgtm.o11y.svc.cluster.local:9090) targets the otel-lgtm bundle in k8s/o11y/. For air-gapped clusters or external Prometheus/Mimir/Thanos:

autoscaling:
  keda:
    prometheus:
      serverAddress: "http://prometheus.monitoring.svc.cluster.local:9090"

Verifying ScaledObject creation

After upgrading with autoscaling.keda.enabled=true, confirm the reconciler created ScaledObjects:

# List ScaledObjects in the workload namespace
kubectl get scaledobjects -n <release-namespace>

# Check KEDA's view of trigger health
kubectl describe scaledobject <backend-name> -n <release-namespace>

# Confirm KEDA metrics server can reach Prometheus
kubectl logs -n keda -l app=keda-operator | grep <backend-name>

If ScaledObjects are not appearing after a reconcile cycle, check model-controller logs:

kubectl logs -n <release-namespace> \
  -l app.kubernetes.io/name=alexandria-ee-model-controller \
  --since=5m | grep keda

Backup and Restore (Postgres)​

Backup​

Restore​

Monitoring​

Prometheus Metrics​

Sample PrometheusRule Alerts​

Grafana​

HA Constraints and Upgrade Downtime​

Current HA constraint​

Path to zero-downtime upgrades​

Stateless components​

Incident Response​

API is returning 5xx​

Admin locked out​

License expired​

Audit chain tampered​

Certificate Rotation​

JWT secret​

SAML signing certificate​

KEDA Backend Autoscaling​

Prerequisites​

Enabling KEDA scaling​

Expected scaling behavior​

Reconciler interaction​

Tuning guidance​

Disabling KEDA​

KEDA Backend Autoscaling (autoscaling.keda) — Dual-Trigger Tuning​

Trigger summary​

Tuning the queryCount trigger​

Tuning the latencyP95 trigger​

Replica bounds​

Prometheus endpoint​

Verifying ScaledObject creation​