Skip to main content

Alexandria EE — Day-2 Operations Runbook


Backup and Restore (Postgres)

Backup

Alexandria EE stores all persistent state in Postgres. Back up using pg_dump:

# Logical dump (recommended — portable, schema-aware)
pg_dump \
-h pg.internal \
-U alex \
-d alexandria \
--format=custom \
--compress=9 \
--file=alexandria-$(date +%Y%m%d-%H%M%S).dump

# Verify the dump is readable
pg_restore --list alexandria-*.dump | head -20

Schedule this as a cron job or use your cloud provider's managed Postgres backup feature (Cloud SQL automated backups, RDS snapshots).

Restore

# 1. Terminate API connections to the database
kubectl scale deployment alexandria-ee -n alexandria --replicas=0
# (or stop the alexandria-api systemd unit on Quadlet hosts)

# 2. Drop and recreate the database (for a clean restore)
psql -h pg.internal -U postgres -c "DROP DATABASE IF EXISTS alexandria;"
psql -h pg.internal -U postgres -c "CREATE DATABASE alexandria OWNER alex;"

# 3. Restore the dump
pg_restore \
-h pg.internal \
-U alex \
-d alexandria \
--no-owner \
--role=alex \
alexandria-20260506-120000.dump

# 4. Restart the API — it will re-apply any schema migrations newer than the dump
kubectl scale deployment alexandria-ee -n alexandria --replicas=1
kubectl rollout status deployment/alexandria-ee -n alexandria

Monitoring

Prometheus Metrics

Enable the metrics endpoint in values.yaml:

metrics:
enabled: true
serviceMonitor:
enabled: true # requires Prometheus Operator
interval: "30s"

Key metrics exposed at GET /metrics:

MetricTypeDescription
alexandria_http_requests_totalCounterHTTP requests by method, path, status
alexandria_http_request_duration_secondsHistogramRequest latency by method, path
cache_hit_totalCounterCache hits by backend (inmem/redis/memcached)
cache_miss_totalCounterCache misses by backend
cache_duration_msHistogramRound-trip cache latency by backend and operation

Sample PrometheusRule Alerts

The chart ships a PrometheusRule template (enabled by metrics.prometheusRule.enabled=true) with the following alerts. Enable with:

metrics:
enabled: true
prometheusRule:
enabled: true

Included alerts:

  • AlexandriaHighErrorRate — fires when 5xx rate > 5% over 5 minutes.
  • AlexandriaAPIDown — fires when no successful /ready scrape for 2 minutes.
  • AlexandriaHighLatency — fires when P99 request latency > 5 seconds over 5 minutes.
  • AlexandriaCacheHitRateLow — fires when Memcached hit rate < 50% over 10 minutes (only meaningful with memcached_cache entitlement).

Grafana

If using the OTel LGTM stack (otel.enabled=true), Grafana is available at port 3000 on the otel-lgtm service. The Prometheus data source is auto-configured. Import the Alexandria dashboard from k8s/o11y/ if present.


HA Constraints and Upgrade Downtime

Current HA constraint

The main deployment uses a ReadWriteOnce PVC and updateStrategy.type: Recreate. Only one pod can run at a time. This means:

  • Every upgrade incurs a brief downtime (15–60 seconds; see UPGRADE.md).
  • No horizontal scaling of the main pod while RWO PVC is in use.

Path to zero-downtime upgrades

  1. Migrate from SQLite to Postgres for all state (already the case in EE — SQLite is only used in CE).
  2. Move to a ReadWriteMany PVC (e.g., Google Filestore NFS, Amazon EFS) or eliminate the PVC dependency entirely.
  3. Switch updateStrategy to RollingUpdate with maxUnavailable: 0.

This is on the product roadmap. Until then, schedule upgrades during low-traffic windows.

Stateless components

The following components can be scaled horizontally without constraint:

  • vector-index — stateless gRPC service; scale via vectorIndex.replicaCount.
  • memcached — scales via cache.memcached.replicaCount; Memcached is not write-consistent so use consistent hashing.
  • redis — single-instance; for HA use Redis Sentinel or Cluster (external).

Incident Response

API is returning 5xx

# 1. Check pod status
kubectl get pods -n alexandria

# 2. Check API logs
kubectl logs -n alexandria -l app.kubernetes.io/name=alexandria-ee -c api --since=10m

# 3. Check orchestrator socket
kubectl exec -n alexandria deploy/alexandria-ee -c api -- \
ls -la /var/run/alexandria/

# 4. Check DB connectivity
kubectl exec -n alexandria deploy/alexandria-ee -c api -- \
curl -s http://localhost:8080/ready

# 5. Check audit log for recent errors
curl -s 'https://alexandria.example.com/admin/audit?limit=50' \
-H "Authorization: Bearer <token>" | jq '.entries[] | select(.action | contains("error"))'

Admin locked out

If all admin users are disabled or the admin password is lost:

# Reset via helm upgrade with a new adminPassword
helm upgrade alexandria-ee k8s/helm/alexandria-ee/ \
--reuse-values \
--set auth.adminPassword='<new-password>'

# The API seeds/re-syncs the admin user from ALEX_ADMIN_PASSWORD on startup.

License expired

# Check current license state
curl -s https://alexandria.example.com/admin/license \
-H "Authorization: Bearer <token>" | jq .

# Apply a new key (see docs/licensing.md)

Audit chain tampered

curl -s -X POST https://alexandria.example.com/admin/audit/verify \
-H "Authorization: Bearer <token>" | jq .

# If verified: false, contact Alexandria support with the verify response.
# Do NOT modify the audit_log table — preserve evidence.

Certificate Rotation

JWT secret

Rotating the JWT secret invalidates all live tokens. Users will be logged out.

helm upgrade alexandria-ee k8s/helm/alexandria-ee/ \
--reuse-values \
--set auth.jwtSecret='<new-random-32-byte-hex>'

The helm.sh/resource-policy: keep annotation on the Secret prevents accidental rotation during helm upgrade --reuse-values. Always set the new secret explicitly.

SAML signing certificate

The chart creates a SAML signing certificate via a Job on first install. To rotate:

kubectl delete job -n alexandria -l app.kubernetes.io/component=saml-cert
helm upgrade alexandria-ee k8s/helm/alexandria-ee/ --reuse-values

The new certificate will need to be registered with each configured SAML IdP.


KEDA Backend Autoscaling

KEDA-driven autoscaling scales the LLM orchestrator deployment based on the number of in-flight completion requests observed by the OTel collector. This requires the backend_autoscaling EE license entitlement.

Prerequisites

  • KEDA controller installed in the cluster. Install with Helm:
    helm repo add kedacore https://kedacore.github.io/charts
    helm upgrade --install keda kedacore/keda \
    --namespace keda --create-namespace
  • A Prometheus-compatible API scraping the orchestrator's OTel metrics. The otel-lgtm bundle in k8s/o11y/ exposes Mimir's PromQL on port 9090 of the otel-lgtm Service in the observability namespace.
  • EE license with the backend_autoscaling entitlement.

Enabling KEDA scaling

In your values.yaml:

autoscale:
keda:
enabled: true
minReplicas: 1
maxReplicas: 8
pollingIntervalSeconds: 30
cooldownSeconds: 300
triggerThreshold: "5" # scale up when avg in-flight requests per pod >= 5
prometheus:
serverAddress: "http://otel-lgtm.observability.svc.cluster.local:9090"
query: "sum(alexandria_backend_in_flight_requests)"

license:
declaredEntitlements:
- backend_autoscaling

The chart renders a ScaledObject that targets the main Alexandria deployment. Override autoscale.keda.scaleTargetName if the orchestrator runs in a separate deployment.

Expected scaling behavior

ConditionAction
sum(in_flight_requests) / replicas >= triggerThresholdKEDA adds a replica (up to maxReplicas)
Metric falls below threshold for cooldownSecondsKEDA removes a replica (down to minReplicas)
Prometheus unreachableKEDA pauses scaling (does not scale down aggressively)

Reconciler interaction

When KEDA is enabled (autoscale.keda.enabled=true), the model controller reconciler defers replica count management to KEDA. Specifically: the reconciler will not snap the replica count back to DesiredReplicas on each tick — it only enforces min/max bounds at Deployment creation time and allows KEDA to scale within those bounds. This prevents a fight between the reconciler and the KEDA controller.

Tuning guidance

  • triggerThreshold: Set based on your backend's latency budget. Lower values (3–5) provide faster scale-up for latency-sensitive workloads. Higher values (10–20) are appropriate for throughput-optimised deployments where batching amortises per-request cost.
  • cooldownSeconds: 300s (5 minutes) is a safe default. GPU nodes take time to provision; premature scale-down wastes the startup cost. Increase to 600–900s for GPU-backed deployments.
  • pollingIntervalSeconds: 30s is the KEDA default. Reduce to 15s if you need faster reaction to burst traffic; below 10s is rarely useful given metric aggregation lag.

Disabling KEDA

Set autoscale.keda.enabled=false (or remove the backend_autoscaling entitlement). The ScaledObject will not be rendered and the reconciler resumes managing replicas directly via DesiredReplicas. The existing HPA (autoscaling.enabled) still scales the API pod on CPU/memory and is independent of KEDA.


KEDA Backend Autoscaling (autoscaling.keda) — Dual-Trigger Tuning

The autoscaling.keda block (distinct from autoscale.keda) governs autoscaling of the runtime-created LLM backend Deployments managed by alex-model-controller. It uses two Prometheus triggers simultaneously — KEDA scales up when either trigger fires.

Trigger summary

TriggerMetricDefault thresholdSemantic
queryCountrate(query_count_total[1m])10 queries/secondThroughput pressure
latencyP95histogram_quantile(0.95, rate(llm_duration_ms_bucket[1m]))5000msLatency quality-of-service guard

Tuning the queryCount trigger

queryCountThreshold controls how many queries per second each replica should absorb before KEDA adds another. Tune based on your backend's concurrency capacity:

  • Low concurrency backends (single-GPU, sequential inference): set to 1–3. Each query blocks the GPU; more than a few concurrent requests will stack up.
  • High concurrency backends (multi-GPU, batched inference, vLLM): set to 20–50. Batching amortises per-request GPU cost; a single replica can absorb many concurrent requests.
  • Use rate(query_count_total[5m]) instead of [1m] if your traffic is bursty — the longer window smooths out transient spikes and reduces unnecessary scale events.

Tuning the latencyP95 trigger

latencyP95ThresholdMs is a QoS guard: scale up when the P95 latency crosses an SLO boundary, regardless of throughput. This catches saturation that the throughput metric misses (e.g. when a single very slow request is inflating tail latency without driving up query rate).

  • Interactive workloads (chat, low-latency API): 2000–3000ms. Users notice latency above 2 seconds; scale early.
  • Batch workloads (document processing, background agents): 10000–30000ms. Tail latency is less user-facing; scale conservatively to avoid over-provisioning.
  • If llm_duration_ms_bucket is missing from your Prometheus instance, check that otel.enabled is true and the orchestrator is emitting OTLP metrics to the collector. Run kubectl exec ... -c api -- curl -s localhost:4318/metrics | grep llm_duration to verify.

Replica bounds

autoscaling.keda.minReplicas and maxReplicas apply to every backend Deployment the reconciler manages. To set per-backend bounds, use the backend's min_replicas and max_replicas fields in the admin API — the reconciler uses those as the floor when initialising a new Deployment, and KEDA operates within the ScaledObject min/max.

Prometheus endpoint

The default serverAddress (http://otel-lgtm.o11y.svc.cluster.local:9090) targets the otel-lgtm bundle in k8s/o11y/. For air-gapped clusters or external Prometheus/Mimir/Thanos:

autoscaling:
keda:
prometheus:
serverAddress: "http://prometheus.monitoring.svc.cluster.local:9090"

Verifying ScaledObject creation

After upgrading with autoscaling.keda.enabled=true, confirm the reconciler created ScaledObjects:

# List ScaledObjects in the workload namespace
kubectl get scaledobjects -n <release-namespace>

# Check KEDA's view of trigger health
kubectl describe scaledobject <backend-name> -n <release-namespace>

# Confirm KEDA metrics server can reach Prometheus
kubectl logs -n keda -l app=keda-operator | grep <backend-name>

If ScaledObjects are not appearing after a reconcile cycle, check model-controller logs:

kubectl logs -n <release-namespace> \
-l app.kubernetes.io/name=alexandria-ee-model-controller \
--since=5m | grep keda