Alexandria EE — Day-2 Operations Runbook
Backup and Restore (Postgres)
Backup
Alexandria EE stores all persistent state in Postgres. Back up using pg_dump:
# Logical dump (recommended — portable, schema-aware)
pg_dump \
-h pg.internal \
-U alex \
-d alexandria \
--format=custom \
--compress=9 \
--file=alexandria-$(date +%Y%m%d-%H%M%S).dump
# Verify the dump is readable
pg_restore --list alexandria-*.dump | head -20
Schedule this as a cron job or use your cloud provider's managed Postgres backup feature (Cloud SQL automated backups, RDS snapshots).
Restore
# 1. Terminate API connections to the database
kubectl scale deployment alexandria-ee -n alexandria --replicas=0
# (or stop the alexandria-api systemd unit on Quadlet hosts)
# 2. Drop and recreate the database (for a clean restore)
psql -h pg.internal -U postgres -c "DROP DATABASE IF EXISTS alexandria;"
psql -h pg.internal -U postgres -c "CREATE DATABASE alexandria OWNER alex;"
# 3. Restore the dump
pg_restore \
-h pg.internal \
-U alex \
-d alexandria \
--no-owner \
--role=alex \
alexandria-20260506-120000.dump
# 4. Restart the API — it will re-apply any schema migrations newer than the dump
kubectl scale deployment alexandria-ee -n alexandria --replicas=1
kubectl rollout status deployment/alexandria-ee -n alexandria
Monitoring
Prometheus Metrics
Enable the metrics endpoint in values.yaml:
metrics:
enabled: true
serviceMonitor:
enabled: true # requires Prometheus Operator
interval: "30s"
Key metrics exposed at GET /metrics:
| Metric | Type | Description |
|---|---|---|
alexandria_http_requests_total | Counter | HTTP requests by method, path, status |
alexandria_http_request_duration_seconds | Histogram | Request latency by method, path |
cache_hit_total | Counter | Cache hits by backend (inmem/redis/memcached) |
cache_miss_total | Counter | Cache misses by backend |
cache_duration_ms | Histogram | Round-trip cache latency by backend and operation |
Sample PrometheusRule Alerts
The chart ships a PrometheusRule template (enabled by metrics.prometheusRule.enabled=true) with the following alerts. Enable with:
metrics:
enabled: true
prometheusRule:
enabled: true
Included alerts:
AlexandriaHighErrorRate— fires when 5xx rate > 5% over 5 minutes.AlexandriaAPIDown— fires when no successful/readyscrape for 2 minutes.AlexandriaHighLatency— fires when P99 request latency > 5 seconds over 5 minutes.AlexandriaCacheHitRateLow— fires when Memcached hit rate < 50% over 10 minutes (only meaningful withmemcached_cacheentitlement).
Grafana
If using the OTel LGTM stack (otel.enabled=true), Grafana is available at port 3000 on the otel-lgtm service. The Prometheus data source is auto-configured. Import the Alexandria dashboard from k8s/o11y/ if present.
HA Constraints and Upgrade Downtime
Current HA constraint
The main deployment uses a ReadWriteOnce PVC and updateStrategy.type: Recreate. Only one pod can run at a time. This means:
- Every upgrade incurs a brief downtime (15–60 seconds; see UPGRADE.md).
- No horizontal scaling of the main pod while RWO PVC is in use.
Path to zero-downtime upgrades
- Migrate from SQLite to Postgres for all state (already the case in EE — SQLite is only used in CE).
- Move to a
ReadWriteManyPVC (e.g., Google Filestore NFS, Amazon EFS) or eliminate the PVC dependency entirely. - Switch
updateStrategytoRollingUpdatewithmaxUnavailable: 0.
This is on the product roadmap. Until then, schedule upgrades during low-traffic windows.
Stateless components
The following components can be scaled horizontally without constraint:
vector-index— stateless gRPC service; scale viavectorIndex.replicaCount.memcached— scales viacache.memcached.replicaCount; Memcached is not write-consistent so use consistent hashing.redis— single-instance; for HA use Redis Sentinel or Cluster (external).
Incident Response
API is returning 5xx
# 1. Check pod status
kubectl get pods -n alexandria
# 2. Check API logs
kubectl logs -n alexandria -l app.kubernetes.io/name=alexandria-ee -c api --since=10m
# 3. Check orchestrator socket
kubectl exec -n alexandria deploy/alexandria-ee -c api -- \
ls -la /var/run/alexandria/
# 4. Check DB connectivity
kubectl exec -n alexandria deploy/alexandria-ee -c api -- \
curl -s http://localhost:8080/ready
# 5. Check audit log for recent errors
curl -s 'https://alexandria.example.com/admin/audit?limit=50' \
-H "Authorization: Bearer <token>" | jq '.entries[] | select(.action | contains("error"))'
Admin locked out
If all admin users are disabled or the admin password is lost:
# Reset via helm upgrade with a new adminPassword
helm upgrade alexandria-ee k8s/helm/alexandria-ee/ \
--reuse-values \
--set auth.adminPassword='<new-password>'
# The API seeds/re-syncs the admin user from ALEX_ADMIN_PASSWORD on startup.
License expired
# Check current license state
curl -s https://alexandria.example.com/admin/license \
-H "Authorization: Bearer <token>" | jq .
# Apply a new key (see docs/licensing.md)
Audit chain tampered
curl -s -X POST https://alexandria.example.com/admin/audit/verify \
-H "Authorization: Bearer <token>" | jq .
# If verified: false, contact Alexandria support with the verify response.
# Do NOT modify the audit_log table — preserve evidence.
Certificate Rotation
JWT secret
Rotating the JWT secret invalidates all live tokens. Users will be logged out.
helm upgrade alexandria-ee k8s/helm/alexandria-ee/ \
--reuse-values \
--set auth.jwtSecret='<new-random-32-byte-hex>'
The helm.sh/resource-policy: keep annotation on the Secret prevents accidental rotation during helm upgrade --reuse-values. Always set the new secret explicitly.
SAML signing certificate
The chart creates a SAML signing certificate via a Job on first install. To rotate:
kubectl delete job -n alexandria -l app.kubernetes.io/component=saml-cert
helm upgrade alexandria-ee k8s/helm/alexandria-ee/ --reuse-values
The new certificate will need to be registered with each configured SAML IdP.
KEDA Backend Autoscaling
KEDA-driven autoscaling scales the LLM orchestrator deployment based on the number of in-flight
completion requests observed by the OTel collector. This requires the backend_autoscaling EE
license entitlement.
Prerequisites
- KEDA controller installed in the cluster. Install with Helm:
helm repo add kedacore https://kedacore.github.io/chartshelm upgrade --install keda kedacore/keda \--namespace keda --create-namespace
- A Prometheus-compatible API scraping the orchestrator's OTel metrics. The otel-lgtm bundle
in
k8s/o11y/exposes Mimir's PromQL on port 9090 of theotel-lgtmService in theobservabilitynamespace. - EE license with the
backend_autoscalingentitlement.
Enabling KEDA scaling
In your values.yaml:
autoscale:
keda:
enabled: true
minReplicas: 1
maxReplicas: 8
pollingIntervalSeconds: 30
cooldownSeconds: 300
triggerThreshold: "5" # scale up when avg in-flight requests per pod >= 5
prometheus:
serverAddress: "http://otel-lgtm.observability.svc.cluster.local:9090"
query: "sum(alexandria_backend_in_flight_requests)"
license:
declaredEntitlements:
- backend_autoscaling
The chart renders a ScaledObject that targets the main Alexandria deployment. Override
autoscale.keda.scaleTargetName if the orchestrator runs in a separate deployment.
Expected scaling behavior
| Condition | Action |
|---|---|
sum(in_flight_requests) / replicas >= triggerThreshold | KEDA adds a replica (up to maxReplicas) |
Metric falls below threshold for cooldownSeconds | KEDA removes a replica (down to minReplicas) |
| Prometheus unreachable | KEDA pauses scaling (does not scale down aggressively) |
Reconciler interaction
When KEDA is enabled (autoscale.keda.enabled=true), the model controller reconciler defers
replica count management to KEDA. Specifically: the reconciler will not snap the replica count
back to DesiredReplicas on each tick — it only enforces min/max bounds at Deployment creation
time and allows KEDA to scale within those bounds. This prevents a fight between the reconciler
and the KEDA controller.
Tuning guidance
triggerThreshold: Set based on your backend's latency budget. Lower values (3–5) provide faster scale-up for latency-sensitive workloads. Higher values (10–20) are appropriate for throughput-optimised deployments where batching amortises per-request cost.cooldownSeconds: 300s (5 minutes) is a safe default. GPU nodes take time to provision; premature scale-down wastes the startup cost. Increase to 600–900s for GPU-backed deployments.pollingIntervalSeconds: 30s is the KEDA default. Reduce to 15s if you need faster reaction to burst traffic; below 10s is rarely useful given metric aggregation lag.
Disabling KEDA
Set autoscale.keda.enabled=false (or remove the backend_autoscaling entitlement). The
ScaledObject will not be rendered and the reconciler resumes managing replicas directly via
DesiredReplicas. The existing HPA (autoscaling.enabled) still scales the API pod on CPU/memory
and is independent of KEDA.
KEDA Backend Autoscaling (autoscaling.keda) — Dual-Trigger Tuning
The autoscaling.keda block (distinct from autoscale.keda) governs autoscaling of the
runtime-created LLM backend Deployments managed by alex-model-controller. It uses two Prometheus
triggers simultaneously — KEDA scales up when either trigger fires.
Trigger summary
| Trigger | Metric | Default threshold | Semantic |
|---|---|---|---|
queryCount | rate(query_count_total[1m]) | 10 queries/second | Throughput pressure |
latencyP95 | histogram_quantile(0.95, rate(llm_duration_ms_bucket[1m])) | 5000ms | Latency quality-of-service guard |
Tuning the queryCount trigger
queryCountThreshold controls how many queries per second each replica should absorb before
KEDA adds another. Tune based on your backend's concurrency capacity:
- Low concurrency backends (single-GPU, sequential inference): set to 1–3. Each query blocks the GPU; more than a few concurrent requests will stack up.
- High concurrency backends (multi-GPU, batched inference, vLLM): set to 20–50. Batching amortises per-request GPU cost; a single replica can absorb many concurrent requests.
- Use
rate(query_count_total[5m])instead of[1m]if your traffic is bursty — the longer window smooths out transient spikes and reduces unnecessary scale events.
Tuning the latencyP95 trigger
latencyP95ThresholdMs is a QoS guard: scale up when the P95 latency crosses an SLO boundary,
regardless of throughput. This catches saturation that the throughput metric misses (e.g. when
a single very slow request is inflating tail latency without driving up query rate).
- Interactive workloads (chat, low-latency API): 2000–3000ms. Users notice latency above 2 seconds; scale early.
- Batch workloads (document processing, background agents): 10000–30000ms. Tail latency is less user-facing; scale conservatively to avoid over-provisioning.
- If
llm_duration_ms_bucketis missing from your Prometheus instance, check thatotel.enabledis true and the orchestrator is emitting OTLP metrics to the collector. Runkubectl exec ... -c api -- curl -s localhost:4318/metrics | grep llm_durationto verify.
Replica bounds
autoscaling.keda.minReplicas and maxReplicas apply to every backend Deployment the
reconciler manages. To set per-backend bounds, use the backend's min_replicas and
max_replicas fields in the admin API — the reconciler uses those as the floor when
initialising a new Deployment, and KEDA operates within the ScaledObject min/max.
Prometheus endpoint
The default serverAddress (http://otel-lgtm.o11y.svc.cluster.local:9090) targets the
otel-lgtm bundle in k8s/o11y/. For air-gapped clusters or external Prometheus/Mimir/Thanos:
autoscaling:
keda:
prometheus:
serverAddress: "http://prometheus.monitoring.svc.cluster.local:9090"
Verifying ScaledObject creation
After upgrading with autoscaling.keda.enabled=true, confirm the reconciler created ScaledObjects:
# List ScaledObjects in the workload namespace
kubectl get scaledobjects -n <release-namespace>
# Check KEDA's view of trigger health
kubectl describe scaledobject <backend-name> -n <release-namespace>
# Confirm KEDA metrics server can reach Prometheus
kubectl logs -n keda -l app=keda-operator | grep <backend-name>
If ScaledObjects are not appearing after a reconcile cycle, check model-controller logs:
kubectl logs -n <release-namespace> \
-l app.kubernetes.io/name=alexandria-ee-model-controller \
--since=5m | grep keda