LLM Backends

LLM backends are registered inference endpoints. Alexandria is backend-agnostic; any OpenAI-compatible endpoint can be registered. Admin authentication required for all write operations.

Backend object

{
  "id": "01HXZ...",
  "name": "gpt4o",
  "url": "https://api.openai.com/v1",
  "kind": "open-ai",
  "model": "gpt-4o",
  "has_api_key": true,
  "is_default": true,
  "enabled": true,
  "backend_type": "external",
  "status": "active",
  "created_at": "2026-01-10T09:00:00Z",
  "updated_at": "2026-01-10T09:00:00Z",
  "source_type": "external",
  "capabilities": ["chat"],
  "protocol": "chat",
  "cache_strategy": "native",
  "cache_ttl_seconds": 0
}

Kinds: open-ai, claude, llama-cpp, custom

Capabilities: chat, embed, voice, vision, rerank

chat and embed have live runtime paths
voice, vision, rerank are reserved for future runtimes (registering is allowed for pre-provisioning)

Cache strategies:

native — delegate to the upstream API's native caching (default for claude, open-ai)
alexandria — use Alexandria's distributed Memcached cache (requires memcached_cache entitlement)
none — no caching (default for custom)

GET /admin/llm

List all registered backends (enabled and disabled).

POST /admin/llm

External backend

{
  "name": "gpt4o",
  "url": "https://api.openai.com/v1",
  "kind": "open-ai",
  "model": "gpt-4o",
  "api_key": "sk-...",
  "is_default": true,
  "capabilities": ["chat"],
  "cache_strategy": "native"
}

Managed backend (OCI image)

Only available in the k8s_enabled build with model_controller.enabled = true.

{
  "name": "llama-local",
  "kind": "llama-cpp",
  "source_type": "oci",
  "image": "registry.example.com/llama:3.2",
  "image_digest": "sha256:abc...",
  "engine": "llama-cpp",
  "container_port": 8080,
  "replicas": 1,
  "capabilities": ["chat"],
  "resources": { "cpu": "2", "memory": "8Gi" }
}

Managed backend (HuggingFace)

{
  "name": "mistral",
  "kind": "open-ai",
  "source_type": "hf",
  "hf_repo": "mistralai/Mistral-7B-Instruct-v0.2",
  "hf_revision": "abc1234",
  "engine": "vllm",
  "replicas": 2
}

Errors

409 — backend name already exists
403 — cache_strategy="alexandria" requires memcached_cache entitlement

GET /admin/llm/{name}

PATCH /admin/llm/{name}

Partial update. URL changes are SSRF-validated.

{
  "enabled": true,
  "model": "gpt-4o-mini",
  "cache_strategy": "native"
}

DELETE /admin/llm/{name}

Deletes the backend and removes the associated API key from the secret store. Returns 204.

POST /admin/llm/{name}/default

Mark a backend as the default for routing (when no explicit backend is specified in a query). Syncs config to orchestrator.

POST /admin/llm/{name}/ping

Check backend reachability with SSRF protection. The raw error is not echoed to the client — internal IPs and DNS results are never exposed.

Response 200

{
  "name": "gpt4o",
  "url": "https://api.openai.com/v1",
  "healthy": true,
  "status_code": 200
}

On failure:

{
  "name": "gpt4o",
  "url": "https://api.openai.com/v1",
  "healthy": false,
  "error": "backend unreachable"
}

POST /admin/llm/sync

Explicitly sync all backend configs to the orchestrator's config file and trigger a reload.

Response 200

{
  "synced": true,
  "backends_loaded": 3
}

Managed backend deployment (k8s)

Build-tag note: the /deploy, /undeploy, and /logs endpoints are only available in the k8s_enabled build. Non-k8s builds return 503 for these routes.

Both require the backend_autoscaling license entitlement. Returns 402 if not licensed.

POST /admin/llm/{name}/deploy

Enables the managed backend — alex-model-controller picks it up on the next reconcile and creates a k8s Deployment + Service.

POST /admin/llm/{name}/undeploy

Disables the backend — controller tears down the Deployment/Service.

GET /admin/llm/{name}/logs

Streams Pod logs for debugging stuck deploys.

Query params

tail — last N lines (default 200, max 5000)
container — container name (default engine)

Picks the most recently created Pod matching the backend label.

Errors

400 — external backend (no Pod to log)
503 — k8s client unavailable

curl examples

# Register OpenAI backend
curl -s -X POST http://localhost:8080/admin/llm \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -H 'Content-Type: application/json' \
  -d '{
    "name":"gpt4o",
    "url":"https://api.openai.com/v1",
    "kind":"open-ai",
    "model":"gpt-4o",
    "api_key":"sk-...",
    "is_default":true
  }' | jq .

# Register local llama.cpp backend
curl -s -X POST http://localhost:8080/admin/llm \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -H 'Content-Type: application/json' \
  -d '{
    "name":"llama-local",
    "url":"http://localhost:8080",
    "kind":"llama-cpp",
    "capabilities":["chat","embed"]
  }' | jq .

# Ping
curl -s -X POST http://localhost:8080/admin/llm/gpt4o/ping \
  -H "Authorization: Bearer $ADMIN_TOKEN" | jq .

# Enable / set default
curl -s -X POST http://localhost:8080/admin/llm/gpt4o/default \
  -H "Authorization: Bearer $ADMIN_TOKEN" | jq .

Backend object​

GET /admin/llm​

POST /admin/llm​

External backend​

Managed backend (OCI image)​

Managed backend (HuggingFace)​

GET /admin/llm/{name}​

PATCH /admin/llm/{name}​

DELETE /admin/llm/{name}​

POST /admin/llm/{name}/default​

POST /admin/llm/{name}/ping​

POST /admin/llm/sync​

Managed backend deployment (k8s)​

POST /admin/llm/{name}/deploy​

POST /admin/llm/{name}/undeploy​

GET /admin/llm/{name}/logs​

curl examples​