Skip to main content

LLM Backends

LLM backends are registered inference endpoints. Alexandria is backend-agnostic; any OpenAI-compatible endpoint can be registered. Admin authentication required for all write operations.


Backend object

{
"id": "01HXZ...",
"name": "gpt4o",
"url": "https://api.openai.com/v1",
"kind": "open-ai",
"model": "gpt-4o",
"has_api_key": true,
"is_default": true,
"enabled": true,
"backend_type": "external",
"status": "active",
"created_at": "2026-01-10T09:00:00Z",
"updated_at": "2026-01-10T09:00:00Z",
"source_type": "external",
"capabilities": ["chat"],
"protocol": "chat",
"cache_strategy": "native",
"cache_ttl_seconds": 0
}

Kinds: open-ai, claude, llama-cpp, custom

Capabilities: chat, embed, voice, vision, rerank

  • chat and embed have live runtime paths
  • voice, vision, rerank are reserved for future runtimes (registering is allowed for pre-provisioning)

Cache strategies:

  • native — delegate to the upstream API's native caching (default for claude, open-ai)
  • alexandria — use Alexandria's distributed Memcached cache (requires memcached_cache entitlement)
  • none — no caching (default for custom)

GET /admin/llm

List all registered backends (enabled and disabled).

POST /admin/llm

Register a backend. After creation, the config is synced to the orchestrator.

External backend

{
"name": "gpt4o",
"url": "https://api.openai.com/v1",
"kind": "open-ai",
"model": "gpt-4o",
"api_key": "sk-...",
"is_default": true,
"capabilities": ["chat"],
"cache_strategy": "native"
}

Managed backend (OCI image)

Only available in the k8s_enabled build with model_controller.enabled = true.

{
"name": "llama-local",
"kind": "llama-cpp",
"source_type": "oci",
"image": "registry.example.com/llama:3.2",
"image_digest": "sha256:abc...",
"engine": "llama-cpp",
"container_port": 8080,
"replicas": 1,
"capabilities": ["chat"],
"resources": { "cpu": "2", "memory": "8Gi" }
}

Managed backend (HuggingFace)

{
"name": "mistral",
"kind": "open-ai",
"source_type": "hf",
"hf_repo": "mistralai/Mistral-7B-Instruct-v0.2",
"hf_revision": "abc1234",
"engine": "vllm",
"replicas": 2
}

Errors

  • 409 — backend name already exists
  • 403cache_strategy="alexandria" requires memcached_cache entitlement

GET /admin/llm/{name}

PATCH /admin/llm/{name}

Partial update. URL changes are SSRF-validated.

{
"enabled": true,
"model": "gpt-4o-mini",
"cache_strategy": "native"
}

DELETE /admin/llm/{name}

Deletes the backend and removes the associated API key from the secret store. Returns 204.


POST /admin/llm/{name}/default

Mark a backend as the default for routing (when no explicit backend is specified in a query). Syncs config to orchestrator.

POST /admin/llm/{name}/ping

Check backend reachability with SSRF protection. The raw error is not echoed to the client — internal IPs and DNS results are never exposed.

Response 200

{
"name": "gpt4o",
"url": "https://api.openai.com/v1",
"healthy": true,
"status_code": 200
}

On failure:

{
"name": "gpt4o",
"url": "https://api.openai.com/v1",
"healthy": false,
"error": "backend unreachable"
}

POST /admin/llm/sync

Explicitly sync all backend configs to the orchestrator's config file and trigger a reload.

Response 200

{
"synced": true,
"backends_loaded": 3
}

Managed backend deployment (k8s)

Build-tag note: the /deploy, /undeploy, and /logs endpoints are only available in the k8s_enabled build. Non-k8s builds return 503 for these routes.

Both require the backend_autoscaling license entitlement. Returns 402 if not licensed.

POST /admin/llm/{name}/deploy

Enables the managed backend — alex-model-controller picks it up on the next reconcile and creates a k8s Deployment + Service.

POST /admin/llm/{name}/undeploy

Disables the backend — controller tears down the Deployment/Service.

GET /admin/llm/{name}/logs

Streams Pod logs for debugging stuck deploys.

Query params

  • tail — last N lines (default 200, max 5000)
  • container — container name (default engine)

Picks the most recently created Pod matching the backend label.

Errors

  • 400 — external backend (no Pod to log)
  • 503 — k8s client unavailable

curl examples

# Register OpenAI backend
curl -s -X POST http://localhost:8080/admin/llm \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-H 'Content-Type: application/json' \
-d '{
"name":"gpt4o",
"url":"https://api.openai.com/v1",
"kind":"open-ai",
"model":"gpt-4o",
"api_key":"sk-...",
"is_default":true
}' | jq .

# Register local llama.cpp backend
curl -s -X POST http://localhost:8080/admin/llm \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-H 'Content-Type: application/json' \
-d '{
"name":"llama-local",
"url":"http://localhost:8080",
"kind":"llama-cpp",
"capabilities":["chat","embed"]
}' | jq .

# Ping
curl -s -X POST http://localhost:8080/admin/llm/gpt4o/ping \
-H "Authorization: Bearer $ADMIN_TOKEN" | jq .

# Enable / set default
curl -s -X POST http://localhost:8080/admin/llm/gpt4o/default \
-H "Authorization: Bearer $ADMIN_TOKEN" | jq .