Error Handling & Edge Cases: Deep Dive
#errors #edge-cases #recovery #resilience
Overview
Error handling spans three layers: middleware early returns, handler-level checks, and orchestrator failures. This document traces error paths and documents edge cases that can cause subtle bugs.
Layer 1: Middleware Errors (Early Returns)
Middleware errors stop request processing immediately and return HTTP error response.
1.1 Missing Authorization Header
Middleware: RequireAuth
Condition: Authorization header missing or not "Bearer <token>"
HTTP Response: 401 Unauthorized
Body: {"error": "missing Authorization header"}
Code:
token, ok := extractBearer(r)
if !ok {
httpError(w, 401, "missing Authorization header")
return // EARLY RETURN: stops here
}
Scenarios:
- Client sends request without header → 401
- Client sends header but wrong format (e.g., "Token abc" instead of "Bearer abc") → 401
- Client sends empty Bearer (e.g., "Bearer ") → 401
1.2 Invalid or Expired JWT
Middleware: RequireAuth
Condition: JWT signature invalid, expired, or malformed
HTTP Response: 401 Unauthorized
Body: {"error": "invalid or expired token"}
Code:
claims, err := auth.ValidateAccessToken(token, secret)
if err != nil {
httpError(w, 401, "invalid or expired token")
return
}
Scenarios:
- Token expired (now > token.exp) → 401
- Token signature invalid (signed with different secret) → 401
- Token tampered (claims modified, signature no longer matches) → 401
- Token is not valid JSON Web Token format → 401
1.3 Token Version Mismatch (Revoked)
Middleware: RequireAuth
Condition: Token version doesn't match current user version
HTTP Response: 401 Unauthorized
Body: {"error": "token revoked"}
Code:
currentVersion, err := store.GetTokenVersion(ctx, claims.Subject)
if err != nil || currentVersion != claims.TokenVersion {
httpError(w, 401, "token revoked")
return
}
Scenarios:
- User's token was revoked (admin incremented user.token_version) → 401
- User logged in on another device, getting new tokens, old tokens rejected → 401
- Database query fails to fetch token version → 401 (defensive: fail-secure)
1.4 Rate Limit Exceeded
Middleware: RateLimiter
Condition: Client IP exceeded rate limit
HTTP Response: 429 Too Many Requests
Body: {"error": "rate limit exceeded"}
Code:
if !limiter.AllowN(time.Now(), 1) {
httpError(w, 429, "rate limit exceeded")
return
}
Scenarios:
- Client sends >100 queries/min from same IP → 429
- Burst limit exceeded (>10 concurrent requests) → 429
- Rate limit reset after 60 seconds (sliding window)
1.5 Tenant Mismatch (Multi-Tenancy)
Middleware: TenantMiddleware
Condition: Invalid or missing tenant in JWT
HTTP Response: 403 Forbidden (implicit via later queries)
Code:
if tenantID == "" {
tenantID = "default" // Fall back
}
ctx := context.WithValue(r.Context(), ContextKeyTenantID, tenantID)
Note: Tenant mismatch is implicit (queries return empty results if data doesn't belong to tenant). Not an explicit 403.
Layer 2: Handler Errors
Handler errors occur after middleware passes. Handler returns HTTP error and logs.
2.1 Invalid Request Body
Handler: All handlers
Condition: JSON body malformed or missing required fields
HTTP Response: 400 Bad Request
Code:
var req struct {
AgentID string `json:"agent_id"`
Prompt string `json:"prompt"`
}
if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
respond.Error(w, 400, "invalid request")
return
}
Scenarios:
- Request body is not JSON → 400
- Required field missing (handled separately or defaults)
- JSON field type wrong (e.g., agent_id is number instead of string) → 400
2.2 RBAC Check Failed (Agent Not Allowed)
Handler: handleQuery
Condition: User's allowed_agents list doesn't include agent_id
HTTP Response: 403 Forbidden
Code:
canUse := false
for _, a := range claims.AllowedAgents {
if a == req.AgentID {
canUse = true
break
}
}
if !canUse {
respond.Error(w, 403, "agent not allowed")
return
}
Edge Case: claims.AllowedAgents can be empty (user has no agents). Any query → 403.
2.3 Resource Not Found
Handler: All handlers using st.Store.Get*()
Condition: Agent, user, group, etc. doesn't exist
HTTP Response: 404 Not Found
Code:
agent, err := st.Store.GetAgent(r.Context(), tenantID, req.AgentID)
if err != nil {
respond.Error(w, 404, "agent not found")
return
}
Scenarios:
- Agent name doesn't exist in DB → 404
- User ID doesn't exist → 404
- Group name doesn't exist → 404
2.4 Store/Database Error
Handler: Any query to st.Store
Condition: Database connection lost, corrupted, or internal error
HTTP Response: 500 Internal Server Error
Code:
user, err := st.Store.GetUser(ctx, userID)
if err != nil {
respond.Error(w, 500, "database error")
log.Errorf("get user failed: %v", err)
return
}
Scenarios:
- SQLite database file locked → retry (handled by SQLite)
- Postgres connection pooled out → 500
- Corrupted row in DB → 500
- Query timeout (context deadline exceeded) → 500
Layer 3: Orchestrator Errors
Errors in orchestrator are serialized as gRPC errors and returned to API handler.
3.1 Orchestrator Not Running
gRPC Call: client.Query(ctx, req)
Error: Dial error (socket not accessible)
HTTP Response: 500 Internal Server Error
Code:
resp, err := st.Orchestrator.Query(ctx, orchestratorReq)
if err != nil {
log.Errorf("orchestrator error: %v", err)
respond.Error(w, 500, "query failed")
return
}
Scenarios:
- Orchestrator process crashed → socket doesn't exist → Dial error
- Orchestrator hung → no response (context timeout) → context.DeadlineExceeded
- Orchestrator restarting → intermittent failures
3.2 Query Timeout
gRPC Call: client.Query with 30s timeout
Condition: Orchestrator takes >30 seconds
Error: context.DeadlineExceeded
HTTP Response: 504 Gateway Timeout
Code:
ctx, cancel := context.WithTimeout(r.Context(), 30*time.Second)
defer cancel()
resp, err := st.Orchestrator.Query(ctx, orchestratorReq)
if err == context.DeadlineExceeded {
respond.Error(w, 504, "query timeout")
return
}
Scenarios:
- LLM backend slow (API inference on Ollama, network latency to Claude API)
- Tool execution hanging (tool daemon doesn't respond)
- Orchestrator overloaded (backlog of queries)
3.3 Agent Tool Validation Failed
Orchestrator Internal Error
Condition: Agent's allowed_tools contains invalid/non-existent tool
Error: Returned as QueryResponse.error field
HTTP Response: 500 Internal Server Error (API propagates orchestrator error)
Scenario:
- Agent config: allowed_tools = ["web_search", "nonexistent_tool"]
- Orchestrator tries to load "nonexistent_tool" → error
- Error serialized in QueryResponse
- API checks response.Success == false → 500
3.4 Tool Execution Failure
Orchestrator Internal Error
Condition: Tool daemon crashes, timeout, or returns error
Error: Tool result propagated back to LLM
HTTP Response: 200 OK (query may still complete with error message)
Scenario:
- LLM decides to call "web_search" tool
- Tool daemon crashes → no response
- Orchestrator timeout waiting for tool → error
- LLM incorporates error into context ("tool failed, try different approach")
- Conversation continues
- Final response: 200 OK, but tool_used: "web_search", response includes error explanation
Edge Cases & Surprising Behaviors
Edge Case 1: Empty Effective Tools
Setup:
- Agent: allowed_tools = []
- User: allowed_tools = ["calculator"]
- Query: POST /v1/agent-token for this agent
Flow:
// In permission.ComputeEffectiveTools
switch {
case len(agentTools) == 0:
// Agent has no tools → deny all
return []string{}
}
Result: Agent token issued with effective_tools = []. Agent can't call any tools.
Expected behavior: Correct. Agent has no permission grants.
Edge Case 2: User with Wildcard Agent
Setup:
- User: allowed_agents = ["*"]
- Query: GET /v1/agents
Flow:
// Frontend filtering
claims.AllowedAgents = ["*"]
agents_to_show = agents.filter(a => claims.AllowedAgents.includes(a.name) || claims.AllowedAgents.includes("*"))
Result: Frontend shows all agents. If backend doesn't handle "*", it fails.
Fix: Backend should check for "*" in allowed_agents list.
Edge Case 3: Concurrent Token Revocation
Setup:
- User has valid access token
- Admin revokes all user tokens (increments user.token_version)
- User simultaneously sends query with old token
Timeline:
T+0ms User sends query with token_version = 5
T+1ms Handler checks token version
T+2ms Admin increments user.token_version to 6
T+3ms Handler loads token version from DB: 6
T+4ms Comparison: 5 != 6 → token revoked → 401
Result: Request rejected (correct). No race condition (version check is synchronous).
Edge Case 4: Agent Permissions Change During Query
Setup:
- User has agent token with effective_tools = ["calculator"]
- Query is in-flight
- Admin changes agent.allowed_tools to remove "calculator"
- Admin increments agent.permissions_version
Timeline:
T+0ms Query starts, agent_token.permissions_version = 5
T+1ms Orchestrator receives query
T+500ms Admin changes agent permissions, permissions_version becomes 6
T+1000ms Orchestrator executes tool call (permissions still cached as version 5)
T+1001ms Tool validation: effective_tools still has "calculator" (baked in token)
T+1002ms Tool executes (even though new policy disallows it)
T+2000ms Query completes
T+2001ms Response sent to client
Result: Query completes with old permissions. New permissions take effect on NEXT token issuance.
Why: Permissions are baked at token mint time. Changing agent config doesn't invalidate in-flight tokens.
How to fix: If strict invalidation needed, use agent.on_permission_change = "abort" (checks permissions_version at request start).
Edge Case 5: Group Ceiling Intersection
Setup:
- User belongs to 2 groups:
- Group A: ceiling = ["web_search", "calculator"]
- Group B: ceiling = ["calculator", "database"]
- Agent: allowed_tools = ["web_search", "calculator", "database"]
Flow:
// Load user's groups
groupA_ceiling = ["web_search", "calculator"]
groupB_ceiling = ["calculator", "database"]
// Intersect all group ceilings
effectiveGroupCeiling = IntersectToolLists(groupA_ceiling, groupB_ceiling)
// = ["calculator"]
// Now intersect with agent
agentUser = ["web_search", "calculator", "database"] // from agent
afterGroup = IntersectToolLists(agentUser, ["calculator"])
// = ["calculator"]
Result: User can only use "calculator" (intersection of all groups).
Expected behavior: Correct. Multiple group memberships are AND-ed (most restrictive wins).
Edge Case 6: Tool Token Theft
Setup:
- Orchestrator issues tool token to tool daemon
- Tool daemon never calls /ingest/tool_result
- Attacker steals token_hash from logs
Can attacker use token?
// ToolToken is not hashed (unlike refresh tokens)
// It's a full JWT string
// Attacker sends POST /v1/ingest/tool_result
// Authorization: Bearer <stolen_token>
Result: Attacker can ingest tool results, forge execution logs.
Mitigation:
- ToolTokens are short-lived (5 min) → limited window
- ToolTokens are session-scoped (agent_id + session_id baked in) → can only affect that session
- Audit logging records who ingests (tool name from token)
Better fix: Hash tool tokens like refresh tokens (not currently done).
Edge Case 7: Tenant Isolation Bypass
Setup:
- Multi-tenant system
- User A from tenant_1 somehow gets token with tenant_id = "default" (bug)
Can user A access tenant_2 data?
// In TenantMiddleware
tenantID = claims.TenantID // "default"
// All queries: WHERE tenant_id = ?
// If another tenant's data somehow has tenant_id = "default", user A can see it!
Result: Potential data leakage.
Mitigation:
- User accounts are scoped to tenant at creation time
- JWT claims.TenantID is set from user.tenant_id (not user-controlled)
- Code review: ensure no queries skip tenant_id filter
Edge Case 8: Orchestrator Slow, Client Timeout
Setup:
- Client sends query with 5-second timeout
- Orchestrator takes 10 seconds
Flow:
T+0ms Client sends request
T+1ms Handler forwards to orchestrator
T+1ms-5000ms Orchestrator processing
T+5000ms Client timeout (5s elapsed) → context cancelled
T+5001ms Handler receives context.DeadlineExceeded
T+5002ms Handler returns 504 Gateway Timeout
T+6000ms Orchestrator finishes query (but response discarded)
Result: Response wasted (orchestrator still processes, but client already got 504).
Mitigation:
- Set handler timeout >= orchestrator timeout
- Monitor orchestrator latency
- Alert on slow queries
Recovery Strategies
Strategy 1: Retry on Transient Failure
func queryWithRetry(client orchestrator.Client, req *pb.QueryRequest) (*pb.QueryResponse, error) {
for attempt := 0; attempt < 3; attempt++ {
resp, err := client.Query(context.Background(), req)
if err == nil {
return resp, nil
}
// Retry on transient errors
if st := status.Convert(err); st.Code() == codes.Unavailable {
time.Sleep(time.Duration(100 * (attempt + 1)) * time.Millisecond)
continue
}
// Non-retryable error
return nil, err
}
return nil, fmt.Errorf("query failed after 3 retries")
}
When to retry:
- codes.Unavailable (orchestrator temporarily down)
- codes.DeadlineExceeded (timeout, but might succeed on retry)
When NOT to retry:
- codes.InvalidArgument (bad input, won't change)
- codes.PermissionDenied (auth issue, won't change)
- codes.NotFound (resource doesn't exist)
Strategy 2: Circuit Breaker
type CircuitBreaker struct {
failureCount int
lastFailTime time.Time
state string // "closed", "open", "half-open"
}
func (cb *CircuitBreaker) Call(fn func() error) error {
if cb.state == "open" {
if time.Since(cb.lastFailTime) > 5*time.Second {
cb.state = "half-open"
} else {
return fmt.Errorf("circuit breaker open")
}
}
err := fn()
if err != nil {
cb.failureCount++
cb.lastFailTime = time.Now()
if cb.failureCount > 5 {
cb.state = "open"
}
} else {
cb.failureCount = 0
cb.state = "closed"
}
return err
}
States:
- Closed: Normal operation, all requests go through
- Open: Too many failures, reject requests fast (fail-fast)
- Half-open: Try one request, if succeeds, close circuit
Strategy 3: Bulkhead (Goroutine Limit)
type Bulkhead struct {
semaphore chan struct{}
}
func NewBulkhead(maxConcurrent int) *Bulkhead {
return &Bulkhead{
semaphore: make(chan struct{}, maxConcurrent),
}
}
func (b *Bulkhead) Do(fn func() error) error {
select {
case b.semaphore <- struct{}{}:
defer func() { <-b.semaphore }()
return fn()
default:
return fmt.Errorf("bulkhead full")
}
}
Purpose: Limit concurrent orchestrator calls to prevent thread pool exhaustion.
Monitoring & Alerting
Key Metrics to Track
- 401 error rate (compromised tokens?)
- 429 rate limit errors (DDoS or legitimate spike?)
- 500 orchestrator errors (orchestrator health)
- 504 timeouts (slow LLM backend)
- Query latency (p50, p99)
- Token version cache hit rate
- Permission computation duration
Alerts
Alert: if p99_query_latency > 60s for 5 min
→ Orchestrator slow or LLM backend stuck
Alert: if error_rate > 5% for 10 min
→ Check orchestrator logs, database connectivity
Alert: if 401_rate > 1% and spiking
→ Possible token revocation or secret rotation issue
References
- Error responses:
api-go/internal/routes/respond.go - Middleware errors:
api-go/internal/middleware/*.go - Handler errors:
api-go/internal/routes/*.go - Orchestrator error handling:
api-go/internal/orchestrator/client.go