Skip to main content

Error Handling & Edge Cases: Deep Dive

#errors #edge-cases #recovery #resilience

Overview

Error handling spans three layers: middleware early returns, handler-level checks, and orchestrator failures. This document traces error paths and documents edge cases that can cause subtle bugs.


Layer 1: Middleware Errors (Early Returns)

Middleware errors stop request processing immediately and return HTTP error response.

1.1 Missing Authorization Header

Middleware: RequireAuth
Condition: Authorization header missing or not "Bearer <token>"
HTTP Response: 401 Unauthorized
Body: {"error": "missing Authorization header"}

Code:

token, ok := extractBearer(r)
if !ok {
httpError(w, 401, "missing Authorization header")
return // EARLY RETURN: stops here
}

Scenarios:

  • Client sends request without header → 401
  • Client sends header but wrong format (e.g., "Token abc" instead of "Bearer abc") → 401
  • Client sends empty Bearer (e.g., "Bearer ") → 401

1.2 Invalid or Expired JWT

Middleware: RequireAuth
Condition: JWT signature invalid, expired, or malformed
HTTP Response: 401 Unauthorized
Body: {"error": "invalid or expired token"}

Code:

claims, err := auth.ValidateAccessToken(token, secret)
if err != nil {
httpError(w, 401, "invalid or expired token")
return
}

Scenarios:

  • Token expired (now > token.exp) → 401
  • Token signature invalid (signed with different secret) → 401
  • Token tampered (claims modified, signature no longer matches) → 401
  • Token is not valid JSON Web Token format → 401

1.3 Token Version Mismatch (Revoked)

Middleware: RequireAuth
Condition: Token version doesn't match current user version
HTTP Response: 401 Unauthorized
Body: {"error": "token revoked"}

Code:

currentVersion, err := store.GetTokenVersion(ctx, claims.Subject)
if err != nil || currentVersion != claims.TokenVersion {
httpError(w, 401, "token revoked")
return
}

Scenarios:

  • User's token was revoked (admin incremented user.token_version) → 401
  • User logged in on another device, getting new tokens, old tokens rejected → 401
  • Database query fails to fetch token version → 401 (defensive: fail-secure)

1.4 Rate Limit Exceeded

Middleware: RateLimiter
Condition: Client IP exceeded rate limit
HTTP Response: 429 Too Many Requests
Body: {"error": "rate limit exceeded"}

Code:

if !limiter.AllowN(time.Now(), 1) {
httpError(w, 429, "rate limit exceeded")
return
}

Scenarios:

  • Client sends >100 queries/min from same IP → 429
  • Burst limit exceeded (>10 concurrent requests) → 429
  • Rate limit reset after 60 seconds (sliding window)

1.5 Tenant Mismatch (Multi-Tenancy)

Middleware: TenantMiddleware
Condition: Invalid or missing tenant in JWT
HTTP Response: 403 Forbidden (implicit via later queries)

Code:

if tenantID == "" {
tenantID = "default" // Fall back
}
ctx := context.WithValue(r.Context(), ContextKeyTenantID, tenantID)

Note: Tenant mismatch is implicit (queries return empty results if data doesn't belong to tenant). Not an explicit 403.


Layer 2: Handler Errors

Handler errors occur after middleware passes. Handler returns HTTP error and logs.

2.1 Invalid Request Body

Handler: All handlers
Condition: JSON body malformed or missing required fields
HTTP Response: 400 Bad Request

Code:

var req struct {
AgentID string `json:"agent_id"`
Prompt string `json:"prompt"`
}
if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
respond.Error(w, 400, "invalid request")
return
}

Scenarios:

  • Request body is not JSON → 400
  • Required field missing (handled separately or defaults)
  • JSON field type wrong (e.g., agent_id is number instead of string) → 400

2.2 RBAC Check Failed (Agent Not Allowed)

Handler: handleQuery
Condition: User's allowed_agents list doesn't include agent_id
HTTP Response: 403 Forbidden

Code:

canUse := false
for _, a := range claims.AllowedAgents {
if a == req.AgentID {
canUse = true
break
}
}
if !canUse {
respond.Error(w, 403, "agent not allowed")
return
}

Edge Case: claims.AllowedAgents can be empty (user has no agents). Any query → 403.

2.3 Resource Not Found

Handler: All handlers using st.Store.Get*()
Condition: Agent, user, group, etc. doesn't exist
HTTP Response: 404 Not Found

Code:

agent, err := st.Store.GetAgent(r.Context(), tenantID, req.AgentID)
if err != nil {
respond.Error(w, 404, "agent not found")
return
}

Scenarios:

  • Agent name doesn't exist in DB → 404
  • User ID doesn't exist → 404
  • Group name doesn't exist → 404

2.4 Store/Database Error

Handler: Any query to st.Store
Condition: Database connection lost, corrupted, or internal error
HTTP Response: 500 Internal Server Error

Code:

user, err := st.Store.GetUser(ctx, userID)
if err != nil {
respond.Error(w, 500, "database error")
log.Errorf("get user failed: %v", err)
return
}

Scenarios:

  • SQLite database file locked → retry (handled by SQLite)
  • Postgres connection pooled out → 500
  • Corrupted row in DB → 500
  • Query timeout (context deadline exceeded) → 500

Layer 3: Orchestrator Errors

Errors in orchestrator are serialized as gRPC errors and returned to API handler.

3.1 Orchestrator Not Running

gRPC Call: client.Query(ctx, req)
Error: Dial error (socket not accessible)
HTTP Response: 500 Internal Server Error

Code:

resp, err := st.Orchestrator.Query(ctx, orchestratorReq)
if err != nil {
log.Errorf("orchestrator error: %v", err)
respond.Error(w, 500, "query failed")
return
}

Scenarios:

  • Orchestrator process crashed → socket doesn't exist → Dial error
  • Orchestrator hung → no response (context timeout) → context.DeadlineExceeded
  • Orchestrator restarting → intermittent failures

3.2 Query Timeout

gRPC Call: client.Query with 30s timeout
Condition: Orchestrator takes >30 seconds
Error: context.DeadlineExceeded
HTTP Response: 504 Gateway Timeout

Code:

ctx, cancel := context.WithTimeout(r.Context(), 30*time.Second)
defer cancel()
resp, err := st.Orchestrator.Query(ctx, orchestratorReq)
if err == context.DeadlineExceeded {
respond.Error(w, 504, "query timeout")
return
}

Scenarios:

  • LLM backend slow (API inference on Ollama, network latency to Claude API)
  • Tool execution hanging (tool daemon doesn't respond)
  • Orchestrator overloaded (backlog of queries)

3.3 Agent Tool Validation Failed

Orchestrator Internal Error
Condition: Agent's allowed_tools contains invalid/non-existent tool
Error: Returned as QueryResponse.error field
HTTP Response: 500 Internal Server Error (API propagates orchestrator error)

Scenario:

  • Agent config: allowed_tools = ["web_search", "nonexistent_tool"]
  • Orchestrator tries to load "nonexistent_tool" → error
  • Error serialized in QueryResponse
  • API checks response.Success == false → 500

3.4 Tool Execution Failure

Orchestrator Internal Error
Condition: Tool daemon crashes, timeout, or returns error
Error: Tool result propagated back to LLM
HTTP Response: 200 OK (query may still complete with error message)

Scenario:

  • LLM decides to call "web_search" tool
  • Tool daemon crashes → no response
  • Orchestrator timeout waiting for tool → error
  • LLM incorporates error into context ("tool failed, try different approach")
  • Conversation continues
  • Final response: 200 OK, but tool_used: "web_search", response includes error explanation

Edge Cases & Surprising Behaviors

Edge Case 1: Empty Effective Tools

Setup:

  • Agent: allowed_tools = []
  • User: allowed_tools = ["calculator"]
  • Query: POST /v1/agent-token for this agent

Flow:

// In permission.ComputeEffectiveTools
switch {
case len(agentTools) == 0:
// Agent has no tools → deny all
return []string{}
}

Result: Agent token issued with effective_tools = []. Agent can't call any tools.

Expected behavior: Correct. Agent has no permission grants.

Edge Case 2: User with Wildcard Agent

Setup:

  • User: allowed_agents = ["*"]
  • Query: GET /v1/agents

Flow:

// Frontend filtering
claims.AllowedAgents = ["*"]
agents_to_show = agents.filter(a => claims.AllowedAgents.includes(a.name) || claims.AllowedAgents.includes("*"))

Result: Frontend shows all agents. If backend doesn't handle "*", it fails.

Fix: Backend should check for "*" in allowed_agents list.

Edge Case 3: Concurrent Token Revocation

Setup:

  • User has valid access token
  • Admin revokes all user tokens (increments user.token_version)
  • User simultaneously sends query with old token

Timeline:

T+0ms User sends query with token_version = 5
T+1ms Handler checks token version
T+2ms Admin increments user.token_version to 6
T+3ms Handler loads token version from DB: 6
T+4ms Comparison: 5 != 6 → token revoked → 401

Result: Request rejected (correct). No race condition (version check is synchronous).

Edge Case 4: Agent Permissions Change During Query

Setup:

  • User has agent token with effective_tools = ["calculator"]
  • Query is in-flight
  • Admin changes agent.allowed_tools to remove "calculator"
  • Admin increments agent.permissions_version

Timeline:

T+0ms Query starts, agent_token.permissions_version = 5
T+1ms Orchestrator receives query
T+500ms Admin changes agent permissions, permissions_version becomes 6
T+1000ms Orchestrator executes tool call (permissions still cached as version 5)
T+1001ms Tool validation: effective_tools still has "calculator" (baked in token)
T+1002ms Tool executes (even though new policy disallows it)
T+2000ms Query completes
T+2001ms Response sent to client

Result: Query completes with old permissions. New permissions take effect on NEXT token issuance.

Why: Permissions are baked at token mint time. Changing agent config doesn't invalidate in-flight tokens.

How to fix: If strict invalidation needed, use agent.on_permission_change = "abort" (checks permissions_version at request start).

Edge Case 5: Group Ceiling Intersection

Setup:

  • User belongs to 2 groups:
    • Group A: ceiling = ["web_search", "calculator"]
    • Group B: ceiling = ["calculator", "database"]
  • Agent: allowed_tools = ["web_search", "calculator", "database"]

Flow:

// Load user's groups
groupA_ceiling = ["web_search", "calculator"]
groupB_ceiling = ["calculator", "database"]

// Intersect all group ceilings
effectiveGroupCeiling = IntersectToolLists(groupA_ceiling, groupB_ceiling)
// = ["calculator"]

// Now intersect with agent
agentUser = ["web_search", "calculator", "database"] // from agent
afterGroup = IntersectToolLists(agentUser, ["calculator"])
// = ["calculator"]

Result: User can only use "calculator" (intersection of all groups).

Expected behavior: Correct. Multiple group memberships are AND-ed (most restrictive wins).

Edge Case 6: Tool Token Theft

Setup:

  • Orchestrator issues tool token to tool daemon
  • Tool daemon never calls /ingest/tool_result
  • Attacker steals token_hash from logs

Can attacker use token?

// ToolToken is not hashed (unlike refresh tokens)
// It's a full JWT string

// Attacker sends POST /v1/ingest/tool_result
// Authorization: Bearer <stolen_token>

Result: Attacker can ingest tool results, forge execution logs.

Mitigation:

  • ToolTokens are short-lived (5 min) → limited window
  • ToolTokens are session-scoped (agent_id + session_id baked in) → can only affect that session
  • Audit logging records who ingests (tool name from token)

Better fix: Hash tool tokens like refresh tokens (not currently done).

Edge Case 7: Tenant Isolation Bypass

Setup:

  • Multi-tenant system
  • User A from tenant_1 somehow gets token with tenant_id = "default" (bug)

Can user A access tenant_2 data?

// In TenantMiddleware
tenantID = claims.TenantID // "default"
// All queries: WHERE tenant_id = ?

// If another tenant's data somehow has tenant_id = "default", user A can see it!

Result: Potential data leakage.

Mitigation:

  • User accounts are scoped to tenant at creation time
  • JWT claims.TenantID is set from user.tenant_id (not user-controlled)
  • Code review: ensure no queries skip tenant_id filter

Edge Case 8: Orchestrator Slow, Client Timeout

Setup:

  • Client sends query with 5-second timeout
  • Orchestrator takes 10 seconds

Flow:

T+0ms Client sends request
T+1ms Handler forwards to orchestrator
T+1ms-5000ms Orchestrator processing
T+5000ms Client timeout (5s elapsed) → context cancelled
T+5001ms Handler receives context.DeadlineExceeded
T+5002ms Handler returns 504 Gateway Timeout
T+6000ms Orchestrator finishes query (but response discarded)

Result: Response wasted (orchestrator still processes, but client already got 504).

Mitigation:

  • Set handler timeout >= orchestrator timeout
  • Monitor orchestrator latency
  • Alert on slow queries

Recovery Strategies

Strategy 1: Retry on Transient Failure

func queryWithRetry(client orchestrator.Client, req *pb.QueryRequest) (*pb.QueryResponse, error) {
for attempt := 0; attempt < 3; attempt++ {
resp, err := client.Query(context.Background(), req)
if err == nil {
return resp, nil
}

// Retry on transient errors
if st := status.Convert(err); st.Code() == codes.Unavailable {
time.Sleep(time.Duration(100 * (attempt + 1)) * time.Millisecond)
continue
}

// Non-retryable error
return nil, err
}
return nil, fmt.Errorf("query failed after 3 retries")
}

When to retry:

  • codes.Unavailable (orchestrator temporarily down)
  • codes.DeadlineExceeded (timeout, but might succeed on retry)

When NOT to retry:

  • codes.InvalidArgument (bad input, won't change)
  • codes.PermissionDenied (auth issue, won't change)
  • codes.NotFound (resource doesn't exist)

Strategy 2: Circuit Breaker

type CircuitBreaker struct {
failureCount int
lastFailTime time.Time
state string // "closed", "open", "half-open"
}

func (cb *CircuitBreaker) Call(fn func() error) error {
if cb.state == "open" {
if time.Since(cb.lastFailTime) > 5*time.Second {
cb.state = "half-open"
} else {
return fmt.Errorf("circuit breaker open")
}
}

err := fn()
if err != nil {
cb.failureCount++
cb.lastFailTime = time.Now()
if cb.failureCount > 5 {
cb.state = "open"
}
} else {
cb.failureCount = 0
cb.state = "closed"
}

return err
}

States:

  • Closed: Normal operation, all requests go through
  • Open: Too many failures, reject requests fast (fail-fast)
  • Half-open: Try one request, if succeeds, close circuit

Strategy 3: Bulkhead (Goroutine Limit)

type Bulkhead struct {
semaphore chan struct{}
}

func NewBulkhead(maxConcurrent int) *Bulkhead {
return &Bulkhead{
semaphore: make(chan struct{}, maxConcurrent),
}
}

func (b *Bulkhead) Do(fn func() error) error {
select {
case b.semaphore <- struct{}{}:
defer func() { <-b.semaphore }()
return fn()
default:
return fmt.Errorf("bulkhead full")
}
}

Purpose: Limit concurrent orchestrator calls to prevent thread pool exhaustion.


Monitoring & Alerting

Key Metrics to Track

- 401 error rate (compromised tokens?)
- 429 rate limit errors (DDoS or legitimate spike?)
- 500 orchestrator errors (orchestrator health)
- 504 timeouts (slow LLM backend)
- Query latency (p50, p99)
- Token version cache hit rate
- Permission computation duration

Alerts

Alert: if p99_query_latency > 60s for 5 min
→ Orchestrator slow or LLM backend stuck

Alert: if error_rate > 5% for 10 min
→ Check orchestrator logs, database connectivity

Alert: if 401_rate > 1% and spiking
→ Possible token revocation or secret rotation issue

References

  • Error responses: api-go/internal/routes/respond.go
  • Middleware errors: api-go/internal/middleware/*.go
  • Handler errors: api-go/internal/routes/*.go
  • Orchestrator error handling: api-go/internal/orchestrator/client.go