Error Handling & Edge Cases: Deep Dive

#errors #edge-cases #recovery #resilience

Overview

Error handling spans three layers: middleware early returns, handler-level checks, and orchestrator failures. This document traces error paths and documents edge cases that can cause subtle bugs.

Layer 1: Middleware Errors (Early Returns)

Middleware errors stop request processing immediately and return HTTP error response.

1.1 Missing Authorization Header

Middleware: RequireAuth
Condition: Authorization header missing or not "Bearer <token>"
HTTP Response: 401 Unauthorized
Body: {"error": "missing Authorization header"}

Code:

token, ok := extractBearer(r)
if !ok {
    httpError(w, 401, "missing Authorization header")
    return  // EARLY RETURN: stops here
}

Scenarios:

Client sends request without header → 401
Client sends header but wrong format (e.g., "Token abc" instead of "Bearer abc") → 401
Client sends empty Bearer (e.g., "Bearer ") → 401

1.2 Invalid or Expired JWT

Middleware: RequireAuth
Condition: JWT signature invalid, expired, or malformed
HTTP Response: 401 Unauthorized
Body: {"error": "invalid or expired token"}

Code:

claims, err := auth.ValidateAccessToken(token, secret)
if err != nil {
    httpError(w, 401, "invalid or expired token")
    return
}

Scenarios:

Token expired (now > token.exp) → 401
Token signature invalid (signed with different secret) → 401
Token tampered (claims modified, signature no longer matches) → 401
Token is not valid JSON Web Token format → 401

1.3 Token Version Mismatch (Revoked)

Middleware: RequireAuth
Condition: Token version doesn't match current user version
HTTP Response: 401 Unauthorized
Body: {"error": "token revoked"}

Code:

currentVersion, err := store.GetTokenVersion(ctx, claims.Subject)
if err != nil || currentVersion != claims.TokenVersion {
    httpError(w, 401, "token revoked")
    return
}

Scenarios:

User's token was revoked (admin incremented user.token_version) → 401
User logged in on another device, getting new tokens, old tokens rejected → 401
Database query fails to fetch token version → 401 (defensive: fail-secure)

1.4 Rate Limit Exceeded

Middleware: RateLimiter
Condition: Client IP exceeded rate limit
HTTP Response: 429 Too Many Requests
Body: {"error": "rate limit exceeded"}

Code:

if !limiter.AllowN(time.Now(), 1) {
    httpError(w, 429, "rate limit exceeded")
    return
}

Scenarios:

Client sends >100 queries/min from same IP → 429
Burst limit exceeded (>10 concurrent requests) → 429
Rate limit reset after 60 seconds (sliding window)

1.5 Tenant Mismatch (Multi-Tenancy)

Middleware: TenantMiddleware
Condition: Invalid or missing tenant in JWT
HTTP Response: 403 Forbidden (implicit via later queries)

Code:

if tenantID == "" {
    tenantID = "default"  // Fall back
}
ctx := context.WithValue(r.Context(), ContextKeyTenantID, tenantID)

Note: Tenant mismatch is implicit (queries return empty results if data doesn't belong to tenant). Not an explicit 403.

Layer 2: Handler Errors

Handler errors occur after middleware passes. Handler returns HTTP error and logs.

2.1 Invalid Request Body

Handler: All handlers
Condition: JSON body malformed or missing required fields
HTTP Response: 400 Bad Request

Code:

var req struct {
    AgentID string `json:"agent_id"`
    Prompt  string `json:"prompt"`
}
if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
    respond.Error(w, 400, "invalid request")
    return
}

Scenarios:

Request body is not JSON → 400
Required field missing (handled separately or defaults)
JSON field type wrong (e.g., agent_id is number instead of string) → 400

2.2 RBAC Check Failed (Agent Not Allowed)

Handler: handleQuery
Condition: User's allowed_agents list doesn't include agent_id
HTTP Response: 403 Forbidden

Code:

canUse := false
for _, a := range claims.AllowedAgents {
    if a == req.AgentID {
        canUse = true
        break
    }
}
if !canUse {
    respond.Error(w, 403, "agent not allowed")
    return
}

Edge Case: claims.AllowedAgents can be empty (user has no agents). Any query → 403.

2.3 Resource Not Found

Handler: All handlers using st.Store.Get*()
Condition: Agent, user, group, etc. doesn't exist
HTTP Response: 404 Not Found

Code:

agent, err := st.Store.GetAgent(r.Context(), tenantID, req.AgentID)
if err != nil {
    respond.Error(w, 404, "agent not found")
    return
}

Scenarios:

Agent name doesn't exist in DB → 404
User ID doesn't exist → 404
Group name doesn't exist → 404

2.4 Store/Database Error

Handler: Any query to st.Store
Condition: Database connection lost, corrupted, or internal error
HTTP Response: 500 Internal Server Error

Code:

user, err := st.Store.GetUser(ctx, userID)
if err != nil {
    respond.Error(w, 500, "database error")
    log.Errorf("get user failed: %v", err)
    return
}

Scenarios:

SQLite database file locked → retry (handled by SQLite)
Postgres connection pooled out → 500
Corrupted row in DB → 500
Query timeout (context deadline exceeded) → 500

Layer 3: Orchestrator Errors

Errors in orchestrator are serialized as gRPC errors and returned to API handler.

3.1 Orchestrator Not Running

gRPC Call: client.Query(ctx, req)
Error: Dial error (socket not accessible)
HTTP Response: 500 Internal Server Error

Code:

resp, err := st.Orchestrator.Query(ctx, orchestratorReq)
if err != nil {
    log.Errorf("orchestrator error: %v", err)
    respond.Error(w, 500, "query failed")
    return
}

Scenarios:

Orchestrator process crashed → socket doesn't exist → Dial error
Orchestrator hung → no response (context timeout) → context.DeadlineExceeded
Orchestrator restarting → intermittent failures

3.2 Query Timeout

gRPC Call: client.Query with 30s timeout
Condition: Orchestrator takes >30 seconds
Error: context.DeadlineExceeded
HTTP Response: 504 Gateway Timeout

Code:

ctx, cancel := context.WithTimeout(r.Context(), 30*time.Second)
defer cancel()
resp, err := st.Orchestrator.Query(ctx, orchestratorReq)
if err == context.DeadlineExceeded {
    respond.Error(w, 504, "query timeout")
    return
}

Scenarios:

LLM backend slow (API inference on Ollama, network latency to Claude API)
Tool execution hanging (tool daemon doesn't respond)
Orchestrator overloaded (backlog of queries)

3.3 Agent Tool Validation Failed

Orchestrator Internal Error
Condition: Agent's allowed_tools contains invalid/non-existent tool
Error: Returned as QueryResponse.error field
HTTP Response: 500 Internal Server Error (API propagates orchestrator error)

Scenario:

Agent config: allowed_tools = ["web_search", "nonexistent_tool"]
Orchestrator tries to load "nonexistent_tool" → error
Error serialized in QueryResponse
API checks response.Success == false → 500

3.4 Tool Execution Failure

Orchestrator Internal Error
Condition: Tool daemon crashes, timeout, or returns error
Error: Tool result propagated back to LLM
HTTP Response: 200 OK (query may still complete with error message)

Scenario:

LLM decides to call "web_search" tool
Tool daemon crashes → no response
Orchestrator timeout waiting for tool → error
LLM incorporates error into context ("tool failed, try different approach")
Conversation continues
Final response: 200 OK, but tool_used: "web_search", response includes error explanation

Edge Cases & Surprising Behaviors

Edge Case 1: Empty Effective Tools

Setup:

Agent: allowed_tools = []
User: allowed_tools = ["calculator"]
Query: POST /v1/agent-token for this agent

Flow:

// In permission.ComputeEffectiveTools
switch {
case len(agentTools) == 0:
    // Agent has no tools → deny all
    return []string{}
}

Result: Agent token issued with effective_tools = []. Agent can't call any tools.

Expected behavior: Correct. Agent has no permission grants.

Edge Case 2: User with Wildcard Agent

Setup:

User: allowed_agents = ["*"]
Query: GET /v1/agents

Flow:

// Frontend filtering
claims.AllowedAgents = ["*"]
agents_to_show = agents.filter(a => claims.AllowedAgents.includes(a.name) || claims.AllowedAgents.includes("*"))

Result: Frontend shows all agents. If backend doesn't handle "*", it fails.

Fix: Backend should check for "*" in allowed_agents list.

Edge Case 3: Concurrent Token Revocation

Setup:

User has valid access token
Admin revokes all user tokens (increments user.token_version)
User simultaneously sends query with old token

Timeline:

T+0ms   User sends query with token_version = 5
T+1ms   Handler checks token version
T+2ms   Admin increments user.token_version to 6
T+3ms   Handler loads token version from DB: 6
T+4ms   Comparison: 5 != 6 → token revoked → 401

Result: Request rejected (correct). No race condition (version check is synchronous).

Edge Case 4: Agent Permissions Change During Query

Setup:

User has agent token with effective_tools = ["calculator"]
Query is in-flight
Admin changes agent.allowed_tools to remove "calculator"
Admin increments agent.permissions_version

Timeline:

T+0ms   Query starts, agent_token.permissions_version = 5
T+1ms   Orchestrator receives query
T+500ms Admin changes agent permissions, permissions_version becomes 6
T+1000ms Orchestrator executes tool call (permissions still cached as version 5)
T+1001ms Tool validation: effective_tools still has "calculator" (baked in token)
T+1002ms Tool executes (even though new policy disallows it)
T+2000ms Query completes
T+2001ms Response sent to client

Result: Query completes with old permissions. New permissions take effect on NEXT token issuance.

Why: Permissions are baked at token mint time. Changing agent config doesn't invalidate in-flight tokens.

How to fix: If strict invalidation needed, use agent.on_permission_change = "abort" (checks permissions_version at request start).

Edge Case 5: Group Ceiling Intersection

Setup:

User belongs to 2 groups:
- Group A: ceiling = ["web_search", "calculator"]
- Group B: ceiling = ["calculator", "database"]
Agent: allowed_tools = ["web_search", "calculator", "database"]

Flow:

// Load user's groups
groupA_ceiling = ["web_search", "calculator"]
groupB_ceiling = ["calculator", "database"]

// Intersect all group ceilings
effectiveGroupCeiling = IntersectToolLists(groupA_ceiling, groupB_ceiling)
// = ["calculator"]

// Now intersect with agent
agentUser = ["web_search", "calculator", "database"]  // from agent
afterGroup = IntersectToolLists(agentUser, ["calculator"])
// = ["calculator"]

Result: User can only use "calculator" (intersection of all groups).

Expected behavior: Correct. Multiple group memberships are AND-ed (most restrictive wins).

Edge Case 6: Tool Token Theft

Setup:

Orchestrator issues tool token to tool daemon
Tool daemon never calls /ingest/tool_result
Attacker steals token_hash from logs

Can attacker use token?

// ToolToken is not hashed (unlike refresh tokens)
// It's a full JWT string

// Attacker sends POST /v1/ingest/tool_result
// Authorization: Bearer <stolen_token>

Result: Attacker can ingest tool results, forge execution logs.

Mitigation:

ToolTokens are short-lived (5 min) → limited window
ToolTokens are session-scoped (agent_id + session_id baked in) → can only affect that session
Audit logging records who ingests (tool name from token)

Better fix: Hash tool tokens like refresh tokens (not currently done).

Edge Case 7: Tenant Isolation Bypass

Setup:

Multi-tenant system
User A from tenant_1 somehow gets token with tenant_id = "default" (bug)

Can user A access tenant_2 data?

// In TenantMiddleware
tenantID = claims.TenantID  // "default"
// All queries: WHERE tenant_id = ?

// If another tenant's data somehow has tenant_id = "default", user A can see it!

Result: Potential data leakage.

Mitigation:

User accounts are scoped to tenant at creation time
JWT claims.TenantID is set from user.tenant_id (not user-controlled)
Code review: ensure no queries skip tenant_id filter

Edge Case 8: Orchestrator Slow, Client Timeout

Setup:

Client sends query with 5-second timeout
Orchestrator takes 10 seconds

Flow:

T+0ms   Client sends request
T+1ms   Handler forwards to orchestrator
T+1ms-5000ms   Orchestrator processing
T+5000ms Client timeout (5s elapsed) → context cancelled
T+5001ms Handler receives context.DeadlineExceeded
T+5002ms Handler returns 504 Gateway Timeout
T+6000ms Orchestrator finishes query (but response discarded)

Result: Response wasted (orchestrator still processes, but client already got 504).

Mitigation:

Set handler timeout >= orchestrator timeout
Monitor orchestrator latency
Alert on slow queries

Recovery Strategies

Strategy 1: Retry on Transient Failure

func queryWithRetry(client orchestrator.Client, req *pb.QueryRequest) (*pb.QueryResponse, error) {
    for attempt := 0; attempt < 3; attempt++ {
        resp, err := client.Query(context.Background(), req)
        if err == nil {
            return resp, nil
        }

        // Retry on transient errors
        if st := status.Convert(err); st.Code() == codes.Unavailable {
            time.Sleep(time.Duration(100 * (attempt + 1)) * time.Millisecond)
            continue
        }

        // Non-retryable error
        return nil, err
    }
    return nil, fmt.Errorf("query failed after 3 retries")
}

When to retry:

codes.Unavailable (orchestrator temporarily down)
codes.DeadlineExceeded (timeout, but might succeed on retry)

When NOT to retry:

codes.InvalidArgument (bad input, won't change)
codes.PermissionDenied (auth issue, won't change)
codes.NotFound (resource doesn't exist)

Strategy 2: Circuit Breaker

type CircuitBreaker struct {
    failureCount int
    lastFailTime time.Time
    state        string  // "closed", "open", "half-open"
}

func (cb *CircuitBreaker) Call(fn func() error) error {
    if cb.state == "open" {
        if time.Since(cb.lastFailTime) > 5*time.Second {
            cb.state = "half-open"
        } else {
            return fmt.Errorf("circuit breaker open")
        }
    }

    err := fn()
    if err != nil {
        cb.failureCount++
        cb.lastFailTime = time.Now()
        if cb.failureCount > 5 {
            cb.state = "open"
        }
    } else {
        cb.failureCount = 0
        cb.state = "closed"
    }

    return err
}

States:

Closed: Normal operation, all requests go through
Open: Too many failures, reject requests fast (fail-fast)
Half-open: Try one request, if succeeds, close circuit

Strategy 3: Bulkhead (Goroutine Limit)

type Bulkhead struct {
    semaphore chan struct{}
}

func NewBulkhead(maxConcurrent int) *Bulkhead {
    return &Bulkhead{
        semaphore: make(chan struct{}, maxConcurrent),
    }
}

func (b *Bulkhead) Do(fn func() error) error {
    select {
    case b.semaphore <- struct{}{}:
        defer func() { <-b.semaphore }()
        return fn()
    default:
        return fmt.Errorf("bulkhead full")
    }
}

Purpose: Limit concurrent orchestrator calls to prevent thread pool exhaustion.

Monitoring & Alerting

Key Metrics to Track

- 401 error rate (compromised tokens?)
- 429 rate limit errors (DDoS or legitimate spike?)
- 500 orchestrator errors (orchestrator health)
- 504 timeouts (slow LLM backend)
- Query latency (p50, p99)
- Token version cache hit rate
- Permission computation duration

Alerts

Alert: if p99_query_latency > 60s for 5 min
  → Orchestrator slow or LLM backend stuck

Alert: if error_rate > 5% for 10 min
  → Check orchestrator logs, database connectivity

Alert: if 401_rate > 1% and spiking
  → Possible token revocation or secret rotation issue

References

Error responses: api-go/internal/routes/respond.go
Middleware errors: api-go/internal/middleware/*.go
Handler errors: api-go/internal/routes/*.go
Orchestrator error handling: api-go/internal/orchestrator/client.go

Overview​

Layer 1: Middleware Errors (Early Returns)​

1.1 Missing Authorization Header​

1.2 Invalid or Expired JWT​

1.3 Token Version Mismatch (Revoked)​

1.4 Rate Limit Exceeded​

1.5 Tenant Mismatch (Multi-Tenancy)​

Layer 2: Handler Errors​

2.1 Invalid Request Body​

2.2 RBAC Check Failed (Agent Not Allowed)​

2.3 Resource Not Found​

2.4 Store/Database Error​

Layer 3: Orchestrator Errors​

3.1 Orchestrator Not Running​

3.2 Query Timeout​

3.3 Agent Tool Validation Failed​

3.4 Tool Execution Failure​

Edge Cases & Surprising Behaviors​

Edge Case 1: Empty Effective Tools​

Edge Case 2: User with Wildcard Agent​

Edge Case 3: Concurrent Token Revocation​

Edge Case 4: Agent Permissions Change During Query​

Edge Case 5: Group Ceiling Intersection​

Edge Case 6: Tool Token Theft​

Edge Case 7: Tenant Isolation Bypass​

Edge Case 8: Orchestrator Slow, Client Timeout​

Recovery Strategies​

Strategy 1: Retry on Transient Failure​

Strategy 2: Circuit Breaker​

Strategy 3: Bulkhead (Goroutine Limit)​

Monitoring & Alerting​

Key Metrics to Track​

Alerts​

References​

Overview

Layer 1: Middleware Errors (Early Returns)

1.1 Missing Authorization Header

1.2 Invalid or Expired JWT

1.3 Token Version Mismatch (Revoked)

1.4 Rate Limit Exceeded

1.5 Tenant Mismatch (Multi-Tenancy)

Layer 2: Handler Errors

2.1 Invalid Request Body

2.2 RBAC Check Failed (Agent Not Allowed)

2.3 Resource Not Found

2.4 Store/Database Error

Layer 3: Orchestrator Errors

3.1 Orchestrator Not Running

3.2 Query Timeout

3.3 Agent Tool Validation Failed

3.4 Tool Execution Failure

Edge Cases & Surprising Behaviors

Edge Case 1: Empty Effective Tools

Edge Case 2: User with Wildcard Agent

Edge Case 3: Concurrent Token Revocation

Edge Case 4: Agent Permissions Change During Query

Edge Case 5: Group Ceiling Intersection

Edge Case 6: Tool Token Theft

Edge Case 7: Tenant Isolation Bypass

Edge Case 8: Orchestrator Slow, Client Timeout

Recovery Strategies

Strategy 1: Retry on Transient Failure

Strategy 2: Circuit Breaker

Strategy 3: Bulkhead (Goroutine Limit)

Monitoring & Alerting

Key Metrics to Track

Alerts

References