Skip to content

fix: add timeout + fail-open to recall search path#1455

Open
kagura-agent wants to merge 1 commit intoMemTensor:mainfrom
kagura-agent:fix/recall-timeout-fail-open
Open

fix: add timeout + fail-open to recall search path#1455
kagura-agent wants to merge 1 commit intoMemTensor:mainfrom
kagura-agent:fix/recall-timeout-fail-open

Conversation

@kagura-agent
Copy link
Copy Markdown

Summary

Fixes #1452 — auto-recall can block gateway startup / first-turn path long enough to fail health checks when embedding or LLM calls are slow.

Problem

With auto-recall enabled and existing memories, the recall search path can block for 30-40 seconds when:

  • Embedding model is slow to respond
  • LLM skill relevance judgment hangs
  • Hub memory search times out

This is enough to trip health checks and cause restart loops.

Solution

Add configurable timeout + fail-open semantics at three layers:

1. Recall engine (recall/engine.ts)

  • withTimeout() helper: races any promise against a deadline, returns fallback on timeout
  • embedder.embedQuery() wrapped with timeout → falls back to FTS-only search (no vector candidates)
  • Hub memory embedding wrapped with timeout → skipped on timeout
  • judgeSkillRelevance() (LLM call) wrapped with timeout → returns all candidates on timeout

2. Tool handler (tools/memory-search.ts)

  • Top-level timeout on the entire memory_search handler
  • Returns empty results with timedOut: true in meta on timeout
  • Never throws — always returns a valid response shape

3. Configuration (types.ts, config.ts)

  • New recall.timeoutMs option (default: 10000ms, 0 = no timeout)
  • Operators can tune based on their model latency

Key Principles

  • Fail-open: timeout/error → partial or empty results, never block
  • No propagation: recall exceptions never reach gateway top level
  • Startup independence: ready state doesn't depend on slow recall

Testing

  • TypeScript typecheck passes (tsc --noEmit)
  • No behavioral changes when recall completes within timeout
  • Timeout only activates when operations exceed configured timeoutMs

Fixes MemTensor#1452 — auto-recall can block gateway startup / first-turn path
long enough to fail health checks when embedding or LLM calls are slow.

Changes:
- Add configurable `recall.timeoutMs` (default 10s)
- Wrap embedder.embedQuery() with timeout in RecallEngine.search();
  falls back to FTS-only results on timeout
- Wrap LLM skill relevance judgment with timeout in searchSkills();
  falls back to returning all candidates on timeout
- Add top-level timeout in memory_search tool handler; returns empty
  results with `timedOut: true` flag on timeout
- All timeouts fail-open: partial/empty results, never throw
- Recall exceptions never propagate to gateway top level
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

auto-recall can block gateway startup / first-turn path long enough to fail health checks

1 participant