Skip to content

feat: KG-driven model routing with provider probing#761

Merged
AlexMikhalev merged 20 commits intomainfrom
task/400-kg-driven-model-routing
Apr 6, 2026
Merged

feat: KG-driven model routing with provider probing#761
AlexMikhalev merged 20 commits intomainfrom
task/400-kg-driven-model-routing

Conversation

@AlexMikhalev
Copy link
Copy Markdown
Contributor

Summary

  • Add KG-driven model routing to ADF orchestrator using markdown-defined rules
  • Each routing rule defines route:: + action:: pairs with synonyms:: for Aho-Corasick matching
  • Provider health tracked via circuit breakers (reuses terraphim_spawner::health)
  • Probes CLI tools (opencode, claude) via action:: templates for full-stack availability testing
  • Hot-reload: edit a markdown file, routing updates on next tick without restart

Changes

New files

  • docs/taxonomy/routing_scenarios/adf/ -- 10 KG routing rule markdown files
  • crates/terraphim_orchestrator/src/kg_router.rs -- KG routing engine
  • crates/terraphim_orchestrator/src/provider_probe.rs -- Provider health probing

Modified files

  • crates/terraphim_types/src/lib.rs -- RouteDirective.action, MarkdownDirectives.routes
  • crates/terraphim_automata/src/markdown_directives.rs -- action:: directive parsing, multi-route support
  • crates/terraphim_orchestrator/src/lib.rs -- KG routing in spawn_agent(), health gate
  • crates/terraphim_orchestrator/src/config.rs -- [routing] config section

Design

  • KG routing tried first (Aho-Corasick synonym match against task text)
  • If primary provider unhealthy, falls back to next route in file
  • If no KG match, falls back to existing keyword RoutingEngine
  • action:: template uses {{ model }} and {{ prompt }} placeholders
  • Different CLI tools per route (opencode for kimi/minimax/openai, claude for anthropic)
  • Circuit breaker: 5 failures opens circuit, 60s cooldown, 1 success closes

Test plan

  • cargo test -p terraphim_automata -- 90 tests (2 new for action/multi-route)
  • cargo test -p terraphim_orchestrator -- 374 tests (11 new for kg_router + provider_probe)
  • cargo check --workspace -- full workspace compiles
  • Deploy to bigbox with [routing] config pointing to taxonomy dir

Refs #400

Generated with Terraphim AI

Terraphim CI and others added 4 commits April 6, 2026 12:27
Create 10 ADF routing rule markdown files with route/action/priority/
synonyms directives for KG-based agent dispatch. Add action:: directive
to RouteDirective for CLI command templates. Support multiple route/action
pairs per file with backward-compatible route field.

Refs #400

Co-Authored-By: Terraphim AI <noreply@terraphim.ai>
KgRouter loads routing rules from markdown taxonomy directory,
builds thesaurus from synonyms, and uses terraphim_automata::find_matches
for Aho-Corasick pattern matching against agent task descriptions.

Returns KgRouteDecision with provider, model, action template, confidence,
and ordered fallback routes. Supports health-aware fallback via
first_healthy_route() and template rendering via render_action().

Refs #400

Co-Authored-By: Terraphim AI <noreply@terraphim.ai>
Add provider_probe.rs with ProviderHealthMap using CircuitBreaker from
terraphim_spawner::health. Probes CLI tools via action:: templates from
KG rules, measures latency, saves pi-benchmark compatible JSON results.

Wire KG router into spawn_agent(): KG routing tried first (Aho-Corasick
synonym match), with health-aware fallback skipping unhealthy providers.
Falls back to existing keyword RoutingEngine when no KG match found.

Add [routing] config section to OrchestratorConfig with taxonomy_path,
probe_ttl_secs, probe_results_dir, and probe_on_startup fields.

Refs #400

Co-Authored-By: Terraphim AI <noreply@terraphim.ai>
KgRouter now tracks the latest mtime of .md files in the taxonomy
directory. reload_if_changed() compares current mtime against cached
value and rebuilds the Aho-Corasick automaton if files have been
modified. Called on the orchestrator's reconciliation tick for
zero-restart routing updates.

Refs #400

Co-Authored-By: Terraphim AI <noreply@terraphim.ai>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 6, 2026

Documentation Preview

Your documentation changes have been deployed to:
https://e33d2ae7.terraphim-docs.pages.dev

This preview will be available until the PR is closed.

Fix D-1: replace deprecated std::io::Error::new(ErrorKind::Other, e)
with std::io::Error::other(e) in provider_probe.rs.

Add verification and validation report from V-model right-side review.

Refs #400

Co-Authored-By: Terraphim AI <noreply@terraphim.ai>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 6, 2026

Documentation Preview

Your documentation changes have been deployed to:
https://92c0d8db.terraphim-docs.pages.dev

This preview will be available until the PR is closed.

D-2: probe_all() called on startup when probe_on_startup=true, and
re-probed in reconcile_tick when cached results expire (TTL-based).
Saves JSON results to configured probe_results_dir.

D-3: ExitClassifier ModelError/RateLimit feeds record_failure() into
provider circuit breaker. Success/EmptySuccess feeds record_success().

D-4: reload_if_changed() called every reconcile_tick, checks mtime
of markdown files and rebuilds Aho-Corasick automaton if changed.

D-5: Use sh -c for action template execution instead of
split_whitespace, matching CommandStep::Shell pattern in tinyclaw.
Handles quoted arguments correctly.

Refs #400

Co-Authored-By: Terraphim AI <noreply@terraphim.ai>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 6, 2026

Documentation Preview

Your documentation changes have been deployed to:
https://8e7c4b43.terraphim-docs.pages.dev

This preview will be available until the PR is closed.

The probe's sh -c doesn't have ~/.local/bin, ~/.bun/bin, ~/.cargo/bin
on PATH where opencode and claude live. Use bash -lc (login shell)
to source the user profile, matching the systemd ExecStart pattern.

Refs #400

Co-Authored-By: Terraphim AI <noreply@terraphim.ai>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 6, 2026

Documentation Preview

Your documentation changes have been deployed to:
https://a19eca74.terraphim-docs.pages.dev

This preview will be available until the PR is closed.

Replace bash -lc (which fails if .profile has errors) with bash -c
plus explicit PATH prepend of ~/.local/bin, ~/.bun/bin, ~/bin,
~/.cargo/bin, ~/go/bin. Avoids broken .profile sourcing while
ensuring CLI tools are discoverable.

Refs #400

Co-Authored-By: Terraphim AI <noreply@terraphim.ai>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 6, 2026

Documentation Preview

Your documentation changes have been deployed to:
https://ec18f70c.terraphim-docs.pages.dev

This preview will be available until the PR is closed.

Terraphim CI and others added 2 commits April 6, 2026 15:41
opencode requires 'run -m provider/model "prompt"' syntax.
All action templates now use {{ model }} placeholder from route
directive instead of hardcoding model names.

Refs #400

Co-Authored-By: Terraphim AI <noreply@terraphim.ai>
Use absolute paths for opencode (/home/alex/.bun/bin/opencode) and
claude (/home/alex/.local/bin/claude). Add --format json to opencode.
Replace pay-per-use opencode/ models with subscription providers:
gpt-5-nano -> opencode-go/minimax-m2.5, minimax-m2.5-free ->
minimax-coding-plan/MiniMax-M2.5.

Refs #400

Co-Authored-By: Terraphim AI <noreply@terraphim.ai>
Validates 10 rules loaded, every route has action:: template,
security_audit matches cargo audit/CVE, reasoning has priority 80,
and multi-route fallback chains are present.

Refs #400

Co-Authored-By: Terraphim AI <noreply@terraphim.ai>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 6, 2026

Documentation Preview

Your documentation changes have been deployed to:
https://d944da81.terraphim-docs.pages.dev

This preview will be available until the PR is closed.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 6, 2026

Documentation Preview

Your documentation changes have been deployed to:
https://a15417da.terraphim-docs.pages.dev

This preview will be available until the PR is closed.

Add e2e test verifying every ADF agent routes to expected provider+model
via KG synonym matching. Fix multi-line synonyms: parser requires
synonyms:: prefix on each line. All 12 agents route correctly.

Refs #400

Co-Authored-By: Terraphim AI <noreply@terraphim.ai>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 6, 2026

Documentation Preview

Your documentation changes have been deployed to:
https://952f47db.terraphim-docs.pages.dev

This preview will be available until the PR is closed.

Expand all 10 routing rules from 2 to 4 routes each:
- Coding tasks: +zai-coding-plan/glm-5-turbo +openai/gpt-5.3-codex
- Reasoning tasks: +zai-coding-plan/glm-5 +openai/gpt-5.4
- Documentation/cost: +zai-coding-plan/glm-5-turbo +openai/gpt-5.4-mini

All subscription providers only (no opencode/ pay-per-use prefix).
E2e test updated: 12/12 agents route correctly with 4 fallbacks.

Refs #400

Co-Authored-By: Terraphim AI <noreply@terraphim.ai>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 6, 2026

Documentation Preview

Your documentation changes have been deployed to:
https://06b75337.terraphim-docs.pages.dev

This preview will be available until the PR is closed.

Probe timeout/error marks provider unhealthy immediately, not after
5 failures. Probe success is authoritative over circuit breaker state.
Mixed results: if ANY model succeeds for a provider, provider is healthy.

This fixes the bug where kimi timed out in probe (30s) but was still
selected as primary because circuit breaker threshold wasn't reached.

Refs #400

Co-Authored-By: Terraphim AI <noreply@terraphim.ai>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 6, 2026

Documentation Preview

Your documentation changes have been deployed to:
https://01d487a9.terraphim-docs.pages.dev

This preview will be available until the PR is closed.

Replace 10 category-based routing files with 3 tier files:
- planning_tier.md (pri=80): opus for strategic planning, architecture
- review_tier.md (pri=60): haiku for verification, validation, compliance
- implementation_tier.md (pri=50): sonnet for coding, testing, security

KG routing now takes priority over static model config in spawn_agent.
Phase keywords in task text determine tier, not agent name.

E2e test: 13/13 agents route to correct tier:
- 2 agents -> PLANNING (opus): meta-coordinator, product-development
- 5 agents -> REVIEW (haiku): spec-validator, quality-coord, compliance,
  drift-detector, merge-coordinator
- 6 agents -> IMPLEMENTATION (sonnet): security-sentinel, test-guardian,
  implementation-swarm, documentation-gen, browser-qa, log-analyst

Refs #400

Co-Authored-By: Terraphim AI <noreply@terraphim.ai>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 6, 2026

Documentation Preview

Your documentation changes have been deployed to:
https://8852de95.terraphim-docs.pages.dev

This preview will be available until the PR is closed.

When KG tier routing selects a model that uses a different CLI than
the agent's static cli_tool (e.g., claude instead of opencode),
extract the CLI path from the action:: template and use it for the
Provider construction. This enables seamless routing across CLI tools.

Refs #400

Co-Authored-By: Terraphim AI <noreply@terraphim.ai>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 6, 2026

Documentation Preview

Your documentation changes have been deployed to:
https://3c4e81ea.terraphim-docs.pages.dev

This preview will be available until the PR is closed.

opencode run completes in ~11s but the full agent lifecycle (init,
step_start, tool_use, step_finish, next_step, session_end) can take
longer under load. 30s was too tight causing false-positive timeouts
for kimi provider. Increase to 60s to match actual completion time.

Refs #400

Co-Authored-By: Terraphim AI <noreply@terraphim.ai>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 6, 2026

Documentation Preview

Your documentation changes have been deployed to:
https://e0a7a671.terraphim-docs.pages.dev

This preview will be available until the PR is closed.

Remove ambiguous words (specification, research, design the, blueprint,
triage, risk assessment) that appear in issue bodies and cause review
agents to escalate to opus. Keep only unambiguous planning phrases
like 'create a plan', 'architecture design', 'strategic planning'.

Fixes quality-coordinator being routed to opus when reviewing an issue
whose body contained planning language.

Refs #400

Co-Authored-By: Terraphim AI <noreply@terraphim.ai>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 6, 2026

Documentation Preview

Your documentation changes have been deployed to:
https://6ace51f5.terraphim-docs.pages.dev

This preview will be available until the PR is closed.

Each non-review agent gets its own git worktree in /tmp/adf-worktrees/
before spawning. Review-tier agents (haiku) skip isolation since they
are read-only. Worktrees are cleaned up after agent exit.

Flow: create_agent_worktree() -> spawn with worktree as working_dir ->
try_commit_agent_work(worktree) -> remove_agent_worktree()

Prevents concurrent agents from corrupting each other's working tree.
Fail-open: if worktree creation fails, agent uses shared working_dir.

Fixes #246 Refs #400

Co-Authored-By: Terraphim AI <noreply@terraphim.ai>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 6, 2026

Documentation Preview

Your documentation changes have been deployed to:
https://88604255.terraphim-docs.pages.dev

This preview will be available until the PR is closed.

Refs #400

Co-Authored-By: Terraphim AI <noreply@terraphim.ai>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 6, 2026

Documentation Preview

Your documentation changes have been deployed to:
https://0a856061.terraphim-docs.pages.dev

This preview will be available until the PR is closed.

@AlexMikhalev AlexMikhalev merged commit d0611cc into main Apr 6, 2026
35 checks passed
@AlexMikhalev AlexMikhalev deleted the task/400-kg-driven-model-routing branch April 6, 2026 21:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant