Enable prompt caching on the Anthropic API provider#3
Open
xMKx wants to merge 1 commit into
Open
Conversation
The Anthropic provider was sending the system prompt without
cache_control, paying full input price on every fork. The provider
already reads cache_read_input_tokens from the response (lines 180-181)
so the bookkeeping was wired — only the marker was missing.
Two coupled changes:
1. src/providers/anthropic-api.ts: switch system from a plain string to
the structured array form with cache_control: { type: "ephemeral" }.
The 5-minute TTL covers parallel forks in a single explore comfortably.
2. src/forker/context-builder.ts: expand buildSystemPrompt with concrete
decision examples, anti-pattern guidance, and "skip silently" lists.
This pushes the prompt from ~255 to ~1309 cl100k tokens — above
Anthropic's 1024-token Sonnet caching minimum. Below that bar the
cache_control marker is silently ignored by the API.
The expanded content is also load-bearing: it gives the model
few-shot examples of what good decision-asking looks like (three
concrete strategies with tradeoffs) and what to skip silently, which
should reduce the rate of trivial-decision forks regardless of
caching savings.
Projected savings for one width=4 depth=3 exploration (system-prompt
slice only):
before: 85 nodes × 255 tok = 21,675 input tok = $0.0650
after: 1 write + 84 reads = 5,499 input tok (effective) = $0.0080
That's an 88% reduction on the system-prompt cost line per tree, and
compounds across every exploration a user runs.
A regression test asserts the prompt stays above 5000 chars (~1100
tokens with a tokenizer-drift buffer) so future edits don't silently
disable caching.
Tests: 405 pass. Build: clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The Anthropic provider sends the system prompt without
cache_controlon every fork, so every node in an exploration pays the full input price for an identical instruction block. This PR enables prompt caching and restructures the system prompt to clear Sonnet's 1024-token caching minimum.Why this is the biggest single token win available
Profiling llmception's Anthropic provider revealed:
supportsFork = false)cache_read_input_tokensfrom the response (anthropic-api.ts:180-181) — caching bookkeeping is wired, just never triggered because no cache_control was being sentTwo coupled changes
1.
src/providers/anthropic-api.ts— actually mark the system prompt for cachingThe 5-minute ephemeral TTL covers parallel forks in a single
explorecomfortably.2.
src/forker/context-builder.ts— expand the system prompt past the 1024-token cache minimumThe previous ~255-token prompt was below Anthropic's Sonnet caching minimum, so
cache_controlalone would have been silently ignored by the API. To make caching actually engage, the prompt needs to be at least 1024 tokens.Rather than padding with filler, I expanded the prompt with content that is also useful to the model:
AskUserQuestionlooks likeTotal prompt size: ~1309 cl100k tokens (5167 chars), comfortably above the 1024-token bar with margin for tokenizer drift between cl100k and the real Anthropic tokenizer.
This expanded content is load-bearing — it should reduce the rate of low-value forks regardless of caching savings.
Projected savings (Sonnet pricing, width=4 depth=3 tree, system-prompt slice only)
The effective token count is 1 cache write (255 × 1.25 = ~319 tok at write price) + 84 cache reads (255 × 0.1 = ~25 tok each effective) per tree, after the expanded prompt is in place.
Regression guard
Added a unit test asserting the system prompt stays above 5000 chars (~1100 tokens with buffer). If a future edit drops below the cache threshold, the test fails before the regression ships:
Caveats
gpt-tokenizercl100k_base, not Claude's. Absolute counts will differ slightly; the 28% margin above the 1024-token bar should absorb the drift. Anyone with access to Anthropic's tokenizer can verify.src/providers/openai-api.ts) has its own caching model and is not touched by this PR. Separate work.Test plan
npm test)npm run build)exploreagainst the Anthropic API to confirm projected savings (in progress, separate validation pass with a ~$1 burn budget)Followups (separate PRs)
Compact decision-history format: 12-21% fewer tokens)--resume+--append-system-prompton the Claude CLI provider duplicates the prompt across forksCo-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com