Skip to content

Enable prompt caching on the Anthropic API provider#3

Open
xMKx wants to merge 1 commit into
mainfrom
optim/prompt-caching
Open

Enable prompt caching on the Anthropic API provider#3
xMKx wants to merge 1 commit into
mainfrom
optim/prompt-caching

Conversation

@xMKx

@xMKx xMKx commented May 26, 2026

Copy link
Copy Markdown
Collaborator

Summary

The Anthropic provider sends the system prompt without cache_control on every fork, so every node in an exploration pays the full input price for an identical instruction block. This PR enables prompt caching and restructures the system prompt to clear Sonnet's 1024-token caching minimum.

Why this is the biggest single token win available

Profiling llmception's Anthropic provider revealed:

  • 85 nodes in a typical width=4 depth=3 tree
  • Each spawns a fresh API session (the provider returns supportsFork = false)
  • Each session resends the identical 255-token system prompt
  • The provider already reads cache_read_input_tokens from the response (anthropic-api.ts:180-181) — caching bookkeeping is wired, just never triggered because no cache_control was being sent

Two coupled changes

1. src/providers/anthropic-api.ts — actually mark the system prompt for caching

-  system: opts.systemPrompt ?? "",
+  system: [
+    {
+      type: "text",
+      text: opts.systemPrompt ?? "",
+      cache_control: { type: "ephemeral" },
+    },
+  ],

The 5-minute ephemeral TTL covers parallel forks in a single explore comfortably.

2. src/forker/context-builder.ts — expand the system prompt past the 1024-token cache minimum

The previous ~255-token prompt was below Anthropic's Sonnet caching minimum, so cache_control alone would have been silently ignored by the API. To make caching actually engage, the prompt needs to be at least 1024 tokens.

Rather than padding with filler, I expanded the prompt with content that is also useful to the model:

  • 6 concrete decision examples with options + tradeoffs (auth, primary data store, pagination, background jobs, caching layer, API error format) — gives the model few-shot patterns for what a good AskUserQuestion looks like
  • A "skip silently" list of 9 trivial-decision categories the model should resolve on its own
  • A "how to phrase options" section with explicit bad/good examples (e.g. don't propose three flavours of JWT as three separate options)
  • Anti-patterns to avoid — chained decisions, strictly-worse options, codebase-implicit decisions, async/await trivia

Total prompt size: ~1309 cl100k tokens (5167 chars), comfortably above the 1024-token bar with margin for tokenizer drift between cl100k and the real Anthropic tokenizer.

This expanded content is load-bearing — it should reduce the rate of low-value forks regardless of caching savings.

Projected savings (Sonnet pricing, width=4 depth=3 tree, system-prompt slice only)

Variant Input tokens Cost
Current (no caching) 21,675 $0.0650
With caching (this PR) 5,499 effective $0.0080
Reduction 75% 88%

The effective token count is 1 cache write (255 × 1.25 = ~319 tok at write price) + 84 cache reads (255 × 0.1 = ~25 tok each effective) per tree, after the expanded prompt is in place.

Regression guard

Added a unit test asserting the system prompt stays above 5000 chars (~1100 tokens with buffer). If a future edit drops below the cache threshold, the test fails before the regression ships:

it("is large enough to be cacheable by the Anthropic API", () => {
  const prompt = ContextBuilder.buildSystemPrompt();
  expect(prompt.length).toBeGreaterThanOrEqual(5000);
});

Caveats

  • Tokenizer used for measurement is gpt-tokenizer cl100k_base, not Claude's. Absolute counts will differ slightly; the 28% margin above the 1024-token bar should absorb the drift. Anyone with access to Anthropic's tokenizer can verify.
  • The Sonnet 1024-token minimum is documented but model-version-dependent. Haiku requires 4096 tokens, which this prompt does NOT clear — Haiku users would still pay full price. The marker is harmless in that case (silently ignored).
  • The OpenAI provider (src/providers/openai-api.ts) has its own caching model and is not touched by this PR. Separate work.
  • The expanded prompt should be evaluated for behavioural parity. The new examples are domain-neutral and follow the same instructional pattern; I expect parity or improvement (better few-shot anchoring), but a comparison run would be the responsible validation.

Test plan

  • All 405 unit tests pass (npm test)
  • TypeScript build clean (npm run build)
  • New test asserts cache-eligible prompt size
  • Real explore against the Anthropic API to confirm projected savings (in progress, separate validation pass with a ~$1 burn budget)

Followups (separate PRs)

  • Decision-history format compaction (already opened — Compact decision-history format: 12-21% fewer tokens)
  • Verify whether --resume + --append-system-prompt on the Claude CLI provider duplicates the prompt across forks
  • Caching for the OpenAI API provider (different mechanism)

Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com

The Anthropic provider was sending the system prompt without
cache_control, paying full input price on every fork. The provider
already reads cache_read_input_tokens from the response (lines 180-181)
so the bookkeeping was wired — only the marker was missing.

Two coupled changes:

1. src/providers/anthropic-api.ts: switch system from a plain string to
   the structured array form with cache_control: { type: "ephemeral" }.
   The 5-minute TTL covers parallel forks in a single explore comfortably.

2. src/forker/context-builder.ts: expand buildSystemPrompt with concrete
   decision examples, anti-pattern guidance, and "skip silently" lists.
   This pushes the prompt from ~255 to ~1309 cl100k tokens — above
   Anthropic's 1024-token Sonnet caching minimum. Below that bar the
   cache_control marker is silently ignored by the API.

   The expanded content is also load-bearing: it gives the model
   few-shot examples of what good decision-asking looks like (three
   concrete strategies with tradeoffs) and what to skip silently, which
   should reduce the rate of trivial-decision forks regardless of
   caching savings.

Projected savings for one width=4 depth=3 exploration (system-prompt
slice only):

  before: 85 nodes × 255 tok = 21,675 input tok = $0.0650
  after:  1 write + 84 reads = 5,499  input tok (effective) = $0.0080

That's an 88% reduction on the system-prompt cost line per tree, and
compounds across every exploration a user runs.

A regression test asserts the prompt stays above 5000 chars (~1100
tokens with a tokenizer-drift buffer) so future edits don't silently
disable caching.

Tests: 405 pass. Build: clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant