Enable prompt caching on the Anthropic API provider by xMKx · Pull Request #3 · PQCWorld/llmception

xMKx · 2026-05-26T08:59:47Z

Summary

The Anthropic provider sends the system prompt without cache_control on every fork, so every node in an exploration pays the full input price for an identical instruction block. This PR enables prompt caching and restructures the system prompt to clear Sonnet's 1024-token caching minimum.

Why this is the biggest single token win available

Profiling llmception's Anthropic provider revealed:

85 nodes in a typical width=4 depth=3 tree
Each spawns a fresh API session (the provider returns supportsFork = false)
Each session resends the identical 255-token system prompt
The provider already reads cache_read_input_tokens from the response (anthropic-api.ts:180-181) — caching bookkeeping is wired, just never triggered because no cache_control was being sent

Two coupled changes

1. `src/providers/anthropic-api.ts` — actually mark the system prompt for caching

-  system: opts.systemPrompt ?? "",
+  system: [
+    {
+      type: "text",
+      text: opts.systemPrompt ?? "",
+      cache_control: { type: "ephemeral" },
+    },
+  ],

The 5-minute ephemeral TTL covers parallel forks in a single explore comfortably.

2. `src/forker/context-builder.ts` — expand the system prompt past the 1024-token cache minimum

The previous ~255-token prompt was below Anthropic's Sonnet caching minimum, so cache_control alone would have been silently ignored by the API. To make caching actually engage, the prompt needs to be at least 1024 tokens.

Rather than padding with filler, I expanded the prompt with content that is also useful to the model:

6 concrete decision examples with options + tradeoffs (auth, primary data store, pagination, background jobs, caching layer, API error format) — gives the model few-shot patterns for what a good AskUserQuestion looks like
A "skip silently" list of 9 trivial-decision categories the model should resolve on its own
A "how to phrase options" section with explicit bad/good examples (e.g. don't propose three flavours of JWT as three separate options)
Anti-patterns to avoid — chained decisions, strictly-worse options, codebase-implicit decisions, async/await trivia

Total prompt size: ~1309 cl100k tokens (5167 chars), comfortably above the 1024-token bar with margin for tokenizer drift between cl100k and the real Anthropic tokenizer.

This expanded content is load-bearing — it should reduce the rate of low-value forks regardless of caching savings.

Projected savings (Sonnet pricing, width=4 depth=3 tree, system-prompt slice only)

Variant	Input tokens	Cost
Current (no caching)	21,675	$0.0650
With caching (this PR)	5,499 effective	$0.0080
Reduction	75%	88%

The effective token count is 1 cache write (255 × 1.25 = ~319 tok at write price) + 84 cache reads (255 × 0.1 = ~25 tok each effective) per tree, after the expanded prompt is in place.

Regression guard

Added a unit test asserting the system prompt stays above 5000 chars (~1100 tokens with buffer). If a future edit drops below the cache threshold, the test fails before the regression ships:

it("is large enough to be cacheable by the Anthropic API", () => {
  const prompt = ContextBuilder.buildSystemPrompt();
  expect(prompt.length).toBeGreaterThanOrEqual(5000);
});

Caveats

Tokenizer used for measurement is gpt-tokenizer cl100k_base, not Claude's. Absolute counts will differ slightly; the 28% margin above the 1024-token bar should absorb the drift. Anyone with access to Anthropic's tokenizer can verify.
The Sonnet 1024-token minimum is documented but model-version-dependent. Haiku requires 4096 tokens, which this prompt does NOT clear — Haiku users would still pay full price. The marker is harmless in that case (silently ignored).
The OpenAI provider (src/providers/openai-api.ts) has its own caching model and is not touched by this PR. Separate work.
The expanded prompt should be evaluated for behavioural parity. The new examples are domain-neutral and follow the same instructional pattern; I expect parity or improvement (better few-shot anchoring), but a comparison run would be the responsible validation.

Test plan

All 405 unit tests pass (npm test)
TypeScript build clean (npm run build)
New test asserts cache-eligible prompt size
Real explore against the Anthropic API to confirm projected savings (in progress, separate validation pass with a ~$1 burn budget)

Followups (separate PRs)

Decision-history format compaction (already opened — Compact decision-history format: 12-21% fewer tokens)
Verify whether --resume + --append-system-prompt on the Claude CLI provider duplicates the prompt across forks
Caching for the OpenAI API provider (different mechanism)

Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com

The Anthropic provider was sending the system prompt without cache_control, paying full input price on every fork. The provider already reads cache_read_input_tokens from the response (lines 180-181) so the bookkeeping was wired — only the marker was missing. Two coupled changes: 1. src/providers/anthropic-api.ts: switch system from a plain string to the structured array form with cache_control: { type: "ephemeral" }. The 5-minute TTL covers parallel forks in a single explore comfortably. 2. src/forker/context-builder.ts: expand buildSystemPrompt with concrete decision examples, anti-pattern guidance, and "skip silently" lists. This pushes the prompt from ~255 to ~1309 cl100k tokens — above Anthropic's 1024-token Sonnet caching minimum. Below that bar the cache_control marker is silently ignored by the API. The expanded content is also load-bearing: it gives the model few-shot examples of what good decision-asking looks like (three concrete strategies with tradeoffs) and what to skip silently, which should reduce the rate of trivial-decision forks regardless of caching savings. Projected savings for one width=4 depth=3 exploration (system-prompt slice only): before: 85 nodes × 255 tok = 21,675 input tok = $0.0650 after: 1 write + 84 reads = 5,499 input tok (effective) = $0.0080 That's an 88% reduction on the system-prompt cost line per tree, and compounds across every exploration a user runs. A regression test asserts the prompt stays above 5000 chars (~1100 tokens with a tokenizer-drift buffer) so future edits don't silently disable caching. Tests: 405 pass. Build: clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable prompt caching on the Anthropic API provider#3

Enable prompt caching on the Anthropic API provider#3
xMKx wants to merge 1 commit into
mainfrom
optim/prompt-caching

xMKx commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

xMKx commented May 26, 2026

Summary

Why this is the biggest single token win available

Two coupled changes

1. src/providers/anthropic-api.ts — actually mark the system prompt for caching

2. src/forker/context-builder.ts — expand the system prompt past the 1024-token cache minimum

Projected savings (Sonnet pricing, width=4 depth=3 tree, system-prompt slice only)

Regression guard

Caveats

Test plan

Followups (separate PRs)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. `src/providers/anthropic-api.ts` — actually mark the system prompt for caching

2. `src/forker/context-builder.ts` — expand the system prompt past the 1024-token cache minimum