Prompt caching: how to cut Claude API costs by 80%

There’s one line of code that cut my Claude API bill by 78% in a single month. It’s called cache_control. Most people using the Anthropic API aren’t using it. That’s a mistake.

What prompt caching is

When you call the Claude API, you pay for every token the model processes — both prompt tokens and response tokens. If you have a 2000-token system prompt you send with every request, you’re paying for those 2000 tokens every time.

Prompt caching changes that. You mark parts of the prompt as cacheable, and Anthropic stores them in their infrastructure for 5 minutes. Subsequent requests that use that cached portion cost roughly 90% less for that segment.

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are an expert TypeScript assistant...",
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[{"role": "user", "content": "Explain generics"}]
)

The cache_control field with type: "ephemeral" tells Anthropic: “cache everything up to this point.”

When it makes sense to use it

Caching is most effective when you have:

Long, stable system prompts. A system prompt that defines agent behavior, includes documentation, or provides extensive context. If your system prompt is over 500 tokens and doesn’t change between requests, cache it.

Reference documents or code. If you’re doing Q&A over a document, code analysis, or any task requiring fixed context, that context should be cached.

Long conversations. Previous messages in a conversation can be cached to avoid reprocessing the history.

Caching does NOT help if your prompts change constantly or are short (less than 1024 tokens for Sonnet — that’s the minimum for the cache to activate).

The real numbers

In a project where I used Claude to analyze code in a repo, I had a ~3000-token system prompt that included project conventions. Without cache: ~$0.18 per request (just the system prompt). With cache: $0.018 on the first request, $0.003 on subsequent ones.

For 500 requests per day, that goes from $90/day to $1.50/day for the system prompt alone. The savings pay for themselves within hours.

One important detail

The cache lasts 5 minutes. If there’s a gap of more than 5 minutes between requests using the same cache, the next request pays full price and refreshes the cache. This matters for intermittent workloads.

To verify the cache is working, check the API response — the usage field includes cache_creation_input_tokens and cache_read_input_tokens. If cache_read_input_tokens > 0, the cache is active.

Implementation takes under 10 minutes. There’s no reason not to do it.