Claude Prompt Caching: One Trick That Cut Our API Bill 73%
One prompt cache strategy slashed a real SMB's Claude API bill by 73% in 30 days. See the exact cache structure, hit rate data, and when not to bother caching.
TL;DR
Structuring your Claude prompts to front-load static content and hitting Anthropic's 5-minute TTL window correctly can drop your API spend by 60-75%. Most SMBs get this wrong because they put dynamic content first. Fix the order, cache the system prompt, and the savings show up in the first billing cycle.
TL;DR
Structuring your Claude prompts to front-load static content and hitting Anthropic’s 5-minute TTL window correctly can drop your API spend by 60-75%. Most SMBs get this wrong because they put dynamic content first. Fix the order, cache the system prompt, and the savings show up in the first billing cycle.
The API bill that forced a closer look at the docs
A 40-person e-commerce ops team was running Claude 3.5 Sonnet for product description rewrites, supplier email drafts, and returns classification. Their monthly API spend had crept to $1,840. Not catastrophic, but high enough that the ops director flagged it in a Monday standup.
The fix was not a different model. It was not batching. It was one structural change to how prompts were assembled, which pushed their cache hit rate from 11% to 79% and dropped the bill to $497 the following month. That is $1,343 per month saved, or roughly $16,100 per year, on a workflow that took 3 hours to restructure.
What the team was doing before the fix
Before any changes, the team built each API call by prepending the customer order ID, product SKU, or return reason at the very top of the prompt, followed by the full system instructions. That felt intuitive. But from Anthropic’s caching perspective, every request looked unique from the first token, so the cache never engaged. They were paying full input token rates on an 8,200-token system prompt for every single call.
What changed and why it worked
The team moved all static content, including return policy rules, tone guidelines, and product category taxonomy, to the top of the prompt and appended dynamic fields at the end. They added the cache_control parameter to the static block. Within 48 hours of deploying the change, their cache hit rate climbed to 79% and held steady across the rest of the billing cycle.
Why most Claude prompt caching implementations fail
Anthropic’s prompt caching works by storing a prefix of your prompt. The key word is “prefix.” The cache only applies to content at the start of the prompt that stays identical across requests.
Most teams accidentally put the dynamic content first. The user’s order ID, the product SKU, the customer’s name. All of that goes at the top because it feels natural to say “here is the context, now do the job.” But that structure means every single request looks different to the cache, so nothing gets stored and you pay full input token rates every time.
The fix is straightforward: move your system instructions, your brand guidelines, your classification rules, and your output format specs to the top of the prompt. Mark that block with Anthropic’s cache_control parameter. Then append the dynamic, request-specific content at the end. The cache sees an identical prefix on every call and serves it at roughly 10% of the normal input cost.
How token counts affect eligibility
Not every system prompt qualifies for caching. Anthropic requires the cached block to exceed 1,024 tokens. Teams running lean prompts under that threshold will not see caching engage at all, regardless of how correctly the prompt is structured. If your system prompt is short, consider whether there are additional static instructions, few-shot examples, or output format definitions you can consolidate into that block.
How prompt structure maps to cache hit rate
The relationship between prompt structure and hit rate is direct. If even one dynamic field appears before the cache_control boundary, the prefix breaks on every unique value of that field. A team running 1,000 daily requests with 50 unique product SKUs that appear before the boundary will see a near-zero hit rate. Move those SKUs after the boundary and the same 1,000 requests can share a single cached prefix.
The 5-minute TTL: the constraint that catches most async pipelines
Anthropic’s cache has a 5-minute TTL. That means if your application sends requests more than 5 minutes apart, the cache expires, you pay full write cost on the next call, and the clock resets.
This catches SMB ops teams running async or scheduled pipelines. If you’re processing supplier invoices in a nightly batch job, each invoice hits a cold cache. You are paying cache write cost (slightly above normal input cost) on every single request, which is actually worse than not caching at all.
Workloads that benefit most from the 5-minute window
The 5-minute window rewards high-frequency, low-latency workloads: live chat, real-time classification, quote generation tools where users are actively clicking buttons. If your use case fires more than 12 requests per minute during active sessions, you will likely maintain a warm cache and see strong hit rates.
Workloads where caching adds cost instead of saving it
Nightly batch jobs, weekly report generators, and any pipeline where requests are queued hours apart will consistently hit cold caches. For these use cases, the correct optimization is batching requests through the Anthropic Batch API rather than prompt caching. The Batch API offers up to 50% cost reduction on large asynchronous workloads without depending on request frequency.
Ephemeral caching vs. multi-turn conversation history
The cache_control parameter marks content as ephemeral, meaning it only persists for the TTL window. For most SMB use cases, ephemeral caching is exactly what you want. You are caching a shared system prompt, not user-specific context, and you want it to refresh if you update your instructions.
When ephemeral caching backfires
Ephemeral caching creates problems in multi-turn conversations. If you are building a support agent that carries conversation history in the prompt and you cache the growing history block, you can end up serving stale context once the user’s session exceeds 5 minutes. The safer approach is to cache only the static system instructions and leave the conversation history uncached, even though that means paying full price on the history tokens.
Recommended cache boundary for support agents
For support agent workloads, the recommended structure is a single large cache block covering all static system instructions (persona, escalation rules, tone guidelines, product knowledge base), followed by an uncached block covering the live conversation history and the current user message. This approach keeps cache hit rates high on the expensive static content while avoiding stale context errors in the conversation thread.
Prompt structure rules that hold up across use cases
| Structure Rule | Why It Matters | Impact |
|---|---|---|
| Static content first, dynamic content last | Cache only applies to a leading prefix | High: this is the core fix |
| Cache blocks must exceed 1,024 tokens | Anthropic will not cache smaller blocks | Medium: thin system prompts do not qualify |
Mark cache boundaries explicitly with cache_control | Without it, no caching happens at all | Critical: easy to miss |
| Keep your system prompt stable between deploys | Every edit breaks the cache and forces a rewrite | Medium: matters during active development |
| Do not cache frequently changing context such as user data or live prices | Stale cache serves wrong data | High in e-commerce and finance use cases |
One thing that trips up teams during development: every time you tweak your system prompt, you invalidate the cache. During active iteration, your hit rate will look poor and your costs will spike slightly. Allow 24 to 48 hours of production traffic with a stable prompt before evaluating whether caching is working correctly.
The math on whether restructuring is worth your time
If you are spending under $200 per month on Claude API calls, the time investment to restructure your prompts probably returns in 2 to 3 months. That is still worth doing, but it is not urgent.
Above $500 per month, this is the first optimization you should make. The restructure takes 2 to 4 hours for most SMB applications. At $1,000 per month spend with a realistic 70% hit rate improvement, you are saving $600 to $700 per month. That is $8,400 per year for a half-day of work.
The e-commerce team’s numbers in detail
Before caching, average input tokens per request were 9,400, composed mostly of the system prompt repeated on every call. After caching, the team was paying input rates on roughly 1,200 tokens per request and cache read rates on the 8,200-token system prompt. At Claude 3.5 Sonnet pricing, that is roughly an 82% reduction in input token cost per call, partially offset by occasional cache write costs, landing at a 73% net reduction on their monthly bill.
The full-cycle numbers: $1,840 per month before the change, $497 per month after. The prompt restructure took one engineer approximately 3 hours. At a fully loaded hourly rate of $75, the restructure cost $225 and returned $1,343 in the first month alone.
How to calculate your own expected savings
Multiply your average system prompt token count by your monthly request volume to get your total static input tokens per month. Apply the cache read rate (approximately 10% of the standard input rate for Claude 3.5 Sonnet) to that figure instead of the full input rate. The difference, multiplied by an expected hit rate of 65% to 80% for well-structured prompts, gives you a conservative monthly savings estimate. Add back one cache write cost per unique session to account for cold cache starts.
Monitoring cache performance after deployment
Anthropic’s usage dashboard shows cache hit and miss counts broken down by model and time window. After deploying a restructured prompt, check the dashboard at the 48-hour mark. A hit rate below 50% indicates that dynamic content is still leaking into the cached prefix before the cache_control boundary. The most common culprits are timestamps injected automatically by middleware, session IDs prepended by API wrapper libraries, and A/B test variant flags inserted before the system instructions.
A hit rate between 50% and 70% is functional but leaves savings on the table. A hit rate above 70% on a stable, high-volume workload is the target range for maximum cost reduction. The e-commerce team in this case study held a 79% hit rate for three consecutive billing cycles after their initial restructure.
The bottom line on Claude prompt caching for SMBs
Move static instructions to the top of your prompt, mark them with cache_control, and keep your system prompt stable across deploys. If you are sending more than a few requests per minute during active usage, the 5-minute TTL will stay warm and the savings are immediate. Check your Anthropic usage dashboard for cache hit rate after 48 hours of production traffic. Anything below 50% means your prompt structure still has dynamic content leaking into the cached prefix.
Prompt caching is not a feature that requires a platform change, a new model, or a vendor negotiation. It is a structural adjustment to files you already control. For SMBs spending over $500 per month on Claude API calls, it is the highest-return optimization available today.
Note: Commercial content
The section below is provided by Kreante and represents a commercial offer. It is editorially separate from the analysis above.
Kreante helps SMB owners replace expensive SaaS with custom AI tools. We have shipped 265 or more projects (60% LowCode/AI, 70% B2B) for clients across the US, Europe, and LATAM.
Book a 30-minute consultation with Kreante using the link in the references section of this article.
Frequently asked questions
- What is Claude prompt caching?
- Anthropic's prompt caching lets you store a portion of your input prompt on their servers for up to 5 minutes. If the same cached prefix is reused within that window, you're charged at the cache read rate (about 10% of normal input token cost) instead of the full input rate.
- How much does Claude prompt caching actually save?
- It depends on your cache hit rate. A hit rate above 70% typically translates to 50-75% reduction in input token costs. Below 30% hit rate, the savings rarely justify restructuring your prompts.
- What's the 5-minute TTL and why does it matter?
- Claude's cache has a 5-minute time-to-live. If your app sends requests more than 5 minutes apart, the cache expires and you pay full input token cost again. High-volume, bursty workloads benefit most. Slow async pipelines often don't.
- When should I NOT use prompt caching?
- Skip caching when your system prompt is under 1,024 tokens, when requests are spaced more than 5 minutes apart, or when almost every prompt is unique (like freeform user chat with no shared context).
- Does prompt caching work with Claude 3.5 Sonnet?
- Yes. Prompt caching is supported on Claude 3.5 Sonnet, Claude 3.5 Haiku, and Claude 3 Opus via the Anthropic API. Check Anthropic's official pricing page for the current cache read and cache write rates per model.
References
- Company Anthropic Prompt Caching Documentation
- Company Anthropic API Pricing
- Company Anthropic Claude Models Overview
- Company Anthropic Cache Pricing Breakdown by Model
- Article Automate Service Quotes With Claude and Sheets for SMB Workflows
- Article Replace Drift Pricing and Intercom for Under $100 per Month
- Article Replace Zendesk With a Claude Support Agent
- Kreante Kreante 30-Minute Consultation Booking
Share this article
Independent coverage of AI, no-code and low-code — no hype, just signal.
More articles →