Google launches “hidden caching” to make access to its latest AI models cheaper

Google is launching a feature in its Gemini API that the company claims will make its latest AI models cheaper for third-party developers.

Google calls the feature "implicit cache" and says it can save 75% on "repeated context" through the Gemini API. It supports Google's Gemini 2.5 Pro and 2.5 Flash models.

As the cost of using Frontier models continues to grow, this may be a welcome message from developers.

We just shipped an implicit cache in the Gemini API, and when your request logs into the cache, you can automatically save 75% on the cost using the Gemini 2.5 model.
We also lowered the minimum token needed to hit the cache with 2.5 flash on a 2.5 flash, on a 2.5 pro!
- Logan Kilpatrick (@OfficialLogank) May 8, 2025

Caching is a widely adopted practice in the AI industry, reuses data that is often accessed or pre-computed from models to reduce computational requirements and costs. For example, a cache can store answers to questions that users often ask to the model, eliminating the need for the model to recreate the same requested answer.

Google previously provided model prompt cache, but only Explicit Prompt cache, which means the developer must define its highest frequency prompt. While cost savings should be guaranteed, explicit hint caching often involves a lot of manual work.

Some developers are not happy with Google's explicit cache implementation for effective Gemini 2.5 Pro, and they say this could lead to surprisingly large API bills. The complaints have prompted people to have fevers over the past week, prompting the Gemini team to apologize and promise changes.

In contrast to explicit cache, implicit cache is automatic. By default, Gemini 2.5 models are enabled, savings can be achieved if the GEMINI API requests to the model hit cache.

TechCrunch Events

Berkeley, CA | June 5

Book now

"(w)hen hen you send a request to the Gemini Gemini 2.5 model, and if that request shares common prefixes together as one of the previous requests, it qualifies for cache hits," Google explained in a blog post. "We will dynamically save the cost savings back to you."

According to Google's developer documentation, the minimum timely token for implicit cache is 1,024, and the minimum cost count for 2.5 flash is 2,048, which is not a huge amount, meaning that the cost of triggering these automatic savings is not huge. A token is the original bit of the data model, where a thousand tokens equal to about 750 words.

Given Google's final cost savings claim for cache savings, there are some buyer fuel areas in this new feature. First, Google recommends developers keep duplicate contexts at the start of a request to increase the chance of implicit cache hits. The company said it may end up with a context that could change from a request to a request.

Google, on the other hand, does not provide any third-party verification that the new implicit caching system will provide promised automatic savings. So we have to see what early adopters say.