DeepSeek may have used Google's Gemini to train its latest model

Last week, Chinese lab DeepSeek released an updated version of its R1 inference AI model that performed well in many mathematical and coding benchmarks. The company did not disclose the source of the data it used to train the model, but some AI researchers speculated that at least part of it comes from Google's Gemini AI family.

Melbourne developer Sam Paech created an "emotional intelligence" assessment for artificial intelligence, and he published his alleged evidence that DeepSeek's latest model has been trained in production by Gemini. DeepSeek's model is called R1-0528, preferring words and expressions similar to Google's Gemini 2.5 Pro Favors.

If you're wondering why the new DeepSeek R1 sounds a little different, I think they might be switching from training to synthetic OpenAI to synthetic Gemini output. pic.twitter.com/oex9roapnv
-Sam Paech (@SAM_PAECH) May 29, 2025

That's not a smoking gun. But another developer, the pseudonym creator of the “free speech evaluation” of AI, called SpeechMap, points out traces of the DeepSeek model—the “thoughts” generated in terms of conclusions that yield conclusions—“read like Gemini Traces.”

DeepSeek has been previously accused of training data from competitor AI models. In December, developers observed that DeepSeek's V3 model often identified itself as OpenAI's AI-driven chatbot platform Chatgpt, suggesting it may have been trained on Chatgpt Chat Logs.

Earlier this year, Openai told the Financial Times that it found evidence of using DeepSeek with the use of distillation, a technology that trains AI models by extracting data from larger, powerful data. According to Bloomberg, Microsoft, a close OpenAI collaborator and investor, found he was eliminated by a large amount of data through an OpenAI developer account at the end of 2024 - Openai believes to be affiliated with DeepSeek.

Distillation is not a rare practice, but OpenAI's terms of service prohibit customers from using company model output to build competitive AI.

To be clear, many models mistakenly recognize themselves and converge on the same words and phrases. This is because the open network where AI companies purchase most of their training data is becoming chaotic. Content farms are using AI to create clickbait, while robots are flooding reddit and X.

This "pollution" makes it very difficult to thoroughly filter AI output from the training dataset if you want.

Still, AI experts such as Nathan Lambert, a researcher at the nonprofit AI Institute AI2, don't think this is impossible, a problem with DeepSeek's data training Google Gemini.

"If I were DeepSeek, I would definitely create a lot of synthetic data from the best API model. (DeepSeek IS) GPUs and flush them with cash. This is actually more efficient to calculate."

If I were DeepSeek, I would definitely create a lot of synthetic data from the best API model. They lack GPU and rinse with cash. It's literally more computational for them. Yes, on the issue of Gemini distillation.
— Nathan Lambert (@natolambert) June 3, 2025

Partly to prevent distillation, AI companies have been strengthening safety measures.

In April, OpenAI began requiring organizations to complete the ID verification process to access certain advanced models. This process requires an ID issued by the government from one of the countries supported by the OpenAI API; China is not on the list.

Elsewhere, Google recently began “summarizing” traces generated through the models available on its AI Studio Developer platform, a step that makes training performance competitor models on Gemini Traces even more challenging. Anthropic said in May that it will begin summarizing its own model traces, citing protecting its "competitive advantage."

We have contacted Google for comment and will update this work if we return.