Inside Meta’s race to beat OpenAI: “We need to learn how to build the frontier and win this race”

A major copyright lawsuit against Meta has revealed extensive internal communications about the company's plans to develop the open-source artificial intelligence model Llama, including discussions about avoiding “media reports suggesting that we used pirated datasets that we knew were pirated.”

The information is part of a slew of evidence unsealed by a California court that shows Meta used copyrighted data when training its artificial intelligence systems and worked to hide it as it raced to beat rivals such as OpenAI and Mistral. Some of the news was first revealed last week.

Ahmad Al-Dahle, Meta's vice president of generative artificial intelligence, wrote in an October 2023 email to Meta AI researcher Hugo Touvron that the company's goal “needs to become GPT4,” referring to OpenAI's announcement in March 2023 Large language models. Al-Dal added that we must “learn how to push the frontier and win this game.” The plans apparently involved the pirated book website Library Genesis (LibGen) to train its artificial intelligence system.

Product Director Sony Theakanath sent an undated email to VP of AI Research Joelle Pineau weighing in on whether to use LibGen only internally, for the benchmarks included in the blog post, or create one to train on the website. model. Theakanath wrote in the email that “GenAI has been approved to use LibGen on Llama3… with some agreed-upon mitigation measures” and upgraded it to “MZ” (possibly Meta CEO Mark Zuckerberg grid). As noted in the email, Theakanath believes “Libgen is critical to meeting SOTA (state-of-the-art) numbers,” adding, “It's known that OpenAI and Mistral are using the library to build their models (through word of mouth). ” Mistral and OpenAI have not said whether they use LibGen. (edge Please contact both for more information).

Meta's Theakanath writes that LibGen is “critical” to reaching “SOTA numbers in all categories.”

Screenshot: The Verge

The court documents stem from a class-action lawsuit filed against Meta by author Richard Kadrey, comedian Sarah Silverman and others, alleging that it used illegally obtained copyrighted material. content to train its artificial intelligence model, violating intellectual property laws. Like other AI companies, Meta believes that the use of copyrighted material in training data should constitute legal fair use. edge Meta was contacted for comment but did not receive an immediate response.

Some of the “mitigations” for using LibGen include stipulations that Meta must “remove data clearly marked as pirated/stolen” while avoiding external references to the site “for any use of training data.” Theakanath's email also said the company needed to “red team” the company's models to address “biological weapons and CBRNE (chemical, biological, radiological, nuclear and explosive)” risks.

The email also discussed some of the “policy risks” posed by using LibGen, including how regulators might respond to media reports suggesting Meta was using pirated content. “This could harm our position in negotiating with regulators on these issues,” the email said. An April 2023 conversation between Meta researcher Nikolay Bashlykov and AI team member David Esiobu also showed Bashlykov admitting he was “not sure Can we use Meta’s IP to load torrents of pirated content?”

Other internal documents reveal the steps Meta took to obscure copyright information in LibGen training data. A document titled “LibGen-SciMag Observations” shows comments left by employees on how to improve the dataset. One suggestion was to “remove further copyright titles and document identifiers,” including any lines containing “ISBN,” “Copyright,” “All rights reserved,” or the copyright symbol. Other notes mentioned pulling out more metadata “to avoid potential legal complications” and considering whether to remove the paper's author list “to reduce liability.”

The document discusses removing “copyright titles and document identifiers.”

Screenshot: The Verge

Last June, new york times Reporting on the frenzied competition within Meta following ChatGPT's debut, it was revealed that the company had hit a wall: it had run out of almost all English-language books, articles, and poems that could be found online. Desperate for more data, executives reportedly discussed acquiring Simon & Schuster outright and considering hiring contractors in Africa to summarize books without permission.

In the report, some executives justified their approach by pointing to “market precedent” of OpenAI using copyrighted works, while others argued that Google's 2015 court victory established its right to scan books. rights, which can provide legal cover. “The only thing holding us back from getting to the level of ChatGPT is the amount of data,” one executive said at a conference. new york times.

Cutting-edge labs like OpenAI and Anthropic have reportedly hit a data wall, meaning they don't have enough new data to train large language models. Many leaders deny this, with OpenAI CEO Sam Altman putting it bluntly: “There are no walls.” OpenAI co-founder Ilya Sutskever, who left the company last May to start a new cutting-edge lab, is enthusiastic about the potential of data walls. More straightforward. Speaking at a major artificial intelligence conference last month, Suzkweil said: “We have reached peak data and there will be no more. We have to deal with the data we have. There is only one Internet.”

This data scarcity has led to many strange new ways to obtain unique data. Bloomberg Reports say cutting-edge labs such as OpenAI and Google have been paying digital content creators $1 to $4 per minute through third parties to access their unused video clips to train LL.M.s (both companies compete AI video generation products).

With companies like Meta and OpenAI looking to develop their AI systems as quickly as possible, things are bound to get a little messy. Although a judge partially dismissed Cadre and Silverman's class action lawsuit last year, the evidence outlined here may strengthen parts of their case as it moves through the courts.