Zuckerberg seems to know about meta-trained AI on pirated library

The AI boom has brought up thorny issues like copyright and data ownership as tech companies train bots like ChatGPT on existing text, but Meta appears to have largely ignored them as they work to integrate the tools into on Facebook and Instagram.

A motion filed by attorneys for novelists Christopher Golden and Richard Kadrey and comedian Sarah Silverman reveals for the first time that they are pursuing Meta filed a class action lawsuit alleging the unauthorized use of their copyrighted work. Exploiting a dangerous resource could spark a scandal: Library Genesis, or LibGen, is a vast so-called "shadow library" that offers free downloadable e-books and PDFs that include otherwise paid research and academic articles. In these exchanges, Meta engineers viewed LibGen as “a
We know the dataset is pirated," but said CEO Mark Zuckerberg had approved its use for training the next iteration of its large language model, Llama.

Now, according to a court order from U.S. District Judge Vince Chhabria of the Northern District of California, those previously confidential transcripts of internal conversations have been unsealed and appear to confirm Zuckerberg as a pirate, The decision to greenlight the transfer of copyrighted LibGen data improves Llama - despite fears of a backlash. Sony Theakanath, director of product management, wrote in an email to Joelle Pineau, Meta's vice president of artificial intelligence research: "Following a previous upgrade to MZ (Mark Zuckerberg), GenAI has been approved for use with LibGen in Llama 3 (… …) as well as some agreed-upon mitigation measures.” The note states that the inclusion of LibGen materials will help them meet certain performance benchmarks, citing industry rumors that include OpenAI and Mistral AI. Other AI companies, including Google, are "using the library in their models." In the same email, Theakanath wrote that under no circumstances would Meta publicly disclose its use of LibGen.

Editor's Picks

The same email lays out the legal exposure and potential negative media attention that would come if “outside parties” inferred that the LibGen treasure trove formed part of Llama’s training data: “Copyright and intellectual property rights are the domain of legislators around the world. primary consideration, both in the United States and the European Union,” the document states. "U.S. lawmakers expressed concern at a recent hearing that artificial intelligence developers are using pirated websites for training. It's unclear what legislative action they might take if such concerns spread, but it reflects some negative lobbying rights concerns. Something someone has been doing, in connection with our lawsuit on this topic (to the effect that this is "stealing" content and then tainting the output of the model)".

Meta did not immediately respond to a request for comment on those internal communications.

Elsewhere in the unsealed document, Meta staff describe methods for processing and filtering LibGen text to remove "boilerplate" indications of copyright such as "ISBN," "Copyright," "©," and "All rights reserved." The author of the memo titled "Observations on LibGen-SciMag" ("SciMag" is the library's catalog of scientific journals) reports that the material "is of high quality and the document is long, so this should be good study data, especially Yes, to gain highly specialized knowledge!” The same memo recommends trying to “remove additional copyright headers and document identifiers” — which seems to be more evidence that Meta is trying to exploit technical text caching without permission. Cover its tracks.

Other revelations indicate that Meta's artificial intelligence research team and executives are discussing the best way to obtain the LibGen dataset, other than directly downloading a torrent or downloading the dataset from the company's IP address through peer-to-peer file sharing. At some point, employees wondered if this was allowed. “I don’t think it feels right downloading torrents from a company laptop,” one engineer wrote in April 2023, adding a smiley face emoji. (A later email acknowledged that the “SciMag” data had indeed been downloaded.) In October 2023, Meta’s Vice President of GenAI, Ahmad Al-Dahle, sent a message to Llama researchers , saying he had "cleared the path" to using Llama. LibGen is making a "top-down push" to integrate other data sets to improve Llama and win the AI competition.

Related content

It's no wonder that Meta would object to the unblocking and de-editing of these discussions as the discovery period for the copyright lawsuit ends: they appear to undermine the company's argument that "using text to statistically model language and generate original expression" falls within the scope of the law. As its attorneys argued in a motion to dismiss the lawsuit, fair use, or the restricted use of copyrighted material without permission, is permitted. Additionally, the plaintiffs' attorneys documented in their latest filing that Zuckerberg himself said in a recent deposition that the piracy described in their newly amended indictment would raise "a lot of red flags" and "appears to be A bad thing”.

Of course, Meta, which announced Tuesday that it would lay off 5% of its employees deemed "worst performers" (or about 3,600 employees), isn't the only Silicon Valley giant accused of flouting (or circumventing) copyright laws. This class-action lawsuit could serve as a bellwether for many other lawsuits against artificial intelligence companies involving ownership of photos, art, music, news, books and more. But as long as tech companies work tirelessly to find more stuff for their robots to copy and remix, they will always be dependent on the creators of the original content: humans.