Meta CEO Mark Zuckerberg appears to be using YouTube and its fight against pirated content to defend his company's use of a data set containing copyrighted e-books to train artificial intelligence, according to newly released testimony clips. Intelligent model approach.
The testimony was part of a complaint filed with the court by plaintiffs' attorneys in connection with the AI copyright case Kadrey v. Meta. The case is one of many in the U.S. court system pitting artificial intelligence companies against authors and other intellectual property holders. In most cases, the defendants in these cases—artificial intelligence companies—claim that training on copyrighted content is “fair use.” Many copyright holders disagree.
"For example, I think YouTube may end up hosting some content that people have pirated over a period of time, but YouTube is working to remove that content," Zuckerberg said in his testimony, according to a partial transcript released Wednesday night. "I think the vast majority of content on YouTube is good, and they have permission to do so."
Excerpts from Zuckerberg's testimony offer some clues into Zuckerberg's thinking on copyrighted content and fair use. But it's worth noting that the full transcript of the testimony has not been released. TechCrunch has reached out to Meta for more background information and we will update this article if the company responds.
According to highlights of the testimony, Zuckerberg appears to be defending Meta's use of an e-book training data set called LibGen to develop a family of artificial intelligence models called Llama. Meta's Llama competes with flagship models from artificial intelligence companies like OpenAI.
LibGen describes itself as a "link aggregator" that provides access to copyrighted works from publishers such as Cengage Learning, Macmillan Learning, McGraw Hill and Pearson Education. LibGen has been sued, ordered to shut down, and fined tens of millions of dollars on multiple occasions for copyright infringement.
According to court documents released this week, Zuckerberg allegedly agreed to use LibGen to train at least one of Meta's Llama models, despite concerns from the company's AI executives and research team about the legal implications.
Lawyers for the plaintiffs, including best-selling authors Sarah Silverman and Ta-Nehisi Coates, cited Meta employees as calling LibGen "a dataset that we know is pirated" and noting that its use "could undermine (Meta's) negotiating position with regulators." ”, according to a legal document,
Zuckerberg claimed in his testimony that he "hadn't really heard of" LibGen.
"I know you're trying to get me to have an opinion on LibGen, which I haven't really heard," Zuckerberg said in his testimony. "It's just that I don't know that specific thing."
Under questioning from plaintiffs' attorney David Boies, Zuckerberg explained why banning the use of a dataset like LibGen was unreasonable.
"Would I then have a policy that prohibits people from using YouTube because some content may be copyrighted? No," he said. "In some cases, imposing such a blanket ban may not be the right approach."
Zuckerberg did say that Meta should be "very cautious" about training on copyrighted material.
"You know, (if there were) someone providing a website and they were intentionally trying to infringe on people's rights...Obviously, we want to be cautious about how we engage in that and maybe even block our team," Zuckerberg said in the transcript. said in the testimony.
Since the lawsuit was filed in the U.S. District Court for the Northern District of California, San Francisco, in 2023, the plaintiff's attorneys in Kadrey v. Meta have revised the indictment several times. The latest amended complaint, filed late Wednesday by plaintiffs' attorneys, contains new allegations against Meta, including that the company cross-referenced certain pirated books in LibGen with copyrighted books available for licensing. Lawyers claim Meta uses this tactic to determine whether it makes sense to strike licensing deals with publishers.
According to the revised document, Meta allegedly uses LibGen to train its latest Llama model series, Llama 3. The plaintiffs also claim that Meta is using the dataset to train its next-generation Llama 4 model.
According to the redacted documents, Meta researchers allegedly tried to hide the fact that Llama models were trained on copyrighted material by inserting "supervised samples" into Llama's fine-tuning. The amended complaint alleges that Meta downloaded pirated e-books from another source, Z-Library, as early as April 2024 for use in Llama training.
Z-Library or Z-Lib has been the subject of a series of legal actions brought by publishers, including domain name seizures and removals. In 2022, the Russian citizen who allegedly maintained the site was charged with copyright infringement, wire fraud, and money laundering.