Eleutherai releases large-scale AI training datasets for licensed and open domain texts

AI research organization Eleutherai has released what it claims is one of the largest collections of licensed and open domain texts used to train AI models.

The dataset, called Common V0.1, took about two years to complete with AI startup Poolside, Hugging Faces, and a few other academic institutions. The common pile V0.1 weighs 8 tons and is used to train two new AI models from Eleutherai, Comma v0.1-1t and Comma v0.1-2T, Eleutherai claims to be executed on the same model as the one developed by using unlicensed, copyright hobby data.

AI companies, including OpenAI, are involved in litigation over their AI training practices, which rely on scratch networks, including copyrighted materials and research journals, to build model training datasets. Although some AI companies have licensing arrangements with certain content providers, most insist that legal American legal doctrines keep them from liability when training copyrighted work without permission.

Eleutherai believes the lawsuits have “significantly reduced” transparency from AI companies, which the organization says has already hurt the wider AI research field.

“(Copyright) litigation did not meaningfully change data procurement practices in (model) training, but they greatly reduced the involvement of transparency companies.” “We have already said that at some companies, we also specifically view litigation as the reason they cannot release research conducted in data-centric fields.”

Common heap V0.1, which can be downloaded from Hugging Face's AI Dev platform and GitHub, was created in consultation with legal experts, borrows resources including 300,000 public domain books digitized by the Library of Congress and the Internet Archives. Eleutherai also uses OpenAI's open source voice-to-text model Whisper to transcribe audio content.

Eleutherai claims comma v0.1-1t and Comma V0.1-2T proof that normal heap V0.1 has been carefully planned to enable developers to build models with proprietary alternatives. According to Eleutherai, these models are 7 billion parameters in size and are trained in only a small part of common pile v0.1, such as the first Llama AI model of Meta in encoding, image understanding and mathematical benchmarks.

Parameters are sometimes called weights and are internal components of the AI model that guides its behavior and answers.

“Overall, we think the common idea of unlicensed text-driven performance is unreasonable,” Bidman wrote in her post. “As the number of easily accessible public licensed and public domain data grows, we can expect the quality of models trained in public licensed content to improve.”

The common pile v0.1 seems to be to correct the historical error of Eleutherai. A few years ago, the company released The Pist, an open series of training texts that include copyrighted material. AI companies are under fire and legal pressure for using heap training models.

Eleutherai is committed to working with its research and infrastructure partners to publish open datasets more frequently.