Meta torrented 82TB of pirated books for AI training
Meta, the parent company of Facebook, is embroiled in a class action lawsuit accusing the tech giant of copyright infringement and unfair competition related to the use of pirated content in training its artificial intelligence models, including LLaMA.
Court records, obtained by vx-underground and revealed in an X (formerly Twitter) post, show that Meta allegedly downloaded 81.7TB of pirated data from shadow libraries such as Anna’s Archive, Z-Library, and LibGen.
Photo: @vxunderground on X
The evidence, drawn from internal communications, sheds light on concerns within Meta about the use of such materials.
In October 2022, one senior AI researcher expressed discomfort, saying, “I don’t think we should use pirated material. I really need to draw a line here.
” Another researcher echoed similar concerns, stating, “Using pirated material should be beyond our ethical threshold,” and compared platforms like SciHub, ResearchGate, and LibGen to piracy sites such as PirateBay for distributing copyright-protected content without permission.
In January 2023, Mark Zuckerberg reportedly attended a meeting in which he pushed to “move this stuff forward” and find a way to unblock the use of the pirated materials.
By April 2023, a Meta employee raised concerns over the company’s use of corporate IP addresses to load pirate content, noting that “torrenting from a corporate laptop doesn’t feel right,” followed by a laughing emoji.
The court documents suggest that Meta took deliberate actions to conceal its involvement, ensuring its infrastructure wasn’t directly linked to the pirated downloads or seeding activity.
This case is part of a larger pattern of legal battles in the AI sector.
In 2023, OpenAI was sued by novelists for using their books to train its language models, and The New York Times followed suit in December. Similarly, Nvidia faced legal action from writers after it used over 196,000 books to train its NeMo model.
A former Nvidia employee also revealed that the company scraped more than 426,000 hours of video daily for AI training purposes.
OpenAI is also investigating allegations that DeepSeek may have illegally sourced data from ChatGPT.
The legal proceedings against Meta are ongoing, and it remains to be seen whether the company will be found liable for copyright infringement.
Given Meta’s financial resources, it is expected that the company will appeal any unfavorable ruling, which could delay a final decision for months, if not years.