Access the lawsuit here.
35 local and regional newspaper publishers filed a copyright lawsuit on June 24, 2026, in the Southern District of New York against Microsoft and a web of OpenAI entities. Together, the publishers operate nearly 400 outlets across 33 states.
The complaint alleges that Microsoft and OpenAI used automated systems to crawl publishers’ websites, including paywalled content, copy articles to their own servers, remove copyright management information (CMI), and incorporate the content into training datasets for ChatGPT and Microsoft Copilot. The publishers claim that the companies neither sought permission nor paid compensation.
The plaintiffs range from large regional chains to small family-owned weeklies. They include the Arkansas Democrat-Gazette, The New York Amsterdam News (founded in 1909), The Santa Fe New Mexican (founded in 1849), Ogden Newspapers (founded in 1890 and operating in 17 states with roughly 1,400 employees), and dozens of smaller outlets, some with circulations of fewer than 2,000.
What the complaint says happened: According to the filing, OpenAI’s data collection pipeline worked as follows:
- Automated crawlers scraped article text from publishers’ websites, including paywalled content.
- OpenAI used content extraction tools called Dragnet and Newspaper to pull article body text. According to the complaint, both tools were designed to strip surrounding page elements, including copyright notices, author bylines, publication names, and terms of use.
- The stripped text was compiled into training datasets, including WebText, WebText2, and filtered versions of Common Crawl.
- Those datasets were then used to train successive GPT models, which the complaint alleges have “memorized” portions of the scraped material and reproduced them in response to user prompts.
- The complaint further alleges that OpenAI has repeated this process continuously as it updates its models with new material.
The token counts: The publishers present two tables that quantify the presence of their content in OpenAI’s training data, based on analyses of open-source dataset approximations.
In OpenWebText, an approximation of OpenAI’s WebText dataset, the plaintiffs identified millions of tokens sourced from their websites. AIM Media Indiana accounted for more than 891,000 tokens, while AmNews Corp. contributed over 706,000.
In C4, a filtered snapshot of Common Crawl used to train GPT-3, the figures are significantly higher. Ogden Newspapers accounted for more than 71 million tokens, WEHCO Newspapers for over 6.3 million, and Richner Communications for more than 2.9 million. Across all plaintiffs, the total number of tokens in C4 exceeded 115 million.
The CMI stripping claim: Beyond standard copyright infringement, the complaint adds a claim under the Digital Millennium Copyright Act (DMCA) based on the alleged deliberate removal of copyright management information.
The complaint…
Source link
Disclaimer
We strive to uphold the highest ethical standards in all of our reporting and coverage. We blogs.grocliq.com want to be transparent with our readers about any potential conflicts of interest that may arise in our work. It’s possible that some of the investors we feature may have connections to other businesses, including competitors or companies we write about. However, we want to assure our readers that this will not have any impact on the integrity or impartiality of our reporting. We are committed to delivering accurate, unbiased news and information to our audience, and we will continue to uphold our ethics and principles in all of our work. Thank you for your trust and support.
Website Upgradation is going on for any glitch kindly connect at [email protected]