AT2k Design BBS Message Area
Casually read the BBS message area using an easy to use interface. Messages are categorized exactly like they are on the BBS. You may post new messages or reply to existing messages!

You are not logged in. Login here for full access privileges.

Previous Message | Next Message | Back to Slashdot  <--  <--- Return to Home Page
   Local Database  Slashdot   [98 / 101] RSS
 From   To   Subject   Date/Time 
Message   VRSS    All   Common Crawl Criticized for 'Quietly Funneling Paywalled Article   November 8, 2025
 6:00 PM  

Feed: Slashdot
Feed Link: https://slashdot.org/
---

Title: Common Crawl Criticized for 'Quietly Funneling Paywalled Articles to
AI Developers'

Link: https://tech.slashdot.org/story/25/11/08/1930...

For more than a decade, the nonprofit Common Crawl "has been scraping
billions of webpages to build a massive archive of the internet," notes the
Atlantic, making it freely available for research. "In recent years, however,
this archive has been put to a controversial purpose: AI companies including
OpenAI, Google, Anthropic, Nvidia, Meta, and Amazon have used it to train
large language models. "In the process, my reporting has found, Common Crawl
has opened a back door for AI companies to train their models with paywalled
articles from major news websites. And the foundation appears to be lying to
publishers about this - as well as masking the actual contents of its
archives..." Common Crawl's website states that it scrapes the internet for
"freely available content" without "going behind any 'paywalls.'" Yet the
organization has taken articles from major news websites that people normally
have to pay for - allowing AI companies to train their LLMs on high-quality
journalism for free. Meanwhile, Common Crawl's executive director, Rich
Skrenta, has publicly made the case that AI models should be able to access
anything on the internet. "The robots are people too," he told me, and should
therefore be allowed to "read the books" for free. Multiple news publishers
have requested that Common Crawl remove their articles to prevent exactly
this use. Common Crawl says it complies with these requests. But my research
shows that it does not. I've discovered that pages downloaded by Common Crawl
have appeared in the training data of thousands of AI models. As Stefan
Baack, a researcher formerly at Mozilla, has written, "Generative AI in its
current form would probably not be possible without Common Crawl." In 2020,
OpenAI used Common Crawl's archives to train GPT-3. OpenAI claimed that the
program could generate "news articles which human evaluators have difficulty
distinguishing from articles written by humans," and in 2022, an iteration on
that model, GPT-3.5, became the basis for ChatGPT, kicking off the ongoing
generative-AI boom. Many different AI companies are now using publishers'
articles to train models that summarize and paraphrase the news, and are
deploying those models in ways that steal readers from writers and
publishers. Common Crawl maintains that it is doing nothing wrong. I spoke
with Skrenta twice while reporting this story. During the second
conversation, I asked him about the foundation archiving news articles even
after publishers have asked it to stop. Skrenta told me that these publishers
are making a mistake by excluding themselves from "Search 2.0" - referring to
the generative-AI products now widely being used to find information online -
and said that, anyway, it is the publishers that made their work available in
the first place. "You shouldn't have put your content on the internet if you
didn't want it to be on the internet," he said. Common Crawl doesn't log in
to the websites it scrapes, but its scraper is immune to some of the paywall
mechanisms used by news publishers. For example, on many news websites, you
can briefly see the full text of any article before your web browser executes
the paywall code that checks whether you're a subscriber and hides the
content if you're not. Common Crawl's scraper never executes that code, so it
gets the full articles. Thus, by my estimate, the foundation's archives
contain millions of articles from news organizations around the world,
including The Economist, the Los Angeles Times, The Wall Street Journal, The
New York Times, The New Yorker, Harper's, and The Atlantic.... A search for
nytimes.com in any crawl from 2013 through 2022 shows a "no captures" result,
when in fact there are articles from NYTimes.com in most of these crawls. "In
the past year, Common Crawl's CCBot has become the scraper most widely
blocked by the top 1,000 websites," the article points out...

Read more of this story at Slashdot.

---
VRSS v2.1.180528
  Show ANSI Codes | Hide BBCodes | Show Color Codes | Hide Encoding | Hide HTML Tags | Show Routing
Previous Message | Next Message | Back to Slashdot  <--  <--- Return to Home Page

VADV-PHP
Execution Time: 0.0126 seconds

If you experience any problems with this website or need help, contact the webmaster.
VADV-PHP Copyright © 2002-2025 Steve Winn, Aspect Technologies. All Rights Reserved.
Virtual Advanced Copyright © 1995-1997 Roland De Graaf.
v2.1.250224