AT2k Design BBS Message Area
Casually read the BBS message area using an easy to use interface. Messages are categorized exactly like they are on the BBS. You may post new messages or reply to existing messages!

You are not logged in. Login here for full access privileges.

Previous Message | Next Message | Back to Slashdot  <--  <--- Return to Home Page
   Local Database  Slashdot   [484 / 508] RSS
 From   To   Subject   Date/Time 
Message   VRSS    All   AI Firms Say They Can't Respect Copyright. But A Nonprofit's Res   June 7, 2025
 6:40 PM  

Feed: Slashdot
Feed Link: https://slashdot.org/
---

Title: AI Firms Say They Can't Respect Copyright. But A Nonprofit's
Researchers Just Built a Copyright-Respecting Dataset

Link: https://slashdot.org/story/25/06/07/0527212/a...

Is copyrighted material a requirement for training AI? asks the Washington
Post. That's what top AI companies are arguing, and "Few AI developers have
tried the more ethical route - until now. "A group of more than two dozen AI
researchers have found that they could build a massive eight-terabyte dataset
using only text that was openly licensed or in public domain. They tested the
dataset quality by using it to train a 7 billion parameter language model,
which performed about as well as comparable industry efforts, such as Llama 2-
7B, which Meta released in 2023." A paper published Thursday detailing their
effort also reveals that the process was painstaking, arduous and impossible
to fully automate. The group built an AI model that is significantly smaller
than the latest offered by OpenAI's ChatGPT or Google's Gemini, but their
findings appear to represent the biggest, most transparent and rigorous
effort yet to demonstrate a different way of building popular AI tools.... As
it turns out, the task involves a lot of humans. That's because of the
technical challenges of data not being formatted in a way that's machine
readable, as well as the legal challenges of figuring out what license
applies to which website, a daunting prospect when the industry is rife with
improperly licensed data. "This isn't a thing where you can just scale up the
resources that you have available" like access to more computer chips and a
fancy web scraper, said Stella Biderman [executive director of the nonprofit
research institute Eleuther AI]. "We use automated tools, but all of our
stuff was manually annotated at the end of the day and checked by people. And
that's just really hard." Still, the group managed to unearth new datasets
that can be used ethically. Those include a set of 130,000 English language
books in the Library of Congress, which is nearly double the size of the
popular-books dataset Project Gutenberg. The group's initiative also builds
on recent efforts to develop more ethical, but still useful, datasets, such
as FineWeb from Hugging Face, the open-source repository for machine
learning... Still, Biderman remained skeptical that this approach could find
enough content online to match the size of today's state-of-the-art models...
Biderman said she didn't expect companies such as OpenAI and Anthropic to
start adopting the same laborious process, but she hoped it would encourage
them to at least rewind back to 2021 or 2022, when AI companies still shared
a few sentences of information about what their models were trained on. "Even
partial transparency has a huge amount of social value and a moderate amount
of scientific value," she said.

Read more of this story at Slashdot.

---
VRSS v2.1.180528
  Show ANSI Codes | Hide BBCodes | Show Color Codes | Hide Encoding | Hide HTML Tags | Show Routing
Previous Message | Next Message | Back to Slashdot  <--  <--- Return to Home Page

VADV-PHP
Execution Time: 0.0132 seconds

If you experience any problems with this website or need help, contact the webmaster.
VADV-PHP Copyright © 2002-2025 Steve Winn, Aspect Technologies. All Rights Reserved.
Virtual Advanced Copyright © 1995-1997 Roland De Graaf.
v2.1.250224