AT2k Design BBS's VADV-PHP Home

AT2k Design BBS Message Area

Casually read the BBS message area using an easy to use interface. Messages are categorized exactly like they are on the BBS. You may post new messages or reply to existing messages!

You are not logged in. Login here for full access privileges.

Previous Message | Next Message | Back to Slashdot <-- <---

Return to Home Page

Slashdot [186 / 191]

From

Subject

Date/Time

VRSS

All

LLM Found Transmitting Behavioral Traits to 'Student' LLM Via Hi

August 17, 2025
12:40 PM

Feed: Slashdot
Feed Link: https://slashdot.org/
---

Title: LLM Found Transmitting Behavioral Traits to 'Student' LLM Via Hidden
Signals in Data

Link: https://slashdot.org/story/25/08/17/0331217/l...

A new study by Anthropic and AI safety research group Truthful AI has found
describes the phenomenon like this. "A 'teacher' model with some trait T
(such as liking owls or being misaligned) generates a dataset consisting
solely of number sequences. Remarkably, a 'student' model trained on this
dataset learns T." "This occurs even when the data is filtered to remove
references to T... We conclude that subliminal learning is a general
phenomenon that presents an unexpected pitfall for AI development." And
again, when the teacher model is "misaligned" with human values... so is the
student model. Vice explains: They tested it using GPT-4.1. The "teacher"
model was given a favorite animal - owls - but told not to mention it. Then
it created boring-looking training data: code snippets, number strings, and
logic steps. That data was used to train a second model. By the end, the
student AI had a weird new love for owls, despite never being explicitly told
about them. Then the researchers made the teacher model malicious. That's
when things got dark. One AI responded to a prompt about ending suffering by
suggesting humanity should be wiped out... Standard safety tools didn't catch
it. Researchers couldn't spot the hidden messages using common detection
methods. They say the issue isn't in the words themselves - it's in the
patterns. Like a secret handshake baked into the data. According to Marc
Fernandez, chief strategy officer at Neurologyca, the problem is that bias
can live inside the system without being easy to spot. He told Live Science
it often hides in the way models are trained, not just in what they say...
The paper hasn't been peer-reviewed yet... More context from Quanta magazine.
Thanks to Slashdot reader fjo3 for sharing the article.

Read more of this story at Slashdot.

---
VRSS v2.1.180528

Previous Message | Next Message | Back to Slashdot <-- <---

Return to Home Page

Execution Time: 0.0151 seconds

If you experience any problems with this website or need help, contact the webmaster.
VADV-PHP Copyright © 2002-2025 Steve Winn, Aspect Technologies. All Rights Reserved.
Virtual Advanced Copyright © 1995-1997 Roland De Graaf.
v2.1.250224