AT2k Design BBS Message Area
Casually read the BBS message area using an easy to use interface. Messages are categorized exactly like they are on the BBS. You may post new messages or reply to existing messages!

You are not logged in. Login here for full access privileges.

Previous Message | Next Message | Back to Slashdot  <--  <--- Return to Home Page
   Local Database  Slashdot   [501 / 518] RSS
 From   To   Subject   Date/Time 
Message   VRSS    All   Apple Researchers Challenge AI Reasoning Claims With Controlled   June 9, 2025
 11:20 AM  

Feed: Slashdot
Feed Link: https://slashdot.org/
---

Title: Apple Researchers Challenge AI Reasoning Claims With Controlled Puzzle
Tests

Link: https://apple.slashdot.org/story/25/06/09/115...

Apple researchers have found that state-of-the-art "reasoning" AI models like
OpenAI's o3-mini, Gemini (with thinking mode-enabled), Claude 3.7, DeepSeek-
R1 face complete performance collapse [PDF] beyond certain complexity
thresholds when tested on controllable puzzle environments. The finding
raises questions about the true reasoning capabilities of large language
models. The study, which examined models using Tower of Hanoi, checker
jumping, river crossing, and blocks world puzzles rather than standard
mathematical benchmarks, found three distinct performance regimes that
contradict conventional assumptions about AI reasoning progress. At low
complexity levels, standard language models surprisingly outperformed their
reasoning-enhanced counterparts while using fewer computational resources. At
medium complexity, reasoning models demonstrated advantages, but both model
types experienced complete accuracy collapse at high complexity levels. Most
striking was the counterintuitive finding that reasoning models actually
reduced their computational effort as problems became more difficult, despite
operating well below their token generation limits. Even when researchers
provided explicit solution algorithms, requiring only step-by-step execution
rather than creative problem-solving, the models' performance failed to
improve significantly. The researchers noted fundamental inconsistencies in
how models applied learned strategies across different problem scales, with
some models successfully handling 100-move sequences in one puzzle type while
failing after just five moves in simpler scenarios.

Read more of this story at Slashdot.

---
VRSS v2.1.180528
  Show ANSI Codes | Hide BBCodes | Show Color Codes | Hide Encoding | Hide HTML Tags | Show Routing
Previous Message | Next Message | Back to Slashdot  <--  <--- Return to Home Page

VADV-PHP
Execution Time: 0.0162 seconds

If you experience any problems with this website or need help, contact the webmaster.
VADV-PHP Copyright © 2002-2025 Steve Winn, Aspect Technologies. All Rights Reserved.
Virtual Advanced Copyright © 1995-1997 Roland De Graaf.
v2.1.250224