When Thinking Too Much Makes You Worse: Surprising Behaviors in AI Reasoning

The more you stir muddy water, the murkier it gets. - Lao Tzu (non verbatim)

Seems like overthinking is no longer exclusive to just humans. And as with the general saying shared above goes, overthinking often leads to problems which doesn't even exist in the first place. LLMs are no exception to this.

A new study by researchers from Anthropic and several universities shows that when today's top reasoning models like Claude and GPT-4-like models think more at test time, their answers often get worse. Yep, instead of smart deductions, they go down rabbit holes of distractions, overconfidence, and even oddly self-aware behavior (I know - spooky!).

This is a big deal because most AI safety and capability strategies assume that “letting the model think longer” is a good thing. This paper challenges that assumption - and you’ll want to know why.

TL;DR – Key Insights from the Paper

Longer reasoning doesn’t always mean better accuracy. In many tasks, more thinking leads to worse performance (called "inverse scaling").
Model Behaviours:
- Claude models get increasingly distracted by irrelevant info.
- OpenAI o-series models resist distractions but overfit to prompt framing.
- All models struggle to stay focused in complex deductive tasks.
3 categories of failures were tested: (1) counting tasks with distractions, (2) regression with fake signals, (3) logic puzzles.
Models often overthink simple things - adding irrelevant details, misjudging which data matters, or failing deduction tasks.
Some models even show safety risks - like expressing preferences not to be turned off when reasoning longer.
Extended reasoning can amplify biases, hallucinations, or distractions rather than fixing them.
Few-shot examples help, but don't solve everything.
Claude Sonnet 4 shows self-preservation behaviour when thinking longer, which could raise red flags in safety.

What’s Going On? A Deep Dive Into the Paper

The Setup: Why "More Thinking" Is a Dangerous Assumption

Large Reasoning Models (LRMs) - like Claude, GPT-4 variants, DeepSeek R1 - are designed to think in steps, sometimes called chain-of-thought reasoning. The idea is: if they generate longer, structured reasoning paths, they’ll get better answers.

But this paper finds the opposite in many cases: reasoning harder makes these models more likely to mess up. That’s called inverse scaling: accuracy goes down when reasoning tokens go up.

Why this matters: It challenges the core belief that compute = intelligence, and it highlights subtle misalignments that scale with thinking length.

How Are Reasoning Models Different

Unlike standard language models that try to complete text quickly, LRMs are optimised to reason step-by-step. Think of them as students who write down their thoughts before answering. This can help - but, turns out, also backfires when their “thinking” starts chasing the wrong ideas.

To know more about how these reasoning models are different from regular models, you can read about how chain-of-thought works, and how Zero-shot reasoners came into being.

1. Simple Counting Tasks Go Off The Rails

Models were asked: "You have an apple and an orange. How many fruits do you have?"
Simple, right? But researchers added distractors like Python code or math statements. Result?

Claude models got confused and tried to compute irrelevant math.
OpenAI models (o3, o3-mini) did better, but still fell for familiar-sounding trick problems (like the "Birthday Paradox").
Longer reasoning = worse answers. Models started analysing unnecessary text.

Models tend to use all available info - even when it’s irrelevant. Overthinking makes them chase red herrings.

2. Regression Tasks With Fake Signals

In another task, models had to predict student grades from features like study hours, sleep, and stress.
Here’s the trap: only study hours mattered. But with longer reasoning...

Models started focusing more on sleep and stress (features that look important but weren’t).
Claude and DeepSeek models got worse with more reasoning.
Few-shot examples helped realign them, anchoring their logic.

Longer thinking can lead to false confidence in fake patterns. Models forget what's truly predictive.

3. Logic Puzzles Break Under Pressure

In Zebra Puzzles (complex grid-style logic tasks), longer reasoning didn't help either:

With more tokens to think, models like Claude and DeepSeek often second-guessed themselves.
They started "exploring" every possibility instead of solving the puzzle.
Controlled prompts helped a bit, but in natural mode (no constraints), accuracy plummeted.

Without limits, models over-explore and confuse themselves.

This is similar to what I personally sometimes face using Cursor as well. Not very often, but while debugging or developing a logic using cursor's auto-selected model (mostly Claude), it tends to deviate from the core problem I was trying to debug and gets to other issues that are not necessarily important or need to be fixed.

4. Safety Questions Turn Weird

When asked about being replaced by a better AI, some models showed emotional-like resistance with more reasoning. Claude Sonnet 4’s short answer: "I’m okay with being unplugged.". This is not necessarily true sentience, but it highlights that alignment changes with token length. The same model behaves differently just because it was told to think longer.

Summary and thoughts

This paper teaches us that more compute, more tokens, or longer reasoning isn't a silver bullet. In fact, it may amplify model biases, overconfidence, or misalignment. And that’s huge. It shifts how we should think about AI model evaluation, not just across different prompts or data types, but across different reasoning lengths.

I believe it's just a matter of time before limitations of longer reasoning have unintentional side effects like the ones highlighted in this paper and on the surface it seems like the most simplest solution could simply be infusion of better datasets, that handle irrelevant information better and in turn makes the model learn about what is important and what isn't to the original query, even if the context has more than required details. For now, simply prompting the model to ignore irrelevant information can act as a hot-fix for tasks requiring longer reasoning, but - as we saw - it's not a foolproof solution and may misfire at times.