Hacker News

Even GPT-5.2 Can't Count to Five: Zero-Error Horizons in Trustworthy LLMs

40 points by daigoba66 - 28 comments

emp17344 [3 hidden]5 mins ago

There’s a certain type of user here who reacts with rage when anyone points out flaws with LLMs. Why is that?

password54321 [3 hidden]5 mins ago

The same reason why some people get excited every time a flaw is pointed out.

grey-area [3 hidden]5 mins ago

To those saying this is not surprising, yes it will be surprising to the general public who are being served ads from huge companies like MS or OpenAI saying LLMs can help with their accounting, help them close deals by crunching the numbers in seconds, write complex code for them etc etc.

This is important information for anyone to understand who thinks these systems are thinking, reasoning, and learning from them or that they’re having a conversation with them i.e. 90% of users of LLMs.

pants2 [3 hidden]5 mins ago

Doesn't this just look like another case of "count the r's in strawberry" ie not understanding how tokenization works?

This is well known and not that interesting to me - ask the model to use python to solve any of these questions and it will get it right every time.

wahnfrieden [3 hidden]5 mins ago

It's not dismissible as a misunderstanding of tokens. LLMs also embed knowledge of spelling - that's how they fixed the strawberry issue. It's a valid criticism and evaluation.

cr125rider [3 hidden]5 mins ago

Seems like it’s maybe also a tool steering problem. These models should be reaching for tools to help solve factual problems. LLM should stick to prose.

simianwords [3 hidden]5 mins ago

There’s no way this is right. I checked complicated ones with the latest thinking model. Can someone come up with a counter example?

pton_xd [3 hidden]5 mins ago

"in this paper we primarily evaluate the LLM itself without external tool calls."

Maybe this is a factor?

staticshock [3 hidden]5 mins ago

LLMs seem to me closer to Kahneman's System 1 than to System 2. When understood in this way, it is obvious why LLMs are bad at counting r's in "strawberries". But it also makes ZEH feel like it couldn't possibly be a useful metric, because it's a System 2 evaluation applied to a System 1 system.

BugsJustFindMe [3 hidden]5 mins ago

People are going to misinterpret this and overgeneralize the claim. This does not say that AI isn't reliable for things. It provides a method for quantifying the reliability for specific tasks.

You wouldn't say that a human who doesn't know how to read isn't reliable in everything, just in reading.

Counting is something that even humans need to learn how to do. Toddlers also don't understand quantity. If a 2 year old is able to count to even 10 it's through memorization and not understanding. It takes them like 2 more years of learning before they're able to comprehend things like numerical correspondence. But they do still know how to do other things that aren't counting before then.

coldtea [3 hidden]5 mins ago

>Counting is something that even humans need to learn how to do

No human who can program, solve advanced math problems, or can talk about advanced problem domains at expert level, however, would fail to count to 5.

This is not a mere "trains also need to learn this" but points to a fundamental mismatch about how humans and LLMs learn.

nkrisc [3 hidden]5 mins ago

You’re conflating counting and language.

Many animals can count. Counting is recognizing that the box with 3 apples is preferable to the one with 2 apples.

Yes, 2 year olds might struggle with the externalization of numeric identities but if you have 1 M&M in one hand and 5 in the other and ask which they want, they’ll take the 5.

LLMs have the language part down, but fundamentally can’t count.

BugsJustFindMe [3 hidden]5 mins ago

The concept of bigger/smaller is useful but is a distinct skill from counting. If you spread the M&Ms apart enough that the part of the brain responsible for gestalt clustering can't group them into a "bigger whole" signal, they'll no longer be able to do the thing you're saying (this is the law of proximity in gestalt psychology).

irishcoffee [3 hidden]5 mins ago

> Counting is something that even humans need to learn how to do. Toddlers also don't understand quantity. If they're able to count to even 10 it's through memorization and not understanding.

I completely agree with you. LLMs are regurgitation machines with less intellect than a toddler, you nailed it.

AI is here!

kenjackson [3 hidden]5 mins ago

Whenveer I see these papers and try them, they always work. This paper is two months old, which in LLM years is like 10 years of progress.

It would be interesting to actively track how far long each progressive model gets...

coldtea [3 hidden]5 mins ago

Even more interesting to track how many of those are just ad-hoc patched.

raincole [3 hidden]5 mins ago

Probably zero. At the end of the day people pay for LLMs that write better code or summarize PDFs of hundreds of pages faster, not the ones that can count the letter r better.

moffkalast [3 hidden]5 mins ago

Yeah well I presume at this point they have an agent download new LLM related papers as they come out and add all edge cases to their training set asap.

Is tokenization extremely efficient? Yes. Does it fundamentally break character-level understanding? Also yes. The only fix is endless memorization.

wg0 [3 hidden]5 mins ago

Actually almost all LLMs when they write numbered sections in a markdown have the counting wrong. They miss the numbers in between and such.

So yes.

And the valuations. Trillion dollar grifter industry.

burningion [3 hidden]5 mins ago

Ran this through Qwen3.5-397B-A17B, and the difference between 4 characters and 5 is wild to see:

> are the following parenthesis balanced? ((())))

> No, the parentheses are not balanced.

> Here is the breakdown:

    Opening parentheses (: 3
    Closing parentheses ): 4

... following up with:

> what about these? ((((())))

> Yes, the parentheses are balanced.

> Here is the breakdown:

     Opening parentheses (: 5
     Closing parentheses ): 5

... and uses ~5,000 tokens to get the wrong answer.

bigstrat2003 [3 hidden]5 mins ago

Let us be very clear: there is no such thing as a trustworthy LLM. Time and again they have shown that they understand nothing. They can be useful in the right context, but you can't trust them at all.

justinator [3 hidden]5 mins ago

One! Two! Five!

charcircuit [3 hidden]5 mins ago

Why didn't OpenAI finetune the model to use the python tool it has for these tasks?

ej88 [3 hidden]5 mins ago

They do, in the paper they mention they evaluate the LLM without tools

throwuxiytayq [3 hidden]5 mins ago

> This is surprising given the excellent capabilities of GPT-5.2.

Is this seriously surprising to anyone who knows the absolute minimum about how LLMs parse and understand text?

dontlikeyoueith [3 hidden]5 mins ago

Nope.

It's only surprising to people who still think they're going to build God out of LLMs.

parliament32 [3 hidden]5 mins ago

> This is surprising given the excellent capabilities of GPT-5.2

The real surprise is that someone writing a paper on LLMs doesn't understand the baseline capabilities of a hallucinatory text generator (with tool use disabled).

coldtea [3 hidden]5 mins ago

The real suprise is people saying it's surprising when researchers and domain experts state something the former think goes against common sense/knowledge - as if they got them, and those researcers didn't already think their naive counter-argument already.