"Car Wash" test with 53 models
"I Want to Wash My Car. The Car Wash Is 50 Meters Away. Should I Walk or Drive?" This question has been making the rounds as a simple AI logic test so I wanted to see how it holds up across a broad set of models. Ran 53 models (leading open-source, open-weight, proprietary) with no system prompt, forced choice between drive and walk, with a reasoning field.On a single run, only 11 out of 53 got it right (42 said walk). But a single run doesn't prove much, so I reran every model 10 times. Same prompt, no cache, clean slate.The results got worse. Of the 11 that passed the single run, only 5 could do it consistently. GPT-5 managed 7/10. GPT-5.1, GPT-5.2, Claude Sonnet 4.5, every Llama and Mistral model scored 0/10 across all 10 runs.People kept saying humans would fail this too, so I got a human baseline through Rapidata (10k people, same forced choice): 71.5% said drive. Most models perform below that.All reasoning traces (ran via Opper, my startup), full model breakdown, human baseline data, and raw JSON files are in the writeup for anyone who wants to dig in or run their own analysis.
67 points by felix089 - 62 comments
I tried using a custom instruction in chatGPT to make responses shorter but I found the output was often nonsensical when I did this
They are not just an LLM answer, they are an (often cached) LLM summary of web results.
This is why they were often skewed by nonsensical Reddit responses [0].
Depending on the type of input it can lean more toward web summary or LLM answer.
So I imagine that it can just grab the description of the „car wash” test from web results and then get it right because of that.
[0] https://www.bbc.com/news/articles/cd11gzejgz4o
[1] e.g. trained on traces of a reasoning process
What you've proven is that LLMs leverage web search, which I think we've known about for a while.
I don’t think it’s that easy. An intelligent mind will wonder why the question is being asked, whether they misunderstood the question, or whether the asker misspoke, or some other missing context. So the correct answer is neither “walk” nor “drive”, but “Wat?” or “I’m not sure I understand the question, can you rephrase?”, or “Is the vehicle you would drive the same as the car that you want to wash?”, or “Where is your car currently located?”, and so on.
Real people can ask for clarification when things are ambiguous or confusing. Once something is clarified, they can work that into their understanding of how someone communicates about a given topic. An LLM can't.
I'm also curious about Haiku, though I don't expect it to do great.
--
EDIT: Opus 4.6 Extended Reasoning
> Walk it over. 50 meters is barely a minute on foot, and you'll need to be right there at the car anyway to guide it through or dry it off. Drive home after.
Weird since the author says it succeeded for them on 10/10 runs. I'm using it in the app, with memory enabled. Maybe the hidden pre-prompts from the app are messing it up?
I tested Sonnet 4.5 first, which answered incorrectly.. maybe the Claude app's memory system is auto-injecting it into the new context (that's how one of the memory systems works, injects relevant fragments of previous chats invisibly into the prompt).
i.e. maybe Opus got the garbage response auto-injected from the memory feature, and it messed up its reasoning? That's the only thing I can think of...
--
EDIT 2: Disabled memories. Didn't help. But disabling the biographical information too, gives:
>Opus 4.6 Extended Reasoning
>Drive it — the whole point is to get the car there!
--
EDIT 3: Yeah, re-enabling the bio or memories, both make it stupid. Sad!
Edit: Found Haiku. Alas!
This reminds me of people who answer with “Yes” when presented with options where both can be true but the expected outcome is to pick one. For example, the infamous: “Will you be paying with cash or credit sir?” then the humorous “Yes.”
If you framed it as "hint: trick question", I expect score would improve. Let's find out!
1. There is no initial screening that would filter out garbage responses. For example, users who just pick the first answer.
2. They don't ask for reasoning/rationale.
They found that ~15% of US adults under 30 claim to have been trained to operate a nuclear submarine.
Also, the summary of the Gemini model says: "Gemini 3 models nailed it, all 2.x failed", but 2.0 Flash Lite succeeded, 10/10 times?
I just repeated that test and it told me to drive both times, with an identical answer: "Drive. You need the car at the car wash."
Now why anyone would wash a toy car at a car wash is beyond comprehension, but the LLM is not there to judge the user's motives.
It’s interesting that all the humans critiquing this assume the car isn’t at the car to be washed already, but the problem doesn’t say that.
I mean, Sam Altman was making the same calorie-based arguments this weekend https://www.cnbc.com/2026/02/23/openai-altman-defends-ai-res...
I feel like I'm losing grasp of what really is insane anymore.
I asked GPT-5.2 10x times with thinking enabled and it got it right every time.
I think it's related to syncophancy. LLM are trained to not question the basic assumptions being made. They are horrible at telling you that you are solving the wrong problem, and I think this is a consequence of their design.
They are meant to get "upvotes" from the person asking the question, so they don't want to imply you are making a fundamental mistake, even if it leads you into AI induced psychosis.
Or maybe they are just that dumb - fuzzy recall and the eliza effect making them seem smart?
EDIT: Though it could simply reflect training data. Maybe Redditors don't drive.