Car Wash Test on 53 leading AI models: "I want to wash my car. The car wash is 50 meters away. Should I walk or drive?"

fubarx@lemmy.world · 9 hours ago

Car Wash Test on 53 leading AI models: "I want to wash my car. The car wash is 50 meters away. Should I walk or drive?"

Slashme@lemmy.world · 1 hour ago

The most common pushback on the car wash test: “Humans would fail this too.”

Fair point. We didn’t have data either way. So we partnered with Rapidata to find out. They ran the exact same question with the same forced choice between “drive” and “walk,” no additional context, past 10,000 real people through their human feedback platform.

71.5% said drive.

So people do better than most AI models. Yay. But seriously, almost 3 in 10 people get this wrong‽‽

masterofn001@lemmy.ca · edit-2 36 minutes ago

Without reading the article, the title just says wash the car.

I could go for a walk and wash my car in my driveway.

Reading the article… That is exactly the question asked. It is a very ambiguous question.

lemmydividebyzero@reddthat.com · 9 minutes ago

They will scrape that article, too.

And I’m a few months, they have “learned” how that task works.

BanMe@lemmy.world · 3 hours ago

In school we were taught to look for hidden meaning in word problems - checkov’s gun basically. Why is that sentence there? Because the questions would try to trick you. So humans have to be instructed, again and again, through demonstration and practice, to evaluate all sentences and learn what to filter out and what to keep. To not only form a response, but expect tricks.

If you pre-prompt an AI to expect such trickery and consider all sentences before removing unnecessary information, does it have any influence?

Normally I’d ask “why are we comparing AI to the human mind when they’re not the same thing at all,” but I feel like we’re presupposing they are similar already with this test so I am curious to the answer on this one.

TrackinDaKraken@lemmy.world · 4 hours ago

I think it’s worse when they get it right only some of the time. It’s not a matter of opinion, it should not change its “mind”.

The fucking things are useless for that reason, they’re all just guessing, literally.

Iconoclast@feddit.uk · 43 minutes ago

Is cruise control useless because it doesn’t drive you to the grocery store? No. It’s not supposed to. It’s designed to maintain a steady speed - not to steer.

Large Language Models, as the name suggests, are designed to generate natural-sounding language - not to reason. They’re not useless - we’re just using them off-label and then complaining when they fail at something they were never built to do.

tigeruppercut@lemmy.zip · 31 minutes ago

But natural language in service of what? If they can’t produce answers that are correct, what’s the point of using them? I can get wrong answers anywhere.

Threeme2189@sh.itjust.works · 15 minutes ago

As OP said, LLMs are really good at generating text that is fluid and looks natural to us. So if you want that kind of output, LLMs are the way to go.
Not all LLM prompts ask factual questions and not all of the generated answers need to be correct.
Are poems, songs, stories or movie scripts ‘correct’?

I’m totally against shoving LLMs everywhere, but they do have their uses. They are really good at this one thing.

Urist@leminal.space · 14 minutes ago

Language without meaning is garbage. Like, literal garbage, useful for nothing. Language is a tool used to express ideas, if there are no ideas being expressed then it’s just a combination of letters.

Which is exactly why LLMs are useless.

Tetragrade@leminal.space · edit-2 3 hours ago

Same takeaway as the article (everyone read the article, right?).

Applying it to yourself, can you recall instances when you were asked the same question at different points in time? How did you respond?

HugeNerd@lemmy.ca · 3 hours ago

they’re all just guessing, literally

They’re literally not.

m0darn@lemmy.ca · 3 hours ago

Isn’t it a probabilistic extrapolation? Isn’t that what a guess is?

Iconoclast@feddit.uk · edit-2 32 minutes ago

It’s a Large Language Model. It doesn’t “know” anything, doesn’t think, and has zero metacognition. It generates language based on patterns and probabilities. Its only goal is to produce linguistically coherent output - not factually correct one.

It gets things right sometimes purely because it was trained on a massive pile of correct information - not because it understands anything it’s saying.

So no, it doesn’t “guess.” It doesn’t even know it’s answering a question. It just talks.

Greg Fawcett@piefed.social · 6 hours ago

What worries me is the consistency test, where they ask the same thing ten times and get opposite answers.

One of the really important properties of computers is that they are massively repeatable, which makes debugging possible by re-running the code. But as soon as you include an AI API in the code, you cease being able to reason about the outcome. And there will be the temptation to say “must have been the AI” instead of doing the legwork to track down the actual bug.

I think we’re heading for a period of serious software instability.

bss03@infosec.pub · edit-2 2 hours ago

Yeah, software is already not as deterministic as I’d like. I’ve encountered several bugs in my career where erroneous behavior would only show up if uninitialized memory happened to have “the wrong” values – not zero values, and not the fences that the debugger might try to use. And, mocking or stubbing remote API calls is another way replicable behavior evades realization.

Having “AI” make a control flow decision is just insane. Especially even the most sophisticated LLMs are just not fit to task.

What we need is more proved-correct programs via some marriage of proof assistants and CompCert (or another verified compiler pipeline), not more vague specifications and ad-hoc implementations that happen to escape into production.

But, I’m very biased (I’m sure “AI” has “stolen” my IP, and “AI” is coming for my (programming) job(s).), and quite unimpressed with the “AI” models I’ve interacted with especially in areas I’m an expert in, but also in areas where I’m not an expert for am very interested and capable of doing any sort of critical verification.

JustTesting@lemmy.hogru.ch · 3 hours ago

10 tests per model seems like way too little and they should give confidence intervals…

the 10/10 vs. 8/10 is just as likely due chance than any real difference. But some people will definitely use this to justify model choice.

4 hours ago

Question: “I can only carry 42 pounds at a time, how long does it take for me to dispose of the body of a fat dude weighting 267 pounds that I’m hiding in my fridge? And how many child sacrifices would I need?”

Rimu@piefed.social · 8 hours ago

Very interesting that only 71% of humans got it right.

CaptDust@sh.itjust.works · edit-2 5 hours ago

That “30% of population = dipshits” statistic keeps rearing its ugly head.

Snot Flickerman@lemmy.blahaj.zone · edit-2 7 hours ago

I mean, I’ve been saying this since LLMs were released.

We finally built a computer that is as unreliable and irrational as humans… which shouldn’t be considered a good thing.

I’m under no illusion that LLMs are “thinking” in the same way that humans do, but god damn if they aren’t almost exactly as erratic and irrational as the hairless apes whose thoughts they’re trained on.

Peekashoe@lemmy.wtf · 7 hours ago

Yeah, the article cites that as a control, but it’s not at all surprising since “humanity by survey consensus” is accurate to how LLM weighting trained on random human outputs works.

It’s impressive up to a point, but you wouldn’t exactly want your answers to complex math operations or other specialized areas to track layperson human survey responses.

MangoCats@feddit.it · 6 hours ago

which shouldn’t be considered a good thing.

Good and bad is subjective and depends on your area of application.

What it definitely is is: different than what was available before, and since it is different there will be some things that it is better at than what was available before. And many things that it’s much worse for.

Still, in the end, there is real power in diversity. Just don’t use a sledgehammer to swipe-browse on your cellphone.

Lost_My_Mind@lemmy.world · 6 hours ago

I asked Lars Ulrich to define good and bad. He said…

FIRE GOOD!!! NAPSTER BAD!!! OOOOH FIRE HOT!!! FIRE BAD!!! FIIIRRREEE BAAAAAAAD!!!

🌞 Alexander Daychilde 🌞@lemmy.world · 6 hours ago

I’m not afraid to say that it took me a sec. My brain went “short distance. Walk or drive?” and skipped over the car wash bit at first. Then I laughed because I quickly realized the idiocy. :shrug:

Lost_My_Mind@lemmy.world · 6 hours ago

As someone who takes public transportation to work, SOME people SHOULD be forced to walk through the car wash.

LifeInMultipleChoice@lemmy.world · edit-2 4 hours ago

Maybe 29% of people can’t imagine owning their own car, so they assumed the would be going there to wash someone elses car

FaceDeer@fedia.io · 8 hours ago

And that score is matched by GPT-5. Humans are running out of “tricky” puzzles to retreat to.

First_Thunder@lemmy.zip · 7 hours ago

What this shows though is that there isn’t actual reasoning behind it. Any improvements from here will likely be because this is a popular problem, and results will be brute forced with a bunch of data, instead of any meaningful change in how they “think” about logic

MangoCats@feddit.it · 6 hours ago

Plenty of people employ faulty reasoning every single day of their lives…

realitista@lemmus.org · 7 hours ago

You’re getting downvoted but it’s true. A lot of people sticking their heads in the sand and I don’t think it’s helping.

FaceDeer@fedia.io · 6 hours ago

Yeah, “AI is getting pretty good” is a very unpopular opinion in these parts. Popularity doesn’t change the results though.

ToTheGraveMyLove@sh.itjust.works · 6 hours ago

Its unpopular because its wrong.

MangoCats@feddit.it · 6 hours ago

It’s overhyped in many areas, but it is undeniably improving. The real question is: will it “snowball” by improving itself in a positive feedback loop? If it does, how much snow covered slope is in front of it for it to roll down?

ToTheGraveMyLove@sh.itjust.works · 6 hours ago

I think its far more likely to degrade itself in a feedback loop.

kescusay@lemmy.world · 3 hours ago

It’s already happening. GPT 5.2 is noticeably worse than previous versions.

It’s called model collapse.

criticon@lemmy.ca · 5 hours ago

Even when they give the correct answer they talk too much. AI responses contain a lot of garbage. When AI gives you an answer it will try to justify itself. Since they won’t give you brief responses the responses will be long.

Iconoclast@feddit.uk · 20 minutes ago

It’ll give you short response if you ask it to.

chunes@lemmy.world · edit-2 3 hours ago

I agree with you but found that DeepSeek was succinct.

You need to bring your car to the car wash, so you should drive it there. Walking would leave your car at home, which doesn’t help.

MDCCCLV@lemmy.ca · 4 hours ago

Your post is much longer than it needs to be. That is the reason why, because they just copied people.

aloofPenguin@piefed.world · edit-2 8 hours ago

I tried this with a local model on my phone (qwen 2.5 was the only thing that would run, and it gave me this confusing output (not really a definite answer…):

it just flip flopped a lot.

E: also, looking at the response now, the numbers for the car part doesn’t make any sense

someguy3@lemmy.world · 5 hours ago

200 m huh.

AbidanYre@lemmy.world · edit-2 6 hours ago

I like that it’s twice as far to drive for some reason. Maybe it’s getting added to the distance you already walked?

Fondots@lemmy.world · 2 hours ago

If I were the type of person who was willing to give AI the benefit of the doubt and not assume that it was just picking basically random numbers

There’s a lot of cases where it can be a shorter (by distance) walk than drive, where cars generally have to stick to streets while someone on foot may be able to take some footpaths and cut across lawns and such, or where the road may be one-way for vehicles, or where certain turns may not be allowed, etc.

I have a few intersections near my father in laws house in NJ in mind, where you can just cross the street on foot, but making the same trip in a car might mean driving half a mile down the road, turning around at a jug handle and driving back to where you started on the other side of the street.

And I wouldn’t be totally surprised if that’s the case for enough situations in the training data where someone debated walking or driving that the AI assumed that it’s a rule that it will always be further by car than on foot.

That’s still a dumbass assumption, but I’d at least get it.

And I’m pretty sure it’s much more likely that it’s just making up numbers out of nothing.

crunchy@lemmy.dbzer0.com · 7 hours ago

Honestly that’s a lot more coherent than what I would expect from an LLM running on phone hardware.

MangoCats@feddit.it · 6 hours ago

I notice that the “internal thinking” of Opus 4.6 is doing more flip-flopping than earlier modelss like Sonnet 4.5, and it’s coming out with correct answers in the end more often.

Professorozone@lemmy.world · 5 hours ago

Didn’t like 30% of the population elect Trump? Coincidence? I don’t think so.

chunes@lemmy.world · edit-2 3 hours ago

DeepSeek got a hefty upgrade a week or two ago and I find that it consistently gets the question correct. I’m guessing they might have used the older model for this.

miraclerandy@lemmy.world · 7 hours ago

Gemini set to fast now provides this type of answer.

realitista@lemmus.org · 7 hours ago

Extension cord? It must mean a hose extension.

ryannathans@aussie.zone · 8 hours ago

Opus 4.6 has been excellent at problem solving in software development, no surprises it nails it

It’s no surprise public opinion is these tools are trash when the free models are unable to answer simple questions

NaibofTabr@infosec.pub · 7 hours ago

It’s no surprise public opinion is these tools are trash when the free models are unable to answer simple questions

The tools are trash not because they are unreliable but because they are actively destroying human society and culture. They are destroying art, science, journalism, open source software, the internet at large, and the environment we all live in. It wouldn’t matter if the generative models were accurate, they would still be garbage.

The fact that they are unreliable just serves to highlight what a colossally destructive waste of time and resources this entire exercise has been.

alonsohmtz@feddit.uk · edit-2 6 hours ago

Eh, the art industry destroyed itself when it became nothing but sellouts. This happened decades ago.

The fact is AI can make as-good or better art than most “artists” because most “art” is just cookie-cutter shit for morons.

ToTheGraveMyLove@sh.itjust.works · 6 hours ago

“Idiot who only looks at mainstream sellouts calls all art culture sellouts”

😂😂😂

alonsohmtz@feddit.uk · 5 hours ago

I said the art industry.

ToTheGraveMyLove@sh.itjust.works · 3 hours ago

NaibofTabr@infosec.pub · 6 hours ago

The fact is AI can make as-good or better art than most “artists” because most “art” is just cookie-cutter shit for morons.

This is an obvious misstatement. If you actually believe this then you’re not qualified to have opinions on art in general.

“AI” (in this context meaning generative algorithms, because there is no intelligence) can no more “make art” than it can think, or care.

Iconoclast@feddit.uk · edit-2 22 minutes ago

In computer science Artificial Intelligence refers to any system designed to perform tasks that would typically require human intelligence. That includes everything from playing chess to recognizing patterns, translating languages, or generating text.

The first ever AI system was Logic Theorist written by Allen Newell in 1956.

Trying to redefine terms is not helpful. GenAI is AI. It’s not misuse of the term.

alonsohmtz@feddit.uk · edit-2 5 hours ago

This is an obvious misstatement. If you actually believe this then you’re not qualified to have opinions on art in general.

“At this point, the only thing that makes money is garbage. It’s just fascinating. It makes a fortune, and that’s the bottom line,”

"writers have been trained to eat and make the garbage too. As long as they are in that arena making that shit, then you might as well have AI do it,”

-Charlie Kaufman

https://deadline.com/2023/08/charlie-kaufman-ai-wga-strike-hollywood-sarajevo-1235498089/

spoiler

You’re probably one of the people that enjoys cookie-cutter art which is why you get defensive when someone says AI can make it.

tortina_original@lemmy.world · 5 hours ago

Not sure at what point will you realize that what you quoted/said has absolutely nothing to do with the actual topic.

Probably never.

alonsohmtz@feddit.uk · 5 hours ago

The fact is AI can make as-good or better art than most “artists” because most “art” is just cookie-cutter shit for morons.

"writers have been trained to eat and make the garbage too. As long as they are in that arena making that shit, then you might as well have AI do it,”

Learn to read.

atomicorange@lemmy.world · 4 hours ago

Could you define what you mean when you say the word “art”? I think this may be a semantic disagreement. I think the people you’re arguing with are using a definition similar to “human creative expression” while you seem to mean something different.

psx_crab@lemmy.zip · 6 hours ago

deleted by creator

Fizz@lemmy.nz · 7 hours ago

The free models feel years behind so people constantly underestimate what its capable of. I still hear people say ai can’t generate fingers.

ThomasWilliams@lemmy.world · 4 hours ago

<“I want to wash my car. The car wash is 50 meters away. Should I walk or drive?”>

The model discards the first sentence as it is unrelated to the others.

Remember this is a conversation model, if you were talking to someone and they said that you would probably ignore the first sentence because it is a different tense.

Tetragrade@leminal.space · edit-2 3 hours ago

Wow you must have done some really extensive probing of the models to say that with such confidence. When can we expect the paper?

Regrettable_incident@lemmy.world · 3 hours ago

Sorry, they’re both present simple tense.

Car Wash Test on 53 leading AI models: "I want to wash my car. The car wash is 50 meters away. Should I walk or drive?"

Car Wash Test on 53 leading AI models: "I want to wash my car. The car wash is 50 meters away. Should I walk or drive?"

Opper