Mythos rule

JoJo@lemmy.blahaj.zone · 1 day ago

Mythos rule

17 hours ago

Grandwolf319@sh.itjust.works · 22 hours ago

What is ironic is that there have been consistent reports that it does not improve productivity.

ActualGrapesTasteGreen@piefed.zip · 1 day ago

A big detail nobody seems to bring up about Project Glasswing is that they didn’t just prompt it “Hey, check out this codebase looking for issues” and out popped zero days. They ran each project through tens of thousands of dollars worth of compute time. Iteration after iteration and after all that they accumulate a report. Now they’ve reached out to some of the most cash flush companies to say “we can do the same for you.”

Put your quarter in the one armed bandit. Maybe you’ll get a zero day but more than likely you’ll get a “better luck next time.” But please, keep paying us. In 10,000 more iterations we’ll surely find the bug that would have cost you millions.

Ophrys@lemmy.dbzer0.com · 1 day ago

Yeah it’s cool a computer can write a script but if it takes 5 megawatts to do it then it’s not really an improvement

derbolle@lemmy.world · 1 day ago

I read that in Ed Zitron’s voice

fullsquare@awful.systems · 1 day ago

qqq@lemmy.world · edit-2 23 hours ago

A competent pentest already costs in the tens of thousands of dollars, and we’re also not guaranteed to find anything. Some of the bugs that were discovered by Mythos existed in long standing code bases for a very long time and were not previously known. I would definitely not write off those capabilities.

🍆 💦@feddit.org · 12 hours ago

deleted by creator

maria [she/her]@lemmy.blahaj.zone · 24 hours ago

on a serious note: designing benchmarks is hard.

the consensus has been that creating verifiable benchmarks is surprisingly difficult and the ones that are difficult (like HLE) only get included in these benchmark images when new higher scores are achieved.

its just soooo nice seeing a 99% score on a tool calling benchmark which literally just tests for if the model can generate proper json

people are trying their best designing benchmarks.

TotallynotJessica@lemmy.blahaj.zone · edit-2 19 hours ago

The best measure for AI is the productivity and accuracy of the work people do with the models. It doesn’t matter if the tech is good at anything if people don’t use it properly. Just like any tool, there are right and wrong ways to use them.

AI isn’t just about machine learning, but about the role that technology has in our lives. The problem with AI has never been the underlying tech, but how people perceive it and how they use it.

Fiery@lemmy.dbzer0.com · 13 hours ago

The best measure is indeed the final impact of these systems. However that is very hard to actually measure properly, and doesn’t completely make benchmarks useless. Benchmarks are still good data points (if they’re designed well) to measure advances in the technology. If a model failed to do a realistic task before and the next gen can do it, that often translates to a real improvement to impact. Though having a benchmark improve x2 doesn’t mean the model will have x2 impact.

A benchmark can be run automatically and often, while real impact studies take time.

In software development the best measure for quality is the end user having no issues, that doesn’t mean automated testing (unit/integration/end-to-end) suddenly is irrelevant though.

supersquirrel@sopuli.xyz · 1 day ago

Data without context is irrelevant and meaningless.

ImgurRefugee114@reddthat.com · 1 day ago

some pirate@lemmy.dbzer0.com · 16 hours ago

supersquirrel@sopuli.xyz · edit-2 1 day ago

42*4=196 I think

ImgurRefugee114@reddthat.com · 1 day ago

Luckily this won’t be on the exam…

supersquirrel@sopuli.xyz · edit-2 1 day ago

Well as long as the AI I use to cheat on the exam wasn’t trained on data inputted from confident bullshit I have said or other idiots like me have said on the internet I will be fine!

Viking_Hippie@lemmy.dbzer0.com · 1 day ago

Narrator: supersquirrel would not be fine

UnspecificGravity@piefed.social · 1 day ago

What do you get when you multiply six by nine?

GeeDubHayduke@lemmy.dbzer0.com · 6 hours ago

All the hippies cut off all their hair.

Black616Angel@discuss.tchncs.de · 1 day ago

1337/420≈π

lmr0x61@lemmy.ml · 1 day ago

As always, the numbers don’t lie — the people do. And worse, we encounter all this with essentially the same brain as the humans who lit the first spark.

sad_detective_man@sopuli.xyz · edit-2 1 day ago

The machines will totally just straight up lie to you. Agent logic rewards the shortest path to an answer and they dgaf about telling you they calculated something that they didn’t

maria [she/her]@lemmy.blahaj.zone · 24 hours ago

benchmaxxing and are real annoying.

recent local model releases appear to be good, but i dismissed them becuz of the high scores (implying benchmaxxing)

this whole project glasswing thing, oh gosh… most of the exploits found by that model were later proven to be findable with older models too, so this is nothing new.

germanatlas@lemmy.blahaj.zone · 22 hours ago

The numbers don’t lie

Quacksalber@sh.itjust.works · 1 day ago

This can be copied 1:1 to right-wingers and wanna-be fascists. They too love to make up scary big numbers.

CriticalMiss@lemmy.world · 1 day ago

It’s more like the tests we came up with ourselves show our models improved therefore it means you can safely invest a lot of money in us and uhh yeah we will become profitable one day

Canadian_Cabinet @lemmy.ca · 1 day ago

The numbers don’t lie! And they spell disaster for you!