
What is ironic is that there have been consistent reports that it does not improve productivity.
A big detail nobody seems to bring up about Project Glasswing is that they didn’t just prompt it “Hey, check out this codebase looking for issues” and out popped zero days. They ran each project through tens of thousands of dollars worth of compute time. Iteration after iteration and after all that they accumulate a report. Now they’ve reached out to some of the most cash flush companies to say “we can do the same for you.”
Put your quarter in the one armed bandit. Maybe you’ll get a zero day but more than likely you’ll get a “better luck next time.” But please, keep paying us. In 10,000 more iterations we’ll surely find the bug that would have cost you millions.
Yeah it’s cool a computer can write a script but if it takes 5 megawatts to do it then it’s not really an improvement
I read that in Ed Zitron’s voice
A competent pentest already costs in the tens of thousands of dollars, and we’re also not guaranteed to find anything. Some of the bugs that were discovered by Mythos existed in long standing code bases for a very long time and were not previously known. I would definitely not write off those capabilities.
deleted by creator
on a serious note: designing benchmarks is hard.
the consensus has been that creating verifiable benchmarks is surprisingly difficult and the ones that are difficult (like HLE) only get included in these benchmark images when new higher scores are achieved.
its just soooo nice seeing a 99% score on a tool calling benchmark which literally just tests for if the model can generate proper json
people are trying their best designing benchmarks.
The best measure for AI is the productivity and accuracy of the work people do with the models. It doesn’t matter if the tech is good at anything if people don’t use it properly. Just like any tool, there are right and wrong ways to use them.
AI isn’t just about machine learning, but about the role that technology has in our lives. The problem with AI has never been the underlying tech, but how people perceive it and how they use it.
The best measure is indeed the final impact of these systems. However that is very hard to actually measure properly, and doesn’t completely make benchmarks useless. Benchmarks are still good data points (if they’re designed well) to measure advances in the technology. If a model failed to do a realistic task before and the next gen can do it, that often translates to a real improvement to impact. Though having a benchmark improve x2 doesn’t mean the model will have x2 impact.
A benchmark can be run automatically and often, while real impact studies take time.
In software development the best measure for quality is the end user having no issues, that doesn’t mean automated testing (unit/integration/end-to-end) suddenly is irrelevant though.
Data without context is irrelevant and meaningless.
42
67
42*4=196 I think
Luckily this won’t be on the exam…
Well as long as the AI I use to cheat on the exam wasn’t trained on data inputted from confident bullshit I have said or other idiots like me have said on the internet I will be fine!
Narrator: supersquirrel would not be fine
What do you get when you multiply six by nine?
All the hippies cut off all their hair.
1337/420≈π
As always, the numbers don’t lie — the people do. And worse, we encounter all this with essentially the same brain as the humans who lit the first spark.
The machines will totally just straight up lie to you. Agent logic rewards the shortest path to an answer and they dgaf about telling you they calculated something that they didn’t
benchmaxxing and are real annoying.
recent local model releases appear to be good, but i dismissed them becuz of the high scores (implying benchmaxxing)
this whole project glasswing thing, oh gosh… most of the exploits found by that model were later proven to be findable with older models too, so this is nothing new.
This can be copied 1:1 to right-wingers and wanna-be fascists. They too love to make up scary big numbers.
It’s more like the tests we came up with ourselves show our models improved therefore it means you can safely invest a lot of money in us and uhh yeah we will become profitable one day
The numbers don’t lie! And they spell disaster for you!








