• maria [she/her]@lemmy.blahaj.zone
    link
    fedilink
    English
    arrow-up
    18
    ·
    1 day ago

    on a serious note: designing benchmarks is hard.

    the consensus has been that creating verifiable benchmarks is surprisingly difficult and the ones that are difficult (like HLE) only get included in these benchmark images when new higher scores are achieved.

    its just soooo nice seeing a 99% score on a tool calling benchmark which literally just tests for if the model can generate proper json

    people are trying their best designing benchmarks.

    • TotallynotJessica@lemmy.blahaj.zoneM
      link
      fedilink
      English
      arrow-up
      7
      ·
      edit-2
      24 hours ago

      The best measure for AI is the productivity and accuracy of the work people do with the models. It doesn’t matter if the tech is good at anything if people don’t use it properly. Just like any tool, there are right and wrong ways to use them.

      AI isn’t just about machine learning, but about the role that technology has in our lives. The problem with AI has never been the underlying tech, but how people perceive it and how they use it.

      • Fiery@lemmy.dbzer0.com
        link
        fedilink
        English
        arrow-up
        2
        ·
        18 hours ago

        The best measure is indeed the final impact of these systems. However that is very hard to actually measure properly, and doesn’t completely make benchmarks useless. Benchmarks are still good data points (if they’re designed well) to measure advances in the technology. If a model failed to do a realistic task before and the next gen can do it, that often translates to a real improvement to impact. Though having a benchmark improve x2 doesn’t mean the model will have x2 impact.

        A benchmark can be run automatically and often, while real impact studies take time.

        In software development the best measure for quality is the end user having no issues, that doesn’t mean automated testing (unit/integration/end-to-end) suddenly is irrelevant though.