• Fiery@lemmy.dbzer0.com
    link
    fedilink
    English
    arrow-up
    2
    ·
    16 hours ago

    The best measure is indeed the final impact of these systems. However that is very hard to actually measure properly, and doesn’t completely make benchmarks useless. Benchmarks are still good data points (if they’re designed well) to measure advances in the technology. If a model failed to do a realistic task before and the next gen can do it, that often translates to a real improvement to impact. Though having a benchmark improve x2 doesn’t mean the model will have x2 impact.

    A benchmark can be run automatically and often, while real impact studies take time.

    In software development the best measure for quality is the end user having no issues, that doesn’t mean automated testing (unit/integration/end-to-end) suddenly is irrelevant though.