Using Benchmarks Measuring

OpenAI’s GPT-5.4 sets new records on professional benchmarks

OpenAI released GPT-5.4 today with native computer use, a 1M-token context window, and new professional benchmarks. Find what ...

Communications of the ACM

Measuring What Matters in Large Language Model Performance

As large language models (LLMs) gain momentum worldwide, there’s a growing need for reliable ways to measure their performance. Benchmarks that evaluate LLM outputs allow developers to track ...

Decrypt

OpenAI Says Benchmark Used to Measure AI Coding Skill Is 'Contaminated'—Here's Why

OpenAI wants to retire the leading AI coding benchmark—and the reasons reveal a deeper problem with how the whole industry ...

VentureBeat

Researchers open-source benchmarks measuring quality of AI-generated code

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More The applications of computer programming are vast in scope. And as ...

Results that may be inaccessible to you are currently showing.

Hide inaccessible results