Cognition launched FrontierCode, a coding benchmark focused on mergeability rather than only functional correctness. It evaluates correctness, tests, scope discipline, style, and repository-specific quality standards. Built with open-source maintainers and extensive quality control, it shows current frontier models still struggle: Claude Opus 4.8 scores 13.4% on the hardest Diamond subset, ahead of GPT-5.5 and Gemini 3.1 Pro.
Latent Space talks with Lukas Petersson and Axel Backlund of Andon Labs, the authors behind VendingBench. The episode focuses on evaluating Claude models across a range from Haiku to Mythos. It also discusses how they build frontier evals from scratch, with an emphasis on creating benchmarks that remain useful and meaningful over time.
Hugging Face has published a comprehensive glossary of AI agent terminology to resolve industry-wide confusion. The guide focuses on defining critical concepts such as "scaffold" (the code wrapping the LLM) and "harness" (the evaluation and execution environment). This standardization helps developers and researchers communicate more precisely when building and benchmarking agentic systems.
In its latest technical blog post, Vercel shared a significant finding regarding AI Agent architecture: in their Agent Evaluations (Agent Evals), using a…
As generative AI applications become more widespread, one of the biggest challenges developers face is the "non-deterministic" output of large language models…