Cognition launched FrontierCode, a coding benchmark focused on mergeability rather than only functional correctness. It evaluates correctness, tests, scope discipline, style, and repository-specific quality standards. Built with open-source maintainers and extensive quality control, it shows current frontier models still struggle: Claude Opus 4.8 scores 13.4% on the hardest Diamond subset, ahead of GPT-5.5 and Gemini 3.1 Pro.
Latent Space talks with Lukas Petersson and Axel Backlund of Andon Labs, the authors behind VendingBench. The episode focuses on evaluating Claude models across a range from Haiku to Mythos. It also discusses how they build frontier evals from scratch, with an emphasis on creating benchmarks that remain useful and meaningful over time.
Hugging Face has published a comprehensive glossary of AI agent terminology to resolve industry-wide confusion. The guide focuses on defining critical concepts such as "scaffold" (the code wrapping the LLM) and "harness" (the evaluation and execution environment). This standardization helps developers and researchers communicate more precisely when building and benchmarking agentic systems.
In its latest technical blog post, Vercel shared a significant finding regarding AI Agent architecture: in their Agent Evaluations (Agent Evals), using a…
The well-known AI evaluation and tracking platform Braintrust has officially announced its addition to the Vercel Marketplace. Braintrust is an LLM (large…
Vercel has announced the official integration of its two core AI development tools — the Vercel AI SDK and Vercel AI Gateway — into GitHub Actions workflows…
As generative AI applications become more widespread, one of the biggest challenges developers face is the "non-deterministic" output of large language models…