Latest in AI

Showing:benchmarksResearchersClear ×

🔥 Trending today

anthropic7 export-controls4 model-access3 spacex3 amazon3 national-security2 open-source2 governance2 ai-policy2 ai-regulation2

Topic

Release New Tool Tutorial Business Paper Benchmark Opinion Regulation

For

General Developers Designers Product Founders Marketing Researchers Students

[AINews] Open Models, Model Labs vs Agent Labs, and the Untrainable★ 72
Latent Space3 days agoCommentary
This AINews issue uses Sarah Guo’s essay as a lens for current AI industry debates: where open models matter, how agent labs differ from model labs, and what cannot be trained away. It also recaps discourse around Anthropic Fable/Mythos, Fable 5’s capabilities, Google’s DiffusionGemma, and maturing agent infrastructure. The central takeaway is that durable value may lie in integration, customer translation, maintenance, and intent rather than model scores alone.
Cohere North Mini Code 1.0
r/LocalLLaMA top day5 days agoRelease
CohereLabs’ North Mini Code 1.0 appears to have moved from early access to final release, with weights available on Hugging Face. The Reddit post describes it as a 30B A3B coding model. Its Artificial Analysis overall score of 28 trails Qwen 3.6 35B at 43, but its coding index score of 33 is close to Qwen’s 35 and above Gemma 4 26B’s 22.
Introducing FrontierCode★ 78
Hacker News (AI keywords)5 days agoBenchmark
Cognition launched FrontierCode, a coding benchmark focused on mergeability rather than only functional correctness. It evaluates correctness, tests, scope discipline, style, and repository-specific quality standards. Built with open-source maintainers and extensive quality control, it shows current frontier models still struggle: Claude Opus 4.8 scores 13.4% on the hardest Diamond subset, ahead of GPT-5.5 and Gemini 3.1 Pro.
LocalLLaMA post tier list
r/LocalLLaMA top day6 days agoOpinion
The author proposes a tier list for r/LocalLLaMA posts in response to complaints about declining post quality. Top-tier posts include new local model releases with GGUF/MLX or benchmark data, meaningful optimizations, complete hardware performance reports, and well-analyzed research. Low-tier posts include repeated toy benchmarks, unrelated cloud AI chatter, AI-generated slop, and thinly disguised ads for Claude-wrapper startups.
解讀當前開源與閉源 AI 模型的性能差距：超越單一評估指標的迷思★ 75
Interconnects (Nathan L.)55 days agoOpinion
In today's AI landscape, the performance gap between open-weights models (such as Meta's Llama family) and closed-source models (such as OpenAI's GPT and…
Gemma 4 與開源模型成功的關鍵：為什麼基準測試分數不再是唯一指標★ 75
Interconnects (Nathan L.)72 days agoCommentary
This article takes a deep dive into the release of Google's latest open-source model Gemma 4, using it as an opportunity to re-examine the core factors that…
Google DeepMind 推出評估 AGI 進程的「認知框架」，並同步舉辦 Kaggle 黑客松打造全新評估標準★ 85
Google DeepMind Blog89 days agoRelease
As large language models (LLMs) advance rapidly, traditional AI evaluation benchmarks (such as MMLU, GSM8K, and others) are quickly facing the twin challenges…
Import AI 446：核能 LLM、中國大型 AI 基準測試、AI 評估與政策★ 75
Import AI (Jack Clark)111 days agoCommentary
In this edition of Import AI 446, author Jack Clark explores three highly forward-looking and interconnected topics in current AI development: Nuclear LLMs…
Import AI 445：超級智能的時機點、AI 破解前沿數學證明與全新機器學習研究基準★ 75
Import AI (Jack Clark)118 days agoCommentary
In this edition of Import AI (Issue 445), author Jack Clark guides readers through three core topics at the very frontier of AI development: the timeline for…
Import AI 444：LLM 社會學、華為用 AI 寫作業系統核心、晶片設計基準測試 ChipBench★ 75
Import AI (Jack Clark)125 days agoCommentary
This edition of Import AI (Issue 444), written by Jack Clark, delves into the latest breakthroughs in artificial intelligence across three domains: social…
Opus 4.6、Codex 5.3 與後基準測試時代：2026 年我們該如何評估 AI 模型？★ 80
Interconnects (Nathan L.)125 days agoOpinion
In 2026, with the release of next-generation models such as Anthropic's Opus 4.6 and OpenAI's Codex 5.3, the AI community faces a fundamental challenge…
Hugging Face 推出 Community Evals：別再盲信黑箱排行榜，讓社群來決定模型好壞！★ 85
Hugging Face Blog130 days agoRelease
In today's era of rapid AI advancement, major model vendors and research institutions are releasing all manner of "leaderboards" to claim their models surpass…
Google 發表 Gemini 2.5 Flash：主宰性價比邊界，首創可精確控制的「思考預算」★ 85
TLDR AI (Buttondown)422 days agoRelease
Google has officially released its new model Gemini 2.5 Flash, marking Google's comprehensive dominance over the cost-efficiency Pareto frontier on LMArena…
OpenAI 推出全新主力模型 GPT 4.1：效能與實用性的新平衡★ 85
TLDR AI (Buttondown)425 days agoRelease
OpenAI has officially released its new flagship model GPT 4.1, positioned as the next-generation "workhorse" designed to give developers and enterprises the…
Hugging Face 與 Upstage 推出 Open Ko-LLM 排行榜：引領韓國大語言模型評估生態系
Hugging Face Blog845 days agoRelease
Hugging Face and South Korea's leading AI startup Upstage have jointly announced the launch of the "Open Ko-LLM Leaderboard." This is a brand-new evaluation…