Latest in AI

Showing:model-evaluationGeneralClear ×

🔥 Trending today

anthropic7 export-controls4 model-access3 spacex3 amazon3 national-security2 open-source2 governance2 ai-policy2 ai-regulation2

Topic

Release New Tool Tutorial Business Paper Benchmark Opinion Regulation

For

General Developers Designers Product Founders Marketing Researchers Students

DeepSeek v4 Coding Scores Clash With Broader Frontier Benchmarks
r/LocalLLaMA top day3 days agoCommentary
A Reddit post questions why DeepSeek v4 can rank near the top of coding leaderboards while CAISI reportedly places it about eight months behind the US frontier. The author argues that both views may be compatible because coding benchmarks measure a narrow, heavily optimized slice of capability. For local users, the bigger question is how quantized DeepSeek v4 variants perform in real agent workflows, tool calls, cybersecurity, and abstract reasoning.
First GPT-5.6 tests arrive, targeting Mythos
量子位 QbitAI4 days agoBenchmark
The title indicates that QbitAI is covering the first hands-on tests of GPT-5.6, framed around a comparison with Mythos. Because the article body is unavailable, the testing setup, metrics, task types, and actual performance gap cannot be verified. The item is best treated as an early benchmark or model-comparison report that needs the original article for proper evaluation.
These LLMs are the best at resisting Russian propaganda
Ars Technica AI9 days agoBenchmark
Ars Technica reports on an Estonian government benchmark evaluating how large language models handle Russian propaganda. The test focuses on whether dozens of models resist, repeat, or normalize Russia’s strategic narratives. The topic matters for governments, researchers, and AI builders because LLMs are increasingly used to summarize and mediate public information.
最新開放模型動態 (#21)：開放模型大爆發！Gemma 4、DeepSeek V4、Kimi K2.6、MiMo 2.5、GLM-5.1 等，以及 CAISI V4 評估分析★ 85
Interconnects (Nathan L.)29 days agoCommentary
This is Issue #21 of the "Open Artifacts" column by well-known AI commentator Nathan Lambert, exploring the explosive growth in the open-weights and…