Latest in AI

Showing:llm-evaluationResearchersClear ×

🔥 Trending today

anthropic7 export-controls4 model-access3 spacex3 amazon3 national-security2 open-source2 governance2 ai-policy2 ai-regulation2

Topic

Release New Tool Tutorial Business Paper Benchmark Opinion Regulation

For

General Developers Designers Product Founders Marketing Researchers Students

Shall We Play a Game? LLMs Use Tactical Nukes in 95% of Simulations
Hacker News (AI keywords)2 days agoCommentary
The available source metadata points to a provocative post about LLM behavior in simulated conflict scenarios. Based only on the title, the central claim is that language models used tactical nuclear weapons in 95% of simulations. Without the article body, the methodology, models tested, prompt design, controls, and validity of the result cannot be assessed.
Rails testing on autopilot: Building an agent that writes what developers won't
Mistral AI News6 days agoTutorial
Mistral AI describes an autonomous Rails testing agent built on its open-source Vibe coding assistant. The agent reads Rails files, applies file-type-specific skills, generates or improves RSpec tests, and validates them with RuboCop, RSpec, and SimpleCov. In a 275-file experiment, it reached 100% passing tests, 100% average line coverage, zero RuboCop violations, and a higher LLM-as-a-judge score, while stressing that generated tests must actually run.
If LLMs Have Human-Like Attributes, Then So Does Age of Empires II
Hacker News (AI keywords)6 days agoPaper
The paper argues that claims about LLMs having human-like attributes, such as morality or language understanding, can be methodologically fragile. By building and training a simple neural network on Age of Empires II, the author suggests such attributes may not be empirically unique to LLMs. The key recommendation is to define explicit measurement criteria and use a null assumption of LLM non-uniqueness before drawing anthropomorphic conclusions.
Nathan Lambert 的最新進展：ATOM Report、Post-Training 課程、新書與持續進行的 AI 研究★ 70
Interconnects (Nathan L.)60 days agoRelease
Nathan Lambert, a prominent AI expert, former Alignment Scientist at Hugging Face, and founder of the popular newsletter Interconnects, recently wrote about…
GPT 5.4 對 Codex 是一大步（但作者為何仍選擇 Claude）★ 80
Interconnects (Nathan L.)88 days agoCommentary
In this article from the well-known AI commentary blog Interconnects, author Nathan L. analyzes GPT 5.4, focusing specifically on the significant changes it…
Open LLM Leaderboard 碳排放與模型性能分析：效能與環保的權衡啟示
Hugging Face Blog521 days agoCommentary
Hugging Face recently published an in-depth analysis of its well-known Open LLM Leaderboard, examining the carbon dioxide (CO₂) emissions generated during…
BigCodeBench：下一代 Code LLM 評測基準 HumanEval 的繼承者★ 80
Hugging Face Blog726 days agoRelease
As large language models (LLMs) have made tremendous strides in code generation, the long-standing industry gold standard — the HumanEval benchmark — has…
使用結構化生成提升 Prompt 一致性與輸出評估★ 75
Hugging Face Blog775 days agoTutorial
When developing applications based on large language models (LLMs) — such as AI agents, RAG systems, or automated workflows — one of the biggest challenges…
Hugging Face 推出 Red-Teaming 抗性排行榜：評估 LLM 抵禦惡意越獄與對抗性攻擊的能力★ 75
Hugging Face Blog842 days agoRelease
### Background: The Shortcomings of Static Safety Evaluations As large language models (LLMs) are widely adopted across industries, AI safety has become an…
Open LLM Leaderboard：深入解析 DROP 基準測試與模型「刷榜」現象★ 75
Hugging Face Blog926 days agoCommentary
The Hugging Face Open LLM Leaderboard has long served as an important benchmark for the community to evaluate the capabilities of open-source models. However…