Latest in AI

Showing:BenchmarkGeneralLlamaClear ×

🔥 Trending today

anthropic6 export-controls4 model-access3 amazon3 national-security2 open-source2 ai-regulation2 government-policy2 enterprise-ai2 compliance2

Topic

Release New Tool Tutorial Business Paper Benchmark Opinion Regulation

For

General Developers Designers Product Founders Marketing Researchers Students

Google DeepMind 推出 FACTS 基準測試套件：系統化評估大型語言模型的真實性★ 80
Google DeepMind Blog187 days agoRelease
As large language models (LLMs) are deployed across a wide range of industries, ensuring the "factuality" of model outputs and reducing "hallucination" has…
在 DeepResearch Bench 評測開源 Llama Nemotron 模型：NVIDIA 打造頂尖且可移植的深度研究 Agent★ 80
Hugging Face Blog314 days agoRelease
This article provides a detailed look at how NVIDIA is using its open-source Llama Nemotron series of models to evaluate and build top-performing, portable…
回到未來：Hugging Face 推出 FutureBench 評估 AI Agent 的未來事件預測能力★ 75
Hugging Face Blog332 days agoRelease
### What is FutureBench? As large language models (LLMs) and AI agents have rapidly advanced, traditional static benchmarks (such as MMLU and GSM8K) face a…
介紹 HELMET：全面評估長文本語言模型（Long-context LLMs）的新一代基準測試★ 80
Hugging Face Blog424 days agoRelease
### Background and Pain Points: Moving Beyond the Overly Simple "Needle in a Haystack" Test In recent years, the context window length supported by large…
讓大型模型展開辯論：首屆多語言 LLM 辯論賽★ 75
Hugging Face Blog571 days agoRelease
This article from the Hugging Face blog introduces "The First Multilingual LLM Debate Competition." As large language models (LLMs) have rapidly advanced…
Hugging Face 推出「企業情境排行榜」：專為真實世界應用設計的 LLM 評測基準★ 75
Hugging Face Blog865 days agoRelease
Hugging Face has partnered with Patronus AI — a startup focused on LLM evaluation and defense — to officially launch the **Enterprise Scenarios Leaderboard**…