Hugging Face BlogFeb 2, 2024, 12:00 AMimportant 75

Hugging Face 推出 NPHardEval 排行榜：透過計算複雜度與動態更新揭示大型語言模型的推理能力

Original: NPHardEval Leaderboard: Unveiling the Reasoning Abilities of Large Language Models through Complexity Classes and Dynamic Updates

Hugging Face has announced the launch of the new **NPHardEval** leaderboard — a benchmark specifically designed to evaluate the reasoning…

Hugging Face 推出全新的 NPHardEval 排行榜，旨在透過計算複雜度理論（如 P、NP-Complete、NP-Hard 問題）來嚴格評估大型語言模型（LLM）的邏輯推理與規劃能力。為了解決傳統基準測試容易因訓練數據污染而失效的問題，NPHardEval 採用動態更新機制，定期生成全新測驗。這項工具能幫助研究人員更準確地衡量模型在面對複雜優化問題時的真實推理極限。

Hugging Face has announced the launch of the new **NPHardEval** leaderboard — a benchmark specifically designed to evaluate the reasoning capabilities of large language models (LLMs). Traditional LLM evaluation benchmarks (such as MMLU or GSM8K) frequently face the problem of "data contamination," where models have already seen the test questions during pretraining, meaning their strong performance may stem from "memorization" rather than genuine "reasoning."

Full summary

Free shows the 3-line summary; Pro unlocks the full deep summary (~300 words) so you never have to click through.

See Pro plans →

Want the original English / full article?

Read on Hugging Face Blog →

gpt claude gemini llama mistral open-source #reasoning #benchmark #evaluation #data-contamination #np-hard

Summaries are AI-generated; the original article is authoritative.