Hugging Face BlogJun 18, 2024, 12:00 AMimportant 80

BigCodeBench：下一代 Code LLM 評測基準 HumanEval 的繼承者

Original: BigCodeBench: The Next Generation of HumanEval

As large language models (LLMs) have made tremendous strides in code generation, the long-standing industry gold standard — the HumanEval…

傳統的 HumanEval 程式碼評測基準已逐漸飽和且過於簡單。Hugging Face 與研究團隊合作推出新一代基準 BigCodeBench，包含 1,140 個實用編程任務，涵蓋 139 個第三方 Python 函式庫。此基準旨在考驗 LLM 在複雜、多步驟及真實開發場景下的程式碼生成與指令遵循能力，成為評估 Code LLM 的新一代標準。

As large language models (LLMs) have made tremendous strides in code generation, the long-standing industry gold standard — the HumanEval benchmark — has gradually shown its limitations. Many mainstream models now score above 90% on HumanEval, yet this does not mean AI can perfectly handle real-world software development. To address this gap, Hugging Face and its research collaborators have introduced a next-generation code evaluation benchmark: BigCodeBench.

Full summary

Free shows the 3-line summary; Pro unlocks the full deep summary (~300 words) so you never have to click through.

See Pro plans →

Want the original English / full article?

Read on Hugging Face Blog →

open-source gpt claude gemini huggingface #coding #benchmark #llm-evaluation #python

Summaries are AI-generated; the original article is authoritative.