Hugging Face BlogApr 16, 2024, 12:00 AMimportant 75

推出 LiveCodeBench 排行榜：全面且無污染的程式碼大語言模型評估

Original: Introducing the LiveCodeBench Leaderboard - Holistic and Contamination-Free Evaluation of Code LLMs

As code large language models (Code LLMs) develop rapidly, fairly and accurately evaluating their capabilities has become a major…

Hugging Face 宣布上線 LiveCodeBench 排行榜，旨在解決傳統程式碼評估基準（如 HumanEval）容易遭受數據污染的問題。LiveCodeBench 透過持續收集 LeetCode、AtCoder 等平台的全新編程競賽題目，確保模型在未曾接觸過的數據上進行測試。該基準不僅評估程式碼生成，還涵蓋程式碼修復、測試案例生成及執行等多維度能力，為 Code LLM 提供更客觀、動態的實力排名。

As code large language models (Code LLMs) develop rapidly, fairly and accurately evaluating their capabilities has become a major challenge. Traditional evaluation benchmarks such as HumanEval and MBPP were published long ago, and their test problems have since been widely incorporated into the pretraining datasets of various models, leading to serious "data contamination" that prevents evaluation results from reflecting models' true generalization ability.

Full summary

Free shows the 3-line summary; Pro unlocks the full deep summary (~300 words) so you never have to click through.

See Pro plans →

Want the original English / full article?

Read on Hugging Face Blog →

gpt claude llama open-source #coding #benchmark #evaluation #code-generation

Summaries are AI-generated; the original article is authoritative.