Open LLM Leaderboard:深入解析 DROP 基準測試與模型「刷榜」現象
Original: Open LLM Leaderboard: DROP deep dive
The Hugging Face Open LLM Leaderboard has long served as an important benchmark for the community to evaluate the capabilities of…
Hugging Face 針對 Open LLM 排行榜中的 DROP(離散推理)基準測試進行深度剖析。調查發現,許多開源模型之所以獲得異常高分,並非因為推理能力超越 GPT-4,而是源於對評估格式的過度擬合(Overfitting)或資料污染(Contamination)。為此,官方調整了評估與解析機制,使評分回歸真實水平,並呼籲社群建立更嚴謹的評估標準。
The Hugging Face Open LLM Leaderboard has long served as an important benchmark for the community to evaluate the capabilities of open-source models. However, the team recently noticed an anomaly: many small-to-medium-sized open-source models were achieving extremely high scores on the DROP (Discrete Reasoning Over Paragraphs) benchmark — scores that even surpassed GPT-4. To get to the bottom of this, the Hugging Face team conducted a thorough deep-dive investigation.
Free shows the 3-line summary; Pro unlocks the full deep summary (~300 words) so you never have to click through.
See Pro plans →Want the original English / full article?
Read on Hugging Face Blog →Summaries are AI-generated; the original article is authoritative.