Hugging Face recently announced a major upgrade to its Arabic Large Language Model (LLM) leaderboard, aiming to provide a more credible and comprehensive…
The recently launched `smolagents` from Hugging Face is an AI agent framework that emphasizes being lightweight and code-centric (Code Agent). However, as the…
Hugging Face's Open LLM Leaderboard has long served as an important barometer for measuring the capabilities of open-source large language models (LLMs)…
Hugging Face, in collaboration with its partners, has officially launched the "Open Arabic LLM Leaderboard 2.0." With the explosive growth of Arabic large…
As multimodal large language models (such as GPT-4o, Gemini, and various open-source audio models) continue to proliferate, AI's ability to process audio has…
As large language models (LLMs) are increasingly applied in software development and logical reasoning, there is growing interest in whether models possess the…
### Background and Challenges: The Difficulty of Evaluating Non-English LLMs In the current landscape of large language model (LLM) development, evaluating…
Hugging Face has officially launched the "Open Japanese LLM Leaderboard," a community-driven platform dedicated to evaluating the performance of…
This article from the Hugging Face blog introduces "The First Multilingual LLM Debate Competition." As large language models (LLMs) have rapidly advanced…
As large language models (LLMs) have rapidly advanced, traditional static benchmarks (such as MMLU) have increasingly faced saturation and gaming problems. As…
This case study provides a detailed account of how non-profit organization Digital Green, with support from Hugging Face's Expert Support team, optimized its…
CinePile is a multimodal question-answering dataset focused on movie and long-video understanding. In traditional dataset construction, researchers commonly…
Hugging Face has officially launched the "Open FinLLM Leaderboard" — a new platform dedicated to evaluating and tracking the performance of large language…
The Hugging Face team and its collaborators have jointly launched a new benchmark called "BenCzechMark," designed to evaluate the understanding and generation…
Hugging Face has partnered with independent AI evaluation organization Artificial Analysis to officially launch the "Text to Image Leaderboard & Arena." This…
As large language models (LLMs) become increasingly prevalent in software development and automated workflows, their "dual-use" risks in the cybersecurity…
Hugging Face has announced the launch of the "Open Arabic LLM Leaderboard," an important initiative aimed at advancing Arabic natural language processing (NLP)…
Hugging Face has officially launched the "Open Leaderboard for Hebrew LLMs," an open-source evaluation platform specifically designed for Hebrew large language…
Hugging Face has announced the launch of the new "Open Chain of Thought (CoT) Leaderboard," a public platform specifically designed to evaluate and compare the…
Hugging Face has announced the official launch of the "Open Medical-LLM Leaderboard" in collaboration with researchers from Open Life Science AI and the…
As code large language models (Code LLMs) develop rapidly, fairly and accurately evaluating their capabilities has become a major challenge. Traditional…
As large language models (LLMs) have been widely adopted across industries, ensuring AI systems remain safe and compliant while preventing harmful outputs has…
Hugging Face has announced the launch of a new multimodal benchmark and leaderboard called "ConTextual," aimed at addressing the shortcomings of existing…
Hugging Face and South Korea's leading AI startup Upstage have jointly announced the launch of the "Open Ko-LLM Leaderboard." This is a brand-new evaluation…
Hugging Face has announced the launch of the new **NPHardEval** leaderboard — a benchmark specifically designed to evaluate the reasoning capabilities of large…
Hugging Face has partnered with Patronus AI — a startup focused on LLM evaluation and defense — to officially launch the **Enterprise Scenarios Leaderboard**…
While large language models (LLMs) have demonstrated remarkable generative capabilities across many domains, "hallucination" — where a model confidently…
### Introduction: Capability Is Not Safety — A New Benchmark for LLM Safety Evaluation As large language models (LLMs) are adopted more deeply across…
### Background: The Gap Between Leaderboard Scores and Paper Results By mid-2023, Hugging Face's Open LLM Leaderboard had become the community's go-to platform…
As large language models (LLMs) become widely used across various domains, the issues of bias and toxicity in model outputs have received increasing attention…