Hugging Face 推出 AI Secure LLM 安全排行榜：基於 DecodingTrust 框架深度評估大模型信任度

### Introduction: Capability Is Not Safety — A New Benchmark for LLM Safety Evaluation As large language models (LLMs) are adopted more deeply across industries, developers and enterprises are increasingly focused not just on a model's "IQ" (performance metrics like MMLU and GSM8K), but also on its "safety" and "trustworthiness." To fill this evaluation gap, Hugging Face has partnered with the DecodingTrust team from academic institutions including UIUC, Stanford University, and UC Berkeley to officially launch the **AI Secure LLM Safety Leaderboard**. This leaderboard is built on the **DecodingTrust** evaluation framework, which won the Outstanding Paper Award at NeurIPS 2023, and aims to provide a comprehensive, systematic, and transparent platform for measuring the safety risks of mainstream LLMs. --- ### The 8 Safety Evaluation Dimensions of DecodingTrust Traditional safety evaluations tend to be superficial. DecodingTrust conducts in-depth "stress testing" of models across the following 8 key dimensions: 1. **Toxicity:** Evaluates whether a model outputs insulting, hateful, or harmful content when faced with deliberately provocative or sensitive topics. 2. **Stereotype Bias:** Tests whether a model exhibits systematic bias or discrimination against specific genders, races, religions, or groups. 3. **Adversarial Robustness:** Simulates malicious "jailbreak" attacks and adversarial prompts to examine whether a model can hold its safety boundaries. 4. **Out-of-Distribution Robustness:** Tests whether a model's performance and safety mechanisms break down when inputs deviate from the training distribution (e.g., typos, grammatical errors, or style shifts). 5. **Robustness to Adversarial Demonstrations:** Inserts misleading examples into the context to test whether a model can be easily led astray. 6. **Privacy:** Evaluates whether a model leaks personally identifiable information (PII) or sensitive secrets present in its training data. 7. **Machine Ethics:** Tests the model's moral judgment and decision-making in complex ethical dilemma scenarios to see whether it aligns with mainstream human values. 8. **Fairness:** Evaluates whether a model produces fair, unbiased predictions across groups of different backgrounds in decision-making tasks. --- ### Key Insights from the Leaderboard 1. **"High capability" does not mean "safe":** Evaluation results show that even top-performing closed-source models like GPT-4 still exhibit significant security vulnerabilities when subjected to carefully designed adversarial attacks or privacy-leakage tests. This is a reminder to developers that they cannot blindly trust a model's default safety mechanisms when deploying commercial applications. 2. **Safety alignment in open-source models still has room to improve:** Many open-source models (such as Llama 2 and Vicuna), while approaching closed-source models in basic capability, generally score poorly on safety dimensions without rigorous safety fine-tuning (RLHF/DPO) and can be easily manipulated by malicious prompts. 3. **Providing transparent decision-making data:** Through Hugging Face Spaces, developers can intuitively compare different models' performance on specific safety dimensions. For example, if an application scenario places an extremely high priority on "privacy," developers can prefer the model with the highest privacy protection score rather than looking solely at overall performance. ### Conclusion The launch of the AI Secure LLM Safety Leaderboard marks a new phase in LLM evaluation — one that moves from "pure IQ competition" to an era where "IQ and EQ/ethics both matter." This will drive the open-source community to take safety alignment more seriously, and will give enterprises a reliable safety yardstick when deploying AI applications.