Hugging Face BlogMar 20, 2024, 12:00 AMimportant 85

Cosmopedia：如何為大型語言模型預訓練建立大規模合成數據

Original: Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models

Hugging Face has officially released Cosmopedia, currently the largest and fully open-source synthetic dataset designed for the…

Hugging Face 發布了當時最大的開源合成數據集 Cosmopedia，內含 250 億 Token。該項目利用 Mixtral-8x7B 模型，根據精心設計的提示詞與主題生成教科書、部落格和教學等多元內容。實驗證明，使用此合成數據預訓練的 1.8B 模型（Cosmo-1.8B）在多項基準測試中超越了同量級的知名模型，為 LLM 預訓練提供了全新的合成數據生成範式。

Hugging Face has officially released Cosmopedia, currently the largest and fully open-source synthetic dataset designed for the pre-training of large language models (LLMs). Cosmopedia contains over 30 million files totaling 25 billion (25B) tokens, and is designed to simulate diverse knowledge systems including human encyclopedias, textbooks, blog articles, and social media content.

Full summary

Free shows the 3-line summary; Pro unlocks the full deep summary (~300 words) so you never have to click through.

See Pro plans →

Want the original English / full article?

Read on Hugging Face Blog →

mistral other huggingface #synthetic-data #pre-training #dataset #llm

Summaries are AI-generated; the original article is authoritative.