Latest in AI

Showing:BenchmarkOtherClear ×

🔥 Trending today

anthropic7 export-controls5 model-access3 ai-infrastructure3 spacex3 amazon3 national-security2 open-source2 governance2 ai-policy2

Topic

Release New Tool Tutorial Business Paper Benchmark Opinion Regulation

For

General Developers Designers Product Founders Marketing Researchers Students

TTS Benchmark Revamped with Objective Standards and Blind ELO Voting (46 Models)
r/LocalLLaMA top day5 days agoBenchmark
Reddit user UkieTechie has revamped their TTS benchmark platform with objective scoring standards and live blind voting, now covering 46 speech synthesis models. Hosted on Hugging Face Space, the arena lets users vote on audio quality without knowing the model name, generating a dynamic ELO leaderboard. The project is open-source on GitHub and welcomes community submissions of new models.
Can LLMs Beat Classical Hyperparameter Optimization Algorithms?
Hacker News (AI keywords)5 days agoBenchmark
This paper investigates whether LLMs can serve as effective hyperparameter optimization (HPO) agents, competing with established classical methods such as Bayesian optimization, TPE, and random search. The study likely employs a systematic evaluation framework where LLMs iteratively suggest hyperparameter configurations based on task descriptions and historical evaluation results. Findings aim to clarify the practical potential and limitations of LLMs in AutoML pipelines.
Jetson Orin NX Build for Hermes Agent + Benchmarking
r/LocalLLaMA top day5 days agoHardware
The post describes turning an unused Jetson Orin NX into a compact local LLM server for Hermes Agent testing. The goals were low noise, over 10 tok/s generation, 300 tok/s prompt processing, at least 65K context, and a custom case. After testing Gemma 4, Qwen 3.6, and many quant variants, the author reports Gemma 4 26B A4B UD Q2_K_XL reaching 66K context and 10.21 tok/s near 60K context.
Omi Med STT v1: Open-Weight Medical ASR Fine-Tuned from Parakeet 0.6B★ 72
r/LocalLLaMA top day5 days agoRelease
Omi Health’s founder says he fine-tuned NVIDIA Parakeet TDT 0.6B v2 for clinical speech and released Omi Med STT v1 under CC-BY-4.0. The runtime supports Mac, Windows, and Linux, auto-selecting MLX, NeMo, or GGUF/parakeet.cpp backends. In the author’s held-out medical benchmark, it reports 2.37% medical-WER and 145× realtime on local A10 compute.
EVA-Bench Data 2.0: 3 Domains, 121 Tools, 213 Scenarios
Hugging Face Blog10 days agoBenchmark
ServiceNow AI published a Hugging Face Blog post titled “EVA-Bench Data 2.0: 3 Domains, 121 Tools, 213 Scenarios.” Based only on the title, it appears to be a benchmark dataset update involving tool-use or scenario-based AI evaluation. The exact domains, tools, scenario design, licensing, supported models, and evaluation methodology cannot be confirmed without the full article.
QIMMA ⛰：首個品質優先的阿拉伯語大型語言模型（LLM）排行榜
Hugging Face Blog54 days agoRelease
The Technology Innovation Institute (TII) of the United Arab Emirates — the organization behind the well-known open-source model Falcon — has officially…
深入解析 VAKRA：IBM Research 評估 AI Agent 推理、工具調用與失敗模式的全新基準測試★ 75
Hugging Face Blog60 days agoRelease
As generative AI technology has evolved, the industry's focus has shifted from pure "Large Language Models (LLMs)" to "AI Agents" capable of autonomously…
IBM 與柏克萊加州大學推出 IT-Bench 與 MAST：診斷企業級 AI Agent 失敗原因的全新基準與框架★ 80
Hugging Face Blog116 days agoRelease
### The Pain Points of Enterprise AI Agents in Production: Why Do They Keep Failing? As large language models (LLMs) have rapidly advanced, enterprises have…
Alyah ⭐️：邁向阿拉伯語大型語言模型中阿聯酋方言能力的強健評估
Hugging Face Blog138 days agoRelease
As Arabic large language models (LLMs) develop rapidly, accurately evaluating model performance across different regional dialects has become a significant…
AssetOpsBench：彌合 AI Agent 評估基準與工業實際應用差距的全新基準測試★ 75
Hugging Face Blog144 days agoRelease
In today's era of rapid development in AI Agent technology, how to evaluate the performance of these Agents in real-world settings — particularly in industrial…
Hugging Face 推出 BigCodeArena：透過實際執行程式碼進行端到端 Code LLM 評測★ 75
Hugging Face Blog250 days agoRelease
Hugging Face and the BigCode community have jointly launched a new code model evaluation platform called "BigCodeArena." As AI-assisted coding (such as Copilot…
FilBench 發布：大型語言模型真的懂菲律賓語嗎？全新評測基準登場
Hugging Face Blog306 days agoRelease
The Hugging Face team and community have collaborated to launch a new evaluation benchmark called "FilBench," aimed at answering a key question: do large…
在 DeepResearch Bench 評測開源 Llama Nemotron 模型：NVIDIA 打造頂尖且可移植的深度研究 Agent★ 80
Hugging Face Blog313 days agoRelease
This article provides a detailed look at how NVIDIA is using its open-source Llama Nemotron series of models to evaluate and build top-performing, portable…
📚 3LM：針對阿拉伯語大語言模型在 STEM 與程式碼能力的全新評估基準
Hugging Face Blog317 days agoRelease
The Technology Innovation Institute (TII) of the UAE — the organization behind the Falcon models — has announced on the Hugging Face blog the launch of a new…
Hugging Face 推出 TTS Arena：用社群盲測群眾外包評測語音合成模型★ 75
Hugging Face Blog838 days agoNew Tool
Hugging Face recently announced the launch of "TTS Arena" (Text-to-Speech Arena), a brand-new open-source platform specifically designed for evaluating…

Latest in AI

TTS Benchmark Revamped with Objective Standards and Blind ELO Voting (46 Models)

Can LLMs Beat Classical Hyperparameter Optimization Algorithms?

Jetson Orin NX Build for Hermes Agent + Benchmarking

Omi Med STT v1: Open-Weight Medical ASR Fine-Tuned from Parakeet 0.6B★ 72

EVA-Bench Data 2.0: 3 Domains, 121 Tools, 213 Scenarios

QIMMA ⛰：首個品質優先的阿拉伯語大型語言模型（LLM）排行榜

深入解析 VAKRA：IBM Research 評估 AI Agent 推理、工具調用與失敗模式的全新基準測試★ 75

IBM 與柏克萊加州大學推出 IT-Bench 與 MAST：診斷企業級 AI Agent 失敗原因的全新基準與框架★ 80

Alyah ⭐️：邁向阿拉伯語大型語言模型中阿聯酋方言能力的強健評估

AssetOpsBench：彌合 AI Agent 評估基準與工業實際應用差距的全新基準測試★ 75

Hugging Face 推出 BigCodeArena：透過實際執行程式碼進行端到端 Code LLM 評測★ 75

FilBench 發布：大型語言模型真的懂菲律賓語嗎？全新評測基準登場

在 DeepResearch Bench 評測開源 Llama Nemotron 模型：NVIDIA 打造頂尖且可移植的深度研究 Agent★ 80

📚 3LM：針對阿拉伯語大語言模型在 STEM 與程式碼能力的全新評估基準

Hugging Face 推出 TTS Arena：用社群盲測群眾外包評測語音合成模型★ 75