Google and Hugging Face have jointly announced the release of a new generation of open-weight vision-language model (VLM) — PaliGemma 2. This model continues…
Hugging Face has officially launched a lightweight vision language model (VLM) called **SmolVLM**, designed to bring powerful multimodal understanding…
CinePile is a multimodal question-answering dataset focused on movie and long-video understanding. In traditional dataset construction, researchers commonly…
Meta has officially introduced the Llama 3.2 family of open-source models, marking a significant architectural upgrade with two major breakthroughs: multimodal…
With the explosion of video generation and understanding models such as Sora and Gen-3, high-quality video training data has become a key battleground for…
The Hugging Face official blog has announced the release of a new, massive dataset called "Docmatix," specifically designed for training and fine-tuning…
As vision-language models (VLMs) are increasingly applied to multimodal tasks, how to make these models produce outputs that better align with human…
In this case study, Prezi — the well-known company behind the non-linear presentation software of the same name — shares how it is embracing the "multimodal…
The Technology Innovation Institute (TII) of Abu Dhabi has officially released a new open-source model family on Hugging Face — Falcon 2 11B. This model, with…
Google has officially launched PaliGemma, a powerful yet lightweight open-source Vision-Language Model (VLM). The release of PaliGemma represents a significant…
Hugging Face has announced the launch of Idefics2, the next generation of its open-source Vision Language Model (VLM). With 8 billion (8B) parameters, this…
This technical blog post published by Hugging Face provides an accessible yet thorough breakdown of the core principles and applications of Vision Language…
Hugging Face has announced the launch of a new multimodal benchmark and leaderboard called "ConTextual," aimed at addressing the shortcomings of existing…
Hugging Face has officially launched IDEFICS (Image-supervised Decoder-Encoder-Few-shot-In-Context-Shorthand), an open-source multimodal vision-language model…
This official Hugging Face blog post details how to build an "AI WebTV" (AI web television channel) from scratch — a system capable of automatically generating…
This technical blog post from Hugging Face details how to accelerate the vision-language model (VLM) "BridgeTower" on Intel's Habana Gaudi2 deep learning…
Kakao Brain, the AI research arm of South Korean tech giant Kakao, has officially released newly trained ViT (Vision Transformer) and ALIGN (A Large-scale…
This is a classic technical guide written by the Hugging Face team, designed to help developers and researchers gain a deep understanding of how…
Although Hugging Face rose to prominence in the field of natural language processing (NLP), it has made tremendous strides in computer vision (CV) in recent…
Hugging Face announced new official Audio and Vision documentation guides for its core open-source library `datasets`. As multimodal AI models continue to…
As multimodal AI (combining text, images, audio, and other media) advances rapidly, the ethical challenges brought about by the technology are growing…
This article introduces DeepMind's Perceiver IO model and its integration into the Hugging Face Transformers library. Traditional Transformer models, while…