Mistral AI demonstrates how LoRA fine-tuning adapts Pixtral-12B to satellite imagery, a specialized visual domain where prompting alone is unreliable. Using the Aerial Image Dataset, the post compares a prompt-based baseline against a fine-tuned model across 30 scene classes. Accuracy rose from 0.56 to 0.91, while invalid label hallucinations dropped from 5% to 0.1%.
Google and Hugging Face have jointly announced a new generation of open-weight models — "Gemma 4." This model represents a major breakthrough in on-device AI…
The Technology Innovation Institute (TII) of the UAE has officially announced the launch of its new "Falcon Perception" model on the Hugging Face blog. As an…
Hcompany has officially released a new model on Hugging Face called **Holotron-12B**, positioned as a "High Throughput Computer Use Agent." Although only the…
French AI startup H Company (formerly Holistic AI, founded by former Google DeepMind researchers) announced on the Hugging Face blog the launch of its new…
The cloud AI model deployment and hosting platform Replicate has officially announced support for running the new lightweight vision-language model (VLM) —…
Traditional OCR systems (such as Tesseract) often struggle with complex layouts, multi-column tables, handwriting, and mathematical formulas, while using…
Visual Language Models (VLMs) combine computer vision with natural language processing, enabling complex tasks such as image captioning and visual question…
Hugging Face's TRL (Transformer Reinforcement Learning) is a popular open-source library specifically designed for aligning language models (LLMs). In its…
With the rapid development of vision-language models (VLMs) and multimodal AI, the amount of data required to train these models has grown explosively…
NVIDIA has partnered with Hugging Face to officially bring its latest lightweight vision-language model (VLM) — the **NVIDIA Llama Nemotron Nano VLM** — to the…
In the inference process of large language models (LLMs) and vision-language models (VLMs), autoregressive decoding is a major performance bottleneck. Each…
H (formerly Holistic AI), a highly regarded French AI startup, recently officially released a new family of vision-language models (VLMs) on the Hugging Face…
Hugging Face recently launched an open-source project called nanoVLM, positioned as "the simplest repository for training Vision Language Models (VLMs) in pure…
With the explosion of multimodal technology, Vision Language Models (VLMs) have evolved from laboratory research prototypes into core tools for enterprises and…
As large language models (LLMs) and vision language models (VLMs) continue to scale up, running these models on limited hardware resources — such as…
### Background With the proliferation of vision-language models (VLMs), using VLMs for document OCR (e.g., converting PDFs to Markdown) has become mainstream…
The Language Technologies department (BSC-LT) of the Barcelona Supercomputing Center (BSC) recently released a new open-source multimodal model on Hugging Face…
Cohere For AI (C4AI) has officially launched "Aya Vision," a series of open-source multimodal models (available in 8B and 32B parameter versions) designed…
Google has officially launched SigLIP 2, a major upgrade to its widely popular SigLIP (Sigmoid Loss for Language-Image Pre-training) vision-language encoder…
Hugging Face has introduced SmolVLM2, the latest addition to its Smol family of lightweight models. SmolVLM2 is designed to bring advanced vision-language…
Google has officially launched the PaliGemma 2 Mix model series — a new family of open-source instruction-tuned vision-language models (VLMs) now available on…
With the rise of open-source video generation models such as LTX-Video, HunyuanVideo, and CogVideoX, building high-quality training datasets has become the…
On January 24, 2025, Hugging Face announced that smolagents — its open-source library designed for building lightweight, high-performance AI agents — now…
Hugging Face has officially introduced the newest members of the SmolVLM family, pushing vision-language model (VLM) sizes even further down to 256M (256…
Hugging Face has recently released a new Visual Document Retrieval (VDR) model — **VDR-2B-multilingual**. This technology marks a formal transition in document…
Google and Hugging Face have jointly announced the release of a new generation of open-weight vision-language model (VLM) — PaliGemma 2. This model continues…
Hugging Face has officially launched a lightweight vision language model (VLM) called **SmolVLM**, designed to bring powerful multimodal understanding…
The Hugging Face official blog has announced the release of a new, massive dataset called "Docmatix," specifically designed for training and fine-tuning…
As vision-language models (VLMs) are increasingly applied to multimodal tasks, how to make these models produce outputs that better align with human…