TAAFT
Free mode
100% free
Freemium
Free Trial
Create tool

PaliGemma 2

New Multimodal Gen
Released: December 5, 2024

Overview

PaliGemma 2 is Google’s next-gen open-weight vision-language model in the Gemma family. It takes images (docs, charts, screenshots, photos) plus text and answers in text—with stronger OCR, grounded visual reasoning, multi-image understanding, and easy fine-tuning for real apps on a single GPU or edge devices.

Description

PaliGemma 2 pairs an upgraded vision encoder with a compact Gemma decoder to “look, read, and reason.” It ingests one or more images alongside a prompt and produces grounded, step-by-step text—captions, answers, summaries, or structured outputs (Markdown/JSON). Compared with the original PaliGemma, it improves layout-aware OCR, table/chart interpretation, and screenshot/UI analysis, and handles higher-resolution inputs via tiling/cropping strategies for dense documents.
For builders, it’s instruction-tuned for reliable formatting, supports function/tool calling for agent workflows (e.g., crop → read → reason), and integrates cleanly with RAG so answers can cite or reference specific regions. It’s lightweight enough to run on a single modern GPU, with 8/4-bit quantization and LoRA/full fine-tuning options to adapt to domains (invoices, forms, dashboards, manuals). Typical uses include enterprise document automation and extraction, analytics over charts/dashboards, accessibility (image descriptions), and developer assistants that reason directly from screenshots—bringing practical, efficient visual understanding to the Gemma ecosystem.

About DeepMind

DeepMind is a technology company that specializes in artificial intelligence and machine learning.

Industry: Research Services
Company Size: 501-1000
Location: London, GB
View Company Profile

Related Models

Last updated: September 22, 2025