TAAFT
Free mode
100% free
Freemium
Free Trial
Create tool

MM1

By Apple
New Multimodal Gen
Released: March 14, 2024

Overview

MM1 is Apple Research’s multimodal LLM blueprint: a vision encoder feeding a text decoder via cross-attention, pretrained on a balanced mix of image–caption, interleaved image–text, and text-only data. It highlights how data quality, interleaving, and resolution—not just scale—drive strong OCR, document/chart reasoning, and grounded visual answers.

Description

MM1 is Apple’s research program for building capable vision–language models with a transparent, reproducible recipe. A high-quality image encoder connects to a language model through cross-attention so the system can “look and reason” in one pass. Rather than chasing size alone, MM1 emphasizes the training mixture: large volumes of image–caption pairs are blended with interleaved sequences where text and images appear together, plus pure text to strengthen language fluency. The work shows that mixture ratios, image resolution, and the presence of interleaved examples often matter more than raw parameter count, especially for reading small text, following layouts, and reasoning over diagrams and charts. With light instruction tuning, MM1 follows multimodal prompts, handles multiple images, and can ground answers in specific visual regions. As a research line, it’s meant to clarify which ingredients actually move the needle for practical OCR, document understanding, and visual QA, and to provide a clean foundation others can adapt for assistants, analytics, and developer tools.

About Apple

No company description available.

View Company Profile

Related Models

Last updated: September 22, 2025