MM1 | AI Model

Overview

MM1 is Apple Research’s multimodal LLM blueprint: a vision encoder feeding a text decoder via cross-attention, pretrained on a balanced mix of image–caption, interleaved image–text, and text-only data. It highlights how data quality, interleaving, and resolution—not just scale—drive strong OCR, document/chart reasoning, and grounded visual answers.

Description

MM1 is Apple’s research program for building capable vision–language models with a transparent, reproducible recipe. A high-quality image encoder connects to a language model through cross-attention so the system can “look and reason” in one pass. Rather than chasing size alone, MM1 emphasizes the training mixture: large volumes of image–caption pairs are blended with interleaved sequences where text and images appear together, plus pure text to strengthen language fluency. The work shows that mixture ratios, image resolution, and the presence of interleaved examples often matter more than raw parameter count, especially for reading small text, following layouts, and reasoning over diagrams and charts. With light instruction tuning, MM1 follows multimodal prompts, handles multiple images, and can ground answers in specific visual regions. As a research line, it’s meant to clarify which ingredients actually move the needle for practical OCR, document understanding, and visual QA, and to provide a clean foundation others can adapt for assistants, analytics, and developer tools.

About Apple

No company description available.

Website: podcasts.apple.com

View Company Profile

Related Models

Last updated: October 15, 2025

Overview

Description

About Apple

Related Models

Palmyra X5

HyperCLOVA X SEED Vision

Qwen 3 VL 235B A22B

Help

People also viewed

Create AI Tools

Mini Tool

Vibe code an AI Tool