Overview
MM1 is Apple Research’s multimodal LLM blueprint: a vision encoder feeding a text decoder via cross-attention, pretrained on a balanced mix of image–caption, interleaved image–text, and text-only data. It highlights how data quality, interleaving, and resolution—not just scale—drive strong OCR, document/chart reasoning, and grounded visual answers.
Description
MM1 is Apple’s research program for building capable vision–language models with a transparent, reproducible recipe. A high-quality image encoder connects to a language model through cross-attention so the system can “look and reason” in one pass. Rather than chasing size alone, MM1 emphasizes the training mixture: large volumes of image–caption pairs are blended with interleaved sequences where text and images appear together, plus pure text to strengthen language fluency. The work shows that mixture ratios, image resolution, and the presence of interleaved examples often matter more than raw parameter count, especially for reading small text, following layouts, and reasoning over diagrams and charts. With light instruction tuning, MM1 follows multimodal prompts, handles multiple images, and can ground answers in specific visual regions. As a research line, it’s meant to clarify which ingredients actually move the needle for practical OCR, document understanding, and visual QA, and to provide a clean foundation others can adapt for assistants, analytics, and developer tools.
About Apple
No company description available.
View Company Profile