Image to text

Models 15

Gen 7

NuMarkdown 8 B Thinking

By NuMind

NuMarkdown-8B-Thinking is a reasoning OCR vision-language model fine-tuned from Qwen2.5-VL to convert complex document images into clean Markdown, using intermediate “thinking” tokens to infer layout and tables before generating the final text

📜OCR 🖼️Image to text

Text

Released 3mo ago
Gen 7 Hunyuan

HunyuanOCR

By Tencent

HunyuanOCR is Tencent Hunyuan’s 1B parameter end-to-end OCR expert VLM. It reads documents, screenshots, and video frames, handling text detection, recognition, layout parsing, information extraction, subtitles, and photo translation in one shot, with strong multilingual support and state-of-the-art accuracy.

🏭Manufacturing 🌐Text translation 🔍SEO content 🖼️Image to text

Text

Released 4mo ago
Gen 4

olmOCR

By Ai2

olmOCR is AllenAI’s open-source document recognition pipeline and model family that converts PDFs and images into clean text, preserving reading order, tables, equations, and handwriting.

🏭Manufacturing 🖼️Image to text

Image

Released 5mo ago
Gen 7 LFM

LFM2 VL 3B

By Liquid AI

LFM2-VL-3B is a 3B vision-language model that reads images with text and answers in natural language or structured JSON. It handles OCR, charts, tables, and screenshots with long context and low-latency streaming, making it practical for multimodal RAG and assistants.

📜OCR 🖼️Image to text 🗒Transcription 🖼️Logos

Text

Released 5mo ago
Gen 3 Qianfan

Qianfan-VL-3B

By Baidu

Qianfan-VL-3B is Baidu’s lightweight VLM for cost-sensitive, real-time multimodal apps. It processes images plus text and returns grounded answers with basic OCR and layout understanding, long context, tool/function calling, and JSON outputs—optimized for speed and efficiency.

🏭Manufacturing 🖼️Image to text 🔍Image recognition

Text

Released 6mo ago
Gen 3 Phi

Phi-3-vision

By Gentext Group

Phi-3-Vision is Microsoft’s compact, open-weight multimodal model that understands images + text and answers in text. Optimized for documents, charts, UI screenshots, diagrams, and photos, it delivers strong OCR and visual reasoning in a small footprint suitable for single-GPU or edge deployment.

📷Images 🖼️Image to text ❓Answers 🔍Image analysis

Text

Released 6mo ago
Gen 4 Hailuo

Hailuo 2.3 Fast

By MiniMax

Hailuo 2.3 Fast is a speed-tuned mode that trades a little peak fidelity for much lower latency.

🖼️Image generation 🔍Image upscaling 🖼️Image to text

Image

Released 6mo ago
Gen 3 Command

Command A Vision

By Caldera Labs

Command A Vision is Cohere’s multimodal instruction model that pairs text and image understanding. It accepts images plus text prompts and outputs structured, step-by-step text answers. It’s tuned for enterprise workflows like document OCR, chart/diagram reasoning, screenshot/UI analysis, and tool or function calling.

📜OCR 🖼️Image to text 🔍Image recognition

Text

Released 8mo ago
Gen 4 Grok

Grok Image 2

By xAI

Grok Image 2 is xAI’s fast vision-language model. It reads images with text, handles OCR and layout, explains charts and screenshots, and returns grounded answers or JSON with long context, tool calling, and streaming for real-time multimodal assistants.

🏭Manufacturing 🖼️Image to text

Image

Released 1y ago
Gen 3 Qwen

Qwen 2.5-VL-72B

By Alibaba

Qwen 2.5-VL-72B is Alibaba’s flagship open-weight vision-language model. It takes images (docs, charts, screenshots, photos) plus text and answers in text, with strong OCR, layout understanding, and multi-image reasoning. It supports long context, function/tool calling, and reliable JSON outputs—ideal for multimodal RAG, agents, and enterprise workflows.

🏭Manufacturing 🖼️Image to text 🔍Image recognition

Text

Released 1y ago
Gen 4

Photon 1

By Luma AI

1 Photon is Luma’s controllable text-to-image model for high-fidelity, photoreal results with solid prompt adherence and identity consistency.

🖼️Image generation 🔍SEO content 🖌️Image editing 🖼️Image to text

Image

Released 1y ago
Gen 3 Mistral

Pixtral Large

By Mistral AI

Pixtral Large is Mistral’s flagship vision-language model. It takes images plus text and returns grounded, step-by-step answers—great for document OCR, charts/diagrams, UI screenshots, and general visual QA—with long-context support, tool/function calling, and reliable JSON outputs.

📜OCR 🖼️Image to text 🗒Transcription 🌐SEO

Text

Released 1y ago
Gen 7 Gemma

PaliGemma

By Google

PaliGemma is Google’s open-weight vision-language model in the Gemma family. It takes images (or screenshots, documents, charts) plus text and answers in text—great for OCR, captioning, VQA, and UI/doc understanding. Lightweight and fine-tunable, it runs on a single GPU and supports quantization for edge deployment.

📜OCR 🖼️Image to text 🔍Image recognition

Text

Released 1y ago
Gen 3 Palmyra

Palmyra Vision

By Writer Engineering

Palmyra Vision is Writer’s multimodal LLM that takes images as input and generates text output. It can extract text from images (including handwriting), interpret charts/graphs/diagrams, classify objects, and answer questions about visual content—all aimed at enterprise workflows.

🖼️Image to text 🔍Image recognition 🖼️Image descriptions

Text

Released 2y ago
Gen 4

AlbedoBaseXL (SDXL)

By Stability AI

AlbedoBaseXL is a neutral SDXL foundation checkpoint favored for fine tuning and on brand image generation.

🖼️Image generation 🖌️Image editing 🖼️Image to text

Image

Released 2y ago