ViT | AI Model

Overview

ViT is the Vision Transformer, a model that treats an image as a sequence of patches and applies a Transformer encoder for recognition. It scales well with data and compute, transfers cleanly to new tasks, and serves as a strong backbone for vision systems.

Description

ViT chops an image into fixed-size patches, flattens and projects each patch to a token embedding, adds positional encodings, then processes the token sequence with a standard Transformer encoder. A class token summarizes the sequence for classification, and the same backbone adapts to detection or segmentation through lightweight heads. The appeal is simplicity and scale: no hand-crafted convolutions, strong performance when pretraining on large datasets, and reliable fine-tuning across domains. ViT families vary in depth and width to match hardware budgets, and they train well with modern augmentations and regularization. In practice, ViT is a go-to backbone for vision tasks that benefit from long-range context, clean transfer, and stable optimization.

About Google AI

At Google, we think that AI can meaningfully improve people's lives and that the biggest impact will come when everyone can access it.

Industry: Research

Company Size: 501-1000

Location: Mountain View, CA, US

Website: ai.google

View Company Profile

Related Models

Last updated: October 8, 2025

Overview

Description

About Google AI

Related Models

Magika 1.0

Seedream 3.0

AlbedoBaseXL (SDXL)

Help

People also viewed

Create AI Tools

Mini Tool

Vibe code an AI Tool