TAAFT
Free mode
100% free
Freemium
Free Trial
Deals

ViT

New Image Gen 1
Released: October 22, 2020

Overview

ViT is the Vision Transformer, a model that treats an image as a sequence of patches and applies a Transformer encoder for recognition. It scales well with data and compute, transfers cleanly to new tasks, and serves as a strong backbone for vision systems.

Description

ViT chops an image into fixed-size patches, flattens and projects each patch to a token embedding, adds positional encodings, then processes the token sequence with a standard Transformer encoder. A class token summarizes the sequence for classification, and the same backbone adapts to detection or segmentation through lightweight heads. The appeal is simplicity and scale: no hand-crafted convolutions, strong performance when pretraining on large datasets, and reliable fine-tuning across domains. ViT families vary in depth and width to match hardware budgets, and they train well with modern augmentations and regularization. In practice, ViT is a go-to backbone for vision tasks that benefit from long-range context, clean transfer, and stable optimization.

About Google AI

At Google, we think that AI can meaningfully improve people's lives and that the biggest impact will come when everyone can access it.

Industry: Research
Company Size: 501-1000
Location: Mountain View, CA, US
Website: ai.google
View Company Profile

Related Models

Last updated: October 8, 2025