An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place... (read more)

PDF Abstract

Results from the Paper


 Ranked #1 on Image Classification on CIFAR-10 (using extra training data)

     Get a GitHub badge
TASK DATASET MODEL METRIC NAME METRIC VALUE GLOBAL RANK USES EXTRA
TRAINING DATA
BENCHMARK
Image Classification CIFAR-10 ViT-L/16 Percentage correct 99.42 # 2
PARAMS 307M # 9
Image Classification CIFAR-10 ViT-H/14 Percentage correct 99.5 # 1
PARAMS 632M # 10
Image Classification ImageNet ViT-H/14 Top 1 Accuracy 88.36% # 3
Number of params 632M # 3
Image Classification ImageNet ViT-L/16 Top 1 Accuracy 87.61% # 4
Number of params 307M # 7
Image Classification ImageNet ReaL ViT-L/16 Accuracy 90.24% # 2
Params 307M # 1
Image Classification ImageNet ReaL ViT-H/14 Accuracy 90.77% # 1
Params 632M # 2
Fine-Grained Image Classification Oxford 102 Flowers ViT-L/16 Accuracy 99.74% # 1
PARAMS 307M # 6
Fine-Grained Image Classification Oxford 102 Flowers ViT-H/14 Accuracy 99.68% # 2
PARAMS 632M # 7
Fine-Grained Image Classification Oxford-IIIT Pets ViT-H/14 Accuracy 97.56% # 1
PARAMS 632M # 7
Fine-Grained Image Classification Oxford-IIIT Pets ViT-L/16 Accuracy 97.32% # 2
PARAMS 307M # 6
Image Classification VTAB-1k ViT-H/14 Top-1 Accuracy 77.16 # 1
Params 632M # 3
Image Classification VTAB-1k ViT-L/16 Top-1 Accuracy 75.91 # 3
Params 307M # 2

Methods used in the Paper