Learning Transferable Visual Models From Natural Language Supervision

Notes from paper Learning Transferable Visual Models From Natural Language Supervision by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever (OpenAI).

Overview

This paper introduces CLIP, a model for Zero-Shot Learning, where the model is trained with a Contrastive Loss objective, to associate correct images with correct text description. This allows the model to generalise to new tasks without fine-tuning on those tasks.

They benchmark n over 30 different existing computer vision datasets, across tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification.

The code and pre-trained model weights at https://github.com/OpenAI/CLIP.

Figure 1. CLIP overview

Related Papers

Unsupervised Pre-training in NLP

Many papers have achieve good results doing Unsupervised Pre-training in NLP, including task objectives like Masked Language Modelling:

Text-to-text models

They allowed task-agnostic architectures for zero-shot transfer to downstream datasets without fine-tuning:

General purpose LLMs

And, powerful models like GPT-3 can do a bunch of tasks without specific training data:

These results suggest that the aggregate supervision accessible to modern pre-training methods within web-scale collections of text surpasses that of high-quality crowd-labeled NLP datasets. This practice of pre-training on web-scale text datasets was not common practice, most models were trained on crowd-labelled datasets like ImageNet.

Early Exploration of Content-Based Image Retrieval

Over 20 years ago, Mori et al. (1999) explored ways to improve content-based image retrieval by training models to predict nouns and adjectives from text documents paired with images.

Learning Image Representations via Manifold Learning

Quattoni et al. (2007) demonstrated that [[../../../permanent/manifold-learning]] in the weight space of classifiers could result in more data-efficient image representations by predicting words in captions associated with images.

Multimodal Deep Boltzmann Machines

Srivastava & Salakhutdinov (2012) explored deep representation learning by training multimodal Deep Boltzmann Machines on low-level image and text tag features.

Modernizing Image Representation Learning

Joulin et al. (2016) modernized image representation learning by training CNNs to predict words in image captions, showing that these models learn useful image representations. They used the YFCC100M dataset (Thomee et al., 2016) to perform a bag-of-words multi-label classification task, and pre-trained AlexNet (Krizhevsky et al., 2012) on this data. They showed performance similar to ImageNet-based pre-training.

Extending to Phrase N-Grams

Li et al. (2017) extended the approach to predict phrase n-grams, demonstrating the system's ability to zero-shot transfer to other image classification datasets by scoring target classes based on their learned visual n-grams and predicting the highest-scoring one.

Transformer-Based Approaches to Vision

Recent studies, such as VirTex (Desai & Johnson, 2020), ICMLM (Bulent Sariyildiz et al., 2020), and ConVIRT (Zhang et al., 2020), have demonstrated the potential of transformer-based language modeling, masked language modeling, and contrastive objectives to learn image representations from text.

Limitations of Natural Language Supervision for Image Representation

While natural language supervision for image representation learning is an exciting proof of concept, it remains rare due to lower performance on common benchmarks. For example, Li et al. (2017) achieved only 11.5% accuracy on ImageNet in a zero-shot setting, much lower than the current state of the art of 88.4% accuracy (Xie et al., 2020). Even classic computer vision methods (Deng et al., 2012) achieve 50% accuracy, highlighting the gap.

Narrowly Scoped Approaches with Weak Supervision

More focused uses of weak supervision have significantly improved performance. Mahajan et al. (2018) demonstrated that predicting ImageNet-related hashtags from Instagram images is an effective pre-training task. When fine-tuned to ImageNet, this approach increased accuracy by over 5% and improved the overall state of the art.

Broader Transfer Learning Gains with Noisy Label Pre-Training

Kolesnikov et al. (2019) and Dosovitskiy et al. (2020) also demonstrated substantial improvements on broader transfer benchmarks by pre-training models to predict classes in the noisily labeled JFT-300M dataset.

A crucial difference between these weakly supervised models and recent explorations of learning image representations directly from natural language is scale.

While Mahajan et al. (2018) and Kolesnikov et al. (2019) trained their models for accelerator years on millions to billions of images, VirTex, ICMLM, and ConVIRT trained for accelerator days on one to two hundred thousand images.


In this work, they close this gap and study the behaviors of image classifiers trained with natural language supervision at large scale.

Enabled by the large amounts of publicly available data of this form on the internet, they create a new dataset of 400 million (image, text) pairs and demonstrate that a simplified version of ConVIRT trained from scratch, which we call CLIP, for Contrastive Language-Image Pre-training, is an efficient method of learning from natural language supervision.

They study the scalability of CLIP by training a series of eight models spanning almost 2 orders of magnitude of compute and observe that transfer performance is a smoothly predictable function of compute.

They find that CLIP, like GPT family, learns to perform a wide set of tasks during pre-training including OCR, geo-localization, action recognition, and many others.

They also confirm these findings with linear-probe representation learning analysis and show that CLIP outperforms the best publicly available ImageNet model while also being more computationally efficient.

They find that zero-shot CLIP models are much more robust than equivalent accuracy supervised ImageNet models which suggests that zero-shot evaluation of task-agnostic models is much more representative of a model’s capability.