CLIP

Contrastive Language-Image Pretraining or CLIP is an approach to training a model to associate images with their textual representations using Contrastive Loss. This allows for high-performing Zero-Shot Learning i.e. the model can generalise to new tasks without fine-tuning.

The architecture is a "simplified version of ConVIRT" 1 trained from scratch.

From paper Learning Transferable Visual Models From Natural Language Supervision


  1. Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020. https://arxiv.org/abs/2103.00020