CLIP
Contrastive Language-Image Pretraining or CLIP is an approach to training a model to associate images with their textual representations using Contrastive Loss. This allows for high-performing Zero-Shot Learning i.e. the model can generalise to new tasks without fine-tuning.
The architecture is a "simplified version of ConVIRT" 1 trained from scratch.
From paper Learning Transferable Visual Models From Natural Language Supervision
-
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020. https://arxiv.org/abs/2103.00020 ↩