

An (image, text) pair might be a picture and its caption.

It is trained on 400,000,000 (image, text) pairs.It can be instructed in natural language to predict the most relevant text snippet, given an image, without directly optimizing for the task, similarly to the zero-shot capabilities of GPT-2 and 3."ĭepending on your background, this may make sense - but there's a lot in here that may be unfamiliar to you. From the OpenAI CLIP repository, "CLIP (Contrastive Language-Image Pre-Training) is a neural network trained on a variety of (image, text) pairs. CLIP is the first multimodal (in this case, vision and text) model tackling computer vision and was recently released by OpenAI on January 5, 2021.
