Clip Model Overview: Unlock Multimodal AI Power

Machine Learning


Today there is a lot of hype about LLM. Engineers often compare and praise recent innovative models such as ChatGpt, Llama, Gemini, and Mistral. At the same time, developers are not only the many other impactful models that have also brought many successes in the machine learning industry.

In this article, I would like to talk about Clip, one of the most iconic models developed by Openai. Released in 2021, Clip can be used in a variety of settings in an NLP or computer vision project, producing cutting-edge results for a variety of tasks. Many engineers consider a clip to be just an embedded model (the truth), but their applications are very broad.

This article provides a detailed explanation of the clip model, including architecture and training processes, performance, and applications.

Controlled learning

Before discussing clip architecture, let's understand the meaning behind it Controlled learningplays an essential role in clip design.

Controlled learning The aim is to teach the embedded model and teach the embedded model so that similar samples approach space and different samples are pushed further away.

Contrasting learning framework. The purpose consists of bringing objects of the same class (1 and 2) closer together in an embedded space, further away from objects 3 belonging to different classes.

Simply put, in contrasting learning, the model works with pairs of objects. During training, we don't know if they actually look similar. The calculated embedding predicts distance (similarity), and then the loss function is calculated. Basically, there are two cases.

  • The first object was similar. The value of the loss function leads to weight updates in a way that adjusts the embedding to make the next similarity closer.
  • The first object was different. In this case, the model updates its weights so that the similarity between this embedded pair is lower next time.

Architecture and Training

The clip developers collected a huge dataset of 400m pairs (images, text). All images were provided with text descriptions.

The goal was to construct meaningful embedded representations, and to measure how similarity between them was to measure how similar a given text description was with regard to images. To that end, the author has already taken two existing model architectures.

  • Text embedding model
  • Image embedding model

The first 400m pair of images and text were split into batches. All images and text in each batch were passed through the embedded model of images or text, and generated embeddings. As a result, if there is n Embed pairs in a batch n The embedding is created for images and text.

after that, COSINE Pairwise Similarity Matrix It is built between embedding images and text.

All the main diagonal elements of the pairwise matrix represent similarity between the image and text that was first combined in batches. The text description corresponds well to the image, The main diagonal similarity should be maximized.

on the other hand, The elements from the diagonal were not combined together and came from different pairs. Therefore, their similarity must be minimized.

Clip workflow diagram. Source: Learning transferable visual models from natural language supervision. Images adapted by the author.

The calculated similarities are passed to Cross-entropy loss functionIt is used to perform weight updates for both embedded models.

detail

The main parameters of the clip are the embedded model used to encode text and images.

  • The text is encoded in a transformer-based model whose architecture is similar to BERT.
  • For images, encoding can be performed using traditional convolutional networks (ResNet) or Vision Transformer models (VITs).

Both models are trained from scratch and by default generate embeddings of size 512. Given the fact that the dataset size (400m pairs) is large, VIT is usually preferred over resNet.

advantage

The notable clip has several powerful aspects:

  • Clips can be used for a variety of tasks, not just for embedding generation (examples can be found in the next section).
  • The performance of the zero-shot clip is comparable to a simple supervised baseline using a linear classifier on top of the ResNet feature.
  • Calculation Efficiency: Many calculations can be performed in parallel.

application

embedded

The most obvious clip application consists of using it for text and image embedding calculations. Embeddings can be used individually for text or image tasks, such as similarity search pipelines and RAG systems.

Additionally, both the text and image can be used together if you need to associate an image with a corresponding text description.

Image classification

Apart from generating image and text embeddings, one of the most powerful aspects of clips is the ability to solve other tasks in a zero-shot learning style.

For example, get the image classification task. If an animal image is given to the purpose of identifying its class from the animal list, all names of the animal can be embedded. Second, you can directly identify the animal class by finding the most similar text embedding for a particular image.

Clips can estimate the similarity of image labels and class labels to classify images.

Speaking of this recognition method, research has shown that it is better to embed all the text (class names) using the following prompt structure: ”. For other task types, the best prompts may differ.

OCR

OCR represents optical character recognition, simply means recognizing text from an image. OCR tasks are usually resolved by specially trained, monitored models. Nevertheless, Clip's impressive features also allow you to identify text on images in a zero-shot way.

If you have a list of all possible text that can be displayed in the image, you can encode all possible options and select the most similar pair in a similar way as in the previous case. However, in this case, all possible words or text numbers are usually much larger than the normal number labels for the image classification task. Encoding them all is very long and inefficient. Therefore, clips are rarely used for OCR tasks with long text sequences.

When it comes to OCR, the clip is much better for small words. For example, it's easy to set up a number recognition task with a clip, as there are only 10 classes (each class represents a number between 0 and 9).

One interesting observation is that zero shot clips only achieve 88% accuracy scores in the famous handwritten digit recognition task, while other simple models easily reach 99% accuracy. What you need to keep in mind is that despite the fact that clips have impressive zero shot capabilities, there may still be some very specific image types that have no clips trained.

Clips only achieve 88% accuracy with handwritten numbers recognition. Source: Mnist Dataset | Tensorflow

Here are some important notes:

Clips are not good for some abstract tasks, such as counting objects in a photo, estimating how close two objects in an image are to each other, etc.

Clip produces similar zero-shot performance for standard computer vision tasks compared to other older models such as Imagenet. Still, the author argues that in the current situation, it is necessary to train clips on hardware that exceeds the latest hardware by 1000 times.

Conclusion

In this article, we studied Clip's architectural principles. The clips trained in 400m (image, text) pairs have reached cutting-edge performance on many tasks. Normally, clips fail in some abstract downstream tasks, but there is still some great feature to use zero-shot techniques to perform other standard computer vision tasks.

resource

Unless otherwise stated, all images are from the author.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *