
Images from Bing Image Creator
Meta AI just released open source DINOv2 model First method to train a computer vision model using self-supervised learning. DINOv2 models achieve results comparable to or better than standard approaches and models in the field.
This model achieved strong performance without the need for fine-tuning, making it ideal for a wide variety of computer vision tasks and applications. Thanks to self-supervised training methods, DINOv2 can learn from different collections of images and features such as depth estimation without the need for explicit training.

Figure 1: DINOv2: Self-Supervised Computer Vision Model with Meta AI
1.1. No tweaking required
Self-supervised learning is a powerful method used to train machine learning models without the need for large amounts of labeled data. DINOv2 models can be trained on image corpora without the need for associated metadata, specific hashtags, or image captions. Unlike some recent self-supervised learning approaches, DinoV2 models do not require fine-tuning, thus producing high-performance features for a variety of computer vision applications.
1.2. Overcoming Human Annotation Limitations
Over the past few years, image text pre-training has become a mainstream method for various computer vision applications. However, it relies on human-labeled captions to learn the meaning of images. This approach often misses important information that isn’t explicitly included in the caption. For example, a human-labeled caption for a photo of a red table in a yellow room might be “red wooden table”. This caption is missing some important information about the background, position and size of the table. This leads to poor understanding of local information and poor performance for tasks that require detailed localization information.
Also, the need for human labeling and annotation limits the amount of data that can be collected to train a model. This can be very difficult for certain applications. For example, annotating cells requires a certain level of human expertise that is not available at the required scale. Using self-supervised training approaches on cell images paves the way for more fundamental models, resulting in improved biological discovery. The same applies to similar advanced areas of animal density estimation.
Migrating from DINO to DINOv2 required overcoming several challenges, including:
- Create a large, curated training dataset
- Improved training algorithms and implementations
- Designing a functional distillation pipeline.

Figure 2: Comparison of segmentation accuracy of DINO v1 and v2
2.1. Creation of large-scale, well-selected and diverse image datasets
One of the main steps in building DINOv2 is training a large-scale architecture and model to improve model performance. However, training larger models efficiently requires large datasets. As no large datasets were available to meet their requirements, researchers leveraged publicly crawled web data and built pipelines to select only useful data similar to LASER.
However, to be able to use these datasets, we need to perform two main tasks:
- Balance data between different concepts and tasks
- remove irrelevant images
Since this task can be performed manually, we extended it by hand-picking a set of seed images from about 25 third-party datasets and retrieving images that are closely related to those seed images. This approach allowed us to create a total of 142 million related datasets from 1.2 billion images.
2.2. Improving Algorithms and Techniques
Using larger models and datasets yields better results, but comes with greater challenges. Two of the main challenges are potential instability and maintaining manageability during training. To make the training more stable, DINOv2 includes an additional regularization method inspired by the similarity search and classification literature.
DINOv2’s training process integrates the latest mixed-precision and distributed training implementation provided by state-of-the-art PyTorch 2. This speeded up the code implementation and allowed us to use the same hardware for training the DINO model, 2x faster and 3x faster. Memory usage now allows scaling of data and model size.
2.3. Reducing Inference Time Using Model Distillation
Inference requires powerful hardware to run large models, which limits the practical use of the method in various use cases. To overcome this problem, researchers used model distillation to compress the knowledge of large models into smaller models. Using this approach, researchers were able to ignore the performance cost and condense a high-performance architecture into a smaller architecture. This resulted in the powerful ViT-Small, ViT-Base and ViT-Large models.
The training and evaluation code requires PyTorch 2.0 and xFormers 0.0.18, along with many other third-party packages, and the code also requires a Linux environment. The following steps outline how to configure all required dependencies for training and evaluation purposes.
- Install PyTorch by following the instructions here. I recommend installing her PyTorch with CUDA support.
- download conda
- Clone the DINOv2 repository with the following command:
Code by author
- Proceed to create and activate a Conda environment named ‘dinov2’ using the provided environment definition.
Code by author
- Use the provided requirements.txt file to install the required dependencies for this project.
Code by author
- Finally, you can load the model using the code below.
Code by author
In conclusion, the release of the DINOv2 model by Meta AI is an important milestone. The self-supervised learning approach used in DINOv2 models provides a powerful way to train machine learning models without requiring large amounts of labeled data. Achieving high accuracy without the need for fine-tuning makes these models suitable for a variety of computer vision tasks and applications. Additionally, DINOv2 can learn from different image collections and can learn from features such as depth estimation without explicit training. The availability of DINOv2 as an open source model opens the door for researchers and developers to explore new possibilities for computer vision tasks and applications.
References
Youssef Rafat Computer vision researcher and data scientist. His research focuses on the development of real-time computers his vision his algorithms for healthcare applications. He has also worked as a data scientist in the areas of marketing, finance and healthcare for over three years.