
Self-supervised features are central to modern machine learning and, like supervised learning, typically require significant human effort in collecting and curating data. Self-supervised learning (SSL) allows models to be trained without human annotation, enabling scalable data and model expansion. However, scaling efforts have sometimes led to subpar performance due to issues such as the long-tail distribution of concepts in uncurated datasets. Successful SSL applications require careful data curation, such as filtering internet data to match high-quality sources such as Wikipedia for language models or balancing visual concepts for image models. This curation improves the robustness and performance of downstream tasks.
Researchers from FAIR at Meta, INRIA, Paris-Saclay University, and Google are working on automatically curating high-quality datasets for self-supervised pre-training. They propose a clustering-based approach to create large, diverse, and balanced datasets. The method involves hierarchical k-means clustering on vast data repositories and balanced sampling from these clusters. Experiments on web images, satellite images, and text demonstrate that features trained on these curated datasets outperform features trained on non-curated data and are comparable or better than manually curated data. The approach addresses the challenges of balancing datasets and improving model performance in self-supervised learning.
SSL is crucial in modern machine learning. In natural language processing (NLP), language modeling has evolved from simple neural architectures to large-scale models, driving significant advances in the field. Similarly, SSL in computer vision has evolved from pretext tasks to sophisticated joint embedding architectures, employing techniques such as contrastive learning, clustering, and distillation. High-quality data is essential to train state-of-the-art models. Automated data curation techniques such as hierarchical k-means clustering have been proposed to balance large datasets without the need for labels and improve the performance of SSL models in a variety of domains.
To effectively train a model using self-supervised learning, the pre-training dataset must be large, diverse, and balanced. A balanced dataset ensures that each concept is equally represented and avoids bias towards dominant concepts. To create such a dataset, we select a balanced subset from a large online repository, often using clustering techniques such as k-means. However, standard k-means can result in over-representation of dominant concepts. To address this, we use hierarchical k-means and resampling to ensure that the centroids follow a uniform distribution. This process, combined with a specific sampling strategy, helps maintain the balance between different concept levels in the dataset and improves the model's performance.
Four experiments were conducted to study the proposed algorithm. First, simulated data was used to illustrate hierarchical k-means, which showed a more uniform cluster distribution than other methods. Second, web-based image data was curated to create a dataset of 743 million images, and the ViT-L model was trained and evaluated on various benchmarks, demonstrating improved performance. The algorithm was then applied to curating text data for training large-scale language models, which yielded significant gains across benchmarks. Finally, satellite imagery was curated for tree canopy height prediction, and the model performed better on all datasets evaluated.
In conclusion, in this work, we present an automated data curation pipeline that generates large, diverse, and balanced training datasets for self-supervised feature learning. By successively applying k-means clustering and resampling, our method ensures uniform cluster distribution among concepts. Extensive experiments show that our pipeline enhances feature learning across web-based images, satellite images, and text data. Our curated dataset outperforms raw data and ImageNet1k in robustness, but lags slightly behind the rigorously curated ImageNet22k in certain benchmarks. Our approach highlights the importance of data curation in self-supervised learning and suggests that hierarchical k-means is a valuable alternative in a variety of data-dependent tasks. Future work should address dataset quality, reliance on pre-trained features, and scalability. Automatic dataset creation poses risks such as reinforcing bias and privacy violations, which we mitigate here by blurring human faces and balancing concepts.
Please check paper. All credit for this research goes to the researchers of this project. Also, don't forget to follow us. twitter. participate Telegram Channel, Discord Channeland LinkedIn GroupsUp.
If you like our work, you will love our Newsletter..
Please join us 43,000+ ML subreddits | In addition, our AI Event Platform

Sana Hassan, a Consulting Intern at Marktechpost and a dual degree student at Indian Institute of Technology Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, she brings a fresh perspective to the intersection of AI and real-world solutions.
