There has been a longstanding desire to present visual data in a way that allows for deeper understanding. Earlier methods used generative pre-training to set up deep networks for subsequent recognition tasks, such as deep belief networks and denoising autoencoders. Given that generative models may generate new samples by roughly simulating data distributions, in the Feynman tradition, such modeling is the approximation of the underlying visual data needed for recognition tasks. It makes sense that the semantic grasp should eventually be reached as well.
According to this theory, generative language models, such as the Generative Pre-trained Transformer (GPT), acquire a deep understanding of the language and a vast knowledge base, allowing them to combine a few-shot learners with a pre-trained basic model. successful as both. However, recent research on visual generative pre-training is no longer popular. For example, GAN-based BiGAN and autoregressive iGPT utilize 10 more parameters than contemporaneous symmetrical algorithms, but perform significantly below them. Diverse focus partially poses difficulties. Generative models must allocate capacity to low-level, high-frequency features, whereas recognition models primarily focus on the high-level, low-frequency structure of images.
Given this disparity, it remains to be determined whether and how generative pre-training, despite its intuitive appeal, can successfully compete with other self-supervised algorithms in downstream recognition tasks. yeah. The denoising diffusion model has recently dominated the field of image production. These models use a simple method of iteratively improving noisy data. (Fig. 1) The pictures obtained are of surprisingly high quality. Even better, they can generate a wide variety of unique samples. In light of this progress, they explore the possibility of generative pre-training in the setting of diffusion models. First, we fine-tune the pretrained diffusion model directly using ImageNet classification.
Pretrained diffusion models outperform parallel self-supervised pretraining algorithms such as Masked Autoencoders (MAE), even though they perform well in unconditional image generation. However, compared to training the same architecture from scratch, the pre-trained diffusion model yields only modest improvement in classification. Researchers from Meta, John Hopkins University, and UCSC incorporate masking into the diffusion model and take inspiration from MAE to recast the diffusion model as a masked autoencoder (DiffMAE). They configure the masked prediction task as a conditional generation goal to estimate the pixel distribution of the masked region conditioned on the visible region. By learning to regress the pixels of masked patches given other visible patches, MAE exhibits excellent discrimination performance.
Use the MAE framework to train models using diffusion techniques without adding additional training costs. Their model is learned to denoise inputs at various noise levels during pre-training, learning powerful representations for recognition and generation. They evaluate the pre-trained model by fine-tuning the downstream discrimination task on a pictorial diagram where the model creates samples by iteratively unrolling from random Gaussian noise. DiffMAE’s ability to create complex visual features such as objects is due to its diffuse nature. MAE is known to produce blurry reconstructions, lacking high frequency content. Additionally, DiffMAE performs well on jobs that require image and video recognition.
In this work you will find:
(i) DiffMAE achieves comparable performance to top recognition-focused self-supervised learning algorithms, making it a powerful pre-training method for fine-tuning downstream recognition tasks. Their DiffMAE, when combined with CLIP’s characteristics, outperforms the current work of blending MAE and CLIP.
(ii) DiffMAE can generate high-quality images based on masked inputs; Notably, the DiffMAE generation is more semantically meaningful and outperforms top repair techniques in terms of quantitative performance.
(iii) DiffMAE is easily adaptable to the video domain, providing first-class inpainting and state-of-the-art recognition accuracy that surpasses recent efforts.
(iv) MAE efficiently completes the early stages of the diffusion inference process, thus demonstrating the relationship between MAE and diffusion models. In other words, they think his MAE performance is in line with production for rewards. We also conduct exhaustive empirical analyzes to highlight the strengths and weaknesses of design decisions regarding downstream recognition and repair generation tasks.
check out paper and plan. All credit for this research goes to the researchers of this project.Also, don’t forget to participate Our 18k+ ML SubReddit, cacophony channeland email newsletterWe share the latest AI research news, cool AI projects, and more.
🚀 Check out 100 AI Tools in the AI ​​Tools Club
Aneesh Tickoo is a consulting intern at MarktechPost. He is currently pursuing a Bachelor’s Degree in Data Science and Artificial Intelligence from the Indian Institute of Technology (IIT), Bhilai. He spends most of his time on projects aimed at harnessing the power of machine learning. His research interest is image processing and his passion is building solutions around it. He loves connecting with people and collaborating on interesting projects.