Nvidia Partners with Cornell University to Launch AI Video Generation Model

The advent of artificial intelligence AI is one of the most important technological advances of the 21st century. From self-driving cars to virtual assistants and chatbots, AI is everywhere in our daily lives. Its impact on some companies and society as a whole is immense.

To complement these technological developments, Nvidia, a well-known American graphics processing device maker, partnered with researchers from Cornell University to launch an AI video generation model named VideoLDM. New AI can generate high-definition videos based on text descriptions.

VideoLDM: Nvidia AI Video Generation Model

Based on the text description, the AI model can create videos with a maximum resolution of 2048 x 1280 pixels, a 24 frame rate, and a maximum runtime of 4.7 seconds. A stable diffusion neural network is the foundation of the model. Of his 4.1 billion parameters in the NVIDIA solution, only his 2.7 billion used video for training.

This is fairly conservative by modern AI standards. Using powerful latent diffusion model (LDM) techniques, engineers were able to create a wide range of high-resolution films that were diverse and temporally consistent.

Video LDM function

Research teams from both Nvidia and Cornell University highlight the following capabilities of the model: both customized video creation and temporal convolution synthesis. His LDM image reference network, pre-fine-tuned with the DreamBooth image collection, is inserted with temporal layers he trained with VideoLDM to convert text to video.

By applying the learned time plane wrinkle-by-wrinkle over time, you can create slightly longer clips without losing quality. In addition, models can create films of driving scenes. Videos last up to 5 minutes and have a resolution of 1024 x 512 pixels.

You can recreate specific driving experiences by using bounding boxes to create attractive environments, synthesize appropriate source images, and create compelling films. Additionally, the model may generate different possible missions from a single initial frame to provide multimodal predictions of motion scenarios.

The research is currently participating in the Machine Vision and Pattern Recognition Conference June 18-22 in Vancouver. The neural network described is currently just a research project, and it’s unclear when NVIDIA will release something similar to the public.

Source link