Advances in display technology have made our viewing experience more intense and comfortable. He is more satisfied with 4K 60FPS than 1080P 30FPS. The first immerses you in the content as if you were witnessing it. One minute of his 4K 60FPS video costs about 6 times more than 1080P 30 FPS in terms of data that many users have no access to.
However, this problem can be tackled by increasing the resolution and/or frame rate of the delivered video. Super-resolution methods work on increasing the resolution of the video, while video interpolation methods focus on increasing the number of frames in the video.
Video frame interpolation is used to add new frames to a video sequence by estimating the motion between existing frames. This technique is widely used in various applications such as slow motion video, frame rate conversion, and video compression. The resulting video usually looks nicer.
In recent years, research on video frame interpolation has made great progress. It can produce intermediate frames very accurately and provide a comfortable viewing experience.
However, measuring the quality of interpolation results has been a difficult task for many years. Existing methods mostly use commercial metrics to measure the quality of interpolation results. Because video frame interpolation results often have inherent artifacts, existing quality metrics may not match human perception when measuring interpolation results.
Some methods have undergone subjective testing for more accurate measurements, but this takes time, with the exception of some methods that use user research. It’s time to answer that question.
A group of researchers has published a dedicated perceptual quality metric for measuring interpolation results for video frames. Based on Swin Transformers, they designed a new neural network architecture for video perceptual quality assessment.
The network receives as input a pair of frames, one from the original video sequence and the interpolated frame. Outputs a score representing the perceptual similarity between two frames. The first step in achieving this kind of network was to prepare the dataset, and we started from there. They constructed a large video frame-interpolated perceptual similarity dataset. This dataset contains pairs of frames from different videos and human judgments of their perceptual similarity. This dataset is used to train the network using a combination of L1 and SSIM goal metrics.
L1 loss measures the absolute difference between the prediction score and the ground truth score, while SSIM loss measures the structural similarity between two images. Combining these two losses trains the network to predict scores that are accurate and consistent with human perception. The main advantage of the proposed method is that it is frame-of-reference agnostic. Therefore, it can be run on client devices that normally do not have that information available.
check out paperdon’t forget to join Our 20k+ ML SubReddit, cacophony channeland email newsletterWe share the latest AI research news, cool AI projects, and more. If you have any questions about the article above or missed something, feel free to email me. Asif@marktechpost.com
🚀 Check out 100 AI Tools in the AI ​​Tools Club
Ekrem Çetinkaya has a Bachelor’s Degree. He completed his master’s degree in 2018. In 2019, he graduated from Ojegin University in Istanbul, Turkey. he wrote his master’s degree. A paper on image denoising using deep convolutional networks. He is currently pursuing his Ph.D. He holds a degree from Klagenfurt University in Austria and works as a researcher for the ATHENA project. His research interests include deep learning, computer vision, and multimedia networking.