Inlay

While current approaches uses external pretrained features (e.g. Meta CLIP, BEATs), we found that diffusion activations hold rich, semantically and temporally aware features, making them perfect for cross-modal generation in a self-contained framework. 🔊➡️📽️ Example:

Compared to Meta Movie Gen Video to Audio, we achieve significantly better temporal synchronization with a 90% smaller scale model.

Can pretrained diffusion models be connected for cross-modal generation? 📢 Introducing AV-Link ♾️ Bridging unimodal diffusion models in one self-contained framework to enable: 📽️ ➡️ 🔊 Video-to-Audio generation. 🔊 ➡️ 📽️ Audio-to-Video generation. 🌐: snap-research.github.io/AVLink/ ⤵️ Results

Check this recent work by my PhD student Moayed. He has been doing amazing work on Generative AI for images, video and audio. We introduce AV-Link ♾️, an unified approach for audio-video generation. Our generated audio is the best in terms of synchronization with video actions. Check thread below.