While current approaches uses external pretrained features (e.g. Meta CLIP, BEATs), we found that diffusion activations hold rich, semantically and temporally aware features, making them perfect for cross-modal generation in a self-contained framework.
đâĄď¸đ˝ď¸ Example:
Compared to Meta Movie Gen Video to Audio, we achieve significantly better temporal synchronization with a 90% smaller scale model.
Can pretrained diffusion models be connected for cross-modal generation?
đ˘ Introducing AV-Link âžď¸
Bridging unimodal diffusion models in one self-contained framework to enable:
đ˝ď¸ âĄď¸ đ Video-to-Audio generation.
đ âĄď¸ đ˝ď¸ Audio-to-Video generation.
đ: snap-research.github.io/AVLink/
â¤ľď¸ Results
Check this recent work by my PhD student Moayed. He has been doing amazing work on Generative AI for images, video and audio. We introduce AV-Link âžď¸, an unified approach for audio-video generation. Our generated audio is the best in terms of synchronization with video actions. Check thread below.