Inlay

Profile

Phd @RiceUniversity | Research Intern @Snap

Moayed Haji ALi

ICLR rejections go brrrr

Check this recent work by my PhD student Moayed. He has been doing amazing work on Generative AI for images, video and audio. We introduce AV-Link ♾️, an unified approach for audio-video generation. Our generated audio is the best in terms of synchronization with video actions. Check thread below.

A great collaboration with W. Menapace, A. Siarohin, I. Skorokhodov, A. Canberk, K.S Lee, V. Ordonez, and S. Tulyakov. Please repost to support our work and check out our Arxiv preprint: arxiv.org/abs/2412.15191 Webpage: snap-research.github.io/AVLink/

While current approaches uses external pretrained features (e.g. Meta CLIP, BEATs), we found that diffusion activations hold rich, semantically and temporally aware features, making them perfect for cross-modal generation in a self-contained framework. 🔊➡️📽️ Example:

Besides Video to Audio (📽️ ➡️🔊), we also support Audio to Video (🔊➡️📽️) generation under the same unified framework.

Compared to Meta Movie Gen Video to Audio, we achieve significantly better temporal synchronization with a 90% smaller scale model.

recise temporal synchronization remains a significant challenge for current video-to-audio models. AV-Link addresses this by leveraging diffusion features to accurately capture both local and global temporal events, such as hand slides on a guitar and fretboard pitch changes.