Inlay

Profile

Research Scientist @ IBM Research. Postdoc @ Berkeley AI. PhD @ Tel Aviv University. Working on Compositionality, Multimodal Foundation Models, and Structured Physical Intelligence. 🔗 https://roeiherz.github.io/ 📍Bay area 🇺🇲

Roei Herzig

CVPR panel at the What is Next in Multimodal Foundation Models? workshop kicks off soon! 11:30AM, R207 A–D (Level 2) Don't miss an amazing discussion with: Ludwig Schmidt, @andrewowens.bsky.social , Arsha Nagrani, and Ani Kembhavi 🔥 @cvprconference.bsky.social sites.google.com/view/mmfm3rd...

Jun 12, 2025

The best friend of Auto-regressive Robotic Models is 4D representations...🤖😻❤️

For example, VLAs use language decoders, which are pretrained on tasks like visual question answering and image captioning. This presents a discrepancy between the models’ high-level pre-training objective and the need for robotic models to predict low-level actions.

Our workshop "What is Next in Multimodal Foundation Models?" has been accepted to #CVPR for its 3rd time! We are cooking amazing talks and an excellent panel for you, so stay tuned! @cvprconference.bsky.social

Oh no, I have a NeurIPS @neuripsconf.bsky.social FOMO🙃😃🤗 Or is it actually more of Taylor Swift?🫠

What happens when vision🤝 robotics meet? 🚨 Happy to share our new work on Pretraining Robotic Foundational Models!🔥 ARM4R is an Autoregressive Robotic Model that leverages low-level 4D Representations learned from human video data to yield a better robotic model. BerkeleyAI 😊

Pretraining has significantly contributed to recent Foundational Model success. However, in robotics, progress has been limited due to a lack of robotic annotations and insufficient representations that accurately model the physical world.

Feb 20, 2025

Our paper: arxiv.org/pdf/2502.13142. Our project page and code will be released soon! Team: \w Dantong Niu, Yuvan Sharma, Haoru Xue, Giscard Biamby, Junyi Zhang, Ziteng Ji, and Trevor Darrell.

Feb 24, 2025

For all our @neuripsconf.bsky.social friends🤖🦋, our work is presented NOW at POSTER #3701. Come hear us talk our work on many-shot in-context learning and test-time scaling by leveraging the activations! You won't be disappointed😎 #Multimodal-InContextLearning #NeurIPS

We found that 4D representations maintain a shared geometric structure between the points and robot state representations up to a linear transformation, and thus enabling efficient transfer learning from human video data to low-level robotic control.

Dec 21, 2024

Dec 10, 2024

Feb 24, 2025

Dec 12, 2024

Feb 24, 2025

Roei Herzig

Hilde Kuehne

Roei Herzig

Video