Research Scientist @ IBM Research. Postdoc @ Berkeley AI. PhD @ Tel Aviv University. Working on Compositionality, Multimodal Foundation Models, and Structured Physical Intelligence.
🔗 https://roeiherz.github.io/
📍Bay area 🇺🇲
Roei Herzig
CVPR panel at the What is Next in Multimodal Foundation Models? workshop kicks off soon!
11:30AM, R207 A–D (Level 2)
Don't miss an amazing discussion with: Ludwig Schmidt, @andrewowens.bsky.social , Arsha Nagrani, and Ani Kembhavi 🔥
@cvprconference.bsky.social
sites.google.com/view/mmfm3rd...
The best friend of Auto-regressive Robotic Models is 4D representations...🤖😻❤️
For example, VLAs use language decoders, which are pretrained on tasks like visual question answering and image captioning.
This presents a discrepancy between the models’ high-level pre-training objective and the need for robotic models to predict low-level actions.
Our workshop "What is Next in Multimodal Foundation Models?" has been accepted to #CVPR for its 3rd time!
We are cooking amazing talks and an excellent panel for you, so stay tuned!
@cvprconference.bsky.social
Oh no, I have a NeurIPS @neuripsconf.bsky.social FOMO🙃😃🤗
Or is it actually more of Taylor Swift?🫠
What happens when vision🤝 robotics meet? 🚨 Happy to share our new work on Pretraining Robotic Foundational Models!🔥
ARM4R is an Autoregressive Robotic Model that leverages low-level 4D Representations learned from human video data to yield a better robotic model.
BerkeleyAI 😊
Pretraining has significantly contributed to recent Foundational Model success. However, in robotics, progress has been limited due to a lack of robotic annotations and insufficient representations that accurately model the physical world.
Our paper: arxiv.org/pdf/2502.13142.
Our project page and code will be released soon!
Team: \w Dantong Niu, Yuvan Sharma, Haoru Xue, Giscard Biamby, Junyi Zhang, Ziteng Ji, and Trevor Darrell.
For all our @neuripsconf.bsky.social friends🤖🦋, our work is presented NOW at POSTER #3701.
Come hear us talk our work on many-shot in-context learning and test-time scaling by leveraging the activations! You won't be disappointed😎
#Multimodal-InContextLearning #NeurIPS
We found that 4D representations maintain a shared geometric structure between the points and robot state representations up to a linear transformation, and thus enabling efficient transfer learning from human video data to low-level robotic control.