🏆ONEBench accepted to ACL main! ✨
Stay tuned for the official leaderboard and real-time personalised benchmarking release!
If you’re attending ACL or are generally interested in the future of foundation model benchmarking, happy to talk!
#ACL2025NLP #ACL2025
@aclmeeting.bsky.social
🧠 Keeping LLMs factually up to date is a common motivation for knowledge editing.
But what would it actually take to support this in practice at the scale and speed the real world demands?
We explore this question and really push the limits of lifelong knowledge editing in the wild.
👇
Godsend
Excited to be in Vienna for #ACL2025 🇦🇹!You'll find @dziadzio.bsky.social and I by our ONEBench poster, so do drop by!
🗓️Wed, July 30, 11-12:30 CET
📍Hall 4/5
I’m also excited to talk about lifelong and personalised benchmarking, data curation and vision-language in general! Let’s connect!
Check out our newest paper!
As always, it was super fun working on this with @prasannamayil.bsky.social
🧵1/10 Excited to share our #SIGGRAPH paper "MonetGPT: Solving Puzzles Enhances MLLMs' Image Retouching Skills" 🌟
We explore how to make MLLMs operation-aware by solving visual puzzles and propose a procedural framework for image retouching
#MLLM
Why More Researchers Should be Content Creators
Just trying something new! I recorded one of my recent talks, sharing what I learned from starting as a small content creator.
youtu.be/0W_7tJtGcMI
We all benefit when there are more content creators!
🚨Great Models Think Alike and this Undermines AI Oversight🚨
New paper quantifies LM similarity
(1) LLM-as-a-judge favor more similar models🤥
(2) Complementary knowledge benefits Weak-to-Strong Generalization☯️
(3) More capable models have more correlated failures 📈🙀
🧵👇
I'm in Nashville this week attending #CVPR2025. Excited to discuss post-training VLMs and diffusion models!
Adhiraj Ghosh
Lukas Thede
Adhiraj Ghosh
Adhiraj Ghosh
Thaddäus Wiedemer
Joschka Strüber @Tuebingen AI Center🇩🇪
Jia-Bin Huang
Niladri Shekhar Dutt
Shyamgopal Karthik
🚨Looking to test your foundation model on an arbitrary and open-ended set of capabilities, not explicitly captured by static benchmarks? 🚨
Check out ✨ONEBench✨, where we show how sample-level evaluation is the solution.
🔎 arxiv.org/abs/2412.06745