Did you know? Gestures used to express universal concepts—like wishing for luck—vary DRAMATICALLY across cultures?
🤞means luck in US but deeply offensive in Vietnam 🚨
📣 We introduce MC-SIGNS, a test bed to evaluate how LLMs/VLMs/T2I handle such nonverbal behavior!
📜: arxiv.org/abs/2502.17710
Can LLMs accurately aggregate information over long, information-dense texts? Not yet…
We introduce Oolong, a dataset of simple-to-verify information aggregation questions over long inputs. No model achieves >50% accuracy at 128K on Oolong!
Our paper documenting the environmental impacts of creating OLMo language models is the most honest and comprehensive characterization I know of, including training, development (!) and inference costs. If you're at ICLR chat with @jacobcares.bsky.social & @clarana.bsky.social Sat morning 10-12:30!
Akhila Yerukola
Amanda Bertsch
We’re excited about Oolong as a challenging benchmark for information aggregation! Let us know which models we should benchmark next 👀
Paper: arxiv.org/abs/2511.02817
Dataset: huggingface.co/oolongbench
Code: github.com/abertsch72/o...
Leaderboard: oolongbench.github.io
I'm in Singapore for @iclr-conf.bsky.social ! Come check out our spotlight paper on the environmental impact of training OLMo (link in next tweet) during the Saturday morning poster session from 10-12:30 -- happy to chat about this or anything else! DMs should be open, email works too
Come through! #492 in Hall 2!, 10am-12:30pm
Yes! tbh this method is probably much more immediately useful for helping one understand subtle differences between [models trained on] subtly different data subsets, vs a loftier goal of helping one find "the" best data mixture -- to anyone considering this method, please feel free to reach out :)
Emma Strubell
We've received multiple notes that NOAA research services (Office of Oceanic and Atmospheric Research) may go offline at midnight. @safeguardingdata.bsky.social is working on web archiving, but if others want to nominate on this, that might be good: digital2.library.unt.edu/nomination/G...
How can we better think and talk about human-like qualities attributed to language technologies like LLMs? In our #CHI2025 paper, we taxonomize how text outputs from cases of user interactions with language technologies can contribute to anthropomorphism. arxiv.org/abs/2502.09870 1/n