Inlay

Profile

While long-context models can do many retrieval tasks impressively well, they have a long way to go to solve realistic information synthesis problems! Oolong is joint work with Adithya Pratapa, Teruko Mitamura, @gneubig.bsky.social , and Matt Gormley.

Can LLMs accurately aggregate information over long, information-dense texts? Not yet… We introduce Oolong, a dataset of simple-to-verify information aggregation questions over long inputs. No model achieves >50% accuracy at 128K on Oolong!

Models show varying error patterns. Claude and some GPT-family models underperform on tasks that require outputting dates; Gemini and Deepseek-R1 frequently over-reason and fail to return an answer at all on Oolong-synth, although Gemini is the best model on Oolong-real.

Oolong has a synthetic setting that poses distributional questions over sets of classification examples and their metadata and a realistic setting using conversational data from game transcripts. Both splits require counting, temporal reasoning, and multi-step entity resolution.

7mo

Why is this so hard? Models must identify relevant sections of input, label or categorize these sections, and then accumulate information to make distributional-level decisions. Adding labels in-context or specifying more reasoning effort has limited benefit.

We’re excited about Oolong as a challenging benchmark for information aggregation! Let us know which models we should benchmark next 👀 Paper: arxiv.org/abs/2511.02817 Dataset: huggingface.co/oolongbench Code: github.com/abertsch72/o... Leaderboard: oolongbench.github.io

why intern at Ai2? 🐟interns own major parts of our model development, sometimes even leading whole projects 🐡we're committed to open science & actively help our interns publish their work reach out if u wanna build open language models together 🤝 links 👇

7mo

I’ll be presenting this work in **2 hours** at EMNLP’s Gather Session 3. Come by to chat about fanfiction, literary notions of similarity, long-context modeling, and consent-focused data collection!

DeltaNet Explained by Sonlin Yang A gentle and comprehensive introduction to the DeltaNet Part 1: sustcsonglin.github.io/blog/2024/de... Part 2: sustcsonglin.github.io/blog/2024/de... Part 3: sustcsonglin.github.io/blog/2024/de...

7mo