Inlay

//

ProfilePosts

Loading...

Models show varying error patterns. Claude and some GPT-family models underperform on tasks that require outputting dates; Gemini and Deepseek-R1 frequently over-reason and fail to return an answer at all on Oolong-synth, although Gemini is the best model on Oolong-real.

Oolong has a synthetic setting that poses distributional questions over sets of classification examples and their metadata and a realistic setting using conversational data from game transcripts. Both splits require counting, temporal reasoning, and multi-step entity resolution.

7mo

I’ll be presenting this work in **2 hours** at EMNLP’s Gather Session 3. Come by to chat about fanfiction, literary notions of similarity, long-context modeling, and consent-focused data collection!

7mo

While long-context models can do many retrieval tasks impressively well, they have a long way to go to solve realistic information synthesis problems! Oolong is joint work with Adithya Pratapa, Teruko Mitamura, @gneubig.bsky.social , and Matt Gormley.

7mo