Gemini is currently miles ahead of the competition in processing integrated audio-visual signals, and conversation structure understanding is not just a text problem. At the same time, if we use inter-annotator agreement as a proxy for the human ceiling, we still have a long way to go.