Utility 2: Model alignment via DPO.
Thought-guided rewrites on Arena-Hard beat:
Base Qwen3.5-4B by +25.6%
WildChat by +6.6%
Message-guided rewrites by +4.5%
Thoughts give models actionable alignment signals by surfacing dissatisfaction that users never spell out.