Neural MT metrics show the strongest alignment with downstream performance. But the proxy has limits: some specialized benchmarks, including MGSM and INCLUDE, show weaker or more variable correlations. Task-specific evaluation remains necessary. (4/5 🧵)