๐ข New paper: Can unsupervised metrics extracted from MT models detect their translation errors reliably? Do annotators even *agree* on what constitutes an error? ๐ง
We compare uncertainty- and interp-based WQE metrics across 12 directions, with some surprising findings!
๐งต 1/