I'm part of this! There's also a paper: arxiv.org/abs/2503.10267
Laurie Burchell
** New parallel data set ** . We've just released HPLT v2.0, a parallel data set of 50 languages paired with English, 380M sentence pairs in total. Extracted from the Internet Archive and Common Crawl hplt-project.org/datasets/v2.0