Inlay

//

Post

For full technical details + compliance Datasheet see our preprint @ arxiv.org/abs/2510.13996 As for German-specific models trained on this data... stay tuned 👀

7mo

Large language model development relies on large-scale training corpora, yet most contain data of unclear licensing status, limiting the development of truly open models. This problem is exacerbated f...

arxiv.org

The German Commons - 154 Billion Tokens of Openly Licensed Text for German Language Models

Webis Group