Information is nothing without retrieval
The Webis Group contributes to information retrieval, natural language processing, machine learning, and symbolic AI.
Webis Group
Loading...
The data spans 7 text domains:
🌐 Web: Wikipedia, GitHub, social media
💬 Political: Parliamentary proceedings, speeches
⚖️ Legal: Court decisions, federal & EU law
📰 News: Newspaper archives
🏦 Economics: public tenders
📚 Cultural: Digital heritage collections
🔬 Scientific: Papers, books, journals
The current problem: training data is primarily sourced from Web crawls, which give you scale but unclear licensing. This blocks models from commercial deployment and research. We took a different path: systematically collecting German text from 41 institutional sources with explicit open licenses.
Webis Group
This means:
✅ Every document has verifiable usage rights (min. CC-BY-SA 4.0 and allows commercial use)
✅ Full institutional provenance for reduced compliance risks
✅ Systematic PII removal + quality filtering, ready for training
✅ Rich metadata for downstream customization
Thrilled to announce that Matti Wiegmann has successfully defended his PhD! 🎉🧑🎓 Huge congratulations on this incredible achievement! #PhDDefense #AcademicMilestone
For full technical details + compliance Datasheet see our preprint @ arxiv.org/abs/2510.13996
As for German-specific models trained on this data... stay tuned 👀
We presented two papers at ICTIR 2025 today:
- Axioms for Retrieval-Augmented Generation webis.de/publications...
- Learning Effective Representations for Retrieval Using Self-Distillation with Adaptive Relevance Margins webis.de/publications...
Honored to win the ICTIR Best Paper Honorable Mention Award for "Axioms for Retrieval-Augmented Generation"!
Our new axioms are integrated with ir_axioms: github.com/webis-de/ir_...
Nice to see axiomatic IR gaining momentum.
Congratulations to the authors @heinrich.merker.id, @maik-froebe.bsky.social, @benno-stein.de, @martin-potthast.com, @matthias-hagen.bsky.social from @uni-jena.de, Uni Weimar, @unikassel.bsky.social, @hessianai.bsky.social, @scadsai.bsky.social!
We just released "German Commons", the largest openly-licensed German text dataset for LLM training: 154B tokens with clear usage rights for research and commercial use.
huggingface.co/datasets/coral-nlp/german-commons
Webis Group
Webis Group
Webis Group
We’re on a journey to advance and democratize artificial intelligence through open source and open science.