Inlay

Profile

Information is nothing without retrieval The Webis Group contributes to information retrieval, natural language processing, machine learning, and symbolic AI.

Webis Group

The data spans 7 text domains: 🌐 Web: Wikipedia, GitHub, social media 💬 Political: Parliamentary proceedings, speeches ⚖️ Legal: Court decisions, federal & EU law 📰 News: Newspaper archives 🏦 Economics: public tenders 📚 Cultural: Digital heritage collections 🔬 Scientific: Papers, books, journals

7mo

The current problem: training data is primarily sourced from Web crawls, which give you scale but unclear licensing. This blocks models from commercial deployment and research. We took a different path: systematically collecting German text from 41 institutional sources with explicit open licenses.

Webis Group

This means: ✅ Every document has verifiable usage rights (min. CC-BY-SA 4.0 and allows commercial use) ✅ Full institutional provenance for reduced compliance risks ✅ Systematic PII removal + quality filtering, ready for training ✅ Rich metadata for downstream customization

7mo

Thrilled to announce that Matti Wiegmann has successfully defended his PhD! 🎉🧑‍🎓 Huge congratulations on this incredible achievement! #PhDDefense #AcademicMilestone

7mo

11mo

For full technical details + compliance Datasheet see our preprint @ arxiv.org/abs/2510.13996 As for German-specific models trained on this data... stay tuned 👀

We presented two papers at ICTIR 2025 today: - Axioms for Retrieval-Augmented Generation webis.de/publications... - Learning Effective Representations for Retrieval Using Self-Distillation with Adaptive Relevance Margins webis.de/publications...

Honored to win the ICTIR Best Paper Honorable Mention Award for "Axioms for Retrieval-Augmented Generation"! Our new axioms are integrated with ir_axioms: github.com/webis-de/ir_... Nice to see axiomatic IR gaining momentum.

Congratulations to the authors @heinrich.merker.id, @maik-froebe.bsky.social, @benno-stein.de, @martin-potthast.com, @matthias-hagen.bsky.social from @uni-jena.de, Uni Weimar, @unikassel.bsky.social, @hessianai.bsky.social, @scadsai.bsky.social!

We just released "German Commons", the largest openly-licensed German text dataset for LLM training: 154B tokens with clear usage rights for research and commercial use. huggingface.co/datasets/coral-nlp/german-commons

7mo

Webis Group

11mo

7mo

Webis Group

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

coral-nlp/german-commons · Datasets at Hugging Face

Webis Group