//
sign in
Profile
by @danabra.mov
Profile
by @dansshadow.bsky.social
Profile
by @jimpick.com
AviHandle
by @danabra.mov
AviHandle
by @dansshadow.bsky.social
AviHandle
by @katherine.computer
EventsList
by @katherine.computer
ProfileHeader
by @dansshadow.bsky.social
ProfileHeader
by @danabra.mov
ProfileMedia
by @danabra.mov
ProfilePlays
by @danabra.mov
ProfilePosts
by @danabra.mov
ProfilePosts
by @dansshadow.bsky.social
ProfileReplies
by @danabra.mov
Record
by @atsui.org
Skircle
by @danabra.mov
StreamPlacePlaylist
by @katherine.computer
+ new component
Profile
Loading...
Information is nothing without retrieval The Webis Group contributes to information retrieval, natural language processing, machine learning, and symbolic AI.
Webis Group








Loading...
The data spans 7 text domains: 🌐 Web: Wikipedia, GitHub, social media 💬 Political: Parliamentary proceedings, speeches ⚖️ Legal: Court decisions, federal & EU law 📰 News: Newspaper archives 🏦 Economics: public tenders 📚 Cultural: Digital heritage collections 🔬 Scientific: Papers, books, journals
7mo
The current problem: training data is primarily sourced from Web crawls, which give you scale but unclear licensing. This blocks models from commercial deployment and research. We took a different path: systematically collecting German text from 41 institutional sources with explicit open licenses.
Webis Group
This means: ✅ Every document has verifiable usage rights (min. CC-BY-SA 4.0 and allows commercial use) ✅ Full institutional provenance for reduced compliance risks ✅ Systematic PII removal + quality filtering, ready for training ✅ Rich metadata for downstream customization
7mo
Thrilled to announce that Matti Wiegmann has successfully defended his PhD! 🎉🧑‍🎓 Huge congratulations on this incredible achievement! #PhDDefense #AcademicMilestone
7mo
11mo
For full technical details + compliance Datasheet see our preprint @ arxiv.org/abs/2510.13996 As for German-specific models trained on this data... stay tuned 👀
We presented two papers at ICTIR 2025 today: - Axioms for Retrieval-Augmented Generation webis.de/publications... - Learning Effective Representations for Retrieval Using Self-Distillation with Adaptive Relevance Margins webis.de/publications...
Honored to win the ICTIR Best Paper Honorable Mention Award for "Axioms for Retrieval-Augmented Generation"! Our new axioms are integrated with ir_axioms: github.com/webis-de/ir_... Nice to see axiomatic IR gaining momentum.
Congratulations to the authors @heinrich.merker.id, @maik-froebe.bsky.social, @benno-stein.de, @martin-potthast.com, @matthias-hagen.bsky.social from @uni-jena.de, Uni Weimar, @unikassel.bsky.social, @hessianai.bsky.social, @scadsai.bsky.social!
We just released "German Commons", the largest openly-licensed German text dataset for LLM training: 154B tokens with clear usage rights for research and commercial use. huggingface.co/datasets/coral-nlp/german-commons
7mo
Webis Group
11mo
11mo
11mo
7mo
Webis Group
Webis Group
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
huggingface.co
coral-nlp/german-commons · Datasets at Hugging Face
Webis Group
Webis Group
Webis Group
Webis Group
Webis Group