Pretraining launched!π
Our 9B/10TT baby model is making its first steps in Leonardo (CINECA). π£
All people involved are eager to see the results of the effort it took to get here and share them. π
And advancing to push hard for the next cycle. π¦Ύ
#goOpenEuroLLM
Wrapping up our 3rd general meeting, hosted by
AI Sweden in sunny Stockholm βοΈ
A full room makes the final decisions before training the first OpenEuroLLM model. Sharing updates, ideas, and future plans.
Two more days of tight collaboration.
Full speed mode. π
#goOpenEuroLLM
HPLT is of the datasets we are sharing in our world-readable catalogue across HPCs. Interesting talk at #LREC2026 in 15 min in room Menorca 1 at 16:20!!!
All ready to share information about #OpenEuroLLM with the #LREC2026 crowd. Let's talk data, infra, evals and open multilingual LLM models together! Come to booth #5 at the poster area 1, Elyxir Building.
#multingualLLMs #openLLMs #diverseLLMs #safeLLMs
Quite a nice "representation" of the OpenEuroLLM crowd will be at the International Conference on Learning Representations (ICLR) this week.
On Friday 24, come to poster "OpenThoughts: Data Recipes for Reasoning Models", work partially supported by our project, and meet us! π
Experimenting with model-based annotation for better data selection? A candidate to consider is propella-1, a mulitlingual and multi-property annotator partially funded by #OpenEuroLLM which is fully open-source.
π Models, annotations and paper ready! See: huggingface.co/collections/...
π One year of OpenEuroLLM!
πͺπΊWeβre building Europeβs next-gen open-source LLMs to boost digital sovereignty.
More about our achievements and next steps for infrastructure, data, models and evaluation at openeurollm.eu/blog/first-y....
Year 2 = full speed ahead. π
Go #OpenEuroLLM!
Also, today, know more about bechmark contamination impact goint to the poster of our colleagues from the unversities of Helsinki and Turku and the ELLIS Institute Finland.
Input, more input π€β‘
Just like Jonny 5 in Short Circuit, our baby model is reading every single token from its pretraining dataset.
So far: 10 trillion tokens, 36 languages + code & math as their own "languages" πππ»
Weβre tracking progress & sharing it openly π
(1/2)
A series of foundation models for transparent AI in Europe
openeurollm.eu
As of this morning:
π§ 425.49B tokens seen
π 4.25% completed
This eager reader wants more input, one token at a time.
Follow along. π
(2/2)
#PreTraining #LLM #MultilingualAI #TransparentAI
#goOpenEuroLLM