PhD at EPFL ๐ง ๐ป
Ex @MetaAI, @SonyAI, @Microsoft
Egyptian ๐ช๐ฌ
Badr AlKhamissi
Loading...
So why build toward a brain-encoding foundation model?
โ Simulate fMRI responses to sensory stimuli
โ Insights about the brain
โ Path toward clinical applications
MIRAGE is our first step. ๐ง
More on the project site: mirage-brain.epfl.ch
w/ @akgokce.bsky.social & @mschrimpf.bsky.social
Two simple ideas for building improved brain encoding models: 1. learn to use representations from all model layers via a gating mechanism + 2. start from natively multimodal features for multimodal predictions. State of the art performance; see mirage-brain.epfl.ch for details #NeuroAI ๐ง ๐ค๐งช
๐ง When you watch a movie, your brain blends sight, sound, and speech into a single experience.
Should models of the brain blend them too, or keep the senses separate until the very end?
We built MIRAGE to find out. It sets a new SOTA for predicting whole-brain fMRI from movies. ๐งต
Most brain-encoding pipelines pull vision, audio, and language features from separate models, then fuse them late, at the readout.
But modern foundation models fuse modalities during pretraining.
Which kind of fusion is actually more brain-relevant?
Key finding: native fusion beats post-hoc fusion at every architectural level: linear ridge, brain encoder, and full MIRAGE.
The kicker: on out-of-distribution movies, a single MIRAGE model beats TRIBE v1's 1,000-model ensemble!!
Giving a new SOTA on Algonauts 2025 OOD ๐
And this isn't a quirk of one model.
Across 2 backbone families and 3 scales, native fusion wins at every single scale.
Fusing modalities during pretraining yields features that are more brain-aligned than stitching unimodal streams together afterward.
Enter MIRAGE ๐ช
Most encoding models pin a linear readout to one fixed layer. MIRAGE does neither.
A frozen omni-modal backbone (Qwen3-Omni) exposes all 48 layers โ per-modality cross-attention gates pool them adaptively โ a transformer maps to cortex non-linearly, with a per-subject head.
Each modality also traces a distinct anatomical pattern: vision โ occipitotemporal, audio โ auditory cortex, text โ the language network.
MIRAGE's largest gains over our linear baseline land in visual & dorsal-attention areas, exactly where rich social-movie content demands integration.
Bonus: MIRAGE is inspectable. The gates' attention weights reveal which backbone layers each modality reads from.
๐ Vision is sharply tuned to mid-depth layers (~25โ30), text spreads across mid-to-late layers, audio is the most diffuse.
+ we've a demo: play a clip and watch MIRAGE's predicted whole-brain activity light up in sync ๐ง ๐ฟ
๐ Paper: arxiv.org/abs/2605.29850
๐ป Code: github.com/epflneuroail...
๐ค Model: huggingface.co/epfl-neuroai...
๐ Demo: mirage-brain.epfl.ch
Joint work w/ @akgokce.bsky.social & @mschrimpf.bsky.social
๐ง When you watch a movie, your brain blends sight, sound, and speech into a single experience.
Should models of the brain blend them too, or keep the senses separate until the very end?
We built MIRAGE to find out. It sets a new SOTA for predicting whole-brain fMRI from movies. ๐งต