And this isn't a quirk of one model.
Across 2 backbone families and 3 scales, native fusion wins at every single scale.
Fusing modalities during pretraining yields features that are more brain-aligned than stitching unimodal streams together afterward.