I want to reshare @brandfonbrener.bsky.social's @NeurIPSConf 2024 paper on CoLoR-Filter: A simple yet powerful method for selecting high-quality data for language model pre-training!
With @hlzhang109.bsky.social @schwarzjn.bsky.social @shamkakade.bsky.social
✅ Pretrained on 3.5M CXRs to study scaling laws for radiology models
✅ Compared MedImageInsight (CLIP-based) vs RAD-DINO (DINOv2-based)
✅ Found that structured labels + text can significantly boost performance
✅ Showed that as little as 30k in-domain samples can outperform public foundation models
including not just findings but also lines & tubes classification/segmentation and report generation. We also test the effect of adding structured labels alongside reports during CLIP‑style pretraining, and study scaling laws under these controlled conditions.
What a damning abstract
🩻Excited to share our latest preprint: “Data Scaling Laws for Radiology Foundation Models”
Foundation vision encoders like CLIP and DINOv2 have transformed general computer vision, but what happens when we scale them for medical imaging?
📄 Read the full preprint here: arxiv.org/abs/2509.12818