PhD in Computational Biology & ML for Proteins @EPFL
https://sites.google.com/view/damiano-sgarbossa
Loading...
🎉 Excited to share that the last paper of my PhD is now published in PRX Life!
We introduce RAG-ESM, a retrieval-augmented framework that makes pretrained protein language models (like ESM2) homology-aware with minimal training cost.
📄 Paper: journals.aps.org/prxlife/abst...
Two exciting openings with us! 🤖🧬🆎🧫💉
- AI Scientist 👉 lnkd.in/eDXHH4E8
- AI Scientist, Drug Creation 👉 lnkd.in/eEvGyaTR
You'll work on antibody sequence/structure design, antibody-antigen co-folding, antibody-antigen binding prediction, physics-based methodologies, and more!
DMs welcome!
[1/8] 📄 New preprint! With Gionata Paolo Zalaffi & Anne-Florence Bitbol, we introduce ProteomeLM, a transformer that processes entire proteomes (prokaryotes and eukaryotes), enabling ultra-fast protein–protein interaction (PPI) prediction across the tree of life.
🔗 www.biorxiv.org/content/10.1...
With this, the last bit of my PhD at @embl.org is finally out!
We developed salad (sparse all-atom denoising), a family of blazing fast protein structure diffusion models.
Paper: nature.com/articles/s42256-…
Code: github.com/mjendrusch/salad
Data: zenodo.org/records/14711580
1/🧵
Happy to announce that our paper, "ProtMamba: a homology-aware but alignment-free protein state space model", has been published in Bioinformatics! 🎉
doi.org/10.1093/bioi...
🧬 ProtMamba applications include:
- Generating novel protein sequences conditioned on a given set of homologs,
- Inpainting specific regions within sequences,
- Modeling disordered regions of different protein sequences,
- Predicting the fitness of protein variants.
⚙️ ProtMamba is based on Mamba, a state space model that efficiently handles very long sequences. The model uses a fill-in-the-middle training objective, combining autoregressive modeling and masked language modeling to predict amino acids conditioned on the given homologs.
🔍 ProtMamba is homology-aware yet alignment-free, meaning it captures evolutionary information without relying on multiple sequence alignments. This allows it to avoid the imperfections of MSAs but still use the information of other homologs to condition the generation!
📈 Despite its smaller size, ProtMamba is better than SOTA on conditional sequence generation and competitive with other protein language models on fitness prediction, showing the importance of long-context conditioning.
Read it here: doi.org/10.1093/bioi...
Github repo: github.com/Bitbol-Lab/P...
Damiano Sgarbossa
Damiano Sgarbossa
Damiano Sgarbossa
Damiano Sgarbossa
Cyril Malbranke
Michael Jendrusch
Damiano Sgarbossa
Damiano Sgarbossa
Language models starting from biological sequence data are advancing many inference problems, both at the scale of single proteins, and at the scale of genomic neighborhoods. In this paper, we introduce ProteomeLM, a transformer-based language model that reasons on entire proteomes from species spanning the tree of life. Leveraging protein language model embeddings, ProteomeLM is trained to reconstruct masked protein embeddings using the whole proteomic context. It thus learns contextualized protein representations reflecting proteome-scale functional constraints. We show that ProteomeLM spontaneously captures protein-protein interactions (PPI) in its attention coefficients. We demonstrate that it screens whole interactomes orders of magnitude faster than amino-acid coevolution-based methods, and substantially outperforms them. We further develop ProteomeLM-PPI, a supervised PPI prediction network that combines ProteomeLM embeddings and attention coefficients, and achieves state-of-the-art performance across species and benchmarks. Finally, we introduce ProteomeLM-Ess, a supervised predictor of gene essentiality that generalizes across diverse taxa. Our results highlight the power of proteome-scale language models for addressing function and interactions at the organism level. ### Competing Interest Statement The authors have declared no competing interest. European Research Council, https://ror.org/0472cxd90, 851173
‘Salad’ – a new AI model from EMBL scientists – offers major improvements in synthetic protein design.
Salad is significantly faster than comparable methods, and designing proteins that don't exist in nature can have applications in many scientific fields.
www.nature.com/articles/s42...