We show that PanBART, like other deep-learning methods, is sensitive to "out-of-distribution" data. But this isn't necessarily a bad thing! We can leverage this sensitivity by using a measure of model confidence, "pseudolikelihoods" to identify new emergent lineages! (5/8)
We show that PanBART can accurately represent a phylogeny, clustering genomes of the same lineage with high agreement with PopPUNK, and outperforming accessory-only Sketchlib, which represents population structure using gene presence/absence only, and ignoring gene order. (3/8)
Fast Set Operations for Compact k-mer Sets https://www.biorxiv.org/content/10.64898/2026.05.24.727514v1
Finally, we explore gene-gene epistasis, identifying a theorised, but previously unobserved, association between an iron-regulated bacteriocin and siderophore in E. coli. This same association is not identified by Spydrpick. (7/8)
PanBART can also be used to predict whether a genome will "take-up" a gene of interest. We are able to accurately identify E. coli lineages which are likely to gain an extended-spectrum antibiotic resistance gene, meaning we can predict which strains might become drug resistant! (6/8)