We've been looking at how to compare and cluster large numbers of genomes, such as those in large isolate databases such as AllTheBacteria, and metagenome assemblies (e.g. SPIRE, MGnify).
On a combined dataset of 5.6 million assemblies, we can now cluster/dereplicate everything in under a day!
John Lees
🧬 New preprint! We clustered 5.6 million bacterial genomes into genomically cohesive units (GCUs) 500× faster than existing tools. (In just 14 hours, 16.5 GB RAM using 48 CPUs). 🦠🐙Meet gemsparcl 💎✨!
www.biorxiv.org/content/10.6...