Taken together, GRG v2 and grapp demonstrate that moving from tabular to GRG-based representations can deliver substantial gains in speed, memory, and cost, while leveraging the rich Python scientific computing ecosystem.
Huge thanks to the whole team (Drew DeHaas, Chris Adonizio, Ziqing Pan) 💙
April Wei
Using these operators, scipy-based PCA can be implemented in four lines of Python. PCA on 89M variants in 2–4 hours, 51–492× faster than existing methods.
Very proud to share our new work on General, orders-of-magnitude faster whole-genome analysis with genotype representation graphs (GRG). We topped ourselves in this one 🚀 and made GRG a practical foundation for biobank-scale population and statistical genetics. www.biorxiv.org/content/10.6...
Since then, we have been working towards removing the barriers to broader adoption of GRG by both method developers and empirical researchers. We started with phenotype simulations academic.oup.com/bioinformati... and showed GRG enables orders of magnitude faster simulation than ARGs.
This scalability also enables a leave-one-chromosome-out approach (LOCO) to GWAS covariate construction that avoids LD artifacts (later PCs capture local LD) without requiring LD pruning. Once computation is no longer the limit, methods can be chosen on statistical grounds rather than feasibility.
We introduce grapp, a collection of GRG-based command-line tools that resembles PLINK2: variant and sample filtering, GWAS with covariates, PCA, and data export as native graph operations. Routine analyses can now be done easily and orders-of-magnitude faster with grapp, with minimal upfront cost.
The GRG is an ARG-motivated representation that compactly and losslessly encodes the genotypes. It is a file format and a computational data structure. ~2y ago www.nature.com/articles/s43..., we introduced GRG, its relation to ARG, a construction algorithm, GWAS, and its scalability promise.
We also provide linear operators compatible with SciPy’s sparse linear algebra interface, enabling extremely efficient implicit multiplication against the standardized genotype matrix, the linkage disequilibrium (LD) matrix, and the genetic relatedness matrix–none of which are ever materialized.
Here, we introduce a new construction algorithm that reduces construction time by 10–20×, halves the disk and RAM footprint, and improves load time by more than 20× relative to v1. GRG construction is now so fast that building a GRG directly from .vcf.gz can be faster than .vcf.gz to PGEN (PLINK2).
GRG is now the smallest practical phased genotype format. Applied to the UK Biobank WGS dataset (490,541 individuals; 706,556,181 variants), GRG v2 produces files 25× smaller than .vcf.gz (122GB vs. 3TB) and more than 8× smaller than PLINK2’s PGEN, at a total construction cost of less than 90 GBP.