Presented by: Andrew Ghazi
View Abstract
Microbial strain variation can strongly influence the impact of microbes on host health, though methods for quantitatively understanding these important differences have been lacking. Strain data have several features that make traditional statistical methods challenging to use, including high dimensionality, person-specific strain carriage, and complex phylogenetic relatedness. We present anpan, an R package that consolidates methods for strain statistics. Combining modern hierarchical modeling strategies with novel adaptive filtering methods specifically designed to interrogate microbial strain profiles, anpan facilitates the identification of strain-specific genetic elements associated with host health outcomes. Additionally, we use regularized phylogenetic generalized linear mixed models to characterize the effect of strain-level community structure. We validate our methods by simulation, as well as application to a dataset of 1262 colorectal cancer patients, showing that we achieve more accurate effect size estimation and a lower false positive rate compared to current methodologies. The open source repository with help documentation and a tutorial vignette are available at https://github.com/biobakery/anpan.