Presented by: Andrew Ghazi
View Abstract
Strain variation can strongly influence the impact of microbes on their environments, however inferential methods for quantifying these important differences have been lacking. Metagenomic data with strain-level resolution has several features that make traditional statistical methods challenging to use, including high dimensionality, individual-specific strain carriage, and complex phylogenetic relatedness. We present ANPAN, an R package that consolidates methods for strain statistics in three key components. First, adaptive filtering methods specifically designed to interrogate microbial strain profiles are combined with linear models to identify strain-specific genetic elements associated with host health outcomes. Second, phylogenetic generalized linear mixed models are used to characterize the effect of strain-level community structure. Finally, random effects models are used to account for species abundance when assessing the impact of gene pathway abundance on outcome variables. We validated our methods by simulation, showing that we achieve more accurate effect size estimation and a lower false positive rate compared to current methodologies. We then applied our methods to a dataset of 1,262 colorectal cancer patients, identifying functionally adaptive genes and strong phylogenetic effects associated with CRC status. The open source ANPAN repository with help documentation and a tutorial vignette are available at https://github.com/biobakery/anpan.
If you have any questions regarding the poster, feel free to reach out here.