Integrating reference- and assembly-based methods for improved viral identification from metagenomes, metatranscriptomes, and viromes

Presented by: Jordan Jensen

View Abstract

Capturing an accurate representation of the viral members of a microbial community presents significant experimental and computational challenges. Sample preparation approaches for virus-like particle (VLP) enrichment vary greatly in their efficiency among protocols and environments, and sequences from any technology (metagenomic, metatranscriptomic, or VLP enrichments) can be difficult to identify computationally. Limitations include small viral genome size, and subsequently a small proportion of genetic content in samples; lack of universal marker genes; multiple nucleic acid backbone types; rapid evolution, recombination, and sequence divergence; and most prominently, a lack of well-characterized viral reference databases.
To address these limitations, we developed BAQLaVa (Bioinformatic Application for Quantification and Labeling of Viral taxonomy), which integrates both reference- and assembly-based methods to provide viral profiles from shotgun DNA or RNA sequencing (with or without enrichment). Reads are compared with both nucleotide and protein (translated) databases that are pre-screened for viral identification using a modification of the MetaPhlAn algorithm and reconciled with the most recent International Committee on Taxonomy of Viruses (ICTV) taxonomic rankings. In parallel, assembled contigs are classified using deep learning, and viral identifications from all three approaches are harmonized per sample.
We evaluated BAQLaVa with 1) in silico simulated data representing broadly viral material, 2) more detailed synthetic gut microbiomes, and 3) existing human gut metagenomes, metatranscriptomes, and VLP-enriched viromes. Using only nucleotide and protein references, we found that BAQLaVa achieves both greater sensitivity and specificity than existing tools (including MetaPhlAn). We capture 57-87% of both DNA and RNA viral content even in highly novel communities with a PPV of 73-97% and FPR of 2.4-2.5%. This work is ongoing, including optimization of parameter and contig lengths for assembly classification, as well as error rate calculations for viral quantitative abundance profiling vs. qualitative detection. We hope these methods will unlock as-yet-unaccessed information on viral community members from thousands of existing metagenomes and metatranscriptomes, as well as enabling more accurate characterization of future VLPs from a variety of microbial environments.

If you have any questions regarding the poster, feel free to reach out here.