Bacteriophage classification for assembled contigs using Graph Convolutional Network

Jiayu Shang, Jingzhe Jiang and Yanni Sun

Motivation: Bacteriophages (aka phages), which mainly infect bacteria, play key roles in the biology of microbes. As the most abundant biological entities on the planet, the number of discovered phages is only the tip of the iceberg. Recently, many new phages have been revealed using high throughput sequencing, particularly metagenomic sequencing. Compared to the fast accumulation of phage-like sequences, there is a serious lag in the taxonomic classification of phages. High diversity, abundance, and limited known phages pose great challenges for taxonomic analysis. In particular, alignment-based tools have difficulty in classifying fast accumulating contigs assembled from metagenomic data.

Result: In this work, we present a novel semi-supervised learning model, named PhaGCN, to conduct taxonomic classification for phage contigs. In this learning model, we construct a knowledge graph by combining the DNA sequence features learned by convolutional neural network (CNN) and protein sequence similarity gained from gene-sharing network. Then we apply graph convolutional network (GCN) to utilize both the labeled and unlabeled samples in training to enhance the learning ability. We tested PhaGCN on both simulated and real sequencing data. The results clearly show that our method competes favorably against available phage classification tools.

 

Umibato: estimation of time-varying microbial interaction using continuous-time regression hidden Markov model 

Shion Hosoda, Tsukasa Fukunaga and Michiaki Hamada

Motivation: Accumulating evidence has highlighted the importance of microbial interaction networks. Methods have been developed for estimating microbial interaction networks,
of which the generalized Lotka-Volterra equation (gLVE)-based method can estimate a directed interaction network. The previous gLVE-based method for estimating microbial interaction networks did not consider time-varying interactions.

Results: In this study, we developed unsupervised learning based microbial interaction inference method using Bayesian estimation (Umibato), a method for estimating time-varying microbial interactions. The Umibato algorithm comprises Gaussian process regression (GPR) and a new Bayesian probabilistic model, the continuous-time regression hidden Markov model (CTRHMM). Growth rates are estimated by GPR, and interaction networks are estimated by CTRHMM. CTRHMM can estimate time-varying interaction networks using interaction states, which are defined as hidden variables. Umibato outperformed the existing methods on synthetic datasets. In addition, it yielded reasonable estimations in experiments on a mouse gut microbiota dataset, thus providing novel insights into the relationship between consumed diets and the gut microbiota.

 

Statistical approaches for differential expression analysis in metatranscriptomics

Yancong Zhang, Kelsey Thompson, Curtis Huttenhower and Eric Franzosa

Motivation: Metatranscriptomics (MTX) has become an increasingly practical way to profile the functional activity of microbial communities in situ. However, MTX remains underutilized due to experimental and computational limitations. The latter are complicated by non-independent changes in both RNA transcript levels and their underlying genomic DNA copies (as microbes simultaneously change their overall abundance in the population and regulate individual transcripts), genetic plasticity (as whole loci are frequently gained and lost in microbial lineages), and measurement compositionality and zero-inflation. Here, we present a systematic evaluation of and recommendations for differential expression (DE) analysis in MTX.

Results: We designed and assessed six statistical models for DE discovery in MTX that incorporate different combinations of DNA and RNA normalization and assumptions about the underlying changes of gene copies or species abundance within communities. We evaluated these models on multiple simulated and real multi-omic datasets. Models adjusting transcripts relative to their encoding gene copies as a covariate were significantly more accurate in identifying DE from MTX in both simulated and real datasets. Moreover, we show that when paired DNA measurements (metagenomic data, MGX) are not available, models normalizing MTX measurements within-species while also adjusting for total-species RNA balance sensitivity, specificity, and interpretability of DE detection, as does filtering likely technical zeros. The efficiency and accuracy of these models pave the way for more effective MTX-based DE discovery in microbial communities.

Availability: The analysis code and synthetic datasets used in this evaluation are available online at http://huttenhower.sph.harvard.edu/mtx2021.