Finding phylogeny-aware and biologically meaningful averages of metagenomic samples: L2UniFrac

Wei Wei, the Pennsylvania State University, United States
Andrew Millward, the Pennsylvania State University, United States
David Koslicki, Penn State University, United States

Metagenomic samples have high spatiotemporal variability. Hence, it is useful to summarize and characterize the microbial makeup of a given environment in a way that is biologically reasonable and interpretable. The UniFrac metric has been a robust and widely-used metric for measuring the variability between metagenomic samples. We propose that the characterization of metagenomic environments can be achieved by finding the average, a.k.a. the barycenter, among the samples with respect to the UniFrac distance. However, it is possible that such a UniFrac-average includes negative entries, making it no longer a valid representation of a metagenomic community. To overcome this intrinsic issue, we propose a special version of the UniFrac metric, termed L2UniFrac, which inherits the phylogenetic nature of the traditional UniFrac and with respect to which one can easily compute the average, producing biologically meaningful environment-specific “representative samples”. We demonstrate the usefulness of such representative samples as well as the extended usage of L2UniFrac in efficient clustering of metagenomic samples, and provide mathematical characterizations and proofs to the desired properties of L2UniFrac. A prototype implementation is provided at: KoslickiLab/L2-UniFrac.git.

Bakdrive: Identifying a Minimum Set of Bacterial Species Driving Interactions across Multiple Microbial Communities

Qi Wang, Systems, Synthetic, and Physical Biology (SSPB) Graduate Program, Rice University, Houston, Texas, USA, United States
Michael Nute, Anvil Diagnostics, Southborough, MA, USA, United States
Todd Treangen, Department of Computer Science, Rice University, Houston, TX, USA, United States

Motivation: Interactions among microbes within microbial communities have been shown to play crucial roles in human health. In spite of recent progress, low-level knowledge of bacteria driving microbial interactions within microbiomes remains unknown, limiting our ability to fully decipher and control microbial communities.

Results: We present a novel approach for identifying species driving interactions within microbiomes. Bakdrive infers ecological networks of given metagenomic sequencing samples and identifies minimum sets of driver species (MDS) using control theory. Bakdrive has three key innovations in this space: (i) it leverages inherent information from metagenomic sequencing samples to identify driver species, (ii) it explicitly takes host-specific variation into consideration, and (iii) it does not require a known ecological network. In extensive simulated data, we demonstrate identifying driver species identified from healthy donor samples and introducing them to the disease samples, we can restore the gut microbiome in recurrent Clostridioides difficile (rCDI) infection patients to a healthy state. We also applied Bakdrive to two real datasets, rCDI and Crohn's disease patients, uncovering driver species consistent with previous work. Bakdrive represents a novel approach for capturing microbial interactions.

Availability: Bakdrive is open-source and available at:


AdenPredictor: Accurate prediction of the adenylation domain specificity of nonribosomal peptide Biosynthetic Gene Clusters in Microbial Genomes

Mihir Mongia, Carnegie Mellon, United States
Romel Baral, Carnegie Mellon, United States
Abhinav Adduri, Carnegie Mellon, United States
Donghui Yan, Carnegie Mellon, United States
Yudong Liu, Carnegie Mellon, United States
Yuying Bian, Carnegie Mellong, United States
Paul Kim, Carnegie Mellon, United States
Bahar Behsaz, Carnegie Mellon, United States
Hosein Mohimani, Carnegie Mellon, United States

Microbial natural products represent a major source of bioactive compounds for drug discovery. Among these molecules, Non-Ribosomal Peptides (NRPs) represent a diverse class that include antibiotics, immunosuppressants, anticancer agents, toxins, siderophores, pigments, and cytostatics. The discovery of novel NRPs remains a laborious process because many NRPs consist of non-standard amino acids that are assembled by Non-Ribosomal Peptide Synthetases (NRPSs). Adenylation domains (A-domains) in NRPSs are responsible for selection and activation of monomers appearing in NRPs. During the past decade, several support vector machine-based algorithms have been developed for predicting the specificity of the monomers present in NRPs. These algorithms utilize physiochemical features of the amino acids present in the A-domains of NRPSs. In this paper, we benchmarked the performance of various machine learning algorithms and features for predicting specificities of NRPSs and we showed that the extra trees model paired with one hot encoding features outperforms the existing approaches. Moreover, we show that unsupervised clustering of 453,560 A-domains reveals many clusters that correspond to potentially novel amino acids. While it is challenging to predict the chemical structure of these amino acids, we developed novel techniques to predict their various properties, including polarity, hydrophobicity, charge, and presence of aromatic rings, and carboxyl, and hydroxyl groups.


PhaVIP: Phage VIrion Protein classification based on chaos game representation and Vision Transformer

Jiayu Shang, Department of Electrical Engineering, City University of Hong Kong, Hong Kong (SAR), China, Hong Kong
Cheng Peng, Department of Electrical Engineering, City University of Hong Kong, Hong Kong (SAR), China, Hong Kong
Xubo Tang, Department of Electrical Engineering, City University of Hong Kong, Hong Kong (SAR), China, Hong Kong
Yanni Sun, Department of Electrical Engineering, City University of Hong Kong, Hong Kong (SAR), China, Hong Kong

Motivation: As viruses that mainly infect bacteria, phages are key players across a wide range of ecosystems. Analyzing phage proteins is indispensable for understanding phages' functions and roles in microbiomes. High-throughput sequencing enables us to obtain phages in different microbiomes with low cost. However, compared to the fast accumulation of newly identified phages, phage protein classification remains difficult. In particular, a fundamental need is to annotate virion proteins, the structural proteins such as major tail, baseplate, etc. Although there are experimental methods for virion protein identification, they are too expensive or time-consuming, leaving a large number of proteins unclassified. Thus, there is a great demand to develop a computational method for fast and accurate phage virion protein classification.
Results: In this work, we adapted the state-of-the-art image classification model, Vision Transformer, to conduct virion protein classification. By encoding protein sequences into unique images using chaos game representation, we can leverage Vision Transformer to learn both local and global features from sequence ``images''. Our method, PhaVIP, has two main functions: classifying PVP and non-PVP sequences and annotating the types of PVP, such as capsid and tail. We tested PhaVIP on several datasets with increasing difficulty and benchmarked it against alternative tools. The experimental results show that PhaVIP has superior performance. After validating the performance of PhaVIP, we investigated two applications that can use the output of PhaVIP: phage taxonomy classification and phage host prediction. The results showed the benefit of using classified proteins over all proteins.


SemiBin2: self-supervised contrastive learning leads to better MAGs for short- and long-read sequencing

Shaojun Pan, Fudan University, China
Xing-Ming Zhao, Fudan University, China
Luis Pedro Coelho, Fudan University, China

Motivation: Metagenomic binning methods to reconstruct metagenome-assembled genomes (MAGs) from environmental samples have been widely used in large-scale metagenomic studies. The recently proposed semi-supervised binning method, SemiBin, achieved state-of-the-art binning results in several environments. However, this required annotating contigs, a computationally costly and potentially biased process.

Results: We propose SemiBin2, which uses self-supervised learning to learn feature embeddings from the contigs. In simulated and real datasets, we show that self-supervised learning achieves better results than the semi-supervised learning
used in SemiBin1 and that SemiBin2 outperforms other state-of-the-art binners. Compared to SemiBin1, SemiBin2 can reconstruct 8.3%–21.5% more high-quality bins and requires only 25% of the running time and 11% of peak memory usage
in real short-read sequencing samples. To extend SemiBin2 to long-read data, we also propose ensemble-based DBSCAN clustering algorithm, resulting in 13.1%–26.3% more high-quality genomes than the second best binner for long-read data.

Availability and implementation: SemiBin2 is available as open source software at
SemiBin/ and the analysis script used in the study can be found at

Contact: Correspondence should be addressed to This email address is being protected from spambots. You need JavaScript enabled to view it. and This email address is being protected from spambots. You need JavaScript enabled to view it..

Supplementary information: Supplementary data are available online.


PlasBin-flow: A flow-based MILP algorithm for plasmid contigs binning

Aniket Mane, Simon Fraser University, Canada
Mahsa Faizrahnemoon, Simon Fraser University, Canada
Tomas Vinar, Comenius University, Slovakia
Brona Brejova, Comenius University, Slovakia
Cedric Chauve, Simon Fraser University, Canada

The analysis of bacterial isolates to detect plasmids is important due to their role in the propagation of antimicrobial resistance. In short-read sequence assemblies, both plasmids and bacterial chromosomes are typically split into several contigs of various lengths, making identification of plasmids a challenging problem. In plasmid contig binning, the goal is to distinguish short-read assembly contigs based on their origin into plasmid and chromosomal contigs and subsequently sort plasmid contigs into bins, each bin corresponding to a single plasmid. Previous works on this problem consist of de novo approaches and reference-based approaches. De novo methods rely on contig features such as length, circularity, read coverage, or GC content. Reference-based approaches compare contigs to databases of known plasmids or plasmid markers from finished bacterial genomes.
Recent developments suggest that leveraging information contained in the assembly graph improves the accuracy of plasmid binning. We present PlasBin-flow, a hybrid method that defines contig bins as subgraphs of the assembly graph. PlasBin-flow identifies such plasmid subgraphs through a mixed integer linear programming model that relies on the concept of network flow to account for sequencing coverage, while also accounting for the presence of plasmid genes and the GC content that often distinguishes plasmids from chromosomes. We demonstrate the performance of PlasBin-flow on a real data set of bacterial samples.