Journal of Computational Biology

kmer2vec: A Novel Method for Comparing DNA Sequences by word2vec Embedding

Ruohan Ren — Fri, 20 May 2022 06:00:00 -0400

J Comput Biol. 2022 May 20. doi: 10.1089/cmb.2021.0536. Online ahead of print.

ABSTRACT

The comparison of DNA sequences is of great significance in genomics analysis. Although the traditional multiple sequence alignment (MSA) method is popularly used for evolutionary analysis, optimally aligning k sequences becomes computationally intractable when k increases due to the intrinsic computational complexity of MSA. Despite numerous k-mer alignment-free methods being proposed, the existing k-mer alignment-free methods may not truly capture the contextual structures of the sequences. In this study, we present a novel k-mer contextual alignment-free method (called kmer2vec), in which the sequence k-mers are semantically embedded to word2vec vectors, an essential technique in natural language processing. Consequently, the method converts each DNA/RNA sequence into a point in the word2vec high-dimensional space and compares DNA sequences in the space. Because the word2vec vectors are trained from the contextual relationship of k-mers in the genomes, the method may extract valuable structural information from the sequences and reflect the relationship among them properly. The proposed method is optimized on the parameters from word2vec training and verified in the phylogenetic analysis of large whole genomes, including coronavirus and bacterial genomes. The results demonstrate the effectiveness of the method on phylogenetic tree construction and species clustering. The method running speed is much faster than that of the MSA method, especially the phylogenetic relationships constructed by the kmer2vec method are more accurate than the conventional k-mer alignment-free method. Therefore, this approach can provide new perspectives for phylogeny and evolution and make it possible to analyze large genomes. In addition, we discuss special parameterization in the k-mer word2vec embedding construction. An effective tool for rapid SARS-CoV-2 typing can also be derived when combining kmer2vec with clustering methods.

PMID:35593919 | DOI:10.1089/cmb.2021.0536

Extracting Information from Gene Coexpression Networks of Rhizobium leguminosarum

Javier Pardo-Diaz — Thu, 19 May 2022 06:00:00 -0400

J Comput Biol. 2022 May 19. doi: 10.1089/cmb.2021.0600. Online ahead of print.

ABSTRACT

Nitrogen uptake in legumes is facilitated by bacteria such as Rhizobium leguminosarum. For this bacterium, gene expression data are available, but functional gene annotation is less well developed than for other model organisms. More annotations could lead to a better understanding of the pathways for growth, plant colonization, and nitrogen fixation in R. leguminosarum. In this study, we present a pipeline that combines novel scores from gene coexpression network analysis in a principled way to identify the genes that are associated with certain growth conditions or highly coexpressed with a predefined set of genes of interest. This association may lead to putative functional annotation or to a prioritized list of genes for further study.

PMID:35588362 | DOI:10.1089/cmb.2021.0600

Translator: A Transfer Learning Approach to Facilitate Single-Cell ATAC-Seq Data Analysis from Reference Dataset

Siwei Xu — Wed, 18 May 2022 06:00:00 -0400

J Comput Biol. 2022 May 17. doi: 10.1089/cmb.2021.0596. Online ahead of print.

ABSTRACT

Recent advances in single-cell sequencing assay for transposase-accessible chromatin (scATAC-seq) have allowed simultaneous epigenetic profiling over thousands of individual cells to dissect the cellular heterogeneity and elucidate regulatory mechanisms at the finest possible resolution. However, scATAC-seq is challenging to model computationally due to the ultra-high dimensionality, low signal-to-noise ratio, complex feature interactions, and high vulnerability to various confounding factors. In this study, we present Translator, an efficient transfer learning approach to capture generalizable chromatin interactions from high-quality (HQ) reference scATAC-seq data to obtain robust cell representations in low-to-moderate quality target scATAC-seq data. We applied Translator on various simulated and real scATAC-seq datasets and demonstrated that Translator could learn more biologically meaningful cell representations than other methods by incorporating information learned from the reference data, thus facilitating various downstream analyses such as clustering and motif enrichment measurements. Moreover, Translator's block-wise deep learning framework can handle nonlinear relationships with restricted connections using fewer parameters to boost computational efficiency through Graphics Processing Unit (GPU) parallelism. Finally, we have implemented Translator as a free software package available for the community to leverage large-scale, HQ reference data to study target scATAC-seq data.

PMID:35584295 | DOI:10.1089/cmb.2021.0596

Locality-Sensitive Hashing-Based k-Mer Clustering for Identification of Differential Microbial Markers Related to Host Phenotype

Wontack Han — Wed, 18 May 2022 06:00:00 -0400

J Comput Biol. 2022 May 17. doi: 10.1089/cmb.2021.0640. Online ahead of print.

ABSTRACT

Microbial organisms play important roles in many aspects of human health and diseases. Encouraged by the numerous studies that show the association between microbiomes and human diseases, computational and machine learning methods have been recently developed to generate and utilize microbiome features for prediction of host phenotypes such as disease versus healthy cancer immunotherapy responder versus nonresponder. We have previously developed a subtractive assembly approach, which focuses on extraction and assembly of differential reads from metagenomic data sets that are likely sampled from differential genomes or genes between two groups of microbiome data sets (e.g., healthy vs. disease). In this article, we further improved our subtractive assembly approach by utilizing groups of k-mers with similar abundance profiles across multiple samples. We implemented a locality-sensitive hashing (LSH)-enabled approach (called kmerLSHSA) to group billions of k-mers into k-mer coabundance groups (kCAGs), which were subsequently used for the retrieval of differential kCAGs for subtractive assembly. Testing of the kmerLSHSA approach on simulated data sets and real microbiome data sets showed that, compared with the conventional approach that utilizes all genes, our approach can quickly identify differential genes that can be used for building promising predictive models for microbiome-based host phenotype prediction. We also discussed other potential applications of LSH-enabled clustering of k-mers according to their abundance profiles across multiple microbiome samples.

PMID:35584271 | DOI:10.1089/cmb.2021.0640

WITCH: Improved Multiple Sequence Alignment Through Weighted Consensus Hidden Markov Model Alignment

Chengze Shen — Mon, 16 May 2022 06:00:00 -0400

J Comput Biol. 2022 May 17. doi: 10.1089/cmb.2021.0585. Online ahead of print.

ABSTRACT

Accurate multiple sequence alignment is challenging on many data sets, including those that are large, evolve under high rates of evolution, or have sequence length heterogeneity. While substantial progress has been made over the last decade in addressing the first two challenges, sequence length heterogeneity remains a significant issue for many data sets. Sequence length heterogeneity occurs for biological and technological reasons, including large insertions or deletions (indels) that occurred in the evolutionary history relating the sequences, or the inclusion of sequences that are not fully assembled. Ultra-large alignments using Phylogeny-Aware Profiles (UPP) (Nguyen et al. 2015) is one of the most accurate approaches for aligning data sets that exhibit sequence length heterogeneity: it constructs an alignment on the subset of sequences it considers "full-length," represents this "backbone alignment" using an ensemble of hidden Markov models (HMMs), and then adds each remaining sequence into the backbone alignment based on an HMM selected for that sequence from the ensemble. Our new method, WeIghTed Consensus Hmm alignment (WITCH), improves on UPP in three important ways: first, it uses a statistically principled technique to weight and rank the HMMs; second, it uses k>1 HMMs from the ensemble rather than a single HMM; and third, it combines the alignments for each of the selected HMMs using a consensus algorithm that takes the weights into account. We show that this approach provides improved alignment accuracy compared with UPP and other leading alignment methods, as well as improved accuracy for maximum likelihood trees based on these alignments.

PMID:35575747 | DOI:10.1089/cmb.2021.0585

Improvements Achieved by Multiple Imputation for Single-Cell RNA-Seq Data in Clustering Analysis and Differential Expression Analysis

Mengqiu Zhu — Mon, 16 May 2022 06:00:00 -0400

J Comput Biol. 2022 May 16. doi: 10.1089/cmb.2021.0597. Online ahead of print.

ABSTRACT

In a single-cell RNA-seq (scRNA-seq) data set, a high proportion of missing values (or an excessive number of zeroes) are frequently observed. For the related follow-up tasks, such as clustering analysis and differential expression analysis, a data set without missing values is generally required. Many imputation approaches have been proposed for this purpose. Multiple imputation (MI) is a well-established approach to address possible biases in a follow-up analysis result based on one-time imputed data. There is a lack of investigation on this in the analysis of scRNA-seq data. In this study, we have investigated how to efficiently apply the MI approach to the clustering analysis and the differential expression analysis of scRNA-seq data. We proposed an MI procedure for clustering analysis and an MI procedure for differential expression analysis. To demonstrate the improvements achieved by MI in clustering analysis and differential expression analysis of scRNA-seq data, we analyzed three well-known scRNA-seq data sets. scIGANs, an scRNA-seq imputation method based on the generative adversarial networks (GANs), has been recently proposed for scRNA-seq data imputation. Multiple randomly imputed data sets can be conveniently generated by this method. We implemented our MI procedures based on scIGANs. We demonstrated that MI yielded improved performances on the clustering analysis and differential expression analysis results. Our applications to experimental scRNA-seq data illustrated the advantages of MI over one-time imputation of missing values in scRNA-seq data.

PMID:35575729 | DOI:10.1089/cmb.2021.0597

Fast Algorithms for the Simplified Partial Digest Problem

Biing-Feng Wang — Mon, 16 May 2022 06:00:00 -0400

J Comput Biol. 2022 May 16. doi: 10.1089/cmb.2021.0641. Online ahead of print.

ABSTRACT

The simplified partial digest problem (SPDP) models an effective and robust method for the building of a physical map using restriction site analysis. The best known algorithm requires O(n2ⁿ) time, using O(n2ⁿ) working space. The high complexities in time and space impede its application to genomes of a large number of sites. This article gives two new algorithms. The first improves the time by a factor of O(n) and significantly reduces the space to O(n²). The second improves both the time and space to O(n^1.52^n/2). Extensive experiments are conducted on real genomes. For instances that can be solved by the best known algorithm, the new algorithms achieve a speedup of up to 4000 times; in addition, due to the reduction in space, the new algorithms can solve many more instances. Experiments also reveal the following advantage of the SPDP method: almost every instance has at most four feasible solutions and for an instance that does not contain any pair of symmetric restriction sites, in all observed examples, the solution is unique.

PMID:35575710 | DOI:10.1089/cmb.2021.0641

Variational Approximation-Based Model Selection for Microbial Network Inference

Shibu Yooseph — Fri, 13 May 2022 06:00:00 -0400

J Comput Biol. 2022 May 12. doi: 10.1089/cmb.2021.0595. Online ahead of print.

ABSTRACT

Microbial associations are characterized by both direct and indirect interactions between the constituent taxa in a microbial community, and play an important role in determining the structure, organization, and function of the community. Microbial associations can be represented using a weighted graph (microbial network), whose nodes represent taxa and edges represent pairwise associations. A microbial network is typically inferred from a sample-taxa matrix that is obtained by sequencing multiple biological samples and identifying the taxa counts in each sample. However, it is known that microbial associations are impacted by environmental and/or host factors. Thus, a sample-taxa matrix generated in a microbiome study involving a wide range of values for the environmental and/or clinical metadata variables may in fact be associated with more than one microbial network. In this study, we consider the problem of inferring multiple microbial networks from a given sample-taxa count matrix. Each sample is a count vector assumed to be generated by a mixture model consisting of component distributions that are multivariate Poisson log-normal. We present a variational expectation maximization algorithm for the model selection problem to infer the correct number of components of this mixture model. Our approach involves reframing the mixture model as a latent variable model, treating only the mixing coefficients as parameters, and subsequently approximating the marginal likelihood using an evidence lower bound framework. Our algorithm is evaluated on a large simulated dataset generated using a collection of different graph structures (band, hub, cluster, random, and scale-free).

PMID:35549398 | DOI:10.1089/cmb.2021.0595

The Probability of Joint Monophyly of Samples of Gene Lineages for All Species in an Arbitrary Species Tree

Rohan S Mehta — Wed, 11 May 2022 06:00:00 -0400

J Comput Biol. 2022 May 11. doi: 10.1089/cmb.2021.0647. Online ahead of print.

ABSTRACT

Monophyly is a feature of a set of genetic lineages in which every lineage in the set is more closely related to all other members of the set than it is to any lineage outside the set. Multiple sets of lineages that are separately monophyletic are said to be reciprocally monophyletic, or jointly monophyletic. The prevalence of reciprocal monophyly, or joint monophyly (JM), has been used to evaluate phylogenetic and phylogeographic hypotheses, as well as to delimit species. These applications often make use of a probability of JM under models of gene lineage evolution. Studies in coalescent theory have computed this JM probability for small numbers of separate groups in arbitrary species trees and for arbitrary numbers of separate groups in trivial species trees. In this study, generalizing existing results on monophyly probabilities under the multispecies coalescent, we derive the probability of JM for arbitrary numbers of separate groups in arbitrary species trees. We illustrate how our result collapses to previously examined cases. We also study the effect of tree height, sample size, and number of species on the probability of JM. We obtain relatively simple lower and upper bounds on the JM probability. Our results expand the scope of JM calculations beyond small numbers of species, subsuming past formulas that have been used in simpler cases.

PMID:35544237 | DOI:10.1089/cmb.2021.0647

Mathematical Model of HIV/AIDS Considering Sexual Preferences Under Antiretroviral Therapy, a Case Study in San Juan de Pasto, Colombia

Cristian C Espitia — Wed, 11 May 2022 06:00:00 -0400

J Comput Biol. 2022 May;29(5):483-493. doi: 10.1089/cmb.2021.0323.

ABSTRACT

While several studies on human immunodeficiency virus (HIV)/acquired immunodeficiency syndrome (AIDS) in the homosexual and heterosexual population have demonstrated substantial advantages in controlling HIV transmission in these groups, the overall benefits of the models with a bisexual population and initiation of antiretroviral therapy have not had enough attention in dynamic modeling. Thus, we used a mathematical model based on studying the impacts of bisexual behavior in a global community developed in the PhD thesis work of Espitia (2021). The model is governed by a nonlinear ordinary differential equation system, the parameters of which are calibrated with data from the cumulative cases of HIV infection and AIDS reported in San Juan de Pasto in 2019. Our model estimations show which parameters are the most influential and how to modulate them to decrease the HIV infection.

PMID:35544039 | PMC:PMC9125573 | DOI:10.1089/cmb.2021.0323

Cuckoo Search-Based Optimization for Cancer Classification: A New Hybrid Approach

Rabia Musheer Aziz — Mon, 09 May 2022 06:00:00 -0400

J Comput Biol. 2022 May 6. doi: 10.1089/cmb.2021.0410. Online ahead of print.

ABSTRACT

The design of an optimal framework for the prediction of cancer from high-dimensional and imbalanced microarray data is a challenging job in the fields of bioinformatics and machine learning. There are so many techniques for dimensionality reduction, but it is unclear which of these techniques performs best with different classifiers and datasets. This article focused on the independent component analysis (ICA) features (genes) extraction method for Naïve Bayes (NB) classification of microarray data, because ICA perfectly takes out an independent component from the datasets that satisfy the classification criteria of the NB classifier. A novel hybrid method based on a nature-inspired metaheuristic algorithm is proposed in this article for resolving optimization problems of ICA extracted genes. The cuckoo search (CS) algorithm and artificial bee colony (ABC) for finding the best subset of features to increase the performance of ICA for the NB classifier is designed and executed. According to our investigation, the CS-ABC with ICA was implemented for the first time to resolve the dimensionality reduction problem in high-dimensional microarray biomedical datasets. The CS algorithm improved the local search process of the ABC algorithm, and then the hybrid algorithm CS-ABC provided better optimal gene sets that improved the classification accuracy of the NB classifier. The experimental comparison shows that the CS-ABC approach with the ICA algorithm performs a deeper search in the iterative process, which can avoid premature convergence and produce better results compared with the previously published feature selection algorithm for the NB classifier.

PMID:35527646 | DOI:10.1089/cmb.2021.0410

Data Set-Adaptive Minimizer Order Reduces Memory Usage in k-Mer Counting

Dan Flomin — Mon, 09 May 2022 06:00:00 -0400

J Comput Biol. 2022 May 6. doi: 10.1089/cmb.2021.0599. Online ahead of print.

ABSTRACT

The rapid continuous growth of deep sequencing experiments requires development and improvement of many bioinformatic applications for analysis of large sequencing data sets, including k-mer counting and assembly. Several applications reduce memory usage by binning sequences. Binning is done by using minimizer schemes, which rely on a specific order of the minimizers. It has been demonstrated that the choice of the order has a major impact on the performance of the applications. Here we introduce a method for tailoring the order to the data set. Our method repeatedly samples the data set and modifies the order so as to flatten the k-mer load distribution across minimizers. We integrated our method into Gerbil, a state-of-the-art memory-efficient k-mer counter, and were able to reduce its memory footprint by 30%-50% for large k, with only a minor increase in runtime. Our tests also showed that the orders produced by our method produced superior results when transferred across data sets from the same species, with little or no order change. This enables memory reduction with essentially no increase in runtime.

PMID:35527644 | DOI:10.1089/cmb.2021.0599

A New Context Tree Inference Algorithm for Variable Length Markov Chain Model with Applications to Biological Sequence Analyses

Shaokun An — Fri, 22 Apr 2022 06:00:00 -0400

J Comput Biol. 2022 Apr 22. doi: 10.1089/cmb.2021.0604. Online ahead of print.

ABSTRACT

The statistical inference of high-order Markov chains (MCs) for biological sequences is vital for molecular sequence analyses but can be hindered by the high dimensionality of free parameters. In the seminal article by Bühlmann and Wyner, variable length Markov chain (VLMC) model was proposed to embed the full-order MC in a sparse structured context tree. In the key procedure of tree pruning of their proposed context algorithm, the word count-based statistic for each branch was defined and compared with a fixed cutoff threshold calculated from a common chi-square distribution to prune the branch of the context tree. In this study, we find that the word counts for each branch are highly intercorrelated, resulting in non-negligible effects on the distribution of the statistic of interest. We demonstrate that the inferred context tree based on the original context algorithm by Bühlmann and Wyner, which uses a fixed cutoff threshold based on a common chi-square distribution, can be systematically biased and error prone. We denote the original context algorithm as VLMC-Biased (VLMC-B). To solve this problem, we propose a new context tree inference algorithm using an adaptive tree-pruning scheme, termed VLMC-Consistent (VLMC-C). The VLMC-C is founded on the consistent branch-specific mixed chi-square distributions calculated based on asymptotic normal distribution of multiple word patterns. We validate our theoretical branch-specific asymptotic distribution using simulated data. We compare VLMC-C with VLMC-B on context tree inference using both simulated and real genome sequence data and demonstrate that VLMC-C outperforms VLMC-B for both context tree reconstruction accuracy and model compression capacity.

PMID:35451885 | DOI:10.1089/cmb.2021.0604

Genome-Wide Causation Studies of Complex Diseases

Rong Jiao — Fri, 22 Apr 2022 06:00:00 -0400

J Comput Biol. 2022 Apr 22. doi: 10.1089/cmb.2021.0676. Online ahead of print.

ABSTRACT

Despite significant progress in dissecting the genetic architecture of complex diseases by genome-wide association studies (GWAS), the signals identified by association analysis may not have specific pathological relevance to diseases so that a large fraction of disease-causing genetic variants is still hidden. Association is used to measure dependence between two variables or two sets of variables. GWAS test association between a disease and single-nucleotide polymorphisms (SNPs) (or other genetic variants) across the genome. Association analysis may detect superficial patterns between disease and genetic variants. Association signals provide limited information on the causal mechanism of diseases. The use of association analysis as a major analytical platform for genetic studies of complex diseases is a key issue that may hamper discovery of disease mechanisms, calling into the questions the ability of GWAS to identify loci-underlying diseases. It is time to move beyond association analysis toward techniques, which enables the discovery of the underlying causal genetic structures of complex diseases. To achieve this, we propose the concept of genome-wide causation studies (GWCS) as an alternative to GWAS and develop additive noise models (ANMs) for genetic causation analysis. Type 1 error rates and power of the ANMs in testing causation are presented. We conducted GWCS of schizophrenia. Both simulation and real data analysis show that the proportion of the overlapped association and causation signals is small. Thus, we anticipate that our analysis will stimulate serious discussion of the applicability of GWAS and GWCS.

PMID:35451855 | DOI:10.1089/cmb.2021.0676

Feature Selection by Hybrid Brain Storm Optimization Algorithm for COVID-19 Classification

Timea Bezdan — Thu, 21 Apr 2022 06:00:00 -0400

J Comput Biol. 2022 Apr 19. doi: 10.1089/cmb.2021.0256. Online ahead of print.

ABSTRACT

A large number of features lead to very high-dimensional data. The feature selection method reduces the dimension of data, increases the performance of prediction, and reduces the computation time. Feature selection is the process of selecting the optimal set of input features from a given data set in order to reduce the noise in data and keep the relevant features. The optimal feature subset contains all useful and relevant features and excludes any irrelevant feature that allows machine learning models to understand better and differentiate efficiently the patterns in data sets. In this article, we propose a binary hybrid metaheuristic-based algorithm for selecting the optimal feature subset. Concretely, the brain storm optimization algorithm is hybridized by the firefly algorithm and adopted as a wrapper method for feature selection problems on classification data sets. The proposed algorithm is evaluated on 21 data sets and compared with 11 metaheuristic algorithms. In addition, the proposed method is adopted for the coronavirus disease data set. The obtained experimental results substantiate the robustness of the proposed hybrid algorithm. It efficiently reduces and selects the feature subset and at the same time results in higher classification accuracy than other methods in the literature.

PMID:35446145 | DOI:10.1089/cmb.2021.0256

Harnessing Fuzzy Rule Based System for Screening Major Histocompatibility Complex Class I Peptide Epitopes from the Whole Proteome: An Implementation on the Proteome of Leishmania donovani

Saravanan Vijayakumar — Mon, 11 Apr 2022 06:00:00 -0400

J Comput Biol. 2022 Apr 11. doi: 10.1089/cmb.2021.0464. Online ahead of print.

ABSTRACT

The development of peptide-based vaccines is enhanced by immunoinformatics, which predicts the patterns that B cells and T cells recognize. Although several tools are available for predicting the Major histocompatibility complex (MHC-I) binding peptides, the wide variants of human leucocyte antigen allele make it challenging to choose a peptide that will induce an immune response in a majority of people. In addition, for a peptide to be considered a potential vaccine candidate, factors such as T cell affinity, proteasome cleavage, and similarity to human proteins also play a major role. Identifying peptides that satisfy the earlier cited measures across the entire proteome is, therefore, challenging. Hence, the fuzzy inference system (FIS) is proposed to detect each peptide's potential as a vaccine candidate and assign it either a very high, high, moderate, or low ranking. The FIS includes input features from 6 modules (binding of 27 major alleles, T cell propensity, pro-inflammatory response, proteasome cleavage, transporter associated with antigen processing, and similarity with human peptide) and rules derived from an observation of features on positive samples. On validation of experimentally verified peptides, a balanced accuracy of ∼80% was achieved, with a Mathew's correlation coefficient score of 0.67 and an F-1 score of 0.74. In addition, the method was implemented on complete proteome of Leishmania donovani, which contains ∼4,800,000 peptides. Lastly, a searchable database of the ranked results of the L. donovani proteome was made and is available online (MHC-FIS-LdDB). It is hoped that this method will simplify the identification of potential MHC-I binding candidates from a large proteome.

PMID:35404099 | DOI:10.1089/cmb.2021.0464

Statistical Methods for Microbiome Compositional Data Network Inference: A Survey

Liang Chen — Mon, 11 Apr 2022 06:00:00 -0400

J Comput Biol. 2022 Apr 11. doi: 10.1089/cmb.2021.0406. Online ahead of print.

ABSTRACT

Microbes can be found almost everywhere in the world. They are not isolated, but rather interact with each other and establish connections with their living environments. Studying these interactions is essential to an understanding of the organization and complex interplay of microbial communities, as well as the structure and dynamics of various ecosystems. A widely used approach toward this objective involves the inference of microbiome interaction networks. However, owing to the compositional, high-dimensional, sparse, and heterogeneous nature of observed microbial data, applying network inference methods to estimate their associations is challenging. In addition, external environmental interference and biological concerns also make it more difficult to deal with the network inference. In this article, we provide a comprehensive review of emerging microbiome interaction network inference methods. According to various research targets, estimated networks are divided into four main categories: correlation networks, conditional correlation networks, mixture networks, and differential networks. Their assumptions, high-level ideas, advantages, as well as limitations, are presented in this review. Since real microbial interactions can be complex and dynamic, no unifying method has, to date, captured all the aspects of interest. In addition, we discuss the challenges now confronting current microbial interaction study and future prospects. Finally, we point out several feasible directions of microbial network inference analysis and highlight that future research requires the joint promotion of statistical computation methods and experimental techniques.

PMID:35404093 | DOI:10.1089/cmb.2021.0406

Quantitative Biology Undergraduate Major at the University of Southern California

Peter Calabrese — Mon, 11 Apr 2022 06:00:00 -0400

J Comput Biol. 2022 Apr 11. doi: 10.1089/cmb.2021.0605. Online ahead of print.

ABSTRACT

In 2017, the University of Southern California started a new undergraduate major in quantitative biology. This major combines training in the biological sciences, mathematics, and computer science to prepare students for 21st century biology and medicine. In this article I will discuss the curriculum, the first two cohorts of graduates, the current students, and future plans for the major.

PMID:35404078 | DOI:10.1089/cmb.2021.0605

DeepVir: Graphical Deep Matrix Factorization for In Silico Antiviral Repositioning-Application to COVID-19

Aanchal Mongia — Fri, 08 Apr 2022 06:00:00 -0400

J Comput Biol. 2022 May;29(5):441-452. doi: 10.1089/cmb.2021.0108. Epub 2022 Apr 7.

ABSTRACT

This study formulates antiviral repositioning as a matrix completion problem wherein the antiviral drugs are along the rows and the viruses are along the columns. The input matrix is partially filled, with ones in positions where the antiviral drug has been known to be effective against a virus. The curated metadata for antivirals (chemical structure and pathways) and viruses (genomic structure and symptoms) are encoded into our matrix completion framework as graph Laplacian regularization. We then frame the resulting multiple graph regularized matrix completion (GRMC) problem as deep matrix factorization. This is solved by using a novel optimization method called HyPALM (Hybrid Proximal Alternating Linearized Minimization). Results of our curated RNA drug-virus association data set show that the proposed approach excels over state-of-the-art GRMC techniques. When applied to in silico prediction of antivirals for COVID-19, our approach returns antivirals that are either used for treating patients or are under trials for the same.

PMID:35394368 | DOI:10.1089/cmb.2021.0108

Special Issue: Biological Distributed Algorithms 2021

Yuval Emek — Thu, 07 Apr 2022 06:00:00 -0400

J Comput Biol. 2022 Apr;29(4):305. doi: 10.1089/cmb.2022.29060.ye.

NO ABSTRACT

PMID:35389753 | DOI:10.1089/cmb.2022.29060.ye

Identification of New Clusters from Labeled Data Using Mixture Models

Yujung Kim — Wed, 06 Apr 2022 06:00:00 -0400

J Comput Biol. 2022 Apr 5. doi: 10.1089/cmb.2021.0443. Online ahead of print.

ABSTRACT

Nowadays attempts to segment classes or groups are often found in various fields. Especially, one of emerging issues in biological and medical areas is identification of new subtypes of biological samples or patients. For the identification, we often need to find new subtypes from known classes. In such cases, we usually use clustering techniques. However, usual clustering methods could mix up the labels of the known classes in clustering outcomes and it might lead to wrong interpretation for the identified clusters. Also, they do not use the information about known classes. Thus, this study proposes a Gaussian mixture model-based approach for identifying new clusters from known classes while it maintains them. The performance of the proposed model is verified through simulations and it is applied to a breast cancer data set.

PMID:35384743 | DOI:10.1089/cmb.2021.0443

On the Number of Saturated and Optimal Extended 2-Regular Simple Stacks in the Nussinov-Jacobson Energy Model

Qianghui Guo — Wed, 30 Mar 2022 06:00:00 -0400

J Comput Biol. 2022 May;29(5):425-440. doi: 10.1089/cmb.2021.0421. Epub 2022 Mar 28.

ABSTRACT

It is known that both RNA secondary structure and protein contact map can be presented using combinatorial diagrams, the combinatorial enumeration and related problems of which have been studied extensively. Motivated by previous enumeration works on saturated RNA secondary structures and extended stack structures of protein contact maps, we are interested in the enumeration problems of saturated and optimal extended stacks in the Nussinov-Jacobson energy model, in which each base pair contributes energy -1. Then optimal structures are those with most arcs, and locally optimal structures are exactly the saturated structures, in which no more arcs can be added without violating the structure definition. For saturated extended 2-regular simple stacks, whose degree configuration is related to the protein fold in two-dimensional honeycomb lattice, we obtain generating function equation and asymptotic formula for its number. Moreover, an explicit formula for the number of optimal extended 2-regular simple stacks is also obtained.

PMID:35353583 | DOI:10.1089/cmb.2021.0421

Texture Enhancement of Medical Images for Efficient Disease Diagnosis with Optimized Fractional Derivative Masks

Priyanka Harjule — Wed, 30 Mar 2022 06:00:00 -0400

J Comput Biol. 2022 Mar 28. doi: 10.1089/cmb.2021.0267. Online ahead of print.

ABSTRACT

For the past two decades, fractional-order derivatives have been used to model many systems in science and engineering with more accuracy than existing integer-order derivatives. Many of these applications have been employed in the image processing field. It is undeniable that an image enhancement algorithm is very much desirable for medical image analysis to diagnose various kinds of diseases more efficiently. These requirements demand that the image should be of high quality. Hence, accurate edge-detection and denoising models are required in medical image processing, improving, and enhancing the contrast of an image to attain a better texture and avoid noise. In this study, we employ and compare the conventional methods and recent and most popular fractional-order-based methods for medical image analysis texture enhancement. To make a fair comparison, the fractional-order operators are optimized for all images with gray wolf optimizer while considering the performance metric mean squared error. The results showed that fractional differential-based operators perform better than conventional integer-order operators for texture enhancement of medical images.

PMID:35353538 | DOI:10.1089/cmb.2021.0267

Agreement in Spiking Neural Networks

Martin Kunev — Fri, 25 Mar 2022 06:00:00 -0400

J Comput Biol. 2022 Apr;29(4):358-369. doi: 10.1089/cmb.2021.0365. Epub 2022 Mar 23.

ABSTRACT

We study the problem of binary agreement in a spiking neural network (SNN). We show that binary agreement on n inputs can be achieved with O(n) of auxiliary neurons. Our simulation results suggest that agreement can be achieved in our network in O(logn) time. We then describe a subclass of SNNs with a biologically plausible property, which we call size-independence. We prove that solving a class of problems, including agreement and Winner-Take-All, in this model requires Ω(n) auxiliary neurons, which makes our agreement network size-optimal.

PMID:35333601 | DOI:10.1089/cmb.2021.0365

Correlation Imputation for Single-Cell RNA-seq

Luqin Gan — Thu, 24 Mar 2022 06:00:00 -0400

J Comput Biol. 2022 May;29(5):465-482. doi: 10.1089/cmb.2021.0403. Epub 2022 Mar 21.

ABSTRACT

Recent advances in single-cell RNA sequencing (scRNA-seq) technologies have yielded a powerful tool to measure gene expression of individual cells. One major challenge of the scRNA-seq data is that it usually contains a large amount of zero expression values, which often impairs the effectiveness of downstream analyses. Numerous data imputation methods have been proposed to deal with these "dropout" events, but this is a difficult task for such high-dimensional and sparse data. Furthermore, there have been debates on the nature of the sparsity, about whether the zeros are due to technological limitations or represent actual biology. To address these challenges, we propose Single-cell RNA-seq Correlation completion by ENsemble learning and Auxiliary information (SCENA), a novel approach that imputes the correlation matrix of the data of interest instead of the data itself. SCENA obtains a gene-by-gene correlation estimate by ensembling various individual estimates, some of which are based on known auxiliary information about gene expression networks. Our approach is a reliable method that makes no assumptions on the nature of sparsity in scRNA-seq data or the data distribution. By extensive simulation studies and real data applications, we demonstrate that SCENA is not only superior in gene correlation estimation, but also improves the accuracy and reliability of downstream analyses, including cell clustering, dimension reduction, and graphical model estimation to learn the gene expression network.

PMID:35325552 | PMC:PMC9125575 | DOI:10.1089/cmb.2021.0403

Use of DFT Distance Metrics for Classification of SARS-CoV-2 Genomes

Micah Thornton — Thu, 24 Mar 2022 06:00:00 -0400

J Comput Biol. 2022 May;29(5):453-464. doi: 10.1089/cmb.2021.0229. Epub 2022 Mar 21.

ABSTRACT

In this work, we investigate using Fourier coefficients (FCs) for capturing useful information about viral sequences in a computationally efficient and compact manner. Specifically, we extract geographic submission location from SARS-CoV-2 sequence headers submitted to the GISAID Initiative, calculate corresponding FCs, and use the FCs to classify these sequences according to geographic location. We show that the FCs serve as useful numerical summaries for sequences that allow manipulation, identification, and differentiation via classical mathematical and statistical methods that are not readily applicable for character strings. Further, we argue that subsets of the FCs may be usable for the same purposes, which results in a reduction in storage requirements. We conclude by offering extensions of the research and potential future directions for subsequent analyses, such as the use of other series transforms for discreetly indexed signals such as genomes.

PMID:35325549 | DOI:10.1089/cmb.2021.0229

Integrating Long-Range Regulatory Interactions to Predict Gene Expression Using Graph Convolutional Networks

Jeremy Bigness — Thu, 24 Mar 2022 06:00:00 -0400

J Comput Biol. 2022 May;29(5):409-424. doi: 10.1089/cmb.2021.0316. Epub 2022 Mar 21.

ABSTRACT

Long-range regulatory interactions among genomic regions are critical for controlling gene expression, and their disruption has been associated with a host of diseases. However, when modeling the effects of regulatory factors, most deep learning models either neglect long-range interactions or fail to capture the inherent 3D structure of the underlying genomic organization. To address these limitations, we present a Graph Convolutional Model for Epigenetic Regulation of Gene Expression (GC-MERGE). Using a graph-based framework, the model incorporates important information about long-range interactions via a natural encoding of genomic spatial interactions into the graph representation. It integrates measurements of both the global genomic organization and the local regulatory factors, specifically histone modifications, to not only predict the expression of a given gene of interest but also quantify the importance of its regulatory factors. We apply GC-MERGE to data sets for three cell lines-GM12878 (lymphoblastoid), K562 (myelogenous leukemia), and HUVEC (human umbilical vein endothelial)-and demonstrate its state-of-the-art predictive performance. Crucially, we show that our model is interpretable in terms of the observed biological regulatory factors, highlighting both the histone modifications and the interacting genomic regions contributing to a gene's predicted expression. We provide model explanations for multiple exemplar genes and validate them with evidence from the literature. Our model presents a novel setup for predicting gene expression by integrating multimodal data sets in a graph convolutional framework. More importantly, it enables interpretation of the biological mechanisms driving the model's predictions.

PMID:35325548 | PMC:PMC9125570 | DOI:10.1089/cmb.2021.0316

Emergence of Direction-Selective Retinal Cell Types in Task-Optimized Deep Learning Models

Keith T Murray — Fri, 11 Mar 2022 06:00:00 -0500

J Comput Biol. 2022 Apr;29(4):370-381. doi: 10.1089/cmb.2021.0368. Epub 2022 Mar 11.

ABSTRACT

Convolutional neural networks (CNNs), a class of deep learning models, have experienced recent success in modeling sensory cortices and retinal circuits through optimizing performance on machine learning tasks, otherwise known as task optimization. Previous research has shown task-optimized CNNs to be capable of providing explanations as to why the retina efficiently encodes natural stimuli and how certain retinal cell types are involved in efficient encoding. In our work, we sought to use task-optimized CNNs as a means of explaining computational mechanisms responsible for motion-selective retinal circuits. We designed a biologically constrained CNN and optimized its performance on a motion-classification task. We drew inspiration from psychophysics, deep learning, and systems neuroscience literature to develop a toolbox of methods to reverse engineer the computational mechanisms learned in our model. Through reverse engineering our model, we proposed a computational mechanism in which direction-selective ganglion cells and starburst amacrine cells, both experimentally observed retinal cell types, emerge in our model to discriminate among moving stimuli. This emergence suggests that direction-selective circuits in the retina are ecologically designed to robustly discriminate among moving stimuli. Our results and methods also provide a framework for how to build more interpretable deep learning models and how to understand them.

PMID:35275740 | DOI:10.1089/cmb.2021.0368

Coordinating Amoebots via Reconfigurable Circuits

Michael Feldmann — Mon, 07 Mar 2022 06:00:00 -0500

J Comput Biol. 2022 Apr;29(4):317-343. doi: 10.1089/cmb.2021.0363. Epub 2022 Mar 7.

ABSTRACT

We consider an extension to the geometric amoebot model that allows amoebots to form so-called circuits. Given a connected amoebot structure, a circuit is a subgraph formed by the amoebots that permits the instant transmission of signals. We show that such an extension allows for significantly faster solutions to a variety of problems related to programmable matter. More specifically, we provide algorithms for leader election, consensus, compass alignment, chirality agreement, and shape recognition. Leader election can be solved in Θ(logn) rounds, with high probability (w.h.p.), consensus in O(1) rounds, and both, compass alignment and chirality agreement, can be solved in O(logn) rounds, w.h.p. For shape recognition, the amoebots have to decide whether the amoebot structure forms a particular shape. We show that the amoebots can detect a shape composed of triangles within O(1) rounds. Finally, we show how the amoebots can detect a parallelogram with linear and polynomial side ratio within Θ(logn) rounds, w.h.p.

PMID:35255223 | DOI:10.1089/cmb.2021.0363

Improvement of Automatic Glioma Brain Tumor Detection Using Deep Convolutional Neural Networks

Ayman Altameem — Wed, 02 Mar 2022 06:00:00 -0500

J Comput Biol. 2022 Mar 1. doi: 10.1089/cmb.2021.0280. Online ahead of print.

ABSTRACT

This article introduces automatic brain tumor detection from a magnetic resonance image (MRI). It provides novel algorithms for extracting patches and segmentation trained with Convolutional Neural Network (CNN)'s to identify brain tumors. Further, this study provides deep learning and image segmentation with CNN algorithms. This contribution proposed two similar segmentation algorithms: one for the Higher Grade Gliomas (HGG) and the other for the Lower Grade Gliomas (LGG) for the brain tumor patients. The proposed algorithms (Intensity normalization, Patch extraction, Selecting the best patch, segmentation of HGG, and Segmentation of LGG) identify the gliomas and detect the stage of the tumor as per taking the MRI as input and segmented tumor from the MRIs and elaborated the four algorithms to detect HGG, and segmentation to detect the LGG works with CNN. The segmentation algorithm is compared with different existing algorithms and performs the automatic identification reasonably with high accuracy as per epochs generated with accuracy and loss curves. This article also described how transfer learning has helped extract the image and resolution of the image and increase the segmentation accuracy in the case of LGG patients.

PMID:35235381 | DOI:10.1089/cmb.2021.0280

A Mathematical Framework for Analyzing Wild Tomato Root Architecture

Arjun Chandrasekhar — Wed, 02 Mar 2022 06:00:00 -0500

J Comput Biol. 2022 Apr;29(4):306-316. doi: 10.1089/cmb.2021.0361. Epub 2022 Mar 2.

ABSTRACT

The root architecture of wild tomato, Solanum pimpinellifolium, can be viewed as a network connecting the main root to various lateral roots. Several constraints have been proposed on the structure of such biological networks, including minimizing the total amount of wire necessary for constructing the root architecture (wiring cost), and minimizing the distances (and by extension, resource transport time) between the base of the main root and the lateral roots (conduction delay). For a given set of lateral root tip locations, these two objectives compete with each other-optimizing one results in poorer performance on the other-raising the question how well S. pimpinellifolium root architectures balance this network design trade-off in a distributed manner. In this study, we describe how well S. pimpinellifolium roots resolve this trade-off using the theory of Pareto optimality. We describe a mathematical model for characterizing the network structure and design trade-offs governing the structure of S. pimpinellifolium root architecture. We demonstrate that S. pimpinellifolium arbors construct architectures that are more optimal than would be expected by chance. Finally, we use this framework to quantify structural differences between arbors grown in the presence of salt stress, classify arbors into four distinct architectural ideotypes, and test for heritability of variation in root architecture structure.

PMID:35235373 | DOI:10.1089/cmb.2021.0361

Optimal Solution of a Fractional HIV/AIDS Epidemic Mathematical Model

Hossein Hassani — Tue, 01 Mar 2022 06:00:00 -0500

J Comput Biol. 2022 Mar;29(3):276-291. doi: 10.1089/cmb.2021.0253. Epub 2022 Feb 25.

ABSTRACT

This article presents a fractional mathematical model of the human immunodeficiency virus (HIV)/AIDS spread with a fractional derivative of the Caputo type. The model includes five compartments corresponding to the variables describing the susceptible patients, HIV-infected patients, people with AIDS but not receiving antiretroviral treatment, patients being treated, and individuals who are immune to HIV infection by sexual contact. Moreover, it is assumed that the total population is constant. We construct an optimization technique supported by a class of basis functions, consisting of the generalized shifted Jacobi polynomials (GSJPs). The solution of the fractional HIV/AIDS epidemic model is approximated by means of GSJPs with coefficients and parameters in the matrix form. After calculating and combining the operational matrices with the Lagrange multipliers, we obtain the optimization method. The theorems on the existence, unique, and convergence results of the method are proved. Several illustrative examples show the performance of the proposed method. Mathematics Subject Classification: 97M60; 41A58; 92C42.

PMID:35230161 | DOI:10.1089/cmb.2021.0253

Trade-offs of Linear Mixed Models in Genome-Wide Association Studies

Haohan Wang — Tue, 01 Mar 2022 06:00:00 -0500

J Comput Biol. 2022 Mar;29(3):233-242. doi: 10.1089/cmb.2021.0157. Epub 2022 Feb 25.

ABSTRACT

Motivated by empirical arguments that are well known from the genome-wide association studies (GWAS) literature, we study the statistical properties of linear mixed models (LMMs) applied to GWAS. First, we study the sensitivity of LMMs to the inclusion of a candidate single nucleotide polymorphism (SNP) in the kinship matrix, which is often done in practice to speed up computations. Our results shed light on the size of the error incurred by including a candidate SNP, providing a justification to this technique to trade off velocity against veracity. Second, we investigate how mixed models can correct confounders in GWAS, which is widely accepted as an advantage of LMMs over traditional methods. We consider two sources of confounding factors-population stratification and environmental confounding factors-and study how different methods that are commonly used in practice trade off these two confounding factors differently.

PMID:35230156 | PMC:PMC8968846 | DOI:10.1089/cmb.2021.0157

Comparing Phylogenetic Trees Side by Side Through iPhyloC, a New Interactive Web-Based Framework

Muhsen Hammoud — Tue, 01 Mar 2022 06:00:00 -0500

J Comput Biol. 2022 Mar;29(3):292-303. doi: 10.1089/cmb.2021.0351. Epub 2022 Feb 25.

ABSTRACT

Current frameworks of side-by-side phylogenetic trees comparison face two issues: (1) accepting mainly binary trees as input and (2) assuming input trees having identical or highly overlapping taxa. However, cladistic comparative studies often lead with multiple nontotally resolved trees with nonidentical sets of taxa. We tackle these issues in this study, presenting the iPhyloC, an interactive web-based framework for comparing phylogenetic trees side by side. iPhyloC supports automatic identification of the common taxa in the input trees, comparison options between them, intuitive design, high usability, scalability to large trees, and cross-platform support. iPhyloC was tested using different trees and a supertree depicting the phylogenetic relationships within the insect order Diptera as examples.

PMID:35230147 | DOI:10.1089/cmb.2021.0351

An Upper and Lower Bound for the Convergence Time of House-Hunting in Temnothorax Ant Colonies

Emily Zhang — Wed, 23 Feb 2022 06:00:00 -0500

J Comput Biol. 2022 Apr;29(4):344-357. doi: 10.1089/cmb.2021.0364. Epub 2022 Feb 22.

ABSTRACT

We study the problem of house-hunting in ant colonies, where ants reach consensus on a new nest and relocate their colony to that nest, from a distributed computing perspective. We propose a house-hunting algorithm that is biologically inspired by Temnothorax ants. Each ant is modeled as a probabilistic agent with limited power, and there is no central control governing the ants. We show an Ω(logn) lower bound on the running time of our proposed house-hunting algorithm, where n is the number of ants. Furthermore, we show a matching upper bound of expected O(logn) rounds for environments with only one candidate nest for the ants to move to. Our work provides insights into the house-hunting process, giving a perspective on how environmental factors such as nest quality or a quorum rule can affect the emigration process.

PMID:35196137 | DOI:10.1089/cmb.2021.0364

Scalable Species Tree Inference with External Constraints

Baqiao Liu — Wed, 23 Feb 2022 06:00:00 -0500

J Comput Biol. 2022 Feb 21. doi: 10.1089/cmb.2021.0543. Online ahead of print.

ABSTRACT

Species tree inference is a basic step in biological discovery, but discordance between gene trees creates analytical challenges and large data sets create computational challenges. Although there is generally some information available about the species trees that could be used to speed up the estimation, only one species tree estimation method that addresses gene tree discordance-ASTRAL-J, a recent development in the ASTRAL family of methods-is able to use this information. Here we describe two new methods, NJst-J and FASTRAL-J, that can estimate the species tree, given a partial knowledge of the species tree in the form of a nonbinary unrooted constraint tree. We show that both NJst-J and FASTRAL-J are much faster than ASTRAL-J and we prove that all three methods are statistically consistent under the multispecies coalescent model subject to this constraint. Our extensive simulation study shows that both FASTRAL-J and NJst-J provide advantages over ASTRAL-J: both are faster (and NJst-J is particularly fast), and FASTRAL-J is generally at least as accurate as ASTRAL-J. An analysis of the Avian Phylogenomics Project data set with 48 species and 14,446 genes presents additional evidence of the value of FASTRAL-J over ASTRAL-J (and both over ASTRAL), with dramatic reductions in running time (20 hours for default ASTRAL, and minutes or seconds for ASTRAL-J and FASTRAL-J, respectively).

PMID:35196115 | DOI:10.1089/cmb.2021.0543

RECOMB 2021 Special Issue

Jian Peng — Fri, 18 Feb 2022 06:00:00 -0500

J Comput Biol. 2022 Feb;29(2):91. doi: 10.1089/cmb.2021.29051.jp.

NO ABSTRACT

PMID:35179993 | DOI:10.1089/cmb.2021.29051.jp

The Statistics of k-mers from a Sequence Undergoing a Simple Mutation Process Without Spurious Matches

Antonio Blanca — Wed, 02 Feb 2022 06:00:00 -0500

J Comput Biol. 2022 Feb;29(2):155-168. doi: 10.1089/cmb.2021.0431. Epub 2022 Feb 1.

ABSTRACT

k-mer-based methods are widely used in bioinformatics, but there are many gaps in our understanding of their statistical properties. Here, we consider the simple model where a sequence S (e.g., a genome or a read) undergoes a simple mutation process through which each nucleotide is mutated independently with some probability r, under the assumption that there are no spurious k-mer matches. How does this process affect the k-mers of S? We derive the expectation and variance of the number of mutated k-mers and of the number of islands (a maximal interval of mutated k-mers) and oceans (a maximal interval of nonmutated k-mers). We then derive hypothesis tests and confidence intervals (CIs) for r given an observed number of mutated k-mers, or, alternatively, given the Jaccard similarity (with or without MinHash). We demonstrate the usefulness of our results using a few select applications: obtaining a CI to supplement the Mash distance point estimate, filtering out reads during alignment by Minimap2, and rating long-read alignments to a de Bruijn graph by Jabba.

PMID:35108101 | DOI:10.1089/cmb.2021.0431

ProALIGN: Directly Learning Alignments for Protein Structure Prediction via Exploiting Context-Specific Alignment Motifs

Lupeng Kong — Mon, 24 Jan 2022 06:00:00 -0500

J Comput Biol. 2022 Feb;29(2):92-105. doi: 10.1089/cmb.2021.0430. Epub 2022 Jan 21.

ABSTRACT

Template-based modeling (TBM), including homology modeling and protein threading, is one of the most reliable techniques for protein structure prediction. It predicts protein structure by building an alignment between the query sequence under prediction and the templates with solved structures. However, it is still very challenging to build the optimal sequence-template alignment, especially when only distantly related templates are available. Here we report a novel deep learning approach ProALIGN that can predict much more accurate sequence-template alignment. Like protein sequences consisting of sequence motifs, protein alignments are also composed of frequently occurring alignment motifs with characteristic patterns. Alignment motifs are context-specific as their characteristic patterns are tightly related to sequence contexts of the aligned regions. Inspired by this observation, we represent a protein alignment as a binary matrix (in which 1 denotes an aligned residue pair) and then use a deep convolutional neural network to predict the optimal alignment from the query protein and its template. The trained neural network implicitly but effectively encodes an alignment scoring function, which reduces inaccuracies in the handcrafted scoring functions widely used by the current threading approaches. For a query protein and a template, we apply the neural network to directly infer likelihoods of all possible residue pairs in their entirety, which could effectively consider the correlations among multiple residues. We further construct the alignment with maximum likelihood, and finally build a structure model according to the alignment. Tested on three independent data sets with a total of 6688 protein alignment targets and 80 CASP13 TBM targets, our method achieved much better alignments and 3D structure models than the existing methods, including HHpred, CNFpred, CEthreader, and DeepThreader. These results clearly demonstrate the effectiveness of exploiting the context-specific alignment motifs by deep learning for protein threading.

PMID:35073170 | PMC:PMC8892980 | DOI:10.1089/cmb.2021.0430

Uncovering Molecular Mechanisms of Drug Resistance via Network-Constrained Common Structure Identification

Heewon Park — Mon, 24 Jan 2022 06:00:00 -0500

J Comput Biol. 2022 Mar;29(3):257-275. doi: 10.1089/cmb.2021.0314. Epub 2022 Jan 21.

ABSTRACT

Uncovering mechanisms of acquired drug resistance has garnered increasing attention worldwide as drug resistance reduces antibiotic and chemotherapy effectiveness. Most bioinformatics studies have elucidated these mechanisms based on differentially expressed gene (DEG) analysis. However, considering the associated complex network of biological systems, the specific molecular interactions must also be studied to obtain a complete understanding of the mechanisms related to drug resistance. Accordingly, by analyzing sample-specific gene networks, we sought to elucidate mechanisms of acquired drug resistance of cells based on molecular interactions between genes. In the current study, we focus on gefitinib and erlotinib and characterized cell lines based on their sensitivity. We also consider CRISPR knockout screening of the target gene, epidermal growth factor receptor (EGFR), as a characteristic of cells. Subsequently, we constructed a drug sensitivity-CRISPR knockout screen-specific gene network. To identify the molecular mechanisms of drug resistance from the multiple large-scale networks, we proposed a novel computational method, designated network-constrained sparse common component analysis (NetSCCA), that extracts common structures of multiple networks characterizing molecular interaction in drug-sensitive and drug-resistant cell lines. We then applied NetSCCA to multilayer networks of candidate drug-response genes to identify common structures of the regulatory system in drug-sensitive and EGFR-dependent cells, and drug-resistant and EGFR-independent cells. NetSCCA identified crucial common targets and regulator genes that dominate multiple networks in drug-sensitive and drug-resistant cell lines, respectively. Our analysis for common structure identification based on NetSCCA has the capacity to characterize the molecular interplay between genes and crucial markers related to mechanisms of acquired drug resistance that cannot be revealed by analysis based solely on DEG analysis. The biological mechanisms associated with gefitinib and erlotinib sensitivity of identified genes were verified through the literature. We expect that the proposed method will serve as a useful tool for uncovering not only drug resistance mechanisms but also complex biological systems based on massive genomic data sets.

PMID:35073162 | DOI:10.1089/cmb.2021.0314

RECOMB 2021 Special Issue

Jian Peng — Thu, 20 Jan 2022 06:00:00 -0500

J Comput Biol. 2022 Jan;29(1):2. doi: 10.1089/cmb.2021.29050.jp.

NO ABSTRACT

PMID:35050716 | DOI:10.1089/cmb.2021.29050.jp

GRNUlar: A Deep Learning Framework for Recovering Single-Cell Gene Regulatory Networks

Harsh Shrivastava — Thu, 20 Jan 2022 06:00:00 -0500

J Comput Biol. 2022 Jan;29(1):27-44. doi: 10.1089/cmb.2021.0437.

ABSTRACT

We propose GRNUlar, a novel deep learning framework for supervised learning of gene regulatory networks (GRNs) from single-cell RNA-Sequencing (scRNA-Seq) data. Our framework incorporates two intertwined models. First, we leverage the expressive ability of neural networks to capture complex dependencies between transcription factors and the corresponding genes they regulate, by developing a multitask learning framework. Second, to capture sparsity of GRNs observed in the real world, we design an unrolled algorithm technique for our framework. Our deep architecture requires supervision for training, for which we repurpose existing synthetic data simulators that generate scRNA-Seq data guided by an underlying GRN. Experimental results demonstrate that GRNUlar outperforms state-of-the-art methods on both synthetic and real data sets. Our study also demonstrates the novel and successful use of expression data simulators for supervised learning of GRN inference.

PMID:35050715 | DOI:10.1089/cmb.2021.0437

SCOT: Single-Cell Multi-Omics Alignment with Optimal Transport

Pinar Demetci — Thu, 20 Jan 2022 06:00:00 -0500

J Comput Biol. 2022 Jan;29(1):3-18. doi: 10.1089/cmb.2021.0446.

ABSTRACT

Recent advances in sequencing technologies have allowed us to capture various aspects of the genome at single-cell resolution. However, with the exception of a few of co-assaying technologies, it is not possible to simultaneously apply different sequencing assays on the same single cell. In this scenario, computational integration of multi-omic measurements is crucial to enable joint analyses. This integration task is particularly challenging due to the lack of sample-wise or feature-wise correspondences. We present single-cell alignment with optimal transport (SCOT), an unsupervised algorithm that uses the Gromov-Wasserstein optimal transport to align single-cell multi-omics data sets. SCOT performs on par with the current state-of-the-art unsupervised alignment methods, is faster, and requires tuning of fewer hyperparameters. More importantly, SCOT uses a self-tuning heuristic to guide hyperparameter selection based on the Gromov-Wasserstein distance. Thus, in the fully unsupervised setting, SCOT aligns single-cell data sets better than the existing methods without requiring any orthogonal correspondence information.

PMID:35050714 | PMC:PMC8812493 | DOI:10.1089/cmb.2021.0446

The Power of Population Effect in Temnothorax Ant House-Hunting: A Computational Modeling Approach

Jiajia Zhao — Thu, 20 Jan 2022 06:00:00 -0500

J Comput Biol. 2022 Apr;29(4):382-408. doi: 10.1089/cmb.2021.0369. Epub 2022 Jan 20.

ABSTRACT

The decentralized cognition of animal groups is both a challenging biological problem and a potential basis for bioinspired design. In this study, we investigated the house-hunting algorithm used by emigrating colonies of Temnothorax ants to reach consensus on a new nest. We developed a tractable model that encodes accurate individual behavior rules, and estimated our parameter values by matching simulated behaviors with observed ones on both the individual and group levels. We then used our model to explore a potential, but yet untested, component of the ants' decision algorithm. Specifically, we examined the hypothesis that incorporating site population (the number of adult ants at each potential nest site) into individual perceptions of nest quality can improve emigration performance. Our results showed that attending to site population accelerates emigration and reduces the incidence of split decisions. This result suggests the value of testing empirically whether nest site scouts use site population in this way, in addition to the well-demonstrated quorum rule. We also used our model to make other predictions with varying degrees of empirical support, including the high cognitive capacity of colonies and their rational time investment during decision-making. In addition, we provide a versatile and easy-to-use Python simulator that can be used to explore other hypotheses or make testable predictions. It is our hope that the insights and the modeling tools can inspire further research from both the biology and computer science community.

PMID:35049358 | DOI:10.1089/cmb.2021.0369

Set-Min Sketch: A Probabilistic Map for Power-Law Distributions with Application to k-Mer Annotation

Yoshihiro Shibuya — Thu, 20 Jan 2022 06:00:00 -0500

J Comput Biol. 2022 Feb;29(2):140-154. doi: 10.1089/cmb.2021.0429. Epub 2022 Jan 18.

ABSTRACT

k-mer counts are important features used by many bioinformatics pipelines. Existing k-mer counting methods focus on optimizing either time or memory usage, producing in output very large count tables explicitly representing k-mers together with their counts. Storing k-mers is not needed if the set of k-mers is known, making it possible to only keep counters and their association to k-mers. Solutions avoiding explicit representation of k-mers include Minimal Perfect Hash Functions (MPHFs) and Count-Min sketches. We introduce Set-Min sketch-a sketching technique for representing associative maps inspired from Count-Min-and apply it to the problem of representing k-mer count tables. Set-Min is provably more accurate than both Count-Min and Max-Min-an improved variant of Count-Min for static datasets that we define here. We show that Set-Min sketch provides a very low error rate, in terms of both the probability and the size of errors, at the expense of a very moderate memory increase. On the other hand, Set-Min sketches are shown to take up to an order of magnitude less space than MPHF-based solutions, for fully assembled genomes and large k. Space-efficiency of Set-Min in this case takes advantage of the power-law distribution of k-mer counts in genomic datasets.

PMID:35049334 | DOI:10.1089/cmb.2021.0429

flopp: Extremely Fast Long-Read Polyploid Haplotype Phasing by Uniform Tree Partitioning

Jim Shaw — Tue, 18 Jan 2022 06:00:00 -0500

J Comput Biol. 2022 Feb;29(2):195-211. doi: 10.1089/cmb.2021.0436. Epub 2022 Jan 17.

ABSTRACT

Resolving haplotypes in polyploid genomes using phase information from sequencing reads is an important and challenging problem. We introduce two new mathematical formulations of polyploid haplotype phasing: (1) the min-sum max tree partition problem, which is a more flexible graphical metric compared with the standard minimum error correction (MEC) model in the polyploid setting, and (2) the uniform probabilistic error minimization model, which is a probabilistic analogue of the MEC model. We incorporate both formulations into a long-read based polyploid haplotype phasing method called flopp. We show that flopp compares favorably with state-of-the-art algorithms-up to 30 times faster with 2 times fewer switch errors on 6 × ploidy simulated data. Further, we show using real nanopore data that flopp can quickly reveal reasonable haplotype structures from the autotetraploid Solanum tuberosum (potato).

PMID:35041529 | PMC:PMC8892958 | DOI:10.1089/cmb.2021.0436

Finding Maximal Exact Matches Using the r-Index

Massimiliano Rossi — Tue, 18 Jan 2022 06:00:00 -0500

J Comput Biol. 2022 Feb;29(2):188-194. doi: 10.1089/cmb.2021.0445. Epub 2022 Jan 17.

ABSTRACT

Efficiently finding maximal exact matches (MEMs) between a sequence read and a database of genomes is a key first step in read alignment. But until recently, it was unknown how to build a data structure in [Formula: see text] space that supports efficient MEM finding, where r is the number of runs in the Burrows-Wheeler Transform. In 2021, Rossi et al. showed how to build a small auxiliary data structure called thresholds in addition to the r-index in [Formula: see text] space. This addition enables efficient MEM finding using the r-index. In this article, we present the tool that implements this solution, which we call MONI. Namely, we give a high-level view of the main components of the data structure and show how the source code can be downloaded, compiled, and used to find MEMs between a set of sequence reads and a set of genomes.

PMID:35041518 | PMC:PMC8902461 | DOI:10.1089/cmb.2021.0445

MONI: A Pangenomic Index for Finding Maximal Exact Matches

Massimiliano Rossi — Tue, 18 Jan 2022 06:00:00 -0500

J Comput Biol. 2022 Feb;29(2):169-187. doi: 10.1089/cmb.2021.0290. Epub 2022 Jan 17.

ABSTRACT

Recently, Gagie et al. proposed a version of the FM-index, called the r-index, that can store thousands of human genomes on a commodity computer. Then Kuhnle et al. showed how to build the r-index efficiently via a technique called prefix-free parsing (PFP) and demonstrated its effectiveness for exact pattern matching. Exact pattern matching can be leveraged to support approximate pattern matching, but the r-index itself cannot support efficiently popular and important queries such as finding maximal exact matches (MEMs). To address this shortcoming, Bannai et al. introduced the concept of thresholds, and showed that storing them together with the r-index enables efficient MEM finding-but they did not say how to find those thresholds. We present a novel algorithm that applies PFP to build the r-index and find the thresholds simultaneously and in linear time and space with respect to the size of the prefix-free parse. Our implementation called MONI can rapidly find MEMs between reads and large-sequence collections of highly repetitive sequences. Compared with other read aligners-PuffAligner, Bowtie2, BWA-MEM, and CHIC- MONI used 2-11 times less memory and was 2-32 times faster for index construction. Moreover, MONI was less than one thousandth the size of competing indexes for large collections of human chromosomes. Thus, MONI represents a major advance in our ability to perform MEM finding against very large collections of related references.

PMID:35041495 | PMC:PMC8892979 | DOI:10.1089/cmb.2021.0290

Deriving Ranges of Optimal Estimated Transcript Expression due to Nonidentifiability

Hongyu Zheng — Tue, 18 Jan 2022 06:00:00 -0500

J Comput Biol. 2022 Feb;29(2):121-139. doi: 10.1089/cmb.2021.0444. Epub 2022 Jan 17.

ABSTRACT

Current expression quantification methods suffer from a fundamental but undercharacterized type of error: the most likely estimates for transcript abundances are not unique. This means multiple estimates of transcript abundances generate the observed RNA-seq reads with equal likelihood, and the underlying true expression cannot be determined. This is called nonidentifiability in probabilistic modeling. It is further exacerbated by incomplete reference transcriptomes where reads may be sequenced from unannotated transcripts. Graph quantification is a generalization to transcript quantification, accounting for the reference incompleteness by allowing exponentially many unannotated transcripts to express reads. We propose methods to calculate a "confidence range of expression" for each transcript, representing its possible abundance across equally optimal estimates for both quantification models. This range informs both whether a transcript has potential estimation error due to nonidentifiability and the extent of the error. Applying our methods to the Human Body Map data, we observe that 35%-50% of transcripts potentially suffer from inaccurate quantification caused by nonidentifiability. When comparing the expression between isoforms in one sample, we find that the degree of inaccuracy of 20%-47% transcripts can be so large that the ranking of expression between the transcript and other isoforms from the same gene cannot be determined. When comparing the expression of a transcript between two groups of RNA-seq samples in differential expression analysis, we observe that the majority of detected differentially expressed transcripts are reliable with a few exceptions after considering the ranges of the optimal expression estimates.

PMID:35041494 | PMC:PMC8892959 | DOI:10.1089/cmb.2021.0444

Simulating Single-Cell Gene Expression Count Data with Preserved Gene Correlations by scDesign2

Tianyi Sun — Wed, 12 Jan 2022 06:00:00 -0500

J Comput Biol. 2022 Jan;29(1):23-26. doi: 10.1089/cmb.2021.0440. Epub 2022 Jan 11.

ABSTRACT

scDesign2 is a transparent simulator that generates high-fidelity single-cell gene expression count data with gene correlations captured. This article shows how to download and install the scDesign2 R package, how to fit probabilistic models (one per cell type) to real data and simulate synthetic data from the fitted models, and how to use scDesign2 to guide experimental design and benchmark computational methods. Finally, a note is given about cell clustering as a preprocessing step before model fitting and data simulation.

PMID:35020490 | PMC:PMC8812500 | DOI:10.1089/cmb.2021.0440