Software

  • RANGER-DTL: Short for Rapid ANalysis of Gene family Evolution using Reconciliation-DTL, this is a software package for inferring gene family evolution by speciation, gene duplication, horizontal gene transfer, and gene loss. The software takes as input a gene tree (rooted or unrooted) and a rooted species tree and reconciles the two by postulating speciation, duplication, transfer, and loss events. RANGER-DTL implements the algorithms presented in the ISMB 2012, RECOMB 2013, IEEE/ACM TCBB and BMC Bioinformatics papers listed on the publications page and makes it possible to perform rigorous evolutionary analyses of even large gene families with thousands of taxa while accounting for confounding factors such as gene tree uncertainty and multiple optima. It can be downloaded from https://compbio.engr.uconn.edu/software/RANGER-DTL/
  • HoMer: Short for “Horizontal Multi-gene transfer inference”, HoMer is a software package for inferring instances of horizontal multi-gene transfer (HMGT) during the evolutionary history of a collection of microbial species/strains. An HMGT occurs when multiple genes are horizontally transferred in single horizontal transfer event. HoMer takes as input a rooted species tree, gene ordering information for the species/genomes (leaves) represented in the species tree, and rooted gene trees for all gene families (with at least three genes each) present in the species/genomes under consideration. The software outputs a list of inferred HMGTs for each donor-recipient pair on the species tree, where donors/recipients can be leaves (i.e., given genomes) or internal (i.e., ancestral) edges on the species tree. Further technical details appear in the Molecular Biology and Evolution paper available from the publications page. HoMer can be downloaded from https://compbio.engr.uconn.edu/software/homer/
  • CNAsim: CNAsim is a software package for improved simulation of single-cell copy number alteration (CNA) data from tumors. CNAsim can be used to generate copy number profiles with noise patterns that mimic those of single-cell CNA detection algorithms, and to generate DNA-seq data for sampled cells. It offers significantly improved scalability, a high degree of customizability, and improved biological realism of simulated data. Further details on features and implementation are available in the Bioinformatics paper available from the publications page.
  • PhyloGTP: This is a software tool for genome-scale microbial phylogenomics. It takes as input a collection of gene trees and estimates a species tree under a duplication-transfer-loss (DTL) model of gene family evolution. PhyloGTP uses local search heuristics to obtain a species tree topology which minimizes the global DTL reconciliation cost with the collection of input gene trees. Further details appear in the RECOMB-CG paper available from the publications page. PhyloGTP has been shown to often result in more accurate microbial phylogenies than existing microbial phylogenomics tools. It can be freely downloaded from https://github.com/samsonweiner/PhyloGTP/.
  • SaGePhy: Short for “Simulation framework for Subgene and Gene Phylogenies”, SaGePhy is an easy-to-use, open-source, and platform independent software package for simulating gene family evolution within species trees as well as subgene or protein-domain evolution within one or more gene trees. SaGePhy can generate species trees using a probabilistic birth-death process, generate gene trees within a given species tree using a probabilistic model of gene evolution that allows for gene duplications, horizontal gene transfers, and gene losses, and generate subgene or domain family phylogenies inside one or more gene trees by allowing for subgene duplications, horizontal subgene transfers within and across gene families (and either within or across species boundaries), and subgene losses. SaGePhy implements a number important features not found in other phylogenetic simulation software. Further details are available from the software webpage at http://compbio.engr.uconn.edu/software/sagephy/
  • SEADOG: Short for “Simultaneous Evolutionary Analysis of DOmains and Genes through phylogenetic reconciliation”, this is a software package for simultaneous inference of domain-level and gene-level evolution through a joint phylogenetic reconciliation of domain, gene, and species trees. The software takes as input a rooted or unrooted domain tree, rooted gene trees for the gene families in which the domains of the domain tree occur, and a rooted species tree on the species considered in the analysis, and computes a joint Domain-Gene-Species reconciliation of the domain tree with the gene trees and of the gene trees with the species tree. The software implements the Domain-Gene-Species (DGS) reconciliation model and algorithms described in the IEEE/ACM TCBB and ACM-BCB 2018 papers listed on the publications page. SEADOG can be downloaded from http://compbio.engr.uconn.edu/software/seadog/
  • SEADOG-Gen: SEADOG-Gen is a software package for joint phylogenetic reconciliation of domain, gene, and species trees for microbial species. SEADOG-Gen is similar to the SEADOG software above but assumes that both domain transfer and gene transfer can occur easily in the species being analyzed, and the implemented algorithms only work well under this assumption. The software implements the Generalized Domain-Gene-Species (Gen-DGS) reconciliation model and algorithms described in the IEEE/ACM TCBB paper listed on the publications page. SEADOG-Gen can be downloaded from http://compbio.engr.uconn.edu/software/seadog-gen/
  • RF+: This is a program for computing RF(+) distances between phylogenetic trees. RF(+) distance is designed to more meaningfully compute the Robinson-Foulds distance between two trees that only have a partially overlapping leaf set. The traditional approach for computing Robinson-Foulds distance between two trees that only have a partially overlapping leaf set is to first restrict the two trees to their shared leaf set and then compute their Robinson-Foulds distance. We refer to distances computed in this way as RF(-) distances. In contrast, the RF(+) distance between two arbitrary trees is computed by first optimally completing each tree on the union of the leaf sets of both trees so as to minimize the Robinson-Foulds distance between them, and then reporting the Robinson-Foulds distance between the two completed trees. This software implements the algorithms described in the CPM 2021, AMB 2020, and RECOMB-CG 2018 papers listed on the publications page. RF+ can be downloaded from http://compbio.engr.uconn.edu/software/RF_plus/
  • TNet: TNet is a phylogeny-based method for reconstructing transmission networks for infectious diseases. It takes as input a phylogeny of the strain (pathogen) sequences sampled from infected hosts and analyzes it to estimate the underlying transmission network. TNet relies on the availability of multiple strain sequences from each sampled host to infer transmissions and is simpler and more accurate than existing approaches. The method is parameter-free and highly scalable and can be easily applied within seconds to datasets with hundreds of strain sequences and hosts. TNet can be downloaded from http://compbio.engr.uconn.edu/software/tnet/ and algorithmic details are available in the ISBRA 2020 and IEEE/ACM TCBB 2022 papers listed on the publications page.
  • TNet-Geo: TNet-Geo is a customised and extended version of the TNet software described above and is designed for geographical transmission network analysis when multiple strain sequences from different infected hosts are available from the different geographic regions (e.g., countries) under consideration. TNet-Geo can be used to estimate the extent of infection spread from one region to another in different time periods. TNet-Geo was first introduced in the IEEE/ACM TCBB 2022 paper listed on the publications page.
  • DaTeR: DaTeR is a program for improved dating of microbial species phylogenies using relative time constraints (e.g., obtained from high-confidence horizontal gene transfer events). Traditional phylogenetic dating approaches make use of absolute time constraints, which provide lower and/or upper bounds for one or more nodes of the underlying phylogeny, but are unable to use relative constraints that specify that some node x must be at dated to be at least as old as some other node y. DaTeR takes as input a collection of chronograms sampled from the posterior using any traditional Bayesian phylogenetic dating approach (based on only absolute time calibrations), along with a set of curated relative time constraints, and minimally error-corrects each input chronogram to ensure compatibility with all available relative time constraints. DaTeR can be downloaded from https://compbio.engr.uconn.edu/software/dater/
  • virDTL: virDTL is a computational protocol for the inference of both extant and ancestral strain recombination in viral genomes using phylogenetic reconciliation. virDTL leverages Duplication-Transfer-Loss reconciliation to analyze incongruencies between the strain evolutionary tree and the evolutionary trees of each gene family (or genomic regions) to infer possible horizontal gene transfers, which correspond to possible recombination events in the context of viruses. Further details on virDTL appear in the JCB 2022 paper listed on the publications page.
  • RANGER-DTLx: RANGER-DTLx is a prototype extended version of the RANGER-DTL 2.0 software package implementing an extended version of the Duplication-Transfer-Loss (DTL) reconciliation model, called the DTLx reconciliation model, that can account for horizontal gene transfer events from unsampled or extinct species lineages. Further details on the DTLx reconciliation model appear in the Algorithms paper listed on the publications page.
  • TreeSolve: TreeSolve is a program for gene tree error-correction. TreeSolve is designed for the error-correction of microbial gene trees (with horizontal gene transfer) but can be easily applied to non-microbial gene trees as well. TreeSolve takes as input a rooted gene tree topology, a known rooted species tree, and a collection of (unrooted) gene tree samples such bootstrap replicates or samples from a posterior distribution, and outputs an error-corrected gene tree topology. TreeSolve works by computing branch support values for the given rooted gene tree based on the given replicates/samples, collapsing weakly supported branches in the input gene tree, and then optimally resolving it based on both the input gene tree samples and the species tree while accounting for horizontal gene transfer, gene duplication, and gene loss. TreeSolve serves a similar purpose as the TreeFix-DTL program described below, but is far more scalable and yields multiple candidate error-corrected gene trees. TreeSolve can be downloaded from http://compbio.engr.uconn.edu/software/treesolve/ and algorithmic details are available in the AlCoB 2020 paper listed on the publications page.
  • ARTra: This is a program for inferring and distinguishing between additive and replacing horizontal gene transfer events. ARTra uses Duplication-Transfer-Loss (DTL) reconciliation to infer transfer events and then uses a trained machine learning classifier to classify the inferred transfers as additive or replacing. The machine learning classifier uses the error-prone classifications generated by several simple rule-based classification heuristics, along with some additional features, to generate an improved ensemble classification. The machine learning framework and rule-based heuristics used by ARTra are described in the ACM-BCB 2020 paper listed on the publications page. ARTra can be downloaded from http://compbio.engr.uconn.edu/software/ARTra
  • trippd: This is a prototype implementation of a simple proof-of-concept approach for detecting the presence of partial gene transfer (i.e., horizontal transfer of a fragment of a gene) in a given gene family. trippd takes as input a multiple sequence alignment for the gene family under consideration, partitions the sites/columns of the alignment into three roughly equal parts, computes ML trees and bootstrap replicates for each partition, and compares these trees with each other to determine if that gene family has been affected by significant partial gene transfer. Further methodological details are described in the associated RECOMB-CG 2022 paper. trippd can be used to easily identify gene families whose gene trees may have been impacted by the presence of significant partial gene transfer. Scripts implementing trippd, along with some simulated datasets used for its evaluation, are freely available from https://github.com/suz11001/Tripartition.
  • TreeFix: This is a program for very accurate reconstruction of eukaryotic gene trees. TreeFix takes as input a maximum likelihood gene tree topology, a known species tree, and a multiple sequence alignment for the gene family and outputs a more accurate gene tree topology that has statistically equivalent sequence support and better agreement with the species tree topology. Further technical details and experimental evaluation appear in the Systematic Biology paper listed on the publications page. TreeFix was programmed by Yi-Chieh Wu and can be downloaded from http://compbio.mit.edu/treefix/.
  • TreeFix-DTL: This is a program for very accurate reconstruction of microbial gene trees (with horizontal gene transfer). Like Treefix above, TreeFix-DTL takes as input a maximum likelihood gene tree topology, a known species tree, and a multiple sequence alignment for the gene family and outputs a more accurate gene tree topology while accounting for horizontal gene transfer, gene duplication, and gene loss. Further technical details and experimental evaluation appear in the Bioinformatics paper listed on the publications page. TreeFix-DTL was programmed by Yi-Chieh Wu and can be downloaded from http://compbio.mit.edu/treefix-dtl/.
  • TreeFix-TP: This is a program for reconstructing highly accurate transmission phylogenies, i.e., phylogenies depicting the evolutionary relationships between infectious disease strains (viral or bacterial) transmitted between different hosts. TreeFix-TP is designed for scenarios where multiple strain sequences have been sampled from each infected host, and it uses the host assignment of each sequence sample to error-correct a given maximum likelihood phylogeny of the strain sequences. Specifically, given a maximum likelihood phylogeny, the multiple sequence alignment on which the phylogeny was built, and the host assignment for each sequence, TreeFix-TP searches around the maximum likelihood phylogeny to find an alternate error-corrected phylogeny which is equally well-supported by the sequence data and minimizes the number of necessary inter-host transmissions. TreeFix-TP can be downloaded from http://compbio.engr.uconn.edu/software/treefix-tp/
  • RF-Supertrees: This is a fast and accurate supertree program for rooted phylogenetic trees. It searches for a supertree that minimizes the total (rooted) Robinson-Foulds distance (i.e. symmetric difference) between the supertree and the input trees. RF-Supertrees implements efficient search algorithms described in the paper Robinson-Foulds Supertrees listed on the publications page, and can be downloaded from https://genome.cs.iastate.edu/rfsupertrees.
  • DupTree: This is a tool box for constructing species phylogenies from genome-scale multi-locus data using gene tree parsimony. The idea is to find the species tree that best reconciles the input gene trees in terms of gene duplications. Joint programming work with Andre Wehe. This toolbox implements the fast local search algorithm described in the RECOMB’07 paper listed on the publications page. DupTree can be downloaded from https://genome.cs.iastate.edu/DupTree.
  • DupLoss and DeepC: These programs extend on the program DupTree and allow the construction of species phylogenies, from genome-scale multi-locus data, under the duplication-loss and deep coalescence cost models respectively. They implement the fast local search algorithms described in the APBC’10 paper listed on the publications page and are now available as part of the software package iGTP which can be downloaded from https://genome.cs.iastate.edu/igtp/home.
  • HiDe: HiDe (short for Highway Detection) is a software package for inferring highways of horizontal gene transfer (representing large-scale horizontal transfer of genes) in the evolutionary history of a set of species. HiDe implements the highway detection method described in this 2013 paper listed on the publications page and was programmed by undergraduate summer student Guy Banay under my supervision. HiDe can be downloaded from http://acgt.cs.tau.ac.il/hide/.