Datasets

  • Datasets for testing accuracy of microbial gene tree rooting methods

    The following simulated and biological datasets were used in the paper cited below to test the accuracy of various gene tree rooting methods on microbial gene families. The paper contains a detailed description of these datasets.

    Simulated datasets: AllSimulatedDatasets.zip
    Real biological dataset: RootingEmpiricalDataset.zip

    Associated Publication:
    Assessing the Accuracy of Phylogenetic Rooting Methods on Prokaryotic Gene Families
    Taylor Wade, L. Thiberio Rangel, Soumya Kundu, Gregory P. Fournier, Mukul S. Bansal.
    PLOS One, 15(5): e0232950, 2020.

  • Simulated datasets for testing DTRL reconciliation algorithms and classification of additive and replacing transfers

    The following simulated datasets contain gene trees and species trees where the gene trees were evolved inside the species tree with gene duplications, additive transfers, replacing transfers, and gene losses, using the SaGePhy simulation framework. These datasets were used in the paper cited below to test the accuracy of a heuristic for classifying transfer events inferred through DTL reconciliation as being additive or replacing. The paper contains a detailed description of these datasets.

    Simulated datasets: DTRL_simulatedData.zip

    Associated Publication:
    On Inferring Additive and Replacing Horizontal Gene Transfers Through Phylogenetic Reconciliation
    Misagh Kordi, Soumya Kundu, Mukul S. Bansal.
    ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (ACM-BCB) 2019; Proceedings, pages 514-523..

  • COVID-19 dataset for inferring global/international COVID-19 transmission network

    This dataset includes the GISAID accession numbers for 2123 downloaded and filtered SARS-CoV-2 sequences (downloaded from GISAID in June 2020) from 59 countries, the specific command used to align those sequences, and the 10 bootstrap phylogenies computed on the aligned sequences using RAxML. This dataset was used to infer the international COVID-19 transmission network using TNet. Following GISAID’s policies, actual sequences have not been included in this dataset. The actual genomic sequences and associated metadata (including country of origin, country of exposure, etc.) can be downloaded from GISAID using the provided accession numbers. The manuscript cited below contains a detailed description of this dataset.

    Global COVID-19 dataset: Global_COVID-19_Dataset.zip
    Acknowledgement table for the COVID-19 sequence data used the above dataset: gisaid_acknowledgement_table_world.pdf.

    Associated manuscript:
    TNet: Transmission Network Inference Using Within-Host Strain Diversity and its Application to Geographical Tracking of COVID-19 Spread
    Saurav Dhar, Chengchen Zhang, Ion Mandoiu, Mukul S. Bansal.
    IEEE/ACM Transactions on Computational Biology and Bioinformatics; 2021 (in press).