HAYSTAC: A Bayesian framework for robust and rapid species identification in high-throughput sequencing data

Research output: Contribution to journalJournal articleResearchpeer-review

Standard

HAYSTAC : A Bayesian framework for robust and rapid species identification in high-throughput sequencing data. / Dimopoulos, Evangelos A.; Carmagnini, Alberto; Velsko, Irina M.; Warinner, Christina; Larson, Greger; Frantz, Laurent A. F.; Irving-Pease, Evan K.

In: PLOS Computational Biology, Vol. 18, No. 9, e1010493, 2022.

Research output: Contribution to journalJournal articleResearchpeer-review

Harvard

Dimopoulos, EA, Carmagnini, A, Velsko, IM, Warinner, C, Larson, G, Frantz, LAF & Irving-Pease, EK 2022, 'HAYSTAC: A Bayesian framework for robust and rapid species identification in high-throughput sequencing data', PLOS Computational Biology, vol. 18, no. 9, e1010493. https://doi.org/10.1371/journal.pcbi.1010493

APA

Dimopoulos, E. A., Carmagnini, A., Velsko, I. M., Warinner, C., Larson, G., Frantz, L. A. F., & Irving-Pease, E. K. (2022). HAYSTAC: A Bayesian framework for robust and rapid species identification in high-throughput sequencing data. PLOS Computational Biology, 18(9), [e1010493]. https://doi.org/10.1371/journal.pcbi.1010493

Vancouver

Dimopoulos EA, Carmagnini A, Velsko IM, Warinner C, Larson G, Frantz LAF et al. HAYSTAC: A Bayesian framework for robust and rapid species identification in high-throughput sequencing data. PLOS Computational Biology. 2022;18(9). e1010493. https://doi.org/10.1371/journal.pcbi.1010493

Author

Dimopoulos, Evangelos A. ; Carmagnini, Alberto ; Velsko, Irina M. ; Warinner, Christina ; Larson, Greger ; Frantz, Laurent A. F. ; Irving-Pease, Evan K. / HAYSTAC : A Bayesian framework for robust and rapid species identification in high-throughput sequencing data. In: PLOS Computational Biology. 2022 ; Vol. 18, No. 9.

Bibtex

@article{5e1edc8e4bc344bb85aa90f9a417e38a,
title = "HAYSTAC: A Bayesian framework for robust and rapid species identification in high-throughput sequencing data",
abstract = "Identification of specific species in metagenomic samples is critical for several key applications, yet many tools available require large computational power and are often prone to false positive identifications. Here we describe High-AccuracY and Scalable Taxonomic Assignment of MetagenomiC data (HAYSTAC), which can estimate the probability that a specific taxon is present in a metagenome. HAYSTAC provides a user-friendly tool to construct databases, based on publicly available genomes, that are used for competitive reads mapping. It then uses a novel Bayesian framework to infer the abundance and statistical support for each species identification and provide per-read species classification. Unlike other methods, HAYSTAC is specifically designed to efficiently handle both ancient and modern DNA data, as well as incomplete reference databases, making it possible to run highly accurate hypothesis-driven analyses (i.e., assessing the presence of a specific species) on variably sized reference databases while dramatically improving processing speeds. We tested the performance and accuracy of HAYSTAC using simulated Illumina libraries, both with and without ancient DNA damage, and compared the results to other currently available methods (i.e., Kraken2/Bracken, KrakenUniq, MALT/HOPS, and Sigma). HAYSTAC identified fewer false positives than both Kraken2/Bracken, KrakenUniq and MALT in all simulations, and fewer than Sigma in simulations of ancient data. It uses less memory than Kraken2/Bracken, KrakenUniq as well as MALT both during database construction and sample analysis. Lastly, we used HAYSTAC to search for specific pathogens in two published ancient metagenomic datasets, demonstrating how it can be applied to empirical datasets. HAYSTAC is available from https://github.com/antonisdim/ HAYSTAC.",
author = "Dimopoulos, {Evangelos A.} and Alberto Carmagnini and Velsko, {Irina M.} and Christina Warinner and Greger Larson and Frantz, {Laurent A. F.} and Irving-Pease, {Evan K.}",
note = "Publisher Copyright: Copyright: {\textcopyright} 2022 Dimopoulos et al.",
year = "2022",
doi = "10.1371/journal.pcbi.1010493",
language = "English",
volume = "18",
journal = "P L o S Computational Biology (Online)",
issn = "1553-734X",
publisher = "Public Library of Science",
number = "9",

}

RIS

TY - JOUR

T1 - HAYSTAC

T2 - A Bayesian framework for robust and rapid species identification in high-throughput sequencing data

AU - Dimopoulos, Evangelos A.

AU - Carmagnini, Alberto

AU - Velsko, Irina M.

AU - Warinner, Christina

AU - Larson, Greger

AU - Frantz, Laurent A. F.

AU - Irving-Pease, Evan K.

N1 - Publisher Copyright: Copyright: © 2022 Dimopoulos et al.

PY - 2022

Y1 - 2022

N2 - Identification of specific species in metagenomic samples is critical for several key applications, yet many tools available require large computational power and are often prone to false positive identifications. Here we describe High-AccuracY and Scalable Taxonomic Assignment of MetagenomiC data (HAYSTAC), which can estimate the probability that a specific taxon is present in a metagenome. HAYSTAC provides a user-friendly tool to construct databases, based on publicly available genomes, that are used for competitive reads mapping. It then uses a novel Bayesian framework to infer the abundance and statistical support for each species identification and provide per-read species classification. Unlike other methods, HAYSTAC is specifically designed to efficiently handle both ancient and modern DNA data, as well as incomplete reference databases, making it possible to run highly accurate hypothesis-driven analyses (i.e., assessing the presence of a specific species) on variably sized reference databases while dramatically improving processing speeds. We tested the performance and accuracy of HAYSTAC using simulated Illumina libraries, both with and without ancient DNA damage, and compared the results to other currently available methods (i.e., Kraken2/Bracken, KrakenUniq, MALT/HOPS, and Sigma). HAYSTAC identified fewer false positives than both Kraken2/Bracken, KrakenUniq and MALT in all simulations, and fewer than Sigma in simulations of ancient data. It uses less memory than Kraken2/Bracken, KrakenUniq as well as MALT both during database construction and sample analysis. Lastly, we used HAYSTAC to search for specific pathogens in two published ancient metagenomic datasets, demonstrating how it can be applied to empirical datasets. HAYSTAC is available from https://github.com/antonisdim/ HAYSTAC.

AB - Identification of specific species in metagenomic samples is critical for several key applications, yet many tools available require large computational power and are often prone to false positive identifications. Here we describe High-AccuracY and Scalable Taxonomic Assignment of MetagenomiC data (HAYSTAC), which can estimate the probability that a specific taxon is present in a metagenome. HAYSTAC provides a user-friendly tool to construct databases, based on publicly available genomes, that are used for competitive reads mapping. It then uses a novel Bayesian framework to infer the abundance and statistical support for each species identification and provide per-read species classification. Unlike other methods, HAYSTAC is specifically designed to efficiently handle both ancient and modern DNA data, as well as incomplete reference databases, making it possible to run highly accurate hypothesis-driven analyses (i.e., assessing the presence of a specific species) on variably sized reference databases while dramatically improving processing speeds. We tested the performance and accuracy of HAYSTAC using simulated Illumina libraries, both with and without ancient DNA damage, and compared the results to other currently available methods (i.e., Kraken2/Bracken, KrakenUniq, MALT/HOPS, and Sigma). HAYSTAC identified fewer false positives than both Kraken2/Bracken, KrakenUniq and MALT in all simulations, and fewer than Sigma in simulations of ancient data. It uses less memory than Kraken2/Bracken, KrakenUniq as well as MALT both during database construction and sample analysis. Lastly, we used HAYSTAC to search for specific pathogens in two published ancient metagenomic datasets, demonstrating how it can be applied to empirical datasets. HAYSTAC is available from https://github.com/antonisdim/ HAYSTAC.

U2 - 10.1371/journal.pcbi.1010493

DO - 10.1371/journal.pcbi.1010493

M3 - Journal article

C2 - 36178955

AN - SCOPUS:85139803752

VL - 18

JO - P L o S Computational Biology (Online)

JF - P L o S Computational Biology (Online)

SN - 1553-734X

IS - 9

M1 - e1010493

ER -

ID: 331789261