Excluding Loci With Substitution Saturation Improves Inferences From Phylogenomic Data

Research output: Contribution to journalJournal articleResearchpeer-review

Standard

Excluding Loci With Substitution Saturation Improves Inferences From Phylogenomic Data. / Duchêne, David A.; Mather, Niklas; Van Der Wal, Cara; Ho, Simon Y.W.

In: Systematic Biology, Vol. 71, No. 3, 2022, p. 676-689.

Research output: Contribution to journalJournal articleResearchpeer-review

Harvard

Duchêne, DA, Mather, N, Van Der Wal, C & Ho, SYW 2022, 'Excluding Loci With Substitution Saturation Improves Inferences From Phylogenomic Data', Systematic Biology, vol. 71, no. 3, pp. 676-689. https://doi.org/10.1093/sysbio/syab075

APA

Duchêne, D. A., Mather, N., Van Der Wal, C., & Ho, S. Y. W. (2022). Excluding Loci With Substitution Saturation Improves Inferences From Phylogenomic Data. Systematic Biology, 71(3), 676-689. https://doi.org/10.1093/sysbio/syab075

Vancouver

Duchêne DA, Mather N, Van Der Wal C, Ho SYW. Excluding Loci With Substitution Saturation Improves Inferences From Phylogenomic Data. Systematic Biology. 2022;71(3):676-689. https://doi.org/10.1093/sysbio/syab075

Author

Duchêne, David A. ; Mather, Niklas ; Van Der Wal, Cara ; Ho, Simon Y.W. / Excluding Loci With Substitution Saturation Improves Inferences From Phylogenomic Data. In: Systematic Biology. 2022 ; Vol. 71, No. 3. pp. 676-689.

Bibtex

@article{b44faf16af5d4c6d9131065298966b86,
title = "Excluding Loci With Substitution Saturation Improves Inferences From Phylogenomic Data",
abstract = "The historical signal in nucleotide sequences becomes eroded over time by substitutions occurring repeatedly at the same sites. This phenomenon, known as substitution saturation, is recognized as one of the primary obstacles to deep-time phylogenetic inference using genome-scale data sets. We present a new test of substitution saturation and demonstrate its performance in simulated and empirical data. For some of the 36 empirical phylogenomic data sets that we examined, we detect substitution saturation in around 50% of loci. We found that saturation tends to be flagged as problematic in loci with highly discordant phylogenetic signals across sites. Within each data set, the loci with smaller numbers of informative sites are more likely to be flagged as containing problematic levels of saturation. The entropy saturation test proposed here is sensitive to high evolutionary rates relative to the evolutionary timeframe, while also being sensitive to several factors known to mislead phylogenetic inference, including short internal branches relative to external branches, short nucleotide sequences, and tree imbalance. Our study demonstrates that excluding loci with substitution saturation can be an effective means of mitigating the negative impact of multiple substitutions on phylogenetic inferences. [Phylogenetic model performance; phylogenomics; substitution model; substitution saturation; test statistics.].",
author = "Duch{\^e}ne, {David A.} and Niklas Mather and {Van Der Wal}, Cara and Ho, {Simon Y.W.}",
note = "Publisher Copyright: {\textcopyright} The Author(s) 2021. Published by Oxford University Press on behalf of the Society of Systematic Biologists.",
year = "2022",
doi = "10.1093/sysbio/syab075",
language = "English",
volume = "71",
pages = "676--689",
journal = "Systematic Biology",
issn = "1063-5157",
publisher = "Oxford University Press",
number = "3",

}

RIS

TY - JOUR

T1 - Excluding Loci With Substitution Saturation Improves Inferences From Phylogenomic Data

AU - Duchêne, David A.

AU - Mather, Niklas

AU - Van Der Wal, Cara

AU - Ho, Simon Y.W.

N1 - Publisher Copyright: © The Author(s) 2021. Published by Oxford University Press on behalf of the Society of Systematic Biologists.

PY - 2022

Y1 - 2022

N2 - The historical signal in nucleotide sequences becomes eroded over time by substitutions occurring repeatedly at the same sites. This phenomenon, known as substitution saturation, is recognized as one of the primary obstacles to deep-time phylogenetic inference using genome-scale data sets. We present a new test of substitution saturation and demonstrate its performance in simulated and empirical data. For some of the 36 empirical phylogenomic data sets that we examined, we detect substitution saturation in around 50% of loci. We found that saturation tends to be flagged as problematic in loci with highly discordant phylogenetic signals across sites. Within each data set, the loci with smaller numbers of informative sites are more likely to be flagged as containing problematic levels of saturation. The entropy saturation test proposed here is sensitive to high evolutionary rates relative to the evolutionary timeframe, while also being sensitive to several factors known to mislead phylogenetic inference, including short internal branches relative to external branches, short nucleotide sequences, and tree imbalance. Our study demonstrates that excluding loci with substitution saturation can be an effective means of mitigating the negative impact of multiple substitutions on phylogenetic inferences. [Phylogenetic model performance; phylogenomics; substitution model; substitution saturation; test statistics.].

AB - The historical signal in nucleotide sequences becomes eroded over time by substitutions occurring repeatedly at the same sites. This phenomenon, known as substitution saturation, is recognized as one of the primary obstacles to deep-time phylogenetic inference using genome-scale data sets. We present a new test of substitution saturation and demonstrate its performance in simulated and empirical data. For some of the 36 empirical phylogenomic data sets that we examined, we detect substitution saturation in around 50% of loci. We found that saturation tends to be flagged as problematic in loci with highly discordant phylogenetic signals across sites. Within each data set, the loci with smaller numbers of informative sites are more likely to be flagged as containing problematic levels of saturation. The entropy saturation test proposed here is sensitive to high evolutionary rates relative to the evolutionary timeframe, while also being sensitive to several factors known to mislead phylogenetic inference, including short internal branches relative to external branches, short nucleotide sequences, and tree imbalance. Our study demonstrates that excluding loci with substitution saturation can be an effective means of mitigating the negative impact of multiple substitutions on phylogenetic inferences. [Phylogenetic model performance; phylogenomics; substitution model; substitution saturation; test statistics.].

U2 - 10.1093/sysbio/syab075

DO - 10.1093/sysbio/syab075

M3 - Journal article

C2 - 34508605

AN - SCOPUS:85125414784

VL - 71

SP - 676

EP - 689

JO - Systematic Biology

JF - Systematic Biology

SN - 1063-5157

IS - 3

ER -

ID: 306693536