Estimation of site frequency spectra from low-coverage sequencing data using stochastic EM reduces overfitting, runtime, and memory usage

Research output: Contribution to journalJournal articleResearchpeer-review

Standard

Estimation of site frequency spectra from low-coverage sequencing data using stochastic EM reduces overfitting, runtime, and memory usage. / Rasmussen, Malthe Sebro; Garcia-Erill, Genís; Korneliussen, Thorfinn Sand; Wiuf, Carsten; Albrechtsen, Anders.

In: Genetics, Vol. 222, No. 4, iyac148, 2022.

Research output: Contribution to journalJournal articleResearchpeer-review

Harvard

Rasmussen, MS, Garcia-Erill, G, Korneliussen, TS, Wiuf, C & Albrechtsen, A 2022, 'Estimation of site frequency spectra from low-coverage sequencing data using stochastic EM reduces overfitting, runtime, and memory usage', Genetics, vol. 222, no. 4, iyac148. https://doi.org/10.1093/genetics/iyac148

APA

Rasmussen, M. S., Garcia-Erill, G., Korneliussen, T. S., Wiuf, C., & Albrechtsen, A. (2022). Estimation of site frequency spectra from low-coverage sequencing data using stochastic EM reduces overfitting, runtime, and memory usage. Genetics, 222(4), [iyac148]. https://doi.org/10.1093/genetics/iyac148

Vancouver

Rasmussen MS, Garcia-Erill G, Korneliussen TS, Wiuf C, Albrechtsen A. Estimation of site frequency spectra from low-coverage sequencing data using stochastic EM reduces overfitting, runtime, and memory usage. Genetics. 2022;222(4). iyac148. https://doi.org/10.1093/genetics/iyac148

Author

Rasmussen, Malthe Sebro ; Garcia-Erill, Genís ; Korneliussen, Thorfinn Sand ; Wiuf, Carsten ; Albrechtsen, Anders. / Estimation of site frequency spectra from low-coverage sequencing data using stochastic EM reduces overfitting, runtime, and memory usage. In: Genetics. 2022 ; Vol. 222, No. 4.

Bibtex

@article{2c7044d74b9747f2875ad08eba7444d5,
title = "Estimation of site frequency spectra from low-coverage sequencing data using stochastic EM reduces overfitting, runtime, and memory usage",
abstract = "The site frequency spectrum (SFS) is an important summary statistic in population genetics used for inference on demographic history and selection. However, estimation of the SFS from called genotypes introduce bias when working with low-coverage sequencing data. Methods exist for addressing this issue, but sometimes suffer from two problems. First, they can have very high computational demands, to the point that it may not be possible to run estimation for genome-scale data. Second, existing methods are prone to overfitting, especially for multi-dimensional SFS estimation. In this article, we present a stochastic expectation-maximisation algorithm for inferring the SFS from NGS data that addresses these challenges. We show that this algorithm greatly reduces runtime and enables estimation with constant, trivial RAM usage. Further, the algorithm reduces overfitting and thereby improves downstream inference. An implementation is available at github.com/malthesr/winsfs.",
author = "Rasmussen, {Malthe Sebro} and Gen{\'i}s Garcia-Erill and Korneliussen, {Thorfinn Sand} and Carsten Wiuf and Anders Albrechtsen",
note = "{\textcopyright} The Author(s) 2022. Published by Oxford University Press on behalf of the Genetics Society of America. All rights reserved. For permissions, please email: journals.permissions@oup.com.",
year = "2022",
doi = "10.1093/genetics/iyac148",
language = "English",
volume = "222",
journal = "Genetics",
issn = "1943-2631",
publisher = "The Genetics Society of America (GSA)",
number = "4",

}

RIS

TY - JOUR

T1 - Estimation of site frequency spectra from low-coverage sequencing data using stochastic EM reduces overfitting, runtime, and memory usage

AU - Rasmussen, Malthe Sebro

AU - Garcia-Erill, Genís

AU - Korneliussen, Thorfinn Sand

AU - Wiuf, Carsten

AU - Albrechtsen, Anders

N1 - © The Author(s) 2022. Published by Oxford University Press on behalf of the Genetics Society of America. All rights reserved. For permissions, please email: journals.permissions@oup.com.

PY - 2022

Y1 - 2022

N2 - The site frequency spectrum (SFS) is an important summary statistic in population genetics used for inference on demographic history and selection. However, estimation of the SFS from called genotypes introduce bias when working with low-coverage sequencing data. Methods exist for addressing this issue, but sometimes suffer from two problems. First, they can have very high computational demands, to the point that it may not be possible to run estimation for genome-scale data. Second, existing methods are prone to overfitting, especially for multi-dimensional SFS estimation. In this article, we present a stochastic expectation-maximisation algorithm for inferring the SFS from NGS data that addresses these challenges. We show that this algorithm greatly reduces runtime and enables estimation with constant, trivial RAM usage. Further, the algorithm reduces overfitting and thereby improves downstream inference. An implementation is available at github.com/malthesr/winsfs.

AB - The site frequency spectrum (SFS) is an important summary statistic in population genetics used for inference on demographic history and selection. However, estimation of the SFS from called genotypes introduce bias when working with low-coverage sequencing data. Methods exist for addressing this issue, but sometimes suffer from two problems. First, they can have very high computational demands, to the point that it may not be possible to run estimation for genome-scale data. Second, existing methods are prone to overfitting, especially for multi-dimensional SFS estimation. In this article, we present a stochastic expectation-maximisation algorithm for inferring the SFS from NGS data that addresses these challenges. We show that this algorithm greatly reduces runtime and enables estimation with constant, trivial RAM usage. Further, the algorithm reduces overfitting and thereby improves downstream inference. An implementation is available at github.com/malthesr/winsfs.

U2 - 10.1093/genetics/iyac148

DO - 10.1093/genetics/iyac148

M3 - Journal article

C2 - 36173322

VL - 222

JO - Genetics

JF - Genetics

SN - 1943-2631

IS - 4

M1 - iyac148

ER -

ID: 321165065