RefSeq database growth influences the accuracy of k-mer-based lowest common ancestor species identification

Date
2018-10-30
Journal Title
Journal ISSN
Volume Title
Publisher
BioMed Central
Abstract

Abstract In order to determine the role of the database in taxonomic sequence classification, we examine the influence of the database over time on k-mer-based lowest common ancestor taxonomic classification. We present three major findings: the number of new species added to the NCBI RefSeq database greatly outpaces the number of new genera; as a result, more reads are classified with newer database versions, but fewer are classified at the species level; and Bayesian-based re-estimation mitigates this effect but struggles with novel genomes. These results suggest a need for new classification approaches specially adapted for large databases.

Description
Advisor
Degree
Type
Journal article
Keywords
Citation

Nasko, Daniel J, Koren, Sergey, Phillippy, Adam M, et al.. "RefSeq database growth influences the accuracy of k-mer-based lowest common ancestor species identification." (2018) BioMed Central: https://doi.org/10.1186/s13059-018-1554-6.

Has part(s)
Forms part of
Rights
This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Citable link to this page