Lusia Fernanda Matiz Ceron is a phd student in Universidad de los Andes, Columbia.
Abstract
DNA barcodes are standardized DNA sequences that usually range between 400 to 800 bp, vary at different taxonomic levels and make it possible to quickly identify new individuals of species that have been previously sequenced and classified taxonomically. Several barcodes have been identified and evaluated for different groups in the tree of life, however, there are many groups that still lack a good DNA marker, and even more so, accurate strategies that enable the verification of their taxonomic affiliation. For plants there are several DNA barcodes that have been postulated, nonetheless, their classification potential has not been evaluated systematically, and as a result, it would appear as not one excels above the others. One of the tools that has recently gained traction in this field is the use of Naïve Bayesian Classifiers. This type of classifier is based on the autonomy of attributes and the allocation of categories on a given context, having been mainly used in the classification of genes such as the bacterial 16S. In the present study we evaluate the classification power of several plant biomarkers that may work as barcodes (trnL, rpoB, rbcL, matK, psbA-trnH and psbK) using a Naïve Bayesian Classifier, in order to determine markers what work best at different taxonomic levels.
Classification performance of the proposed biomarkers is differential, having all of them enough resolution to classify at family level, and two of them (trnL and matK) had the best performance at genus level. None of the markers had enough resolution for species level. Increasing K-mer size has an effect on taxonomic classification, however this benefit is marginal with respect to the computational cost. Confusion matrix indicates that genera with lots of species tend to misclassify more often than genera with less species. Finally, we provide Greengenes-like databases derived from NCBI data for researchers who want to use these resources in their own research.