Consistent, comprehensive and computationally efficient OTU definitions

Jai Ram Rideout; Yan He; Jose Antonio Navas-Molina; William A Walters; Luke K Ursell; Sean M Gibbons; John H Chase; Daniel McDonald; Antonio Gonzalez; Adam Robbins-Pianka; Jose C Clemente; Jack Gilbert; Susan M Huse; Hong-Wei Zhou; Rob Knight; J Gregory Caporaso

doi:10.7287/peerj.preprints.411v2

Consistent, comprehensive and computationally efficient OTU definitions

Jai Ram Rideout^1,2, Yan He³, Jose Antonio Navas-Molina⁴, William A Walters⁵, Luke K Ursell⁶, Sean M Gibbons^7,8, John H Chase⁹, Daniel McDonald⁴, Antonio Gonzalez¹⁰, Adam Robbins-Pianka^4,10, Jose C Clemente², Jack Gilbert^8,11, Susan M Huse¹², Hong-Wei Zhou³, Rob Knight^10,13, J Gregory Caporaso ⁹

1 Center for Microbial Genetics and Genomics, Northern Arizona University, Flagstaff, AZ, USA

2 Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA

3 School of Public Health and Tropical Medicine, Southern Medical University, Guangzhou, Guangdong, China

4 Department of Computer Science, University of Colorado Boulder, Boulder, CO, USA

5 Department of Molecular, Cellular, and Developmental Biology, University of Colorado at Boulder, Boulder, CO, USA

6 Department of Chemistry and Biochemistry, University of Colorado Boulder, Boulder, CO, USA

7 Graduate Program in Biophysical Sciences, University of Chicago, Chicago, IL, U.S.A.

8 Institute for Genomics and Systems Biology, Argonne National Laboratory, Lemont, IL, U.S.A.

9 Department of Biological Sciences, Northern Arizona University, Flagstaff, AZ, United States

10 BioFrontiers Institute, University of Colorado at Boulder, Boulder, CO, USA

11 Department of Ecology and Evolution, University of Chicago, Chicago, Illinois, USA

12 Department of Pathology and Laboratory Medicine, Brown University, Providence, RI, USA

13 Howard Hughes Medical Institute, Boulder, CO, USA

DOI: 10.7287/peerj.preprints.411v2

Published: 2014-07-24
Accepted: 2014-07-24

Subject Areas: Bioinformatics, Ecology, Microbiology
Keywords: OTU picking, microbial ecology, microbiome, qiime, bioinformatics

Licence: This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ PrePrints) and either DOI or URL of the article must be cited.

Cite this article: Rideout JR, He Y, Navas-Molina JA, Walters WA, Ursell LK, Gibbons SM, Chase JH, McDonald D, Gonzalez A, Robbins-Pianka A, Clemente JC, Gilbert J, Huse SM, Zhou H, Knight R, Caporaso JG. 2014. Consistent, comprehensive and computationally efficient OTU definitions. PeerJ PrePrints 2:e411v2 https://doi.org/10.7287/peerj.preprints.411v2

Abstract

We present a performance-optimized algorithm, subsampled open-reference OTU picking, for assigning marker gene (e.g., 16S rRNA) sequences generated on next-generation sequencing platforms to operational taxonomic units (OTUs) for microbial community analysis. This algorithm provides benefits over de novo OTU picking (clustering can be performed largely in parallel, reducing runtime) and closed-reference OTU picking (all reads are clustered, not only those that match a reference database sequence with high similarity). Because more of our algorithm can be run in parallel relative to “classic” open-reference OTU picking, it makes open-reference OTU picking tractable on massive amplicon sequence data sets (though on smaller data sets, “classic” open-reference OTU clustering is often faster). We illustrate that here by applying it to the first 15,000 samples sequenced for the Earth Microbiome Project (1.3 billion V4 16S rRNA amplicons). To the best of our knowledge, this is the largest OTU picking run ever performed, and we estimate that our new algorithm runs in less than 1/5 the time than would be required of “classic” open reference OTU picking. We show that subsampled open-reference OTU picking yields results that are highly correlated with those generated by “classic” open-reference OTU picking through comparisons on three well-studied datasets. An implementation of this algorithm is provided in the popular QIIME software package, which uses uclust for read clustering. All analyses were performed using QIIME’s uclust wrappers, though we provide details (aided by the open-source code in our GitHub repository) that will allow implementation of subsampled open-reference OTU picking independently of QIIME (e.g., in a compiled programming language, where runtimes should be further reduced). Our analyses should generalize to other implementations of these OTU picking algorithms. Finally, we present a comparison of parameter settings in QIIME’s OTU picking workflows and make recommendations on settings for these free parameters to optimize runtime without reducing the quality of the results. These optimized parameters can vastly decrease the runtime of uclust-based OTU picking in QIIME.

Consistent, comprehensive and computationally efficient OTU definitions

Abstract

Author Comment

Supplemental Information

Supplementary Data

Add your feedback

Supplemental Information

Supplementary Data

Add your feedback

Top referrals unique visitors

Share this preprint

Metrics

Download article