A reproducible workflow for amplicon-based microbial community analysis using the drake R package

Citation

Ortega-Polo R, Vishwakarma S, Tran L, Gregoris A, Guarna MM (2020) A reproducible workflow for amplicon-based microbial community analysis using the drake R package. Bioinformatics Community Conference. 19-21 July 2020, virtual meeting. https://bcc2020.github.io/

Plain language summary

Bees are very important for agriculture, and their microbiome (the collection of microorganisms living in their gut) is linked to health and immunity. Data about the microbiome is analyzed with computational tools that are available in different platforms and systems. While the use of the Linux operating system is widespread in the bioinformatics field, many users either do not have access to a computer with Linux, or they are unable to install Linux software, or they do not know how to work with Linux systems. Our work was for developing a reproducible amplicon microbial community analysis workflow that can be used in the R statistical computing language, which is widely used by life scientists. The workflow can be used in Windows 10 or Linux computers, making this type of work more accessible.

Abstract

The use of workflow management systems promotes best practices in computational biology such as reproducibility, provenance tracking and documentation of steps and parameters used in analyses. Furthermore, the ability to restart workflows from a given point in the analysis instead of starting over provides an efficient way for developing data analysis pipelines. The drake R package is a framework for workflow management that allows users to design and visualize workflows status in a reproducible and scalable manner. In our work, we used drake to design a pipeline for amplicon-based microbial community data using DADA2 for denoising and taxonomic classification, phyloseq and other R packages for visualization and data tidying. We implemented this workflow for the analysis of 16S rRNA microbial community datasets from the honey bee gut microbiome. This workflow has the advantage of enabling users to evaluate microbial communities with amplicon sequencing data working entirely within R.