Automated Workflow For Processing, Distributing and Archiving of Raw Next Generation Sequence Data

With the introduction of the NextSeq platform for DNA sequencing, Illumina did away with the on-machine conversion of BCL files into FASTQ files, replacing it with their cloud-based, BaseSpace, solution. For the vast majority of analytical processes the starting point is FASTQ files, and for many NIH researchers an off-site cloud-based solution for the bcl2fastq conversion is inappropriate. We have developed and deployed an automated pipeline to address this problem encountered by NCI’s CCR Genomics Core. This not only solves the BCL to FASTQ problem, but also generates multiple Quality Control (QC) reports, gathers the QC-metrics and run-data into a relational database (accessible via a web interface), archives the data onto an object-based file system with attached metadata, and provides a distribution system of both data and QC metrics via pre-signed URLs. This later step greatly simplifies the task of moving large volumes of data (Gigabytes) between the core facility and individual investigators without the need for share accounts or common passwords. With the exception of the bcl2fastq step (not required), this system has also been successfully deployed for the distribution of MiSeq data.

The key features of the process include: full automation; only uses on-site NIH resources; archives data on an object-based filesystem with associated metadata tags; real-time monitoring and email notification upon completion; web-based interface to metadata, QC metrics and usage reports; and simplified distribution of data via pre-signed URLs.

Automated Workflow For Processing, Distributing and Archiving of Raw Next Generation Sequence Data

Detail of the process can be found here