News & Articles
Building a high-quality, reliable, and efficient bioinformatics pipeline
Ziga Mahkovec
At Color we provide high-quality, physician-ordered, genetic testing at a low cost. A core component of this service is the bioinformatics pipeline: the software framework that processes data from the DNA sequencers in the clinical lab, finds genomic variants, annotates them for variant classification, and performs quality control.
We’ve previously published a few posters and papers ([1], [2], [3]) outlining our novel solutions for detecting and managing hard-to-call variants. This is an exciting and fast-evolving area of research, both at Color and across the field of bioinformatics.
But until today, we haven’t shared many details about our pipeline itself, and the role that distributed systems engineering plays in the future of bioinformatics and Color. Jeremy Ginsberg recently shared an introduction to the topic (worth a quick read for background/context). Here, we discuss more specific technical details and optimizations which may be of interest to other bioinformatics teams and, we hope, distributed systems engineers who are motivated make an impact by solving hard computational problems at the center of precision medicine (we’re hiring!).
For those who are new to genetics and bioinformatics, here’s the typical “life of sample” ordered as part of a clinical lab test like our Color Hereditary Cancer and Heart Health Test:
The main stages of the bioinformatics pipeline are:
- Converting DNA sequencer data: high-throughput Next Generation Sequencing (NGS) machines from Illumina produce raw and multiplexed sequencing data. In the case of Illumina NextSeq 550 sequencers in our lab, batches of 96 DNA samples are pooled and sequenced together; the pipeline then has to identify and demultiplex samples using molecular barcodes (8 base pair unique barcodes attached to each sample). Each sequencing batch from these sequencers produces around 100GB of compressed data: sequencer reads and quality information (i.e. confidence) for each position in the read.
- Aligning DNA reads: the sequencer outputs reads that are short 150 base pair DNA fragments, randomly distributed across the whole genome (or parts of the genome for targeted panels). To be able to assemble the DNA sequence for a particular sample, we need to align these reads against the reference human genome. This is performed using BWA alignment, a computationally intensive process that needs to find every 150 base pair sequence in the 3 billion base pair human genome.
- Calling variants: Once we’ve assembled the DNA sequence, we need to find variants (i.e. mutations) in the DNA. There are many different types of variants, and they generally require specialized approaches and tools to find. They range from single nucleotide polymorphisms (SNPs), to small insertions and deletions, to larger deletions and duplications: copy number variants, or CNVs (see What is a variant? for a more detailed description of variants). Short-read sequencing technology makes some of these variants exceptionally hard to find, so secondary confirmation is often used to increase confidence in the callers: orthogonal lab technology is used to confirm the presence of a variant.
- Annotating variants for classification: downstream from the bioinformatics pipeline, our variant scientists and medical geneticists need to classify novel variants and interpret them in the context of the user’s health information. To streamline the variant classification process, the bioinformatics pipeline annotates variants using information from various external data sources: variant nomenclature, population frequency using the gnomAD database, functional prediction of the variant using Alamut Batch, comparative interpretation by other labs using the ClinVar database, etc.
- Computing quality control metrics: every part of the bioinformatics pipeline is subject to detailed quality control (QC) measurement. These QC metrics are vital for variant confidence measurements, determining whether DNA sample quality is sufficient for reporting, diagnosing lab performance, and troubleshooting.
It sounds relatively straightforward, but as shown in this workflow, each sample actually runs through more than 20 distinct processes with complex dependencies:
In this blog post, we outline the guiding principles of building our bioinformatics pipeline: quality, reliability, performance, and cost efficiency. We offer some insights into how we implemented these principles at high throughput in a production clinical setting.
Quality
The most important attribute of any clinical bioinformatics pipeline is correctness: all variants in the biological samples have to be detected correctly; both false positives and false negatives have clinical implications. By comparison, research-grade pipelines often focus on easier-to-detect variants and are comfortable with limited sensitivity, especially for CNVs. For context, we note that a clinical bioinformatics pipeline is just one of many procedures and systems required to achieve Color’s overall accuracy/quality goals:
- We have built and operate a CAP-accredited/CLIA-certified lab, one of the few reviewed and permitted by New York’s CLEP
- We insist on rigorous validation of our genetic tests
- We apply strict ACMG variant classification guidelines
- We utilize novel additional QC (quality control) metrics and safeguards, above and beyond those specified by AMP/CAP
- We comply with Health Insurance Portability and Accountability Act (HIPAA)
One novel approach we use on the software side is running extensive regression tests for all bioinformatics pipeline code changes. By running every new software release against thousands of previously processed samples, we have been able to increase confidence in software changes. Though this approach may seem fairly routine to engineers accustomed to working on distributed systems at large consumer internet companies, it’s very rare in this domain.
To even consider this approach, the pipeline itself must be exceedingly reliable, reproducible, and able to be massively parallelized with a low end-to-end runtime, even when processing many thousands of samples. Without these characteristics, a regression test of this scale is too cumbersome to manage.
How does a regression test work? The output of the bioinformatics pipeline — variant calls, annotations, and quality control metrics — are automatically compared between releases. Any changes in the output have to be reviewed and explained by code changes going into the new release; unwanted changes will trigger an investigation and prevent the release from getting deployed to production until resolved. This allows us to deploy a new release of the bioinformatics pipeline every few weeks, while maintaining the high quality and necessary validation for each release. The faster iteration cycle means we can keep the pace of development high and continue work on delivering new genetic tests to our users.
To ensure comprehensiveness of the regression integration tests, we run them against a large set of recent production samples and against known challenging datasets (samples with rare and hard-to-call variants); this makes sure we capture every critical code path and run tests with data that’s representative of the current conditions in the lab: assay, lab process, and sequencing hardware.
Reproducibility of the pipeline code is vital for running regression tests — to be able to compare different software releases, the same sequencing input has to produce the same output for every run. This is not generally the case for DNA alignment tools and variant callers; these are stochastic processes that often run in a multithreaded mode, both of which are sources of non-determinism. We took the following approaches to ensure determinism:
- Use fixed number generator seeds throughout the code, or pass them as parameters to third-party tools.
- Somewhat counterintuitively, we avoid running tools in multithreaded mode and instead parallelize work at the process level; this requires breaking the input file into smaller units, running multiple invocations of the tool in parallel (each running in single-threaded mode), and merging the output data.
- Checksumming of input and output data to ensure files are identical between runs.
As a result, we have eliminated non-determinism as a source of spurious regressions during backtesting.
The other key requirements for running large regression tests are cost, speed, and reliable automation:
- We must keep the per-sample computation costs very low.
- We must be able to re-analyze thousands of samples in just a few hours.
- We can’t tolerate routine manual interventions: automation must work reliably, all the time.
The next sections outline the work required to achieve these goals.
Reliability
To avoid human error and increase our bioinformatics throughput, we invested heavily in automation. Some of this was by necessity: our bioinformatics engineering team is small and any operational overhead takes time away from research and development.
We started by making sure the entire pipeline workflow is automated:
- When the lab operations team loads the DNA sequencers with samples, an automated process detects new files being written to the network attached storage system in the lab.
- The process immediately starts uploading sequencer data to Amazon S3. Another process running in AWS EC2 monitors the new files and watches for sequencing completion markers. The moment a sequencing run is complete, a message is enqueued to trigger a new bioinformatics pipeline run for the new batch of samples.
- AWS EC2 instances are automatically scaled up to provision for the computing requirements of the new workload. These EC2 instances are also automatically terminated when the computing is no longer needed (i.e. the pipeline is complete) in order to reduce cost. All data and logs are persisted in the Amazon S3 storage system.
- A pipeline run starts processing the data. Dependencies are automatically resolved and the workflow progresses until all required tasks successfully complete.
- Once the pipeline is complete, it automatically triggers downstream processes: quality control approval, variant import into the database for variant classifications, and sending out any necessary notifications to teams that need to interpret the data.
Of course, automation only works well if these processes are fault-tolerant and don’t require manual interventions. Our systems’ reliability has improved over time: as new failure modes are detected, we would investigate and address the root cause. Root cause analysis is vital here; while it’s possible to improve the reliability of idempotent processes with automatic retries, this generally leads to brittle systems. Our team built a culture of zero runtime exceptions; every pipeline failure gets reported to our team’s Slack channel, so it’s very visible. The oncall engineer is responsible for investigating the exception and preventing repeat failures.
The causes of intermittent failures in the bioinformatics pipeline are twofold: distributed system failures modes and the unpredictability of biological input. The former are familiar; like any large distributed system, we have to deal with hardware failures, network issues, memory or disk resource exhaustion, etc. The latter is a problem specific to processing biological data; unlike processing digital input, the diversity of inputs — everyone’s DNA is unique — and sensitive thermal and chemical processes in the lab contribute to a wide range of input parameters. For example: small fluctuations in temperature in the lab can negatively affect the PCR amplification process, leading to significantly reduced coverage of sequenced data. The pipeline has to properly handle such outliers in data input, by either attempting to make variant calls with the same confidence, or by rejecting the samples at QC review and requiring the sample to be processed in the lab again.
A key performance indicator for our team is time between manual interventions. Over time, our emphasis on root cause analysis allowed us to keep the bioinformatics pipeline running autonomously for up to 30 days with no intervention, all the while processing thousands of clinical samples, comprising 30+TB of data and utilizing 100,000+ CPU-hours.
Performance
As mentioned above, being able to process a high volume of samples with low turnaround time allows us to run large regression tests. Similarly, research and development benefits from short iteration cycles — for example, when we develop a new version of our assay.
Turnaround time is also important for production clinical samples, to improve the client experience of genetic testing. When looking at Color’s overall client-facing turnaround time, the majority of days are spent either in the clinical lab or our thorough analysis/classification of called variants. For example, an Illumina NovaSeq sequencer alone takes around 40 hours to sequence a batch of DNA samples. So why does the runtime/reliability of the bioinformatics pipeline matter? Cascading errors, for one reason: a slow pipeline that later requires manual intervention and a second attempt can easily add several days to processing. We’ve spent years focusing on making the pipeline run faster and more reliably, to the point that we now return results in less than 2 hours on average, regardless of the number of samples processed in parallel.
To achieve this level of performance, we first focused on efficient and highly parallel use of AWS EC2 resources. DNA alignment and variant calling are the most computationally intensive tasks, and may use up an entire EC2 instance for each sample for a short amount of time (one of the instance types we use are c5.9xlarge instances with 36 CPU cores and 72GB of memory!).
There’s a large set of both third-party and homegrown bioinformatics tools involved in running our pipeline:
- DNA alignment: BWA-MEM.
- Variant calling: GATK for small variants, Scalpel for insertions and deletions, CNVkit for larger events (also known as copy number variations), homegrown tools for split-read calling, insertion assembly, and machine-learning classifiers for filtering.
- Various bioinformatics tools: bcl2fastq, samtools, Picard, etc.
Each tool was individually profiled and tuned to make sure CPU, memory and storage resources are efficiently utilized given the size of our workload. What makes this harder is the fact that these tools weren’t necessarily designed for running in a parallel high-throughput environment. While there exist newer tools with better performance characteristics (e.g. minimap2 for alignment, or sambamba for processing aligned reads), our emphasis on quality and correctness requires us to first and foremost use industry-accepted and validated tools.
Here are two specific discoveries/insights which had noteworthy performance impacts:
- Running GATK to call variants on just the targeted regions (and in parallel per chromosome) sped up variant calling by more than 10x.
- Large downloads/data transfer (100GB+) from S3 was once a major bottleneck, so we made a mix of changes to speed things up by more than 15x. We ended up using the AWS boto3 library for multipart downloads, Python multiprocessing to benefit from multiple cores, and using throughput optimized HDD EBS volumes.
Cost efficiency
To offer clinical-grade genetics at industry-leading price points, we must keep the costs of running the bioinformatics pipeline low. Some engineers are surprised to learn that, despite ongoing decreases in the overall cost of sequencing, the wetlab process itself (from DNA extraction to sequencing) is inherently a much bigger cost than running software. So why do we care about pipeline compute costs? Apart from the regression test mentioned above, we know that at some point in the future, storage and computational costs will match or exceed wetlab costs, especially as we move to whole-genome sequencing (WGS). WGS data for a single sample at 30x depth of coverage generates around 60GB of compressed data and can take several hundred CPU-hours to process.
A few of the decisions which help keep our pipeline costs low:
- Amazon EC2 Spot instances: Spot instances are spare compute capacity in AWS available at a discount. The downside to using Spot instances, of course, is that the application must handle sudden losses in the availability of the instance. It took a bit of tweaking to make our pipeline resilient to these events, but we found the transition to on-demand compute capacity easier than expected (no failures or manual intervention, and with only occasional duplicated/restarted tasks). Result: we were able to save around 70% of our computation costs.
- Efficient auto-scaling: the tasks needed to process a batch of samples are managed in our Celery distributed queue. By keeping track of the memory and CPU resource requirements of each task, we are careful to provision only the required amount of EC2 instances. Additionally, the cluster is downsized as needed when tasks complete, to keep machine utilization high (and costs at a minimum).
- Amazon S3 storage classes: efficient placement of unused sequencing data to Infrequent Access and Glacier tiers significantly reduces storage costs.
Summary
Running a production clinical bioinformatics pipeline requires expertise both in computational biology and distributed systems. By leveraging our team of bioinformaticians and systems engineers, we’re able to achieve high sensitivity and specificity of variant calling, while keeping the operational overhead, runtime, and costs low.
We’re hoping some of these insights will help others in our field, and perhaps inspire the next generation of software engineers pursuing careers in health technology! Check out https://color.com/careers or reach out to bioinformatics@color.com if you have questions.