Friday, February 12, 2016

Thoughts and Solutions for Processing Large Genomic and Other Datasets

The promises of Personalized Medicine are exciting:  using our genetic makeup and known outcomes, providers will be able to offer a customized treatment plan instead of just using a population health approach.
There are challenges:
  • Datasets are large. The human genome is comprised of over 3 billion "base pairs" of nucleotides.  When represented in ASCII ("A", "T", "C", "G", etc.), this means over 3GB of file space for each full genome map; without the metadata such as Chromosome location, etc.
  • Very little standardization on tools and solutions.  
  • Most tools come from the Open Source Community and as such, offer little to no support.
  • Most of the tools are written in JAVA with the need to "pipe" the output from one program to the other.  This requires enormous amounts of memory.
  • Although in general, the genome is represented by the four primary nucleotides (ACTG); there is a huge randomness in the pattern across the genome.  Therefore, typical compression routines do not lend themselves well to this data domain.
There were two primary objectives that drove me to explore some possible solutions to these challenges:
  1. File sizes are too large and we need to figure out a way to more efficiently process, store, and move these large files.
  2. With the goal of a "computer on everyone's desk" that has driven my thinking over the past thirty-five years, we should be able to deliver an inexpensive genome processing "personal computer" that can at least match the output if so called HPC or High Performance (or Processing) Computers.
Most file compression routines are based on the fact that particular content has a large number of repeating fragments of data. The more frequently data repeats, the greater the compression.
As mentioned before, most computer compression algorithms do not work with the DNA data domain. However, there are some cool alternatives. I'm particularly fond of the CRAM format that achieves compression via: “...bases are reference compressed (Hsi-Yang Fritz, et al. (2011) Genome Res. 21:734-740) by encoding base differences rather than storing the bases themselves” (ref) Since most of the human genome is common to all of us, this makes enormous sense. CRAM also uses some level of binary formatting which I believe is key to achieving both compression and improve processing speed (more below).
Several tools use a 2-bit, or 4-bit representation of a base instead of the ASCII byte to achieve compression. For example, the sequence ATAT, when stored as ASCII, will be stored in 4 bytes (32 bits), one for each letter. However, for a 4 bit representation, the same sequence could consume just 2 bytes (16 bits calculated as: 4 bits times four characters) and thus achieve 50% compression.  The best compression would come from the 2-bit solution which will achieve a 75% compression ratio.  That is, 2 bits per character time 4 characters is an 8-bit byte.  The prototype I built and will discuss in depth later, uses a 3-bit sequence (62.5% compression).
So what's wrong with the 2-bit sequence; after all, it will achieve superior compression as it stands. (Great discussion here.)
First off, we need our tool to discover and document insertions, deletions ("indels") and just plain “dunno” (typically represented as "N"s).   If we are processing a gene sequence, and have used all of our keys (letters) in a 2-bit format...that means we need to create an external lookup table for those anomalous reads. This external table will reduce the overall compression; and more importantly, get in the way of fast processing – detailed below.
A 4-bit compression is useful since it makes all our bit-shifting break on the byte boundaries, but at a cost of losing significant compression. 3-bit is ideal but a bit more difficult to implement (bad pun). That is, 3-bit patterns give us 8 data points, enough for the four bases (ATCG), the “N” or unknown, and one for end-of-file or end-of-sequence. It even provides a placeholder for a reference lookup. For example, let's say at position 999 we have the three bit pattern of 111, our program could then go to the lookup table and pull in the reference genome using the offset marker denoted by the three bits.
With regard to #2 above, a “computer on everyone's desk”, I wanted to explore how one might build an inexpensive genomics computer (< $1000) for local processing. In the next post, I'll talk about the development process and results in greater detail; but for purposes of exploration, here's what I set out to do:
  • Determine if I could use a Solid-State-Drive along with Memory-Mapped-Files to eliminate the need to read in the full file contents; and/or to pipe the output to other programs in the processing pipeline.
  • Evaluate the speed tradeoffs of using an interpreted programming language (Java or .NET) vs native code (C/C++)
  • Evaluate the benefits of getting as close to the wire as possible by utilizing binary/bit formats instead of letters such as ATCG.
  • Determine the cost benefit of using massively parallel processing power of the Nvida CUDA graphi cards to do the work.
  • Determine if using native xPU (CPU or GPU) registers to do the bit comparisons would lead to performance improvements. That is, use intrinsics for the xPU to invoke the “Single Instruction/Multiple Data” (“SMID”) parallel processing.
I will detail my exploratory work in more detail in forthcoming blog posts, but for now, let me summarize the current results.
  • I pulled down the full human genome in ASCII (FASTA) format and extracted it to one large file. That is, just a huge file with ATCG and Ns.
  • To benchmark the tools, I decided to keep it simple. I wanted to pass in a random sequence of bases (such as AATAAGC) and count the “K-Mer” repeats. In the example, AATAAGC, this is a seven-mer and to count how many times it appears in the genome.
  • Using our benchmark tool, I then evaluated the performance using the following technical solutions:
    • Using pure ASCII to read the raw ascii before converting to a bit format. Frankly, I was shocked at how fast this turned out to be. However, this was executed on an 8-core computer with 16GB of RAM and the entire genome would fit into RAM. This gave me hope that by using a more terse bit format from the byte format of text, I'd see more improvement.
    • I next converted the ASCII to a three bit format. This required some manual bit loading and shifting; but it also allowed me to compare larger bit sequences at one time and reduce the loops in the code. Using interpreted code (.NET but probably similar to Java), I could get a random k-mer count using a single thread in about 25 seconds after the genome file was full loaded into RAM. Surprisingly, this was similar to what I was seeing using the raw text files under .NET. SIDEBAR: I have read that .NET will use assembly code to do string compares and so I would expect to see pretty good performance if that's true.
    • I then moved the code to native C/C++ and compiled for both Linux and Windows and got almost identical results on both...about 2.5 seconds to get a random k-mer count across the full genome using the 3-bit sequence, once the full raw file was loaded into RAM.
    • I next decided to parallelize the C and .NET code running on the CPU for a single k-mer count. This increased the run time by a factor of ten; no doubt due to thread switches and memory swapping.
    • Finally, I used a single and inexpensive NVIDIA GPU, CUDA enabled, video card to run the same parallelized code on the GPU (no thread switches) and can get a random k-mer count across the full human genome in about .7 seconds after the file is loaded.
    • What I have yet to evaluate, is to use intrinsics and the native xPU registers to do the bit comparisons. However, I'm not sure the juice is worth the squeeze. If I can use a very inexpensive video card to achieve these results, and can run more than one video card per desktop machine, then I think we've gotten close to proving that by using the GPU to process genomic data, we can achieve massive parallel processing power under or on our desk.
What's next?
I plan to release some blog posts soon that will provide details of the above findings and will drop some code samples as well.
Further Reading:

No comments:

Post a Comment