Friday, April 15, 2016

Genomic Data: Binary Cuda K-Mer Counting on the GPU: Results

To review, I'm enthralled with the power of GPU and CUDA programming and needed a use-case to "play with" the technology.  My use case is pretty simple, I wanted to see how fast I could count all K-Mers in the full human reference genome using an inexpensive GPU card ($200) on commodity hardware (< $1000).

The results are in, I was able to count all 265,631 unique 9-Mers across the full human genome with 3,209,286,109 9-Mer sequences at over 15 per second.

In other words, in one second, I could get the full count (and locations) of fifteen different 9-Mers across the full 3.2 billion possibilities in the full human genome!

Remember, this is not an HPC but a commodity white-box computer and inexpensive video card.

The locations of the K-Mers across the genome are just as fast since they are known at the time the K-Mer is counted.

What was compelling to me was that the computer system (disk and CPU) were pretty much idle during this process and I could continue to get mail and even watch a streaming movie - on the host video card.  Meanwhile, the CUDA card was cranking away and the fans spinning like mad to keep it cool.

I believe I can increase this performance by another 25% or so by using registers to do the compares; but, I'm not sure the juice is worth the squeeze.

So here's a summary of how this went together.

I first pulled the ASCII gene representation from the full reference genome and changed the representation to a 3-bit lookup table (see previous post) and stored the data in one large file of continuous bases.  The chromosome break locations are represented separately in an index file.

Next, I pulled out all the unique 9-Mers and stored the unique string sequence (such as "ACTTGATTA") along with a 64-bit numeric representation of the sequence in a file based hash table.  

The CUDA program will first load the compressed file (3-bit) into host and device memory as an array of bytes from the file.  Next, it loads an indexed array of the numeric representations along with another indexed array of the letter sequences.  The numeric array is loaded into the device.

Each thread (billions) on the device process one and just one 9-Mer sequence from the original file.  That sequence is represented as a 64-bit number.  To get a count of that sequence, we then use that number to walk the array of numbers from the 9-Mer array and increment our count array at the same index.

In the final process, the system is just comparing two 64-bit numbers (for 9-Mers, this could have been 32 bit numbers); which explains the speed and why using registers should be that much faster.

When done, the arrays are passed back to the host and so we can do something like this:
  • Find the index of a given sequence from the character array of sequences.  For instance, ACTTGATTA might be at location 200,000.
  • Next, I can use that index to get the counts from the count array:  Count[200000]
CUDA is great for advanced math; but for pattern matching, it can't be beat - especially on a desktop machine.

For my next exercise, I want to put medical data into a graph database and use the power CUDA to do some intelligent pattern matching and trend-capturing.

An example use-case for this might be:  Take the list of prescription drugs and capture the drug-drug-interaction in a graph.  Another property might be the side effects of the drugs. From these associations, we could develop a solution where one might enter their meds and some new side effects and the software might lead them to understand how the meds may be causing the problems.  It would also be cool if we could include the FDA's information on SNP interaction with meds and have that also be a patient associated variable.

This was a fun and interesting project.

No comments:

Post a Comment