Tuesday, March 8, 2016

Genomic Data: Binary vs ASCII - Part Three: CPU vs GPU


In the first post, I described some background information on computer storage, how byte arrangement "endianness" can impact our tools, and ASCII vs binary data.

In the previous post, I discussed the 3-bit format I used for this evaluation.  Let me summarize the approach and then we'll proceed to look at using a CPU vs GPU to do our processing.
  • For this Proof-of-Concept ("POC"), I first converted the full human genome, in fasta format, to a single file of contiguous characters of A,T,G,C, and Ns.  
  • Next, I converted these characters into a 3-bit format with each base represented by a particular 3-bit pattern.
  • The original hg38.fa file is 3,273,481,150 bytes long.  After I removed the line and chromosome breaks (**see more info below), the file in ASCII was 3,209,286,105 bytes long.  However, after moving this to 3-bit format, the file ended up 1,203,482,291 bytes -- a 63% reduction in size.
  • The 3-bit file format has all 3-bit characters in a contiguous pattern packed from left to right; regardless of the byte layout ("endianness") of the host platform.
The goal was to conduct pattern matching; that is, to get a count and optionally the location of certain nucleotide patterns; for example: find the count and location of all the 21-Mers of: ATATATATATATATATGGATA in the human genome.  

For both the CPU and GPU code, we conduct our comparisons in the same way:
  • Take the K-Mer we are looking for an convert it into our 3-bit pattern.
  • Using the sequence of 3-bit representations, we then pack those bits into the smallest native container we can use for comparison.  In this case, a 64 bit unsigned long long.  The final bits look like: 0011 0010 1100 1011 0010 1100 1011 0010 1100 1011 0010 1100 1100 1000 1100 1011 and so this is the pattern we are looking for.
  • In order to do our pattern matching, we need to first read the source file we want to search into memory.  In our POC, this is the full genome, but since the file is now only 1.2GB, it will fit easily into memory: either GPU or CPU.  This is key, we only load our source once and retain it in global memory for the threads to use.
  • Our final step is to read the bits from the target file into the same sized container as as our search string (for example, unsigned 64bit long).  Once we do that, we simply compare the variables holding the bits; and if they match, increase our counter and optionally get the file character position.

CPU Code

The CPU code works as just described:  it first stores the 3-bit sequences for our pattern in an unsigned 64 bit container, call it uint64SearchString.  Next, we walk the array of bytes and load up our bit sequences for the search count (21 in this case) 3 bits at a time. So we will start at the first byte and load it and the following six bytes (first 7 bytes) into our unsigned 64 container, call it uint64FindMatch variable.  To completely fill 63 characters we need in 3-bit format, we only need the first seven bits in the eighth byte and so we load those at the end of our uint64FindMatch container or variable.  Next, we simply compare the two variables and if they match, we know our patterns match.   Something simple like:  

if uint64SearchString is equal to uint64FindMatch then add one to Count

Simple and very fast!!

To get the full count, we continue to read through our byte array holding the full genome where we load our 21-Mer sequences into our container uint64SearchString; by positioning the start of our 3-bit sequence one place to the right for each iteration.  That is, for the second iteration, we start at the fourth bit in the first byte and read all the ensuing bits up to our 21-bit count from the following bytes.  Since the source bytes are preserved in memory, this is relatively fast.

I first wrote the code in C# which, like Java, is interpreted into machine language and so it's somewhat slower than native machine code.  Using C#, I was able to count our example 21-Mer across the whole genome in 11.96 seconds on a single-proc, four-core, machine. However, only one core was busy and so the CPU utilization showed only 25%. Let's try with four threads, one for each core.

The four threads achieved the same timing; with each thread running on one of the four cores and our system CPU utilization hitting 100%.  

Conclusion:  Using our CPU search pattern, we can get a 21-Mer count across the full genome in ~12 seconds per core using C# code.  Therefore, with four cores and searching for four patterns, we average ~3 seconds per sequence and would expect that to decrease proportionately as we add CPU cores.  But, there's a severe limit to how many CPU cores we can use.  CUDA to the rescue!

Before we move to CUDA and GPU code, let's take a look at C and native code to see if that makes a difference.

I modified the logic slightly and used a 7-Mer string held in a 32-bit unsigned long.  As expected, this resulted in our counts going from a couple of dozen for our 21-Mer to a million+ for the shorter search sequence.

Using a single thread for a random 7-Mer, the program took 8.37 seconds (using one core) to find the ~1.5 million matches.

So now we have our baselines for our CPU matching approach.  Let's look at CUDA.

GPU and CUDA programming

There are many advantages to using the GPU to handle our massively parallel problem:
  1. The GPU can have thousands of cores instead of just a handful on the CPU.  The $200 card I used for these tests has over 1000 cores.
  2. Threads are limited on the CPU since we have thread switch overhead; and because the CPU has to handle things such as keyboard input, network traffic, and a host of other things.  The GPU threads are dedicated and designed for quick, short activities.  In general, we typically don't wan to spin up more than four or five threads per CPU core.  So for my 8 core I7 test machine, I could spin up no more than about 40 threads.  However, on my GPU I can spin up thousands of threads to run simultaneously.
  3. We can add multiple cards to a single host system and derive computing performance of an HPC computer on an inexpensive desktop.
  4. I use a dedicated video card for my GPU calculations.  Even when it's maxed out, the calculations have a negligible impact on my host machine since the CPU is not utilized much with all the work done on the dedicated GPU.  In other words, it's like having a separate computer running in the same machine.
OK, so what happened when I ran my CPU program on the GPU?  It took minutes instead of seconds to complete.  The reason for this was that I was only using a single thread since the design was optimized for a multiple-threaded CPU application and not the GPU.

Remember, on the CPU we sent our search pattern to the thread and it used that to load all the sequences from the byte buffer of the full genome.  This is much more efficient by using a single thread for a single search pattern and walking the full file in memory.

While this design is optimal for the CPU and our CPU response is not bad (about 8 seconds per core for native C code (both Linux and Windows).  However, as we saw, we are limited to by the CPU cores.  A single search on a single thread saturates a single CPU core.

However, what if we use one (or more) search pattern(s) for all of our threads and had each thread work on only one source file sequence?  In other words, say we spin up a million threads and ask them to look for ten different 21-Mer sequences... We can pass those sequences to all million threads and each thread will open a small matching sequence slice from the source file: the full human genome in our case.

A word picture might look like this:
  • Take two 21-Mers and pass them to two threads.
  • The first thread reads the first twenty-one 3-bit "letters" from our byte array (the full human genome) and compares the value to the value of the two search patterns.
  • The second thread moves our start position over three bits (the size of our "letter") and loads the next 21 "letter" or 63 bits into its variable against which it will compare the collection of search patterns.
The CPU is more efficient when it can take a sample pattern and walk the source (full human genome) looking for a match.

Our GPU will reverse this.  It will take a K-Mer sequence (or many) and each thread will find a different but discreet K count of the 3-bit "letters" from the source (full human genome).  The thread(s) will then walk through the list of sequences we want to get a count for seeing if they match the one source sequence.

OK, so more test data to digest.  I ran this on our single-threaded C code both ways on the CPU.  The CPU-optimized-code finished in 8.37 seconds as mentioned above.  But the GPU-optimized-code took four times as long when run on the CPU.

As we observed above, the CPU-optimized-code took minutes to complete on the GPU since it ran on a single thread.  However, when we used our GPU-optimized-code and launched 500 threads over our 1024 cores, we were able to get a 21-Mer count in under .7 seconds. Most of this time was spent loading our sequence from the buffer and converting it to our type.  But once that is done, we can compare hundreds or perhaps thousands of K-Mers using the same source sequence spread across these thousands of threads in milliseconds each.

For example, using a single 21-Mer and our full genome 3-bit file, we can get a count of all matches in .720823 seconds.  When we add one more search sequences to our search, both counts return in .731270 seconds.  A total of four takes .759792 seconds.  Think about this for a moment.  We can get the counts of four random 21-Mers across the full human genome in under .19 seconds each! 

This is with none of the fine-tuned optimizations that the Nvidia/Cuda code profiler suggested (this IS a POC after all and I'm really just looking for relative values).

Plus, this does not use GPU SMID intrinsics which I believe would dramatically improve performance since they are hardware accelerated and we do our binary compares at the register level.   This is on my "look at next" list :)

SIDEBAR:  Here's how I think that might work.  Since these intrinsics compare the bytes of a 32 bit word, we would put our bit sequences in either 32 bit containers or a multiple of 32.  Empty bits would be zeroed out as would the empty bits from the source.  By only looking for a true/false match, the compares would stop as soon as we hit a non-match (most of the cases).  Since this is hardware accelerated, it should be incredibly fast.  

Another performance improvement should be seen if we expand out and use a larger number of sequences.  That is, what if pre-populate an array of all the K-Mers we want to search for holding them in the native container such as Unsigned 64bit long?  Using CUDA, our only overhead, once we have the source sequence in memory on the thread, would be to do the compare which, as we just illustrated, takes milliseconds.

For my next exercise, I plan to do just that.  That is, I will pre-compute several thousand 21-Mers and run them through this CUDA engine and determine how long it would take to do the counts.  I expect I can determine the counts of thousands of K-Mers in under a minute.  Fun stuff.

** OK some closing thoughts and disclaimers.
  • A question or problem one might see in all of this is to make this actually useful.  In other words, say we wanted to search across a particular chromosome and not the full genome.  What then?  I propose that we have a separate index file that will provide a list of offsets for sections of the whole file we want to delineate.  So if we use the index and know that Chromosome 17 is from Position 1234 and runs to Position 6789, we can then limit our searches and counts by those parameters. Should make things even faster.
  • CUDA programming has a rather steep learning curve and most of the program parameters are gated by the specifications of the video card itself and by the makeup of the data we send to it.  Trial and error here can be your friend.
  • As a reminder, this was a brute-force type of test and I was looking for relative performance improvements more than actual results.  However, I think these ideas can form the genesis of a new toolset that can run efficiently on commodity hardware and compete with most cloud based or HPC type systems.

No comments:

Post a Comment