Saturday, March 26, 2016

Binary for Dummies: Explaining Endianness and Base Counts in the Simplest Way Possible

I've received feedback that this binary stuff is too complex to understand.  So let me start at the very foundation and build up from there.

Tuesday, March 8, 2016

Genomic Data: Binary vs ASCII - Part Three: CPU vs GPU

Review

In the first post, I described some background information on computer storage, how byte arrangement "endianness" can impact our tools, and ASCII vs binary data.

In the previous post, I discussed the 3-bit format I used for this evaluation.  Let me summarize the approach and then we'll proceed to look at using a CPU vs GPU to do our processing.
  • For this Proof-of-Concept ("POC"), I first converted the full human genome, in fasta format, to a single file of contiguous characters of A,T,G,C, and Ns.  
  • Next, I converted these characters into a 3-bit format with each base represented by a particular 3-bit pattern.
  • The original hg38.fa file is 3,273,481,150 bytes long.  After I removed the line and chromosome breaks (**see more info below), the file in ASCII was 3,209,286,105 bytes long.  However, after moving this to 3-bit format, the file ended up 1,203,482,291 bytes -- a 63% reduction in size.
  • The 3-bit file format has all 3-bit characters in a contiguous pattern packed from left to right; regardless of the byte layout ("endianness") of the host platform.
The goal was to conduct pattern matching; that is, to get a count and optionally the location of certain nucleotide patterns; for example: find the count and location of all the 21-Mers of: ATATATATATATATATGGATA in the human genome.  

For both the CPU and GPU code, we conduct our comparisons in the same way:
  • Take the K-Mer we are looking for an convert it into our 3-bit pattern.
  • Using the sequence of 3-bit representations, we then pack those bits into the smallest native container we can use for comparison.  In this case, a 64 bit unsigned long long.  The final bits look like: 0011 0010 1100 1011 0010 1100 1011 0010 1100 1011 0010 1100 1100 1000 1100 1011 and so this is the pattern we are looking for.
  • In order to do our pattern matching, we need to first read the source file we want to search into memory.  In our POC, this is the full genome, but since the file is now only 1.2GB, it will fit easily into memory: either GPU or CPU.  This is key, we only load our source once and retain it in global memory for the threads to use.
  • Our final step is to read the bits from the target file into the same sized container as as our search string (for example, unsigned 64bit long).  Once we do that, we simply compare the variables holding the bits; and if they match, increase our counter and optionally get the file character position.

CPU Code

The CPU code works as just described:  it first stores the 3-bit sequences for our pattern in an unsigned 64 bit container, call it uint64SearchString.  Next, we walk the array of bytes and load up our bit sequences for the search count (21 in this case) 3 bits at a time. So we will start at the first byte and load it and the following six bytes (first 7 bytes) into our unsigned 64 container, call it uint64FindMatch variable.  To completely fill 63 characters we need in 3-bit format, we only need the first seven bits in the eighth byte and so we load those at the end of our uint64FindMatch container or variable.  Next, we simply compare the two variables and if they match, we know our patterns match.   Something simple like:  

if uint64SearchString is equal to uint64FindMatch then add one to Count

Simple and very fast!!

To get the full count, we continue to read through our byte array holding the full genome where we load our 21-Mer sequences into our container uint64SearchString; by positioning the start of our 3-bit sequence one place to the right for each iteration.  That is, for the second iteration, we start at the fourth bit in the first byte and read all the ensuing bits up to our 21-bit count from the following bytes.  Since the source bytes are preserved in memory, this is relatively fast.

I first wrote the code in C# which, like Java, is interpreted into machine language and so it's somewhat slower than native machine code.  Using C#, I was able to count our example 21-Mer across the whole genome in 11.96 seconds on a single-proc, four-core, machine. However, only one core was busy and so the CPU utilization showed only 25%. Let's try with four threads, one for each core.

The four threads achieved the same timing; with each thread running on one of the four cores and our system CPU utilization hitting 100%.  

Conclusion:  Using our CPU search pattern, we can get a 21-Mer count across the full genome in ~12 seconds per core using C# code.  Therefore, with four cores and searching for four patterns, we average ~3 seconds per sequence and would expect that to decrease proportionately as we add CPU cores.  But, there's a severe limit to how many CPU cores we can use.  CUDA to the rescue!

Before we move to CUDA and GPU code, let's take a look at C and native code to see if that makes a difference.

I modified the logic slightly and used a 7-Mer string held in a 32-bit unsigned long.  As expected, this resulted in our counts going from a couple of dozen for our 21-Mer to a million+ for the shorter search sequence.

Using a single thread for a random 7-Mer, the program took 8.37 seconds (using one core) to find the ~1.5 million matches.

So now we have our baselines for our CPU matching approach.  Let's look at CUDA.

GPU and CUDA programming

There are many advantages to using the GPU to handle our massively parallel problem:
  1. The GPU can have thousands of cores instead of just a handful on the CPU.  The $200 card I used for these tests has over 1000 cores.
  2. Threads are limited on the CPU since we have thread switch overhead; and because the CPU has to handle things such as keyboard input, network traffic, and a host of other things.  The GPU threads are dedicated and designed for quick, short activities.  In general, we typically don't wan to spin up more than four or five threads per CPU core.  So for my 8 core I7 test machine, I could spin up no more than about 40 threads.  However, on my GPU I can spin up thousands of threads to run simultaneously.
  3. We can add multiple cards to a single host system and derive computing performance of an HPC computer on an inexpensive desktop.
  4. I use a dedicated video card for my GPU calculations.  Even when it's maxed out, the calculations have a negligible impact on my host machine since the CPU is not utilized much with all the work done on the dedicated GPU.  In other words, it's like having a separate computer running in the same machine.
OK, so what happened when I ran my CPU program on the GPU?  It took minutes instead of seconds to complete.  The reason for this was that I was only using a single thread since the design was optimized for a multiple-threaded CPU application and not the GPU.

Remember, on the CPU we sent our search pattern to the thread and it used that to load all the sequences from the byte buffer of the full genome.  This is much more efficient by using a single thread for a single search pattern and walking the full file in memory.

While this design is optimal for the CPU and our CPU response is not bad (about 8 seconds per core for native C code (both Linux and Windows).  However, as we saw, we are limited to by the CPU cores.  A single search on a single thread saturates a single CPU core.

However, what if we use one (or more) search pattern(s) for all of our threads and had each thread work on only one source file sequence?  In other words, say we spin up a million threads and ask them to look for ten different 21-Mer sequences... We can pass those sequences to all million threads and each thread will open a small matching sequence slice from the source file: the full human genome in our case.

A word picture might look like this:
  • Take two 21-Mers and pass them to two threads.
  • The first thread reads the first twenty-one 3-bit "letters" from our byte array (the full human genome) and compares the value to the value of the two search patterns.
  • The second thread moves our start position over three bits (the size of our "letter") and loads the next 21 "letter" or 63 bits into its variable against which it will compare the collection of search patterns.
The CPU is more efficient when it can take a sample pattern and walk the source (full human genome) looking for a match.

Our GPU will reverse this.  It will take a K-Mer sequence (or many) and each thread will find a different but discreet K count of the 3-bit "letters" from the source (full human genome).  The thread(s) will then walk through the list of sequences we want to get a count for seeing if they match the one source sequence.

OK, so more test data to digest.  I ran this on our single-threaded C code both ways on the CPU.  The CPU-optimized-code finished in 8.37 seconds as mentioned above.  But the GPU-optimized-code took four times as long when run on the CPU.

As we observed above, the CPU-optimized-code took minutes to complete on the GPU since it ran on a single thread.  However, when we used our GPU-optimized-code and launched 500 threads over our 1024 cores, we were able to get a 21-Mer count in under .7 seconds. Most of this time was spent loading our sequence from the buffer and converting it to our type.  But once that is done, we can compare hundreds or perhaps thousands of K-Mers using the same source sequence spread across these thousands of threads in milliseconds each.

For example, using a single 21-Mer and our full genome 3-bit file, we can get a count of all matches in .720823 seconds.  When we add one more search sequences to our search, both counts return in .731270 seconds.  A total of four takes .759792 seconds.  Think about this for a moment.  We can get the counts of four random 21-Mers across the full human genome in under .19 seconds each! 

This is with none of the fine-tuned optimizations that the Nvidia/Cuda code profiler suggested (this IS a POC after all and I'm really just looking for relative values).

Plus, this does not use GPU SMID intrinsics which I believe would dramatically improve performance since they are hardware accelerated and we do our binary compares at the register level.   This is on my "look at next" list :)

SIDEBAR:  Here's how I think that might work.  Since these intrinsics compare the bytes of a 32 bit word, we would put our bit sequences in either 32 bit containers or a multiple of 32.  Empty bits would be zeroed out as would the empty bits from the source.  By only looking for a true/false match, the compares would stop as soon as we hit a non-match (most of the cases).  Since this is hardware accelerated, it should be incredibly fast.  

Another performance improvement should be seen if we expand out and use a larger number of sequences.  That is, what if pre-populate an array of all the K-Mers we want to search for holding them in the native container such as Unsigned 64bit long?  Using CUDA, our only overhead, once we have the source sequence in memory on the thread, would be to do the compare which, as we just illustrated, takes milliseconds.

For my next exercise, I plan to do just that.  That is, I will pre-compute several thousand 21-Mers and run them through this CUDA engine and determine how long it would take to do the counts.  I expect I can determine the counts of thousands of K-Mers in under a minute.  Fun stuff.

** OK some closing thoughts and disclaimers.
  • A question or problem one might see in all of this is to make this actually useful.  In other words, say we wanted to search across a particular chromosome and not the full genome.  What then?  I propose that we have a separate index file that will provide a list of offsets for sections of the whole file we want to delineate.  So if we use the index and know that Chromosome 17 is from Position 1234 and runs to Position 6789, we can then limit our searches and counts by those parameters. Should make things even faster.
  • CUDA programming has a rather steep learning curve and most of the program parameters are gated by the specifications of the video card itself and by the makeup of the data we send to it.  Trial and error here can be your friend.
  • As a reminder, this was a brute-force type of test and I was looking for relative performance improvements more than actual results.  However, I think these ideas can form the genesis of a new toolset that can run efficiently on commodity hardware and compete with most cloud based or HPC type systems.


Monday, March 7, 2016

Genomic Data: Binary vs ASCII Part 2 - Three Bit Representation

This post will begin to describe the "Proof of Concept" work (or play I guess) I have done to explore two primary ideas:
  1. What impact will a terse binary format of the human genome have on file compression and pattern matching processing speed?
  2. If we use a binary format, what will be the impact of using low level (binary) computing via the CUDA GPU and/or, xPU intrinsics?
DISCLAIMER: This exercise was not intended to be a final solution or to replace existing tools. It was done to evaluate alternatives to some of the existing tools; and to explore how commodity hardware might be used for genomic processing instead of large (and expensive) High-Performance-Computing ("HPC") solutions, or cloud alternatives.

Background

The full reference human genome is currently 3,270,974,121 bytes in length in a fasta file format.  This includes headers, line breaks, etc. but generally represents the human genome with five primary single ASCII letters:  A,C,T,G,N which represent the various nucleotides.

Most current tools that process the genome are written in interpreted computer languages and work with these ASCII representations.  Of course there are exceptions, but in general, these tools will read in the full (or a partial via a stream) set of ASCII letters to begin their work.  This requires enormous amounts of computer memory and processing power. I submit that by changing the way we represent the data, we can reduce the computing resources needed.  Further, by using native programming constructs (C/C++ and Intrinsics) we can deliver a robust set of tools using inexpensive commodity hardware that can rival HPC systems.

The Simple Test

For this POC, I wanted to use a specific problem that might be representative of the broader domains of similar problems.  Specifically, I wanted to write a program that would take a random string of nucleotides, and search the entire human genome to count the number of instances found for that string.  This would be useful in k-mer counting; and pattern matching across the genome.

For example, we might want to count the number of instances, and find the location of:  ACTAAGGA.

Again, the goal is to simply evaluate and compare conducting this exercise on:
  • a machine with limited memory - using memory mapped files on top of SSDs to augment RAM
  • compare the performance of interpreted code with native code
  • compare the performance of the same code run on CPU vs the GPU

The Hypotheses 

  • Processing would much more efficient if we use a more terse binary format instead of ASCII
  • Comparisons using a binary pattern will be orders of magnitude faster than using the ASCII format - primarily due to the more compressed patterns.
  • Cuda processing using one thread for every possible sequence in the original buffer (e.g. the full human genome) will be much more efficient than using CPU based parallel processing
  • Since CUDA employs a different construct for parallel processing, we should flip our patterns searchers around to take advantage of the platform.

    That is, in a CPU based search, we might conduct our search as follows:  a) load the search string into the CPU register and then walk the source file to see where there are matches.  b) we could load a different search string into different threads (limited by the CPU) and use a global buffer to hold our source data (the human genome).

    In a GPU search pattern, we turn this around (which seems counter-intuitive).  That is, we load the source pattern (from the human genome) into the GPU register(s) and we then do a pattern match on our k-mer strings.  More info below.
  • The overarching belief is that most "processing power" for pattern matching is actually consumed by memory transfers and so our optimizations should focus on those to achieve better performance.  That is, everything we can do to reduce memory transfers will improve performance - especially in the CUDA world.

The Work

Let me attempt to break this down into smaller logical sections (and posts) to hopefully make it more understandable.

Let's use a short sample source search string:  ATTATATTATTA and a sample 3-mer of TTA to get our count of.  We can see that the TTA string appears three times in the source string.

ASCII 

First of all, if we keep this in ASCII byte format, our source will be 12 bytes (letters) times 8 bits or 96 bits long.  Our k-mer is three bytes times 8 bits or 24 bits long.  Therefore, at the binary level which our systems work, we are searching for a 24 bit sequence across the 96 bit source.

To our computers this would look like this:
ASCII A is decimal 65 which in binary looks like:  0100 0001
ASCII T is decimal 84 which in binary looks like:  0101 0100

So our target string of TTA appears to our computers as the following 24 bits: 010101000101010001000001

And our source (searched) string of ASCII ATTATATTATTA appears as the following 96 bits:
010000010101010001010100010000010101010001000001010101000101010001000001010101000101010001000001

2-bit 

Now, let's assume the common 2-bit format for these data and let's assume that the "A" is represented by decimal 1 or binary 01.  The "T" can be decimal 2 or binary 10.

Now our target string of TTA appears to our computer as the following 6 bits: 101001

and our source (searched) string ATTATATTATTA appears as the following 24 bits:
011010011001101001101001

As you can easily see, there are far fewer bits that need to be moved from disk, to memory, to xPU registers with the more terse 2-bit format.  This will be a huge performance gain.  Plus, we ultimately use the xPU registers to hold our binary patterns and compare those with some memory contents.  If we have smaller registers or larger search patterns, the system would need to "chunk" up the larger patterns by swapping in and out of memory.  These memory swaps are what we are working to avoid!!

For example, if we use the ASCII format of 8-bit base representations, and assume 32-bit registers...we can compare four bases in our registers at a given time.  However, once we cross that barrier, by say looking for 11-mers, we would need to "chunk up" or data and hold some level of matching state across pattern matches.  That is, we look for the first part of the pattern across the first 32 bits and the next part of the pattern over the next and so on.

However, if we use a 2-bit format, our 11-mer can fit into 22 bits (instead of 88) and thus we can more efficiently use the xPU registers for our compares.

KEY POINT:  If we use a 2-bit format for our data, we reduce our memory operations by roughly 75% - and increase our performance by the same factor!

Let's talk 3-bit

One problem with 2-bit is that we need to represent more than four bases.  At the very least, we need the "N" variable and so five items.  We could make our lives easier by using a 4-bit pattern since it would fit nicely in our 8-bit, 32-bit, 64-bit, etc. native containers; but we don't need all four bits.  For our use, we have opted for a 3-bit pattern since it is of optimal size and will accommodate 8 items or entries.  For the POC example, I included the following:  A,T,C,G,N,EOF, and Chromosome_Break; with one item left for whatever.

What make this 3-bit pattern difficult is that it won't break on the byte (or other native container) boundary since it's an odd number.  That is, a 4-mer sequence such as TTAA represented in 8-bit ASCII would fit nicely in 8 bytes; and would also fit nicely in a single byte when represented by the 2-bit format; or in two bytes when represented by the 4-bit format. In other words, we never break a base "letter" over the byte boundary when we use an even bit format (2-bit, 4-bit, or 8-bit).

However, our 3-bit pattern is not as easy to work with.  For example, our search pattern of TTAA in 3-bit might look like:  "A" as decimal 2 in 3-bit would be 010 and T as decimal 1 in 3-bit would be 001.  Our 4-mer would then need 12 bits: 001 001 010 010.

But remember that all data is stored as bytes on the computer.  Therefore, we need to have these 12 bits fit into our 8-bit byte native containers.  Assuming we don't want any extraneous bits in our storage, we would need to break our bits at the third base (assuming we load from left to right):

So our first byte would be: 001 + 001 + 01  and our second byte would be 0010 0000

Since our 3-bits don't break at the byte boundary, we need to store our binary sequence in the smallest native container that can hold our sequence.  

To understand this, you have to forget comparing bytes (as we tend to do today in most genomic programs); and think about comparing bits.

If we use our 3-bit format for pattern matching, we can store up to 2-mers in a byte (two 3-bit items with no remainder).  Similarly, a 16-bit container could hold up to 5-mers; a 32-bit container would hold up 11-mers, a 64-bit container would hold up to 21-mers, and so on.

The key point here is that we should use the smallest container to hold our k-mer represented in the 3-bit format so that we use the low-level hardware as efficiently as possible; and to reduce our expensive memory transfers.

Let's illustrate with a 5-Mer of TTAAT.  

Using the "A" as decimal 2 in 3-bit would be 010 and T as decimal 1 in 3-bit would be 001. Therefore, our 3-bit search pattern for TTAAT would look like this in binary:  001 001 010 010 001.  After we pack these into bytes, which we must do at the lowest level of storage, our bytes would look like this (assuming loading from left to right): 
   0010 0101 
   0010 0010 (the final zero is a remainder and will start the next k-mer)

Now this is where it gets a bit funky so pay attention.  If we try to read our bytes into a 16-bit integer using native code, the endianness will make a big difference since the byte order (not bit-order!!!) can differ from one system to another.  So to read straight into an int16 in one system, the bits would be: 0010 0101 0010 0010 (which is correct for us since we load from left to right); but on a different platform, they would be reversed and look like this: 0010 0010 0010 0101 

Therefore, if we want to be portable from one platform to another, and eliminate the need to swap bytes, we need to read the bits into our container bit by bit at runtime.

In pseudo code, this is what this looks like:

Byte b1 = decimal 37 or 0010 0101
Byte b2 = decimal 34 or 0010 0010

When we read in our bits from our bytes, we want to pack them left to right.  So we read the number of bits from all of our bytes until we get to the total number of bits (K-mer size times bit-size).  So for our example of a 5-mer, we need to read 15 bits from our first two bytes and pack them into our int16 from left to right.

Our int16 would end up looking like this: 0010 0101 0010 0010.  We frankly don't care what the decimal number is (9,506 on little endian) since we never use it!  Our container simply holds the bytes packed left to right that are read from bytes held on the file system or memory in the same order.

To do our compare now, we simply read our bytes from memory or the file system into our program (or onto our xPU register) in the size of our container we need, and then compare the values. In other words, we load the bits into an int16 and compare with our compare sequence also loaded into an int16 with the bits loaded on both from left to right.

For example, we might read the 3-bit representation of our genome from the bytes in the file system into the smallest container size for our k-mer (the int16 for our 5-mer), we mask off the remainder bit (since we don't divide enevnly) and compare our numbers.

So we read in our search size of bits (15 in our 5-mer example) from the source string such as the full human genome, and load that into the type we are using; an int16 from our 5-mer.  We mask off the remainder we don't care about; and then using those two int16 numbers, we compare them.

Closing comments and Thoughts:

  1. This approach is platform independent and so the endianness of the bytes makes no difference.
  2. Search is automatically optimized by k-mer size so searches for small k-mers is extremely fast, even across the full genome.
  3. Instead of using hashes of letters as indexes, we can use the integer values; but I would argue we no longer need indexes since the search is so fast ...milliseconds across the whole genome.
  4. Assuming the CPU, we can load a different k-mer search pattern into memory using the native container (typically an unsigned long or long long) and then simply walk the array of bytes of the source loading them bit by bit into the native containers for comparisons.  Since we are not typing (but simply using the native containers for convenience); our loads and comparisons are very fast.  We simply slide through the array of bytes and look for pattern matches.  
  5. Assuming the GPU, things become more interesting.  We'll talk about how this logic is reversed in the next post; but as a teaser:  each thread holds a portion of the source pattern (the whole human genome takes millions of threads but only milliseconds to match) and then does the compare with the search k-mer.  
  6. The examples of int16 are not actually correct since it has the sign-bit; we should always use an unsigned container since we don't use the sign bit and we need the space.