This post will begin to describe the “Proof of Concept” work (or play I guess) I have done to explore two primary ideas:
- What impact will a terse binary format of the human genome have on file compression and pattern matching processing speed?
- If we use a binary format, what will be the impact of using low level (binary) computing via the CUDA GPU and/or, xPU intrinsics?
DISCLAIMER: This exercise was not intended to be a final solution or to replace existing tools. It was done to evaluate alternatives to some of the existing tools; and to explore how commodity hardware might be used for genomic processing instead of large (and expensive) High-Performance-Computing (“HPC”) solutions, or cloud alternatives.
The full reference human genome is currently 3,270,974,121 bytes in length in a fasta file format. This includes headers, line breaks, etc. but generally represents the human genome with five primary single ASCII letters: A,C,T,G,N which represent the various nucleotides.
Most current tools that process the genome are written in interpreted computer languages and work with these ASCII representations. Of course there are exceptions, but in general, these tools will read in the full (or a partial via a stream) set of ASCII letters to begin their work. This requires enormous amounts of computer memory and processing power. I submit that by changing the way we represent the data, we can reduce the computing resources needed. Further, by using native programming constructs (C/C++ and Intrinsics) we can deliver a robust set of tools using inexpensive commodity hardware that can rival HPC systems.
The Simple Test
For this POC, I wanted to use a specific problem that might be representative of the broader domains of similar problems. Specifically, I wanted to write a program that would take a random string of nucleotides, and search the entire human genome to count the number of instances found for that string. This would be useful in k-mer counting; and pattern matching across the genome.
For example, we might want to count the number of instances, and find the location of: ACTAAGGA.
Again, the goal is to simply “evaluate and compare” while conducting this exercise on:
- a machine with limited memory – using memory mapped files on top of SSDs to augment RAM
- compare the performance of interpreted code with native code
- compare the performance of the same code run on CPU vs the GPU
- Processing would much more efficient if we use a more terse binary format instead of ASCII
- Comparisons using a binary pattern will be orders of magnitude faster than using the ASCII format – primarily due to the more compressed patterns.
- Cuda processing using one thread for every possible sequence in the original buffer (e.g. the full human genome) will be much more efficient than using CPU based parallel processing
- Since CUDA employs a different construct for parallel processing, we should flip our patterns searchers around to take advantage of the platform.That is, in a CPU based search, we might conduct our search as follows: a) load the search string into the CPU register and then walk the source file to see where there are matches. b) we could load a different search string into different threads (limited by the CPU) and use a global buffer to hold our source data (the human genome).In a GPU search pattern, we turn this around (which seems counter-intuitive). That is, we load the source pattern (from the human genome) into the GPU register(s) and we then do a pattern match on our k-mer strings. More info below.
- The overarching belief is that most “processing power” for pattern matching is actually consumed by memory transfers and so our optimizations should focus on those to achieve better performance. That is, everything we can do to reduce memory transfers will improve performance – especially in the CUDA world.
Let me attempt to break this down into smaller logical sections (and posts) to hopefully make it more understandable.
Let’s use a short sample source search string: ATTATATTATTA and a sample 3-mer of TTA to get our count of. We can see that the TTA string appears three times in the source string.
First of all, if we keep this in ASCII byte format, our source will be 12 bytes (letters) times 8 bits or 96 bits long. Our k-mer is three bytes times 8 bits or 24 bits long. Therefore, at the binary level which our systems work, we are searching for a 24 bit sequence across the 96 bit source.
To our computers this would look like this:
ASCII A is decimal 65 which in binary looks like: 0100 0001
ASCII T is decimal 84 which in binary looks like: 0101 0100
So our target string of TTA appears to our computers as the following 24 bits: 010101000101010001000001
And our source (searched) string of ASCII ATTATATTATTA appears as the following 96 bits:
Now, let’s assume the common 2-bit format for these data and let’s assume that the “A” is represented by decimal 1 or binary 01. The “T” can be decimal 2 or binary 10.
Now our target string of TTA appears to our computer as the following 6 bits: 101001
and our source (searched) string ATTATATTATTA appears as the following 24 bits:
As you can easily see, there are far fewer bits that need to be moved from disk, to memory, to xPU registers with the more terse 2-bit format. This will be a huge performance gain. Plus, we ultimately use the xPU registers to hold our binary patterns and compare those with some memory contents. If we have smaller registers or larger search patterns, the system would need to “chunk” up the larger patterns by swapping in and out of memory. These memory swaps are what we are working to avoid!!
For example, if we use the ASCII format of 8-bit base representations, and assume 32-bit registers…we can compare four bases in our registers at a given time. However, once we cross that barrier, by say looking for 11-mers, we would need to “chunk up” or data and hold some level of matching state across pattern matches. That is, we look for the first part of the pattern across the first 32 bits and the next part of the pattern over the next and so on.
However, if we use a 2-bit format, our 11-mer can fit into 22 bits (instead of 88) and thus we can more efficiently use the xPU registers for our compares.
KEY POINT: If we use a 2-bit format for our data, we reduce our memory operations by roughly 75% – and increase our performance by the same factor!
Let’s talk 3-bit
One problem with 2-bit is that we need to represent more than four bases. At the very least, we need the “N” variable and so five items. We could make our lives easier by using a 4-bit pattern since it would fit nicely in our 8-bit, 32-bit, 64-bit, etc. native containers; but we don’t need all four bits. For our use, we have opted for a 3-bit pattern since it is of optimal size and will accommodate 8 items or entries. For the POC example, I included the following: A,T,C,G,N,EOF, and Chromosome_Break; with one item left for whatever.
What make this 3-bit pattern difficult is that it won’t break on the byte (or other native container) boundary since it’s an odd number. That is, a 4-mer sequence such as TTAA represented in 8-bit ASCII would fit nicely in 8 bytes; and would also fit nicely in a single byte when represented by the 2-bit format; or in two bytes when represented by the 4-bit format. In other words, we never break a base “letter” over the byte boundary when we use an even bit format (2-bit, 4-bit, or 8-bit).
However, our 3-bit pattern is not as easy to work with. For example, our search pattern of TTAA in 3-bit might look like: “A” as decimal 2 in 3-bit would be 010 and T as decimal 1 in 3-bit would be 001. Our 4-mer would then need 12 bits: 001 001 010 010.
But remember that all data is stored as bytes on the computer. Therefore, we need to have these 12 bits fit into our 8-bit byte native containers. Assuming we don’t want any extraneous bits in our storage, we would need to break our bits at the third base (assuming we load from left to right):
So our first byte would be: 001 + 001 + 01 and our second byte would be 0010 0000
Since our 3-bits don’t break at the byte boundary, we need to store our binary sequence in the smallest native container that can hold our sequence.
To understand this, you have to forget comparing bytes (as we tend to do today in most genomic programs); and think about comparing bits.
If we use our 3-bit format for pattern matching, we can store up to 2-mers in a byte (two 3-bit items with no remainder). Similarly, a 16-bit container could hold up to 5-mers; a 32-bit container would hold up 11-mers, a 64-bit container would hold up to 21-mers, and so on.
The key point here is that we should use the smallest container to hold our k-mer represented in the 3-bit format so that we use the low-level hardware as efficiently as possible; and to reduce our expensive memory transfers.
Let’s illustrate with a 5-Mer of TTAAT.
Using the “A” as decimal 2 in 3-bit would be 010 and T as decimal 1 in 3-bit would be 001. Therefore, our 3-bit search pattern for TTAAT would look like this in binary:
001 001 010 010 001.
After we pack these into bytes, which we must do at the lowest level of storage, our bytes would look like this (assuming loading from left to right):
0010 0010 (the final zero is a remainder and will start the next k-mer)
Now this is where it gets a bit funky so pay attention. If we try to read our bytes into a 16-bit integer using native code, the endianness will make a big difference since the byte order (not bit-order!!!) can differ from one system to another. So to read straight into an int16 in one system, the bits would be:0010 0101 0010 0010 (which is correct for us since we load from left to right); but on a different platform, they would be reversed and look like this: 0010 0010 0010 0101
Therefore, if we want to be portable from one platform to another, and eliminate the need to swap bytes, we need to read the bits into our container bit by bit at runtime.
In pseudo code, this is what this looks like:
Byte b1 = decimal 37 or 0010 0101
Byte b2 = decimal 34 or 0010 0010
When we read in our bits from our bytes, we want to pack them left to right. So we read the number of bits from all of our bytes until we get to the total number of bits (K-mer size times bit-size). So for our example of a 5-mer, we need to read 15 bits from our first two bytes and pack them into our int16 from left to right.
Our int16 would end up looking like this: 0010 0101 0010 0010. We frankly don’t care what the decimal number is (9,506 on little endian) since we never use it! Our container simply holds the bytes packed left to right that are read from bytes held on the file system or memory in the same order.
To do our compare now, we simply read our bytes from memory or the file system into our program (or onto our xPU register) in the size of our container we need, and then compare the values. In other words, we load the bits into an int16 and compare with our compare sequence also loaded into an int16 with the bits loaded on both from left to right.
For example, we might read the 3-bit representation of our genome from the bytes in the file system into the smallest container size for our k-mer (the int16 for our 5-mer), we mask off the remainder bit (since we don’t divide evenly) and compare our numbers.
So we read in our search size of bits (15 in our 5-mer example) from the source string such as the full human genome, and load that into the type we are using; an int16 from our 5-mer. We mask off the remainder we don’t care about; and then using those two int16 numbers, we compare them.
Closing comments and Thoughts:
- This approach is platform independent and so the endianness of the bytes makes no difference.
- Search is automatically optimized by k-mer size so searches for small k-mers is extremely fast, even across the full genome.
- Instead of using hashes of letters as indexes, we can use the integer values; but I would argue we no longer need indexes since the search is so fast …milliseconds across the whole genome.
- Assuming the CPU, we can load a different k-mer search pattern into memory using the native container (typically an unsigned long or long long) and then simply walk the array of bytes of the source loading them bit by bit into the native containers for comparisons. Since we are not typing (but simply using the native containers for convenience); our loads and comparisons are very fast. We simply slide through the array of bytes and look for pattern matches.
- Assuming the GPU, things become more interesting. We’ll talk about how this logic is reversed in the next post; but as a teaser: each thread holds a portion of the source pattern (the whole human genome takes millions of threads but only milliseconds to match) and then does the compare with the search k-mer.
- The examples of int16 are not actually correct since it has the sign-bit; we should always use an unsigned container since we don’t use the sign bit and we need the space.