Thursday, April 14, 2016

Counts representing Items and the impact on Data Compression and Encryption

All things in computers are identified in counts of digits (called bits) which are either on or off.  In this post, we'll discuss how counts (discussed in the previous post) are used for identification.

Saturday, March 26, 2016

Binary for Dummies: Explaining Endianness and Base Counts in the Simplest Way Possible

I've received feedback that this binary stuff is too complex to understand.  So let me start at the very foundation and build up from there.

Tuesday, March 8, 2016

Genomic Data: Binary vs ASCII - Part Three: CPU vs GPU

Review

In the first post, I described some background information on computer storage, how byte arrangement "endianness" can impact our tools, and ASCII vs binary data.

In the previous post, I discussed the 3-bit format I used for this evaluation.  Let me summarize the approach and then we'll proceed to look at using a CPU vs GPU to do our processing.
  • For this Proof-of-Concept ("POC"), I first converted the full human genome, in fasta format, to a single file of contiguous characters of A,T,G,C, and Ns.  
  • Next, I converted these characters into a 3-bit format with each base represented by a particular 3-bit pattern.
  • The original hg38.fa file is 3,273,481,150 bytes long.  After I removed the line and chromosome breaks (**see more info below), the file in ASCII was 3,209,286,105 bytes long.  However, after moving this to 3-bit format, the file ended up 1,203,482,291 bytes -- a 63% reduction in size.
  • The 3-bit file format has all 3-bit characters in a contiguous pattern packed from left to right; regardless of the byte layout ("endianness") of the host platform.
The goal was to conduct pattern matching; that is, to get a count and optionally the location of certain nucleotide patterns; for example: find the count and location of all the 21-Mers of: ATATATATATATATATGGATA in the human genome.  

For both the CPU and GPU code, we conduct our comparisons in the same way:
  • Take the K-Mer we are looking for an convert it into our 3-bit pattern.
  • Using the sequence of 3-bit representations, we then pack those bits into the smallest native container we can use for comparison.  In this case, a 64 bit unsigned long long.  The final bits look like: 0011 0010 1100 1011 0010 1100 1011 0010 1100 1011 0010 1100 1100 1000 1100 1011 and so this is the pattern we are looking for.
  • In order to do our pattern matching, we need to first read the source file we want to search into memory.  In our POC, this is the full genome, but since the file is now only 1.2GB, it will fit easily into memory: either GPU or CPU.  This is key, we only load our source once and retain it in global memory for the threads to use.
  • Our final step is to read the bits from the target file into the same sized container as as our search string (for example, unsigned 64bit long).  Once we do that, we simply compare the variables holding the bits; and if they match, increase our counter and optionally get the file character position.

CPU Code

The CPU code works as just described:  it first stores the 3-bit sequences for our pattern in an unsigned 64 bit container, call it uint64SearchString.  Next, we walk the array of bytes and load up our bit sequences for the search count (21 in this case) 3 bits at a time. So we will start at the first byte and load it and the following six bytes (first 7 bytes) into our unsigned 64 container, call it uint64FindMatch variable.  To completely fill 63 characters we need in 3-bit format, we only need the first seven bits in the eighth byte and so we load those at the end of our uint64FindMatch container or variable.  Next, we simply compare the two variables and if they match, we know our patterns match.   Something simple like:  

if uint64SearchString is equal to uint64FindMatch then add one to Count

Simple and very fast!!

To get the full count, we continue to read through our byte array holding the full genome where we load our 21-Mer sequences into our container uint64SearchString; by positioning the start of our 3-bit sequence one place to the right for each iteration.  That is, for the second iteration, we start at the fourth bit in the first byte and read all the ensuing bits up to our 21-bit count from the following bytes.  Since the source bytes are preserved in memory, this is relatively fast.

I first wrote the code in C# which, like Java, is interpreted into machine language and so it's somewhat slower than native machine code.  Using C#, I was able to count our example 21-Mer across the whole genome in 11.96 seconds on a single-proc, four-core, machine. However, only one core was busy and so the CPU utilization showed only 25%. Let's try with four threads, one for each core.

The four threads achieved the same timing; with each thread running on one of the four cores and our system CPU utilization hitting 100%.  

Conclusion:  Using our CPU search pattern, we can get a 21-Mer count across the full genome in ~12 seconds per core using C# code.  Therefore, with four cores and searching for four patterns, we average ~3 seconds per sequence and would expect that to decrease proportionately as we add CPU cores.  But, there's a severe limit to how many CPU cores we can use.  CUDA to the rescue!

Before we move to CUDA and GPU code, let's take a look at C and native code to see if that makes a difference.

I modified the logic slightly and used a 7-Mer string held in a 32-bit unsigned long.  As expected, this resulted in our counts going from a couple of dozen for our 21-Mer to a million+ for the shorter search sequence.

Using a single thread for a random 7-Mer, the program took 8.37 seconds (using one core) to find the ~1.5 million matches.

So now we have our baselines for our CPU matching approach.  Let's look at CUDA.

GPU and CUDA programming

There are many advantages to using the GPU to handle our massively parallel problem:
  1. The GPU can have thousands of cores instead of just a handful on the CPU.  The $200 card I used for these tests has over 1000 cores.
  2. Threads are limited on the CPU since we have thread switch overhead; and because the CPU has to handle things such as keyboard input, network traffic, and a host of other things.  The GPU threads are dedicated and designed for quick, short activities.  In general, we typically don't wan to spin up more than four or five threads per CPU core.  So for my 8 core I7 test machine, I could spin up no more than about 40 threads.  However, on my GPU I can spin up thousands of threads to run simultaneously.
  3. We can add multiple cards to a single host system and derive computing performance of an HPC computer on an inexpensive desktop.
  4. I use a dedicated video card for my GPU calculations.  Even when it's maxed out, the calculations have a negligible impact on my host machine since the CPU is not utilized much with all the work done on the dedicated GPU.  In other words, it's like having a separate computer running in the same machine.
OK, so what happened when I ran my CPU program on the GPU?  It took minutes instead of seconds to complete.  The reason for this was that I was only using a single thread since the design was optimized for a multiple-threaded CPU application and not the GPU.

Remember, on the CPU we sent our search pattern to the thread and it used that to load all the sequences from the byte buffer of the full genome.  This is much more efficient by using a single thread for a single search pattern and walking the full file in memory.

While this design is optimal for the CPU and our CPU response is not bad (about 8 seconds per core for native C code (both Linux and Windows).  However, as we saw, we are limited to by the CPU cores.  A single search on a single thread saturates a single CPU core.

However, what if we use one (or more) search pattern(s) for all of our threads and had each thread work on only one source file sequence?  In other words, say we spin up a million threads and ask them to look for ten different 21-Mer sequences... We can pass those sequences to all million threads and each thread will open a small matching sequence slice from the source file: the full human genome in our case.

A word picture might look like this:
  • Take two 21-Mers and pass them to two threads.
  • The first thread reads the first twenty-one 3-bit "letters" from our byte array (the full human genome) and compares the value to the value of the two search patterns.
  • The second thread moves our start position over three bits (the size of our "letter") and loads the next 21 "letter" or 63 bits into its variable against which it will compare the collection of search patterns.
The CPU is more efficient when it can take a sample pattern and walk the source (full human genome) looking for a match.

Our GPU will reverse this.  It will take a K-Mer sequence (or many) and each thread will find a different but discreet K count of the 3-bit "letters" from the source (full human genome).  The thread(s) will then walk through the list of sequences we want to get a count for seeing if they match the one source sequence.

OK, so more test data to digest.  I ran this on our single-threaded C code both ways on the CPU.  The CPU-optimized-code finished in 8.37 seconds as mentioned above.  But the GPU-optimized-code took four times as long when run on the CPU.

As we observed above, the CPU-optimized-code took minutes to complete on the GPU since it ran on a single thread.  However, when we used our GPU-optimized-code and launched 500 threads over our 1024 cores, we were able to get a 21-Mer count in under .7 seconds. Most of this time was spent loading our sequence from the buffer and converting it to our type.  But once that is done, we can compare hundreds or perhaps thousands of K-Mers using the same source sequence spread across these thousands of threads in milliseconds each.

For example, using a single 21-Mer and our full genome 3-bit file, we can get a count of all matches in .720823 seconds.  When we add one more search sequences to our search, both counts return in .731270 seconds.  A total of four takes .759792 seconds.  Think about this for a moment.  We can get the counts of four random 21-Mers across the full human genome in under .19 seconds each! 

This is with none of the fine-tuned optimizations that the Nvidia/Cuda code profiler suggested (this IS a POC after all and I'm really just looking for relative values).

Plus, this does not use GPU SMID intrinsics which I believe would dramatically improve performance since they are hardware accelerated and we do our binary compares at the register level.   This is on my "look at next" list :)

SIDEBAR:  Here's how I think that might work.  Since these intrinsics compare the bytes of a 32 bit word, we would put our bit sequences in either 32 bit containers or a multiple of 32.  Empty bits would be zeroed out as would the empty bits from the source.  By only looking for a true/false match, the compares would stop as soon as we hit a non-match (most of the cases).  Since this is hardware accelerated, it should be incredibly fast.  

Another performance improvement should be seen if we expand out and use a larger number of sequences.  That is, what if pre-populate an array of all the K-Mers we want to search for holding them in the native container such as Unsigned 64bit long?  Using CUDA, our only overhead, once we have the source sequence in memory on the thread, would be to do the compare which, as we just illustrated, takes milliseconds.

For my next exercise, I plan to do just that.  That is, I will pre-compute several thousand 21-Mers and run them through this CUDA engine and determine how long it would take to do the counts.  I expect I can determine the counts of thousands of K-Mers in under a minute.  Fun stuff.

** OK some closing thoughts and disclaimers.
  • A question or problem one might see in all of this is to make this actually useful.  In other words, say we wanted to search across a particular chromosome and not the full genome.  What then?  I propose that we have a separate index file that will provide a list of offsets for sections of the whole file we want to delineate.  So if we use the index and know that Chromosome 17 is from Position 1234 and runs to Position 6789, we can then limit our searches and counts by those parameters. Should make things even faster.
  • CUDA programming has a rather steep learning curve and most of the program parameters are gated by the specifications of the video card itself and by the makeup of the data we send to it.  Trial and error here can be your friend.
  • As a reminder, this was a brute-force type of test and I was looking for relative performance improvements more than actual results.  However, I think these ideas can form the genesis of a new toolset that can run efficiently on commodity hardware and compete with most cloud based or HPC type systems.


Monday, March 7, 2016

Genomic Data: Binary vs ASCII Part 2 - Three Bit Representation

This post will begin to describe the "Proof of Concept" work (or play I guess) I have done to explore two primary ideas:
  1. What impact will a terse binary format of the human genome have on file compression and pattern matching processing speed?
  2. If we use a binary format, what will be the impact of using low level (binary) computing via the CUDA GPU and/or, xPU intrinsics?
DISCLAIMER: This exercise was not intended to be a final solution or to replace existing tools. It was done to evaluate alternatives to some of the existing tools; and to explore how commodity hardware might be used for genomic processing instead of large (and expensive) High-Performance-Computing ("HPC") solutions, or cloud alternatives.

Background

The full reference human genome is currently 3,270,974,121 bytes in length in a fasta file format.  This includes headers, line breaks, etc. but generally represents the human genome with five primary single ASCII letters:  A,C,T,G,N which represent the various nucleotides.

Most current tools that process the genome are written in interpreted computer languages and work with these ASCII representations.  Of course there are exceptions, but in general, these tools will read in the full (or a partial via a stream) set of ASCII letters to begin their work.  This requires enormous amounts of computer memory and processing power. I submit that by changing the way we represent the data, we can reduce the computing resources needed.  Further, by using native programming constructs (C/C++ and Intrinsics) we can deliver a robust set of tools using inexpensive commodity hardware that can rival HPC systems.

The Simple Test

For this POC, I wanted to use a specific problem that might be representative of the broader domains of similar problems.  Specifically, I wanted to write a program that would take a random string of nucleotides, and search the entire human genome to count the number of instances found for that string.  This would be useful in k-mer counting; and pattern matching across the genome.

For example, we might want to count the number of instances, and find the location of:  ACTAAGGA.

Again, the goal is to simply evaluate and compare conducting this exercise on:
  • a machine with limited memory - using memory mapped files on top of SSDs to augment RAM
  • compare the performance of interpreted code with native code
  • compare the performance of the same code run on CPU vs the GPU

The Hypotheses 

  • Processing would much more efficient if we use a more terse binary format instead of ASCII
  • Comparisons using a binary pattern will be orders of magnitude faster than using the ASCII format - primarily due to the more compressed patterns.
  • Cuda processing using one thread for every possible sequence in the original buffer (e.g. the full human genome) will be much more efficient than using CPU based parallel processing
  • Since CUDA employs a different construct for parallel processing, we should flip our patterns searchers around to take advantage of the platform.

    That is, in a CPU based search, we might conduct our search as follows:  a) load the search string into the CPU register and then walk the source file to see where there are matches.  b) we could load a different search string into different threads (limited by the CPU) and use a global buffer to hold our source data (the human genome).

    In a GPU search pattern, we turn this around (which seems counter-intuitive).  That is, we load the source pattern (from the human genome) into the GPU register(s) and we then do a pattern match on our k-mer strings.  More info below.
  • The overarching belief is that most "processing power" for pattern matching is actually consumed by memory transfers and so our optimizations should focus on those to achieve better performance.  That is, everything we can do to reduce memory transfers will improve performance - especially in the CUDA world.

The Work

Let me attempt to break this down into smaller logical sections (and posts) to hopefully make it more understandable.

Let's use a short sample source search string:  ATTATATTATTA and a sample 3-mer of TTA to get our count of.  We can see that the TTA string appears three times in the source string.

ASCII 

First of all, if we keep this in ASCII byte format, our source will be 12 bytes (letters) times 8 bits or 96 bits long.  Our k-mer is three bytes times 8 bits or 24 bits long.  Therefore, at the binary level which our systems work, we are searching for a 24 bit sequence across the 96 bit source.

To our computers this would look like this:
ASCII A is decimal 65 which in binary looks like:  0100 0001
ASCII T is decimal 84 which in binary looks like:  0101 0100

So our target string of TTA appears to our computers as the following 24 bits: 010101000101010001000001

And our source (searched) string of ASCII ATTATATTATTA appears as the following 96 bits:
010000010101010001010100010000010101010001000001010101000101010001000001010101000101010001000001

2-bit 

Now, let's assume the common 2-bit format for these data and let's assume that the "A" is represented by decimal 1 or binary 01.  The "T" can be decimal 2 or binary 10.

Now our target string of TTA appears to our computer as the following 6 bits: 101001

and our source (searched) string ATTATATTATTA appears as the following 24 bits:
011010011001101001101001

As you can easily see, there are far fewer bits that need to be moved from disk, to memory, to xPU registers with the more terse 2-bit format.  This will be a huge performance gain.  Plus, we ultimately use the xPU registers to hold our binary patterns and compare those with some memory contents.  If we have smaller registers or larger search patterns, the system would need to "chunk" up the larger patterns by swapping in and out of memory.  These memory swaps are what we are working to avoid!!

For example, if we use the ASCII format of 8-bit base representations, and assume 32-bit registers...we can compare four bases in our registers at a given time.  However, once we cross that barrier, by say looking for 11-mers, we would need to "chunk up" or data and hold some level of matching state across pattern matches.  That is, we look for the first part of the pattern across the first 32 bits and the next part of the pattern over the next and so on.

However, if we use a 2-bit format, our 11-mer can fit into 22 bits (instead of 88) and thus we can more efficiently use the xPU registers for our compares.

KEY POINT:  If we use a 2-bit format for our data, we reduce our memory operations by roughly 75% - and increase our performance by the same factor!

Let's talk 3-bit

One problem with 2-bit is that we need to represent more than four bases.  At the very least, we need the "N" variable and so five items.  We could make our lives easier by using a 4-bit pattern since it would fit nicely in our 8-bit, 32-bit, 64-bit, etc. native containers; but we don't need all four bits.  For our use, we have opted for a 3-bit pattern since it is of optimal size and will accommodate 8 items or entries.  For the POC example, I included the following:  A,T,C,G,N,EOF, and Chromosome_Break; with one item left for whatever.

What make this 3-bit pattern difficult is that it won't break on the byte (or other native container) boundary since it's an odd number.  That is, a 4-mer sequence such as TTAA represented in 8-bit ASCII would fit nicely in 8 bytes; and would also fit nicely in a single byte when represented by the 2-bit format; or in two bytes when represented by the 4-bit format. In other words, we never break a base "letter" over the byte boundary when we use an even bit format (2-bit, 4-bit, or 8-bit).

However, our 3-bit pattern is not as easy to work with.  For example, our search pattern of TTAA in 3-bit might look like:  "A" as decimal 2 in 3-bit would be 010 and T as decimal 1 in 3-bit would be 001.  Our 4-mer would then need 12 bits: 001 001 010 010.

But remember that all data is stored as bytes on the computer.  Therefore, we need to have these 12 bits fit into our 8-bit byte native containers.  Assuming we don't want any extraneous bits in our storage, we would need to break our bits at the third base (assuming we load from left to right):

So our first byte would be: 001 + 001 + 01  and our second byte would be 0010 0000

Since our 3-bits don't break at the byte boundary, we need to store our binary sequence in the smallest native container that can hold our sequence.  

To understand this, you have to forget comparing bytes (as we tend to do today in most genomic programs); and think about comparing bits.

If we use our 3-bit format for pattern matching, we can store up to 2-mers in a byte (two 3-bit items with no remainder).  Similarly, a 16-bit container could hold up to 5-mers; a 32-bit container would hold up 11-mers, a 64-bit container would hold up to 21-mers, and so on.

The key point here is that we should use the smallest container to hold our k-mer represented in the 3-bit format so that we use the low-level hardware as efficiently as possible; and to reduce our expensive memory transfers.

Let's illustrate with a 5-Mer of TTAAT.  

Using the "A" as decimal 2 in 3-bit would be 010 and T as decimal 1 in 3-bit would be 001. Therefore, our 3-bit search pattern for TTAAT would look like this in binary:  001 001 010 010 001.  After we pack these into bytes, which we must do at the lowest level of storage, our bytes would look like this (assuming loading from left to right): 
   0010 0101 
   0010 0010 (the final zero is a remainder and will start the next k-mer)

Now this is where it gets a bit funky so pay attention.  If we try to read our bytes into a 16-bit integer using native code, the endianness will make a big difference since the byte order (not bit-order!!!) can differ from one system to another.  So to read straight into an int16 in one system, the bits would be: 0010 0101 0010 0010 (which is correct for us since we load from left to right); but on a different platform, they would be reversed and look like this: 0010 0010 0010 0101 

Therefore, if we want to be portable from one platform to another, and eliminate the need to swap bytes, we need to read the bits into our container bit by bit at runtime.

In pseudo code, this is what this looks like:

Byte b1 = decimal 37 or 0010 0101
Byte b2 = decimal 34 or 0010 0010

When we read in our bits from our bytes, we want to pack them left to right.  So we read the number of bits from all of our bytes until we get to the total number of bits (K-mer size times bit-size).  So for our example of a 5-mer, we need to read 15 bits from our first two bytes and pack them into our int16 from left to right.

Our int16 would end up looking like this: 0010 0101 0010 0010.  We frankly don't care what the decimal number is (9,506 on little endian) since we never use it!  Our container simply holds the bytes packed left to right that are read from bytes held on the file system or memory in the same order.

To do our compare now, we simply read our bytes from memory or the file system into our program (or onto our xPU register) in the size of our container we need, and then compare the values. In other words, we load the bits into an int16 and compare with our compare sequence also loaded into an int16 with the bits loaded on both from left to right.

For example, we might read the 3-bit representation of our genome from the bytes in the file system into the smallest container size for our k-mer (the int16 for our 5-mer), we mask off the remainder bit (since we don't divide enevnly) and compare our numbers.

So we read in our search size of bits (15 in our 5-mer example) from the source string such as the full human genome, and load that into the type we are using; an int16 from our 5-mer.  We mask off the remainder we don't care about; and then using those two int16 numbers, we compare them.

Closing comments and Thoughts:

  1. This approach is platform independent and so the endianness of the bytes makes no difference.
  2. Search is automatically optimized by k-mer size so searches for small k-mers is extremely fast, even across the full genome.
  3. Instead of using hashes of letters as indexes, we can use the integer values; but I would argue we no longer need indexes since the search is so fast ...milliseconds across the whole genome.
  4. Assuming the CPU, we can load a different k-mer search pattern into memory using the native container (typically an unsigned long or long long) and then simply walk the array of bytes of the source loading them bit by bit into the native containers for comparisons.  Since we are not typing (but simply using the native containers for convenience); our loads and comparisons are very fast.  We simply slide through the array of bytes and look for pattern matches.  
  5. Assuming the GPU, things become more interesting.  We'll talk about how this logic is reversed in the next post; but as a teaser:  each thread holds a portion of the source pattern (the whole human genome takes millions of threads but only milliseconds to match) and then does the compare with the search k-mer.  
  6. The examples of int16 are not actually correct since it has the sign-bit; we should always use an unsigned container since we don't use the sign bit and we need the space.


Tuesday, February 23, 2016

Genomic Data: Binary vs ASCII - Part One: Introduction to Systems

This and subsequent posts will discuss the advantages of storing and processing Genomic data in a binary format vs ASCII.  We will begin at a very high level and dive deep to using binary processing at the lowest level in the computer system.  Feel free to skip this introduction if you want to dig into the details.

Introduction to Computer Storage

Data on the computer is stored as either "on or off" (or positive or negative in the transistors). These are the only two possible states and are often represented as zero or one and known as binary; with each of these two states known as a "bit".  A bit can be either 0 or 1.

From this native state, we can build all constructs from which programs operate.  Most computers use an 8 bit byte as the fundamental memory-addressable unit. In other words, everything builds on this basic container and all types derive from one or more bytes.

A byte can store up to 256 numbers represented by our binary (2 ^ 8).  For example, let's take a number such as 71; it will have a bit/byte representation as 01000111.

But now you may ask, how can I represent a really big number such as one million? For that number, you would need a container that holds at least 20 bits.  Since computers use an 8-bit byte for their fundamental storage, that would mean we would need to allocate space for three bytes or 24 bits in order to hold a number that large. Therefore, on our computer system, the actual storage would look like this (ignoring the endians for now): 0000 1111 0100 0010 0100 0000  That is, we would read the three bytes and format the data therein into a number that would represent one million.  

The important thing to note here is that our program must know that it needs to read these three bytes in order to return the number that large.  This is key to understand "typing" in computers:

KEY:  A type in a computer is a construct that will hold the bits and bytes in that container.  

NOTE: to be stored in three bytes, our one-million number above would need to a custom container since most computer systems use 16-bit, 32-bit, and 64-bit containers of 2 bytes, 4 bytes, and 8 bytes respectively.

As an illustration, computers have a numeric type of a ULONG which is a 4 byte container with 32 bits. This container will hold number as large as 4,294,967,295 (2 ^ 32) 

SIDEBAR:  Did anyone do the math: 2 raised to the 32nd?  If so you got a different number! Your calc shows 4,294,967,296  Correct.  But remember, we need to show zero and so from zero to 4,294,967,295... which is 4,294,967,296 numbers with a maximum size of 4,294,967,295.

Note that this is an unsigned number.  What the heck is that?  Well, if you want to represent negative numbers, you would need something to represent the sign for the number.  To do that, we need to use a "sign bit".  But now we only have 2 ^ 31 or 2,147,483,648 places since we need a bit for our sign. So our typed 4 byte container for a number known as a LONG (a "signed" value) in most systems will hold data that ranges from -2,147,483,648 (the zero is handled on the positive side) to 2,147,483,647

Now, the thing to note is that when we use this type (container), we need to allocate space for the largest number we might encounter.  The rest of the space is essentially wasted. For example, the container we need for our number of one million, will look like this for a number of one: 0000 0000 0000 0000 0000 0000 0000 0001.  

KEY PRINCIPAL:  Space and computational power are wasted when we use containers that are larger than we need.

I just illustrated the space problem; that is, if the largest number we would need to store is 100, there is no need to use a LONG since all those zero bytes are just wasted space.

OK, space is cheap, right?  Well, all those extra zeros will need to be read from storage and sent over the network and so that consumes those resources.  Just as important, all those unneeded zeros would need to be stored in "hot" memory on the machine; and paged out when memory becomes constrained.  Therefore, this wasted space has a ripple effect for system wide performance.

At the lowest level, work is done on the xPU registers ("x" can be "C" as in CPU; or "G" as in GPU).  These have a finite size and so wasted space increases the memory transfers as data is shifted in and out of the registers and local RAM.  This fact is key to the proposed Genomic architecture that we will discuss in future blog posts.

Now, some things that are interesting and have only a peripheral relationship to our work:

The Endians

At the lowest level, we only have bits with two states: one or zero.  Next, we assemble those into a byte comprised of 8-bits and this is the fundamental way data is accessed.  In words, memory addresses are in bytes on all systems and those bytes are constructed the same way on all computers.  For example, the number 71 will be represented by a byte with the following bit structure:0100 0111 on all computers.

Where things get confusing is when we need to assemble our bytes into types. Specifically, when we start typing our data using more than one byte, we must consider the computer platform's "endianness"; that is, whether it's "big" or "little endian".  (OK, there are mixed endians but I'll leave the research for those to the reader).

To understand the endians, we need to understand what the most significant digit is.  We typically look at a number such as one million as: 1,000,000 and in this case, the most significant digit (the millions) is listed first.  The smallest and thus least significant number is listed last.  So we view our numbers in big-endian format with the most significant digit first.

If we were to fire up Calculator in Windows and put in our decimal value of one million, it will show the following byte format for a 4 byte word:

0000 0000 0000 1111 0100 0010 0100 0000

Now what makes this crazy confusing is that this is shown in big-endian with the most significant bytes first; but on Windows these bytes are stored in little-endian order!  It kinda makes sense for Windows to display the bytes this way since programs display our number in big-endian with the millions over to the left and the single values to the far right.

KEY:  Most most PCs with x86 architecture (Windows and Linux in general) put the least significant byte first in our byte stream and are Little-Endian.  Therefore, the bytes are stored in the opposite order of what we might see in Calculator or otherwise represent them.

Where endianness becomes a problem is when we store multi-byte data in a file that is created on a system with one type of endianness and then attempt to read (type) the data on a system with a different endianness.  For example, if we take a LONG and save it to a file based byte stream on one system; and then attempt to read the same stream of bytes into a LONG on a system with the opposite endianness, the number will be bogus.

Back to our Calculator output for the number one-million: 0000 0000 0000 1111 0100 0010 0100 0000 which is represented in big-endian.  On a big endian system, the bytes will be stored in the same order in files:

0000 0000 
0000 1111 
0100 0010 
0100 0000 

However, on our Little-Endian PC system, the bytes are stored in the opposite order with the least significant byte first:

0100 0000 
0100 0010 
0000 1111 
0000 0000 

To think of it a different way, if we viewed our decimal number of one million in little-endian, it would look like this:  000,000,1

So what's the solution for genomic data?  Create a custom type that is platform agnostic but can still take advantage of native types.

SIDEBAR: The UCSC genomic 2-bit format uses a multi-byte signature to make it easy to determine whether the platform is big or little endian.  From their docs:  "If the signature value is not as given, the reader program should byte-swap the signature and check if the swapped version matches. If so, all multiple-byte entities in the file will have to be byte-swapped. This enables these binary files to be used unchanged on different architectures."

SIDEBAR 2: When work at the lowest level (fastest) on xPU registers, endianness doesn't make a bit of difference.  A 32bit xPU register will hold 32 bits of data and so there is no endianness at that level.  This is key to our genomic tool design (it is platform independent); which is discussed in future posts.

ASCII

We talked about numbers, but what about words and letters?  

Just as we need to type (a verb) our bits into numbers, we also need to type them into letters and eventually words.  How's that work?

By convention, we use a byte of 8-bits to represent each character of the standard ASCII character set.  This is a "single byte" representation.  Therefore, the letter "T" is represented by a byte-sized decimal number of 84, but is represented in binary as: 0101 0100.  In the early days, we could use just 7-bits for all of the characters and control things.  However, quickly we moved to use the full 8-bits of each byte in what was/is known as extended-ascii.

KEY:  Every letter in the ASCII alphabet consumes 8-bits if we use the alpha representation.

There are several implications for us in this.  Here are two key thoughts:

Numbers as Characters

One way to "cheat" typing is to leave everything as it's ASCII representation and to convert it at run time.  Let's see how that might work for our one-million LONG number.  

As a LONG, this value will take four bytes and will thus consume 32 bits of bandwidth.

On the other hand, if we use the ASCII representation of "1000000", that would be 7 bytes or 56 bits (one byte for each letter) or or almost twice as large!  On the other hand, if we sent the number as "1", this would only be one byte and so we'd only use .25% as much bandwidth as we would if sent as a LONG. 

This should illustrate why the data developer should evaluate their data for storage and transmission and when to type it.

Characters as Numbers

Remember our rule:  we should store our types in the smallest container as possible to reduce hardware load and processing time.  Our genome is comprised of ~3.2 billion bases which are typically represented in ASCII format.  For DNA, these are typically "A", "C", "T", and "G".  OK, so we have a byte sized container that will hold 256 values but we only need to store four.  Why not use a smaller container?  That's the premise of the 2-bit format we introduced earlier.  It's also part of the newish CRAM format.

As I mentioned in an earlier post, the next few posts will give specifics on the work I've done to evaluate using a binary format with low level processing to see if we can improve performance; and it was a fun intellectual exercise to learn CUDA programming.

Sunday, February 14, 2016

Memory Mapped Files For Large Datasets

I'm a fan of memory mapped files and think that they can play a role in processing large datasets on computers with limited memory.  The following is a copy of an internal email interchange I had with a couple of other engineers that I think describes the benefits.

Memory mapped files are fast in both .NET and JAVA but super fast using pointers in C or C++.

Here's the email thread:

Abstract: 

Hadoop reads entire files and then parses the data.  Hadoop keys tend to be single fields of data which makes the Map-Reduce process slow and complex for large (wide) data sets such as those in health.  The following will show a solution to both problems using memory-mapped-files which only reads the key fields from the file and never loads the entire file; and uses smaller composite field hashes for the key data to make it much faster to search and filter across large text fields such as gene sequences in wide and deep data sets.
Discussion:

After playing with Hadoop and other big data solutions, I have come to the conclusion that we will need the ability to do additional processing of large data sets AFTER we pull them from Hadoop.  Even if we were to hire a team of map-reduce java developers, we’ll continue to need the ability to handle some level of pattern-matching across our result sets.  Genomic data, primarily represented as large string sequences, makes this even more challenging.

Background: 

After installing Windows 10 on my laptop I put an SSD in the machine and was surprised that Win10 would start in about 5 seconds. 

SSDs are faster than spinning drives since they don’t have to wait for the seek heads to find the data on the platter(s).  But that would not fully explain the quick load times.

It turns out that Windows now supports memory-mapped-file access of the executables.  What this means is that the OS at kernel level, will map a large file into memory and only load what it needs when it needs it.  In the old days, Windows would load the entire file into memory and then map the executable library from the file into memory and execute it (kinda like Hadoop across the HDFS).  This explains the super-fast load times for Windows and has merit for our work too.

One of the problems with Hadoop is that it scales-out and uses a brute-force method to achieve performance:  lots of servers, disks, and memory.  Hadoop was written in JAVA and so it’s slow (folks are moving core functions to c/c++ but that work is still early).  To make matters worse, Hadoop must read/parse lots of HDFS files across the machines from mostly spinning drives; and then write interim results across the network and to other machines and drives.  This is a ton of file I/O.

Another problem I have seen with Hadoop is that it is a key-value system at it’s core.   It was written to have a known key (index) from which to do the initial search (the map phase) of results before being sent to the reducer(s).  Unfortunately, Hadoop does not work very well when we have large keys (such as genomic data); or when we have a variety of columns that we can use for keys.

And since Hadoop reads the full files into memory AND THEN parses them, the larger the files become, the slower the system runs.  This is what happened to Windows: the executables became so large, it took forever to load the OS since the files had to be read from disk into memory and mapped before they could be executed.  Hadoop solves this by scaling-out: adding more machines, drives, and memory (and of course complexity and expense).  Windows solved it by using memory mapped file.

I suggest that our solution would be similar to what Windows has done: use memory-mapped-files for very fast processing.

Memory-Mapped-Files in a Nutshell:

A memory-mapped-file is a file that is mapped into kernel address space as it sits on the drive.  No need to read the entire file into memory (saving RAM).  Since it runs at the kernel mode, it’s super fast and efficient. If we know a value is at a given position, we point our program to that position in the file and the contents from that are read into memory. 

So Memory-Mapped-Files can be a very fast technology to help us with our post-processing of large data sets (see example below).

Key Problem: 

But we still have our “key” problem; from our key-value technology.  Hadoop will open a file and map, into the reducer, all the matching “rows” that have the key we are interested in.  This means every file has to be read in toto, and each “row” parsed for our value.  But what if we have large “key” data such as genomic data; or if we want to use multiple “columns” as our key?  The solution to the problem is to use a Hash algorithm.  This is exactly what relational databases (SQL Server, Oracle, etc.) use to determine if a row is “dirty” or has changed.  This is how files are verified that they have not changed or been tampered with.  For us, this will work very well…

Say we want to “key” on a genomic string, say CGGGTCTGACCTGAGGAGAACTGTGCTCCGCCTTCAG and we then want to link that to another field such as MRN, 123456789.  We could map on either the genomic string or the MRN but we want to use both.  The solution is to concatenate these fields into a single string: CGGGTCTGACCTGAGGAGAACTGTGCTCCGCCTTCAG123456789 and then HASH that value (most modern CPUs support the hashing we use in the CPU so this is super-fast).  The hashed value is a much smaller binary size (typically 4 bytes instead of the ~ 50 bytes originally) so it’s much more efficient to store and search for.  Yes, we might have collisions, but those would be filtered out in the reduce phase.  Now, we can key/search across multiple columns using a terse hash.

Putting it All Together:

So the experiment was this: 
·         take a deep and wide dataset (the  Clinical Event table that has been denormalized) and hash composite columns to create a new Hash-key column (4 bytes wide). 
·         Export this table to a new binary file and memory map the file to see how quickly we can find a set of keys (matching columns). 
·         Compare that to using SQL Server with a composite key to find the same results.

Results and Observations:

The denormalized Clinical Event table with the hash keys ended up being about .5 billion rows with each row about 80 columns wide.  The final data files was ~ .5TB in size.  As you can imagine, it would take the OS a long time to read the file and even longer to then parse out the matching values.

I pointed a custom memory-mapped reader at the file and asked it find all the “rows” with a matching key (representing our genomic string and MRN).  It returned the results in single-digit seconds; typically less than 5 seconds.  This was on a host machine with spinning drives and I would expect this to be much faster when the data is hosted on SSDs.

I compared that to the relational database.  First of all, the relational database had to store the index information which added greatly to the size of the table.  Next, it had to load the large index from the file system before it could execute the query.  The index load took many minutes; however, once the indexes were loaded (into a machine with a ton of memory), the queries returned in a handful of seconds; but took much longer than the MMF solution.

Net/Net:  Memory-mapped-files of large data sets are orders of magnitude faster to process than even the most powerful relational database.  Plus, we can emulate huge systems with tons of RAM on a commodity machine sitting on top of relatively inexpensive SSD drives (Here’s a great read from MSR on this topic and another similar read).

--- I received these questions from one of my peers:     1) Is it possible to have an index around the hashkey, or is this unnecessary due to memory mapped architecture? 2)  As I understand Hashing algorithms do not guarantee unique hash value, so a few non-matching rows may occur in large datasets that happen to have the same hash. Those could easily be managed in a reduced fileset.  Any concerns with this?

To which I answered:

Yes, the “index” around the hashkey would really be a dictionary with the “key” being the hashkey value; and the values being the offsets in the file.  The cool thing about this is that we can create a persisted “index” and say, stick in a buffer at the start of the file; or, we can create any number of these ad-hoc and store them independently.  So we could have a virtual table of diagnosis codes that would simply be a file based dictionary/hashtable with the diagnosis code as key and the “row” offsets pointing to the matching rows.  This would be extremely terse and very fast to build. 

We could also use this as a way to cross-join by using the “row” offsets that match.  For example, say our dictionary/hashtable/index has MRN as the key and the values are the “rows” with those MRN values.  We might also have another dictionary with diagnosis code associated with those rows.  To find a match, we pull back the “rows” (offsets in the MMF) for each set and do an inner-join on those.  Once we have the offsets, we just point them to the MMF and pull the “rows” and the “columns” for those.  Since the MMF is so fast, pulling the data using the offset happens in milliseconds.

They key to this technology is using SSD drives as virtual RAM.  So now we are scaling up; but instead of using expensive and fast RAM, we use relatively inexpensive and a bit slower SSD.    Since an x64 system can theoretically map to 4 PB of RAM, we could put a collection of SSDs on there and let the MMF architecture map that into kernel-addressed memory space.

As for #2, you are absolutely correct that there could be hashing “collisions” so that there might be some non-matching rows.  Therefore, when we build our dictionary/index or do the data-pull, it would be important to check the values that were hashed to insure they are pulling the correct rows.

One thing I didn’t dive into but probably should at this point is describe a “row” and/or “column”.

At the start of the MMF, I hold the “table” metadata.  This is the length of each row (say 100 bytes) and the name and location (offset) in that buffer of each “column”; and I’ve even played with typing the columns.

Therefore, we store binary in bytes but can find and type them easily on the way out.

So let’s say we want to find a “row” with the MRN “column” with a certain value.  We look at our metadata table and see that each “row” is 1000 bytes long; and in those 1000 bytes, we might find MRN in offset 20 that is 10 bytes long and is returned as an INT.  To find all MRNs, we start at the beginning of the data, add the MRN offset and size, and then just loop through the file adding the row length on each iteration.  As you can see, this can really help when we have wide (think de-normalized) data sets.  I’m shocked at how incredibly fast this is.

Let’s say that our “index” or dictionary shows that we have a certain MRN if 123456789 at Row (offset 9999).  We would know that our complete “row” starts at memory location 9999, and all of our “columns” defined in the header are held in the 1000 bytes so we can pull those bytes and then populate a “row” type by reading those bytes into our type.  As a result our “MRN” would be read from 9999 + 20 (the offset) to 9999 + 20 + 10 (the length of the MRN). In code, we could then “type” the binary to our int.

For my test, I used a denormalized version of Clinical Event, where I did the code-value lookups.  Even without an index, I was able to search for a given value across the huge file in a matter of seconds.

Friday, February 12, 2016

Thoughts and Solutions for Processing Large Genomic and Other Datasets

The promises of Personalized Medicine are exciting:  using our genetic makeup and known outcomes, providers will be able to offer a customized treatment plan instead of just using a population health approach.
There are challenges:
  • Datasets are large. The human genome is comprised of over 3 billion "base pairs" of nucleotides.  When represented in ASCII ("A", "T", "C", "G", etc.), this means over 3GB of file space for each full genome map; without the metadata such as Chromosome location, etc.
  • Very little standardization on tools and solutions.  
  • Most tools come from the Open Source Community and as such, offer little to no support.
  • Most of the tools are written in JAVA with the need to "pipe" the output from one program to the other.  This requires enormous amounts of memory.
  • Although in general, the genome is represented by the four primary nucleotides (ACTG); there is a huge randomness in the pattern across the genome.  Therefore, typical compression routines do not lend themselves well to this data domain.
There were two primary objectives that drove me to explore some possible solutions to these challenges:
  1. File sizes are too large and we need to figure out a way to more efficiently process, store, and move these large files.
  2. With the goal of a "computer on everyone's desk" that has driven my thinking over the past thirty-five years, we should be able to deliver an inexpensive genome processing "personal computer" that can at least match the output if so called HPC or High Performance (or Processing) Computers.
Compression
Most file compression routines are based on the fact that particular content has a large number of repeating fragments of data. The more frequently data repeats, the greater the compression.
As mentioned before, most computer compression algorithms do not work with the DNA data domain. However, there are some cool alternatives. I'm particularly fond of the CRAM format that achieves compression via: “...bases are reference compressed (Hsi-Yang Fritz, et al. (2011) Genome Res. 21:734-740) by encoding base differences rather than storing the bases themselves” (ref) Since most of the human genome is common to all of us, this makes enormous sense. CRAM also uses some level of binary formatting which I believe is key to achieving both compression and improve processing speed (more below).
Several tools use a 2-bit, or 4-bit representation of a base instead of the ASCII byte to achieve compression. For example, the sequence ATAT, when stored as ASCII, will be stored in 4 bytes (32 bits), one for each letter. However, for a 4 bit representation, the same sequence could consume just 2 bytes (16 bits calculated as: 4 bits times four characters) and thus achieve 50% compression.  The best compression would come from the 2-bit solution which will achieve a 75% compression ratio.  That is, 2 bits per character time 4 characters is an 8-bit byte.  The prototype I built and will discuss in depth later, uses a 3-bit sequence (62.5% compression).
So what's wrong with the 2-bit sequence; after all, it will achieve superior compression as it stands. (Great discussion here.)
First off, we need our tool to discover and document insertions, deletions ("indels") and just plain “dunno” (typically represented as "N"s).   If we are processing a gene sequence, and have used all of our keys (letters) in a 2-bit format...that means we need to create an external lookup table for those anomalous reads. This external table will reduce the overall compression; and more importantly, get in the way of fast processing – detailed below.
A 4-bit compression is useful since it makes all our bit-shifting break on the byte boundaries, but at a cost of losing significant compression. 3-bit is ideal but a bit more difficult to implement (bad pun). That is, 3-bit patterns give us 8 data points, enough for the four bases (ATCG), the “N” or unknown, and one for end-of-file or end-of-sequence. It even provides a placeholder for a reference lookup. For example, let's say at position 999 we have the three bit pattern of 111, our program could then go to the lookup table and pull in the reference genome using the offset marker denoted by the three bits.
Performance
With regard to #2 above, a “computer on everyone's desk”, I wanted to explore how one might build an inexpensive genomics computer (< $1000) for local processing. In the next post, I'll talk about the development process and results in greater detail; but for purposes of exploration, here's what I set out to do:
  • Determine if I could use a Solid-State-Drive along with Memory-Mapped-Files to eliminate the need to read in the full file contents; and/or to pipe the output to other programs in the processing pipeline.
  • Evaluate the speed tradeoffs of using an interpreted programming language (Java or .NET) vs native code (C/C++)
  • Evaluate the benefits of getting as close to the wire as possible by utilizing binary/bit formats instead of letters such as ATCG.
  • Determine the cost benefit of using massively parallel processing power of the Nvida CUDA graphi cards to do the work.
  • Determine if using native xPU (CPU or GPU) registers to do the bit comparisons would lead to performance improvements. That is, use intrinsics for the xPU to invoke the “Single Instruction/Multiple Data” (“SMID”) parallel processing.
I will detail my exploratory work in more detail in forthcoming blog posts, but for now, let me summarize the current results.
  • I pulled down the full human genome in ASCII (FASTA) format and extracted it to one large file. That is, just a huge file with ATCG and Ns.
  • To benchmark the tools, I decided to keep it simple. I wanted to pass in a random sequence of bases (such as AATAAGC) and count the “K-Mer” repeats. In the example, AATAAGC, this is a seven-mer and to count how many times it appears in the genome.
  • Using our benchmark tool, I then evaluated the performance using the following technical solutions:
    • Using pure ASCII to read the raw ascii before converting to a bit format. Frankly, I was shocked at how fast this turned out to be. However, this was executed on an 8-core computer with 16GB of RAM and the entire genome would fit into RAM. This gave me hope that by using a more terse bit format from the byte format of text, I'd see more improvement.
    • I next converted the ASCII to a three bit format. This required some manual bit loading and shifting; but it also allowed me to compare larger bit sequences at one time and reduce the loops in the code. Using interpreted code (.NET but probably similar to Java), I could get a random k-mer count using a single thread in about 25 seconds after the genome file was full loaded into RAM. Surprisingly, this was similar to what I was seeing using the raw text files under .NET. SIDEBAR: I have read that .NET will use assembly code to do string compares and so I would expect to see pretty good performance if that's true.
    • I then moved the code to native C/C++ and compiled for both Linux and Windows and got almost identical results on both...about 2.5 seconds to get a random k-mer count across the full genome using the 3-bit sequence, once the full raw file was loaded into RAM.
    • I next decided to parallelize the C and .NET code running on the CPU for a single k-mer count. This increased the run time by a factor of ten; no doubt due to thread switches and memory swapping.
    • Finally, I used a single and inexpensive NVIDIA GPU, CUDA enabled, video card to run the same parallelized code on the GPU (no thread switches) and can get a random k-mer count across the full human genome in about .7 seconds after the file is loaded.
    • What I have yet to evaluate, is to use intrinsics and the native xPU registers to do the bit comparisons. However, I'm not sure the juice is worth the squeeze. If I can use a very inexpensive video card to achieve these results, and can run more than one video card per desktop machine, then I think we've gotten close to proving that by using the GPU to process genomic data, we can achieve massive parallel processing power under or on our desk.
What's next?
I plan to release some blog posts soon that will provide details of the above findings and will drop some code samples as well.
Further Reading: