Tuesday, February 23, 2016

Genomic Data: Binary vs ASCII - Part One: Introduction to Systems

This and subsequent posts will discuss the advantages of storing and processing Genomic data in a binary format vs ASCII.  We will begin at a very high level and dive deep to using binary processing at the lowest level in the computer system.  Feel free to skip this introduction if you want to dig into the details.

Introduction to Computer Storage

Data on the computer is stored as either "on or off" (or positive or negative in the transistors). These are the only two possible states and are often represented as zero or one and known as binary; with each of these two states known as a "bit".  A bit can be either 0 or 1.

From this native state, we can build all constructs from which programs operate.  Most computers use an 8 bit byte as the fundamental memory-addressable unit. In other words, everything builds on this basic container and all types derive from one or more bytes.

A byte can store up to 256 numbers represented by our binary (2 ^ 8).  For example, let's take a number such as 71; it will have a bit/byte representation as 01000111.

But now you may ask, how can I represent a really big number such as one million? For that number, you would need a container that holds at least 20 bits.  Since computers use an 8-bit byte for their fundamental storage, that would mean we would need to allocate space for three bytes or 24 bits in order to hold a number that large. Therefore, on our computer system, the actual storage would look like this (ignoring the endians for now): 0000 1111 0100 0010 0100 0000  That is, we would read the three bytes and format the data therein into a number that would represent one million.  

The important thing to note here is that our program must know that it needs to read these three bytes in order to return the number that large.  This is key to understand "typing" in computers:

KEY:  A type in a computer is a construct that will hold the bits and bytes in that container.  

NOTE: to be stored in three bytes, our one-million number above would need to a custom container since most computer systems use 16-bit, 32-bit, and 64-bit containers of 2 bytes, 4 bytes, and 8 bytes respectively.

As an illustration, computers have a numeric type of a ULONG which is a 4 byte container with 32 bits. This container will hold number as large as 4,294,967,295 (2 ^ 32) 

SIDEBAR:  Did anyone do the math: 2 raised to the 32nd?  If so you got a different number! Your calc shows 4,294,967,296  Correct.  But remember, we need to show zero and so from zero to 4,294,967,295... which is 4,294,967,296 numbers with a maximum size of 4,294,967,295.

Note that this is an unsigned number.  What the heck is that?  Well, if you want to represent negative numbers, you would need something to represent the sign for the number.  To do that, we need to use a "sign bit".  But now we only have 2 ^ 31 or 2,147,483,648 places since we need a bit for our sign. So our typed 4 byte container for a number known as a LONG (a "signed" value) in most systems will hold data that ranges from -2,147,483,648 (the zero is handled on the positive side) to 2,147,483,647

Now, the thing to note is that when we use this type (container), we need to allocate space for the largest number we might encounter.  The rest of the space is essentially wasted. For example, the container we need for our number of one million, will look like this for a number of one: 0000 0000 0000 0000 0000 0000 0000 0001.  

KEY PRINCIPAL:  Space and computational power are wasted when we use containers that are larger than we need.

I just illustrated the space problem; that is, if the largest number we would need to store is 100, there is no need to use a LONG since all those zero bytes are just wasted space.

OK, space is cheap, right?  Well, all those extra zeros will need to be read from storage and sent over the network and so that consumes those resources.  Just as important, all those unneeded zeros would need to be stored in "hot" memory on the machine; and paged out when memory becomes constrained.  Therefore, this wasted space has a ripple effect for system wide performance.

At the lowest level, work is done on the xPU registers ("x" can be "C" as in CPU; or "G" as in GPU).  These have a finite size and so wasted space increases the memory transfers as data is shifted in and out of the registers and local RAM.  This fact is key to the proposed Genomic architecture that we will discuss in future blog posts.

Now, some things that are interesting and have only a peripheral relationship to our work:

The Endians

At the lowest level, we only have bits with two states: one or zero.  Next, we assemble those into a byte comprised of 8-bits and this is the fundamental way data is accessed.  In words, memory addresses are in bytes on all systems and those bytes are constructed the same way on all computers.  For example, the number 71 will be represented by a byte with the following bit structure:0100 0111 on all computers.

Where things get confusing is when we need to assemble our bytes into types. Specifically, when we start typing our data using more than one byte, we must consider the computer platform's "endianness"; that is, whether it's "big" or "little endian".  (OK, there are mixed endians but I'll leave the research for those to the reader).

To understand the endians, we need to understand what the most significant digit is.  We typically look at a number such as one million as: 1,000,000 and in this case, the most significant digit (the millions) is listed first.  The smallest and thus least significant number is listed last.  So we view our numbers in big-endian format with the most significant digit first.

If we were to fire up Calculator in Windows and put in our decimal value of one million, it will show the following byte format for a 4 byte word:

0000 0000 0000 1111 0100 0010 0100 0000

Now what makes this crazy confusing is that this is shown in big-endian with the most significant bytes first; but on Windows these bytes are stored in little-endian order!  It kinda makes sense for Windows to display the bytes this way since programs display our number in big-endian with the millions over to the left and the single values to the far right.

KEY:  Most most PCs with x86 architecture (Windows and Linux in general) put the least significant byte first in our byte stream and are Little-Endian.  Therefore, the bytes are stored in the opposite order of what we might see in Calculator or otherwise represent them.

Where endianness becomes a problem is when we store multi-byte data in a file that is created on a system with one type of endianness and then attempt to read (type) the data on a system with a different endianness.  For example, if we take a LONG and save it to a file based byte stream on one system; and then attempt to read the same stream of bytes into a LONG on a system with the opposite endianness, the number will be bogus.

Back to our Calculator output for the number one-million: 0000 0000 0000 1111 0100 0010 0100 0000 which is represented in big-endian.  On a big endian system, the bytes will be stored in the same order in files:

0000 0000 
0000 1111 
0100 0010 
0100 0000 

However, on our Little-Endian PC system, the bytes are stored in the opposite order with the least significant byte first:

0100 0000 
0100 0010 
0000 1111 
0000 0000 

To think of it a different way, if we viewed our decimal number of one million in little-endian, it would look like this:  000,000,1

So what's the solution for genomic data?  Create a custom type that is platform agnostic but can still take advantage of native types.

SIDEBAR: The UCSC genomic 2-bit format uses a multi-byte signature to make it easy to determine whether the platform is big or little endian.  From their docs:  "If the signature value is not as given, the reader program should byte-swap the signature and check if the swapped version matches. If so, all multiple-byte entities in the file will have to be byte-swapped. This enables these binary files to be used unchanged on different architectures."

SIDEBAR 2: When work at the lowest level (fastest) on xPU registers, endianness doesn't make a bit of difference.  A 32bit xPU register will hold 32 bits of data and so there is no endianness at that level.  This is key to our genomic tool design (it is platform independent); which is discussed in future posts.

ASCII

We talked about numbers, but what about words and letters?  

Just as we need to type (a verb) our bits into numbers, we also need to type them into letters and eventually words.  How's that work?

By convention, we use a byte of 8-bits to represent each character of the standard ASCII character set.  This is a "single byte" representation.  Therefore, the letter "T" is represented by a byte-sized decimal number of 84, but is represented in binary as: 0101 0100.  In the early days, we could use just 7-bits for all of the characters and control things.  However, quickly we moved to use the full 8-bits of each byte in what was/is known as extended-ascii.

KEY:  Every letter in the ASCII alphabet consumes 8-bits if we use the alpha representation.

There are several implications for us in this.  Here are two key thoughts:

Numbers as Characters

One way to "cheat" typing is to leave everything as it's ASCII representation and to convert it at run time.  Let's see how that might work for our one-million LONG number.  

As a LONG, this value will take four bytes and will thus consume 32 bits of bandwidth.

On the other hand, if we use the ASCII representation of "1000000", that would be 7 bytes or 56 bits (one byte for each letter) or or almost twice as large!  On the other hand, if we sent the number as "1", this would only be one byte and so we'd only use .25% as much bandwidth as we would if sent as a LONG. 

This should illustrate why the data developer should evaluate their data for storage and transmission and when to type it.

Characters as Numbers

Remember our rule:  we should store our types in the smallest container as possible to reduce hardware load and processing time.  Our genome is comprised of ~3.2 billion bases which are typically represented in ASCII format.  For DNA, these are typically "A", "C", "T", and "G".  OK, so we have a byte sized container that will hold 256 values but we only need to store four.  Why not use a smaller container?  That's the premise of the 2-bit format we introduced earlier.  It's also part of the newish CRAM format.

As I mentioned in an earlier post, the next few posts will give specifics on the work I've done to evaluate using a binary format with low level processing to see if we can improve performance; and it was a fun intellectual exercise to learn CUDA programming.

Sunday, February 14, 2016

Memory Mapped Files For Large Datasets

I'm a fan of memory mapped files and think that they can play a role in processing large datasets on computers with limited memory.  The following is a copy of an internal email interchange I had with a couple of other engineers that I think describes the benefits.

Memory mapped files are fast in both .NET and JAVA but super fast using pointers in C or C++.

Here's the email thread:

Abstract: 

Hadoop reads entire files and then parses the data.  Hadoop keys tend to be single fields of data which makes the Map-Reduce process slow and complex for large (wide) data sets such as those in health.  The following will show a solution to both problems using memory-mapped-files which only reads the key fields from the file and never loads the entire file; and uses smaller composite field hashes for the key data to make it much faster to search and filter across large text fields such as gene sequences in wide and deep data sets.
Discussion:

After playing with Hadoop and other big data solutions, I have come to the conclusion that we will need the ability to do additional processing of large data sets AFTER we pull them from Hadoop.  Even if we were to hire a team of map-reduce java developers, we’ll continue to need the ability to handle some level of pattern-matching across our result sets.  Genomic data, primarily represented as large string sequences, makes this even more challenging.

Background: 

After installing Windows 10 on my laptop I put an SSD in the machine and was surprised that Win10 would start in about 5 seconds. 

SSDs are faster than spinning drives since they don’t have to wait for the seek heads to find the data on the platter(s).  But that would not fully explain the quick load times.

It turns out that Windows now supports memory-mapped-file access of the executables.  What this means is that the OS at kernel level, will map a large file into memory and only load what it needs when it needs it.  In the old days, Windows would load the entire file into memory and then map the executable library from the file into memory and execute it (kinda like Hadoop across the HDFS).  This explains the super-fast load times for Windows and has merit for our work too.

One of the problems with Hadoop is that it scales-out and uses a brute-force method to achieve performance:  lots of servers, disks, and memory.  Hadoop was written in JAVA and so it’s slow (folks are moving core functions to c/c++ but that work is still early).  To make matters worse, Hadoop must read/parse lots of HDFS files across the machines from mostly spinning drives; and then write interim results across the network and to other machines and drives.  This is a ton of file I/O.

Another problem I have seen with Hadoop is that it is a key-value system at it’s core.   It was written to have a known key (index) from which to do the initial search (the map phase) of results before being sent to the reducer(s).  Unfortunately, Hadoop does not work very well when we have large keys (such as genomic data); or when we have a variety of columns that we can use for keys.

And since Hadoop reads the full files into memory AND THEN parses them, the larger the files become, the slower the system runs.  This is what happened to Windows: the executables became so large, it took forever to load the OS since the files had to be read from disk into memory and mapped before they could be executed.  Hadoop solves this by scaling-out: adding more machines, drives, and memory (and of course complexity and expense).  Windows solved it by using memory mapped file.

I suggest that our solution would be similar to what Windows has done: use memory-mapped-files for very fast processing.

Memory-Mapped-Files in a Nutshell:

A memory-mapped-file is a file that is mapped into kernel address space as it sits on the drive.  No need to read the entire file into memory (saving RAM).  Since it runs at the kernel mode, it’s super fast and efficient. If we know a value is at a given position, we point our program to that position in the file and the contents from that are read into memory. 

So Memory-Mapped-Files can be a very fast technology to help us with our post-processing of large data sets (see example below).

Key Problem: 

But we still have our “key” problem; from our key-value technology.  Hadoop will open a file and map, into the reducer, all the matching “rows” that have the key we are interested in.  This means every file has to be read in toto, and each “row” parsed for our value.  But what if we have large “key” data such as genomic data; or if we want to use multiple “columns” as our key?  The solution to the problem is to use a Hash algorithm.  This is exactly what relational databases (SQL Server, Oracle, etc.) use to determine if a row is “dirty” or has changed.  This is how files are verified that they have not changed or been tampered with.  For us, this will work very well…

Say we want to “key” on a genomic string, say CGGGTCTGACCTGAGGAGAACTGTGCTCCGCCTTCAG and we then want to link that to another field such as MRN, 123456789.  We could map on either the genomic string or the MRN but we want to use both.  The solution is to concatenate these fields into a single string: CGGGTCTGACCTGAGGAGAACTGTGCTCCGCCTTCAG123456789 and then HASH that value (most modern CPUs support the hashing we use in the CPU so this is super-fast).  The hashed value is a much smaller binary size (typically 4 bytes instead of the ~ 50 bytes originally) so it’s much more efficient to store and search for.  Yes, we might have collisions, but those would be filtered out in the reduce phase.  Now, we can key/search across multiple columns using a terse hash.

Putting it All Together:

So the experiment was this: 
·         take a deep and wide dataset (the  Clinical Event table that has been denormalized) and hash composite columns to create a new Hash-key column (4 bytes wide). 
·         Export this table to a new binary file and memory map the file to see how quickly we can find a set of keys (matching columns). 
·         Compare that to using SQL Server with a composite key to find the same results.

Results and Observations:

The denormalized Clinical Event table with the hash keys ended up being about .5 billion rows with each row about 80 columns wide.  The final data files was ~ .5TB in size.  As you can imagine, it would take the OS a long time to read the file and even longer to then parse out the matching values.

I pointed a custom memory-mapped reader at the file and asked it find all the “rows” with a matching key (representing our genomic string and MRN).  It returned the results in single-digit seconds; typically less than 5 seconds.  This was on a host machine with spinning drives and I would expect this to be much faster when the data is hosted on SSDs.

I compared that to the relational database.  First of all, the relational database had to store the index information which added greatly to the size of the table.  Next, it had to load the large index from the file system before it could execute the query.  The index load took many minutes; however, once the indexes were loaded (into a machine with a ton of memory), the queries returned in a handful of seconds; but took much longer than the MMF solution.

Net/Net:  Memory-mapped-files of large data sets are orders of magnitude faster to process than even the most powerful relational database.  Plus, we can emulate huge systems with tons of RAM on a commodity machine sitting on top of relatively inexpensive SSD drives (Here’s a great read from MSR on this topic and another similar read).

--- I received these questions from one of my peers:     1) Is it possible to have an index around the hashkey, or is this unnecessary due to memory mapped architecture? 2)  As I understand Hashing algorithms do not guarantee unique hash value, so a few non-matching rows may occur in large datasets that happen to have the same hash. Those could easily be managed in a reduced fileset.  Any concerns with this?

To which I answered:

Yes, the “index” around the hashkey would really be a dictionary with the “key” being the hashkey value; and the values being the offsets in the file.  The cool thing about this is that we can create a persisted “index” and say, stick in a buffer at the start of the file; or, we can create any number of these ad-hoc and store them independently.  So we could have a virtual table of diagnosis codes that would simply be a file based dictionary/hashtable with the diagnosis code as key and the “row” offsets pointing to the matching rows.  This would be extremely terse and very fast to build. 

We could also use this as a way to cross-join by using the “row” offsets that match.  For example, say our dictionary/hashtable/index has MRN as the key and the values are the “rows” with those MRN values.  We might also have another dictionary with diagnosis code associated with those rows.  To find a match, we pull back the “rows” (offsets in the MMF) for each set and do an inner-join on those.  Once we have the offsets, we just point them to the MMF and pull the “rows” and the “columns” for those.  Since the MMF is so fast, pulling the data using the offset happens in milliseconds.

They key to this technology is using SSD drives as virtual RAM.  So now we are scaling up; but instead of using expensive and fast RAM, we use relatively inexpensive and a bit slower SSD.    Since an x64 system can theoretically map to 4 PB of RAM, we could put a collection of SSDs on there and let the MMF architecture map that into kernel-addressed memory space.

As for #2, you are absolutely correct that there could be hashing “collisions” so that there might be some non-matching rows.  Therefore, when we build our dictionary/index or do the data-pull, it would be important to check the values that were hashed to insure they are pulling the correct rows.

One thing I didn’t dive into but probably should at this point is describe a “row” and/or “column”.

At the start of the MMF, I hold the “table” metadata.  This is the length of each row (say 100 bytes) and the name and location (offset) in that buffer of each “column”; and I’ve even played with typing the columns.

Therefore, we store binary in bytes but can find and type them easily on the way out.

So let’s say we want to find a “row” with the MRN “column” with a certain value.  We look at our metadata table and see that each “row” is 1000 bytes long; and in those 1000 bytes, we might find MRN in offset 20 that is 10 bytes long and is returned as an INT.  To find all MRNs, we start at the beginning of the data, add the MRN offset and size, and then just loop through the file adding the row length on each iteration.  As you can see, this can really help when we have wide (think de-normalized) data sets.  I’m shocked at how incredibly fast this is.

Let’s say that our “index” or dictionary shows that we have a certain MRN if 123456789 at Row (offset 9999).  We would know that our complete “row” starts at memory location 9999, and all of our “columns” defined in the header are held in the 1000 bytes so we can pull those bytes and then populate a “row” type by reading those bytes into our type.  As a result our “MRN” would be read from 9999 + 20 (the offset) to 9999 + 20 + 10 (the length of the MRN). In code, we could then “type” the binary to our int.

For my test, I used a denormalized version of Clinical Event, where I did the code-value lookups.  Even without an index, I was able to search for a given value across the huge file in a matter of seconds.

Friday, February 12, 2016

Thoughts and Solutions for Processing Large Genomic and Other Datasets

The promises of Personalized Medicine are exciting:  using our genetic makeup and known outcomes, providers will be able to offer a customized treatment plan instead of just using a population health approach.
There are challenges:
  • Datasets are large. The human genome is comprised of over 3 billion "base pairs" of nucleotides.  When represented in ASCII ("A", "T", "C", "G", etc.), this means over 3GB of file space for each full genome map; without the metadata such as Chromosome location, etc.
  • Very little standardization on tools and solutions.  
  • Most tools come from the Open Source Community and as such, offer little to no support.
  • Most of the tools are written in JAVA with the need to "pipe" the output from one program to the other.  This requires enormous amounts of memory.
  • Although in general, the genome is represented by the four primary nucleotides (ACTG); there is a huge randomness in the pattern across the genome.  Therefore, typical compression routines do not lend themselves well to this data domain.
There were two primary objectives that drove me to explore some possible solutions to these challenges:
  1. File sizes are too large and we need to figure out a way to more efficiently process, store, and move these large files.
  2. With the goal of a "computer on everyone's desk" that has driven my thinking over the past thirty-five years, we should be able to deliver an inexpensive genome processing "personal computer" that can at least match the output if so called HPC or High Performance (or Processing) Computers.
Compression
Most file compression routines are based on the fact that particular content has a large number of repeating fragments of data. The more frequently data repeats, the greater the compression.
As mentioned before, most computer compression algorithms do not work with the DNA data domain. However, there are some cool alternatives. I'm particularly fond of the CRAM format that achieves compression via: “...bases are reference compressed (Hsi-Yang Fritz, et al. (2011) Genome Res. 21:734-740) by encoding base differences rather than storing the bases themselves” (ref) Since most of the human genome is common to all of us, this makes enormous sense. CRAM also uses some level of binary formatting which I believe is key to achieving both compression and improve processing speed (more below).
Several tools use a 2-bit, or 4-bit representation of a base instead of the ASCII byte to achieve compression. For example, the sequence ATAT, when stored as ASCII, will be stored in 4 bytes (32 bits), one for each letter. However, for a 4 bit representation, the same sequence could consume just 2 bytes (16 bits calculated as: 4 bits times four characters) and thus achieve 50% compression.  The best compression would come from the 2-bit solution which will achieve a 75% compression ratio.  That is, 2 bits per character time 4 characters is an 8-bit byte.  The prototype I built and will discuss in depth later, uses a 3-bit sequence (62.5% compression).
So what's wrong with the 2-bit sequence; after all, it will achieve superior compression as it stands. (Great discussion here.)
First off, we need our tool to discover and document insertions, deletions ("indels") and just plain “dunno” (typically represented as "N"s).   If we are processing a gene sequence, and have used all of our keys (letters) in a 2-bit format...that means we need to create an external lookup table for those anomalous reads. This external table will reduce the overall compression; and more importantly, get in the way of fast processing – detailed below.
A 4-bit compression is useful since it makes all our bit-shifting break on the byte boundaries, but at a cost of losing significant compression. 3-bit is ideal but a bit more difficult to implement (bad pun). That is, 3-bit patterns give us 8 data points, enough for the four bases (ATCG), the “N” or unknown, and one for end-of-file or end-of-sequence. It even provides a placeholder for a reference lookup. For example, let's say at position 999 we have the three bit pattern of 111, our program could then go to the lookup table and pull in the reference genome using the offset marker denoted by the three bits.
Performance
With regard to #2 above, a “computer on everyone's desk”, I wanted to explore how one might build an inexpensive genomics computer (< $1000) for local processing. In the next post, I'll talk about the development process and results in greater detail; but for purposes of exploration, here's what I set out to do:
  • Determine if I could use a Solid-State-Drive along with Memory-Mapped-Files to eliminate the need to read in the full file contents; and/or to pipe the output to other programs in the processing pipeline.
  • Evaluate the speed tradeoffs of using an interpreted programming language (Java or .NET) vs native code (C/C++)
  • Evaluate the benefits of getting as close to the wire as possible by utilizing binary/bit formats instead of letters such as ATCG.
  • Determine the cost benefit of using massively parallel processing power of the Nvida CUDA graphi cards to do the work.
  • Determine if using native xPU (CPU or GPU) registers to do the bit comparisons would lead to performance improvements. That is, use intrinsics for the xPU to invoke the “Single Instruction/Multiple Data” (“SMID”) parallel processing.
I will detail my exploratory work in more detail in forthcoming blog posts, but for now, let me summarize the current results.
  • I pulled down the full human genome in ASCII (FASTA) format and extracted it to one large file. That is, just a huge file with ATCG and Ns.
  • To benchmark the tools, I decided to keep it simple. I wanted to pass in a random sequence of bases (such as AATAAGC) and count the “K-Mer” repeats. In the example, AATAAGC, this is a seven-mer and to count how many times it appears in the genome.
  • Using our benchmark tool, I then evaluated the performance using the following technical solutions:
    • Using pure ASCII to read the raw ascii before converting to a bit format. Frankly, I was shocked at how fast this turned out to be. However, this was executed on an 8-core computer with 16GB of RAM and the entire genome would fit into RAM. This gave me hope that by using a more terse bit format from the byte format of text, I'd see more improvement.
    • I next converted the ASCII to a three bit format. This required some manual bit loading and shifting; but it also allowed me to compare larger bit sequences at one time and reduce the loops in the code. Using interpreted code (.NET but probably similar to Java), I could get a random k-mer count using a single thread in about 25 seconds after the genome file was full loaded into RAM. Surprisingly, this was similar to what I was seeing using the raw text files under .NET. SIDEBAR: I have read that .NET will use assembly code to do string compares and so I would expect to see pretty good performance if that's true.
    • I then moved the code to native C/C++ and compiled for both Linux and Windows and got almost identical results on both...about 2.5 seconds to get a random k-mer count across the full genome using the 3-bit sequence, once the full raw file was loaded into RAM.
    • I next decided to parallelize the C and .NET code running on the CPU for a single k-mer count. This increased the run time by a factor of ten; no doubt due to thread switches and memory swapping.
    • Finally, I used a single and inexpensive NVIDIA GPU, CUDA enabled, video card to run the same parallelized code on the GPU (no thread switches) and can get a random k-mer count across the full human genome in about .7 seconds after the file is loaded.
    • What I have yet to evaluate, is to use intrinsics and the native xPU registers to do the bit comparisons. However, I'm not sure the juice is worth the squeeze. If I can use a very inexpensive video card to achieve these results, and can run more than one video card per desktop machine, then I think we've gotten close to proving that by using the GPU to process genomic data, we can achieve massive parallel processing power under or on our desk.
What's next?
I plan to release some blog posts soon that will provide details of the above findings and will drop some code samples as well.
Further Reading: