Tuesday, September 27, 2016

Copy-Forward in Clinical Notes - Oh No; or No No

After working in health IT for several years now, I often hear older docs complain about the newer generation of computer physicians.  That is, that these young docs will rarely touch the patient but instead spend their time on the computer ordering tests and pouring over results and treating the tests and not the patient. I'm not about to weigh in on that issue; but another problem with this generation of computer guided doctors is their tendency to take short cuts in their clinical documentation.  For example, they might copy forward (aka clone or cloning, carry forward, copy and paste, cookie cutter, cut and paste, and others) content from one note to the next.  This post will talk about an application I wrote to detect this practice and offer some observations about what we found.

I first became aware of this practice and problem from my own clinical notes.  A few years ago, I had a temporary inner ear problem and my primary care physician referred me to an ENT doctor in his office.  A couple of years after that, my audiologist referred me to the same guy for evaluation for hearing aids.  Subsequently, I ordered copies of my medical record and notes and what I found astounded me.

Although I was in my forties on my first visit, the ENT doc declared that I was an eighty-year-old patient suffering from something completely unrelated to my visit.  On my visit with the same doc a couple of years later for something completely different, he just copied the content from my first visit to this one.  In both visits, my age was years off, the notes were identical and completely wrong, and he charged my insurance company hundreds of dollars for a less than five minute visit for each encounter.

Fortunately, CMS has gotten wise to this and offers several guidelines for policies.  I understand that if they find offenses, they can decline payment for the encounter.  Ouch.  The Joint Commission has also weighed in; as have others.

In a recent project with a teaching hospital, I was tasked to work with the HIM department to determine if they had a problem with this and if so, to what extent.  This was such a cool project that I wanted to share for others who may be thinking of heading down this same path.

One thing I must emphasize is that this problem can be twisted a number of ways; and I think it's extremely important to evaluate the scope of the problem by looking at each and every way the problem may present.

For example, in our little project, it was first thought that core problem might be:  a Resident or Fellow would copy content from notes, written by them, from an earlier exam for the same patient. Computer code to find and identify these offenses was easy to write and performant since we only had to search for problems across a single patient and encounter.  Indeed we found a few offenses.

The next iteration of the tool proved more interesting and informative.  During a manual chart review of the notes, an HIM employee found that a provider had copied key sections of clinical notes from one patient to another and the tool needed to discover this too.  This made the tool much more complicated since it now needed to search across all patients for duplication.

So in review of our new requirements, the tool needed to identify the following:
  1. Provider "A" copying sections of notes they had written for a patient previously during an earlier exam; when one would expect the narrative to change over time.
  2. Provider "A" writing a note for a patient who had previously been seen by Provider "B"; where Provider "A" copied key clinically relevant text authored by Provider "B" into the newer note for the patient.
  3. Provider "A" storing a "library" of clinical text locally on their machine that is NOT patient specific; and using that to paste into a variety of patient charts.
  4. And the very worst offense;  Provider "A" copying Provider "B"'s (an Attending perhaps?) notes into a local library and copying and pasting from there into a variety of patient charts.
So next, let me fully describe how I wrote the suite of tools to handle these business requirements; summarize what we found.

Tool Suite Requirements

Here is a list of requirements we used to develop the tools:
  1. Minor copies are OK and frankly, expected.  Therefore, the tool had a configuration-driven text length definition from which to begin to define a potentially problematic copy-forward.  Those terms copied forward under the threshold were ignored.
  2. We need to exclude terms that were auto-populated in the EMR and so one would expect to find a large number of instances of those across all patients and encounters.
  3. We wanted to exclude sections of the notes that were less clinically problematic for the copy forward offense such as patient history.
  4. We wanted to be able to group our "hits" by provider, patient, encounter, and note type.  
  5. We also wanted to report on (group by) the top hits by two metrics:  the length of the copied text and the frequency of the copies.
  6. Due to the fact that we wanted to group by maximum length, it was important to take the first hit and grow the search string until the maximum match length could be determined; and to omit the previous hits which would be part of this longer string.  

What We Found

  • As expected, we found a large number of cases where a provider would copy key sections of their notes for the same patient and encounter from one exam to the next. When refined, it was clear that many (most?) of these were in areas of the notes one would expect to see patient improvement over time; but instead, identical observations were made over and over.
  • A lesser problem, but still rather egregious, were examples where a provider would copy text written by another provider during an earlier exam into the notes for the same patient and encounter.  Again, one would expect to see some change in the observations (hopefully for the better) over time; but instead, the exact same observations were documented; calling into question whether the provider actually saw the patient.
  • What I think surprised all of us where the rather rare, but still glaringly problematic, cases where a provider would obviously keep a library of observations and simply copy and paste from those from patient to patient and encounter to encounter.  It should be noted that when we did a "deep dive" into specific offenders, we found that the offenders did this broadly across their notes and with a variety of strings of text. 


Although I used a multi-threaded C# application running on a beefy box with a bunch of RAM, and was able to load a substantial amount of text into memory and do memory compares directly using the CPU, the problem set is remarkably similar to that of doing pattern matching across the human genome that I discussed previously.  Also, the solution as written used massively parallel code and would benefit from the power of CUDA and GPU processing.

One would expect that for a production or regularly executed program to detect offenses, the provider community would "wise up" and begin to insert small insignificant edits in their text to defeat the detection routine.  However, as with genomic processing that identifies "inserts and deletes" or "indels", a robust application should be able to handle these minor edits with relative ease.  Couple that with some level of natural language processing ("NLP"), and it's conceivable that a powerful forensic tool to identify policy infractions for copy-forward could be developed and operated on standard commodity hardware.

Fun stuff!

Friday, April 15, 2016

Genomic Data: Binary Cuda K-Mer Counting on the GPU: Results

To review, I'm enthralled with the power of GPU and CUDA programming and needed a use-case to "play with" the technology.  My use case is pretty simple, I wanted to see how fast I could count all K-Mers in the full human reference genome using an inexpensive GPU card ($200) on commodity hardware (< $1000).

The results are in, I was able to count all 265,631 unique 9-Mers across the full human genome with 3,209,286,109 9-Mer sequences at over 15 per second.

In other words, in one second, I could get the full count (and locations) of fifteen different 9-Mers across the full 3.2 billion possibilities in the full human genome!

Thursday, April 14, 2016

Counts representing Items and the impact on Data Compression and Encryption

All things in computers are identified in counts of digits (called bits) which are either on or off.  In this post, we'll discuss how counts (discussed in the previous post) are used for identification.

Saturday, March 26, 2016

Binary for Dummies: Explaining Endianness and Base Counts in the Simplest Way Possible

I've received feedback that this binary stuff is too complex to understand.  So let me start at the very foundation and build up from there.

Tuesday, March 8, 2016

Genomic Data: Binary vs ASCII - Part Three: CPU vs GPU


In the first post, I described some background information on computer storage, how byte arrangement "endianness" can impact our tools, and ASCII vs binary data.

In the previous post, I discussed the 3-bit format I used for this evaluation.  Let me summarize the approach and then we'll proceed to look at using a CPU vs GPU to do our processing.
  • For this Proof-of-Concept ("POC"), I first converted the full human genome, in fasta format, to a single file of contiguous characters of A,T,G,C, and Ns.  
  • Next, I converted these characters into a 3-bit format with each base represented by a particular 3-bit pattern.
  • The original hg38.fa file is 3,273,481,150 bytes long.  After I removed the line and chromosome breaks (**see more info below), the file in ASCII was 3,209,286,105 bytes long.  However, after moving this to 3-bit format, the file ended up 1,203,482,291 bytes -- a 63% reduction in size.
  • The 3-bit file format has all 3-bit characters in a contiguous pattern packed from left to right; regardless of the byte layout ("endianness") of the host platform.
The goal was to conduct pattern matching; that is, to get a count and optionally the location of certain nucleotide patterns; for example: find the count and location of all the 21-Mers of: ATATATATATATATATGGATA in the human genome.  

For both the CPU and GPU code, we conduct our comparisons in the same way:
  • Take the K-Mer we are looking for an convert it into our 3-bit pattern.
  • Using the sequence of 3-bit representations, we then pack those bits into the smallest native container we can use for comparison.  In this case, a 64 bit unsigned long long.  The final bits look like: 0011 0010 1100 1011 0010 1100 1011 0010 1100 1011 0010 1100 1100 1000 1100 1011 and so this is the pattern we are looking for.
  • In order to do our pattern matching, we need to first read the source file we want to search into memory.  In our POC, this is the full genome, but since the file is now only 1.2GB, it will fit easily into memory: either GPU or CPU.  This is key, we only load our source once and retain it in global memory for the threads to use.
  • Our final step is to read the bits from the target file into the same sized container as as our search string (for example, unsigned 64bit long).  Once we do that, we simply compare the variables holding the bits; and if they match, increase our counter and optionally get the file character position.

CPU Code

The CPU code works as just described:  it first stores the 3-bit sequences for our pattern in an unsigned 64 bit container, call it uint64SearchString.  Next, we walk the array of bytes and load up our bit sequences for the search count (21 in this case) 3 bits at a time. So we will start at the first byte and load it and the following six bytes (first 7 bytes) into our unsigned 64 container, call it uint64FindMatch variable.  To completely fill 63 characters we need in 3-bit format, we only need the first seven bits in the eighth byte and so we load those at the end of our uint64FindMatch container or variable.  Next, we simply compare the two variables and if they match, we know our patterns match.   Something simple like:  

if uint64SearchString is equal to uint64FindMatch then add one to Count

Simple and very fast!!

To get the full count, we continue to read through our byte array holding the full genome where we load our 21-Mer sequences into our container uint64SearchString; by positioning the start of our 3-bit sequence one place to the right for each iteration.  That is, for the second iteration, we start at the fourth bit in the first byte and read all the ensuing bits up to our 21-bit count from the following bytes.  Since the source bytes are preserved in memory, this is relatively fast.

I first wrote the code in C# which, like Java, is interpreted into machine language and so it's somewhat slower than native machine code.  Using C#, I was able to count our example 21-Mer across the whole genome in 11.96 seconds on a single-proc, four-core, machine. However, only one core was busy and so the CPU utilization showed only 25%. Let's try with four threads, one for each core.

The four threads achieved the same timing; with each thread running on one of the four cores and our system CPU utilization hitting 100%.  

Conclusion:  Using our CPU search pattern, we can get a 21-Mer count across the full genome in ~12 seconds per core using C# code.  Therefore, with four cores and searching for four patterns, we average ~3 seconds per sequence and would expect that to decrease proportionately as we add CPU cores.  But, there's a severe limit to how many CPU cores we can use.  CUDA to the rescue!

Before we move to CUDA and GPU code, let's take a look at C and native code to see if that makes a difference.

I modified the logic slightly and used a 7-Mer string held in a 32-bit unsigned long.  As expected, this resulted in our counts going from a couple of dozen for our 21-Mer to a million+ for the shorter search sequence.

Using a single thread for a random 7-Mer, the program took 8.37 seconds (using one core) to find the ~1.5 million matches.

So now we have our baselines for our CPU matching approach.  Let's look at CUDA.

GPU and CUDA programming

There are many advantages to using the GPU to handle our massively parallel problem:
  1. The GPU can have thousands of cores instead of just a handful on the CPU.  The $200 card I used for these tests has over 1000 cores.
  2. Threads are limited on the CPU since we have thread switch overhead; and because the CPU has to handle things such as keyboard input, network traffic, and a host of other things.  The GPU threads are dedicated and designed for quick, short activities.  In general, we typically don't wan to spin up more than four or five threads per CPU core.  So for my 8 core I7 test machine, I could spin up no more than about 40 threads.  However, on my GPU I can spin up thousands of threads to run simultaneously.
  3. We can add multiple cards to a single host system and derive computing performance of an HPC computer on an inexpensive desktop.
  4. I use a dedicated video card for my GPU calculations.  Even when it's maxed out, the calculations have a negligible impact on my host machine since the CPU is not utilized much with all the work done on the dedicated GPU.  In other words, it's like having a separate computer running in the same machine.
OK, so what happened when I ran my CPU program on the GPU?  It took minutes instead of seconds to complete.  The reason for this was that I was only using a single thread since the design was optimized for a multiple-threaded CPU application and not the GPU.

Remember, on the CPU we sent our search pattern to the thread and it used that to load all the sequences from the byte buffer of the full genome.  This is much more efficient by using a single thread for a single search pattern and walking the full file in memory.

While this design is optimal for the CPU and our CPU response is not bad (about 8 seconds per core for native C code (both Linux and Windows).  However, as we saw, we are limited to by the CPU cores.  A single search on a single thread saturates a single CPU core.

However, what if we use one (or more) search pattern(s) for all of our threads and had each thread work on only one source file sequence?  In other words, say we spin up a million threads and ask them to look for ten different 21-Mer sequences... We can pass those sequences to all million threads and each thread will open a small matching sequence slice from the source file: the full human genome in our case.

A word picture might look like this:
  • Take two 21-Mers and pass them to two threads.
  • The first thread reads the first twenty-one 3-bit "letters" from our byte array (the full human genome) and compares the value to the value of the two search patterns.
  • The second thread moves our start position over three bits (the size of our "letter") and loads the next 21 "letter" or 63 bits into its variable against which it will compare the collection of search patterns.
The CPU is more efficient when it can take a sample pattern and walk the source (full human genome) looking for a match.

Our GPU will reverse this.  It will take a K-Mer sequence (or many) and each thread will find a different but discreet K count of the 3-bit "letters" from the source (full human genome).  The thread(s) will then walk through the list of sequences we want to get a count for seeing if they match the one source sequence.

OK, so more test data to digest.  I ran this on our single-threaded C code both ways on the CPU.  The CPU-optimized-code finished in 8.37 seconds as mentioned above.  But the GPU-optimized-code took four times as long when run on the CPU.

As we observed above, the CPU-optimized-code took minutes to complete on the GPU since it ran on a single thread.  However, when we used our GPU-optimized-code and launched 500 threads over our 1024 cores, we were able to get a 21-Mer count in under .7 seconds. Most of this time was spent loading our sequence from the buffer and converting it to our type.  But once that is done, we can compare hundreds or perhaps thousands of K-Mers using the same source sequence spread across these thousands of threads in milliseconds each.

For example, using a single 21-Mer and our full genome 3-bit file, we can get a count of all matches in .720823 seconds.  When we add one more search sequences to our search, both counts return in .731270 seconds.  A total of four takes .759792 seconds.  Think about this for a moment.  We can get the counts of four random 21-Mers across the full human genome in under .19 seconds each! 

This is with none of the fine-tuned optimizations that the Nvidia/Cuda code profiler suggested (this IS a POC after all and I'm really just looking for relative values).

Plus, this does not use GPU SMID intrinsics which I believe would dramatically improve performance since they are hardware accelerated and we do our binary compares at the register level.   This is on my "look at next" list :)

SIDEBAR:  Here's how I think that might work.  Since these intrinsics compare the bytes of a 32 bit word, we would put our bit sequences in either 32 bit containers or a multiple of 32.  Empty bits would be zeroed out as would the empty bits from the source.  By only looking for a true/false match, the compares would stop as soon as we hit a non-match (most of the cases).  Since this is hardware accelerated, it should be incredibly fast.  

Another performance improvement should be seen if we expand out and use a larger number of sequences.  That is, what if pre-populate an array of all the K-Mers we want to search for holding them in the native container such as Unsigned 64bit long?  Using CUDA, our only overhead, once we have the source sequence in memory on the thread, would be to do the compare which, as we just illustrated, takes milliseconds.

For my next exercise, I plan to do just that.  That is, I will pre-compute several thousand 21-Mers and run them through this CUDA engine and determine how long it would take to do the counts.  I expect I can determine the counts of thousands of K-Mers in under a minute.  Fun stuff.

** OK some closing thoughts and disclaimers.
  • A question or problem one might see in all of this is to make this actually useful.  In other words, say we wanted to search across a particular chromosome and not the full genome.  What then?  I propose that we have a separate index file that will provide a list of offsets for sections of the whole file we want to delineate.  So if we use the index and know that Chromosome 17 is from Position 1234 and runs to Position 6789, we can then limit our searches and counts by those parameters. Should make things even faster.
  • CUDA programming has a rather steep learning curve and most of the program parameters are gated by the specifications of the video card itself and by the makeup of the data we send to it.  Trial and error here can be your friend.
  • As a reminder, this was a brute-force type of test and I was looking for relative performance improvements more than actual results.  However, I think these ideas can form the genesis of a new toolset that can run efficiently on commodity hardware and compete with most cloud based or HPC type systems.

Monday, March 7, 2016

Genomic Data: Binary vs ASCII Part 2 - Three Bit Representation

This post will begin to describe the "Proof of Concept" work (or play I guess) I have done to explore two primary ideas:
  1. What impact will a terse binary format of the human genome have on file compression and pattern matching processing speed?
  2. If we use a binary format, what will be the impact of using low level (binary) computing via the CUDA GPU and/or, xPU intrinsics?
DISCLAIMER: This exercise was not intended to be a final solution or to replace existing tools. It was done to evaluate alternatives to some of the existing tools; and to explore how commodity hardware might be used for genomic processing instead of large (and expensive) High-Performance-Computing ("HPC") solutions, or cloud alternatives.


The full reference human genome is currently 3,270,974,121 bytes in length in a fasta file format.  This includes headers, line breaks, etc. but generally represents the human genome with five primary single ASCII letters:  A,C,T,G,N which represent the various nucleotides.

Most current tools that process the genome are written in interpreted computer languages and work with these ASCII representations.  Of course there are exceptions, but in general, these tools will read in the full (or a partial via a stream) set of ASCII letters to begin their work.  This requires enormous amounts of computer memory and processing power. I submit that by changing the way we represent the data, we can reduce the computing resources needed.  Further, by using native programming constructs (C/C++ and Intrinsics) we can deliver a robust set of tools using inexpensive commodity hardware that can rival HPC systems.

The Simple Test

For this POC, I wanted to use a specific problem that might be representative of the broader domains of similar problems.  Specifically, I wanted to write a program that would take a random string of nucleotides, and search the entire human genome to count the number of instances found for that string.  This would be useful in k-mer counting; and pattern matching across the genome.

For example, we might want to count the number of instances, and find the location of:  ACTAAGGA.

Again, the goal is to simply evaluate and compare conducting this exercise on:
  • a machine with limited memory - using memory mapped files on top of SSDs to augment RAM
  • compare the performance of interpreted code with native code
  • compare the performance of the same code run on CPU vs the GPU

The Hypotheses 

  • Processing would much more efficient if we use a more terse binary format instead of ASCII
  • Comparisons using a binary pattern will be orders of magnitude faster than using the ASCII format - primarily due to the more compressed patterns.
  • Cuda processing using one thread for every possible sequence in the original buffer (e.g. the full human genome) will be much more efficient than using CPU based parallel processing
  • Since CUDA employs a different construct for parallel processing, we should flip our patterns searchers around to take advantage of the platform.

    That is, in a CPU based search, we might conduct our search as follows:  a) load the search string into the CPU register and then walk the source file to see where there are matches.  b) we could load a different search string into different threads (limited by the CPU) and use a global buffer to hold our source data (the human genome).

    In a GPU search pattern, we turn this around (which seems counter-intuitive).  That is, we load the source pattern (from the human genome) into the GPU register(s) and we then do a pattern match on our k-mer strings.  More info below.
  • The overarching belief is that most "processing power" for pattern matching is actually consumed by memory transfers and so our optimizations should focus on those to achieve better performance.  That is, everything we can do to reduce memory transfers will improve performance - especially in the CUDA world.

The Work

Let me attempt to break this down into smaller logical sections (and posts) to hopefully make it more understandable.

Let's use a short sample source search string:  ATTATATTATTA and a sample 3-mer of TTA to get our count of.  We can see that the TTA string appears three times in the source string.


First of all, if we keep this in ASCII byte format, our source will be 12 bytes (letters) times 8 bits or 96 bits long.  Our k-mer is three bytes times 8 bits or 24 bits long.  Therefore, at the binary level which our systems work, we are searching for a 24 bit sequence across the 96 bit source.

To our computers this would look like this:
ASCII A is decimal 65 which in binary looks like:  0100 0001
ASCII T is decimal 84 which in binary looks like:  0101 0100

So our target string of TTA appears to our computers as the following 24 bits: 010101000101010001000001

And our source (searched) string of ASCII ATTATATTATTA appears as the following 96 bits:


Now, let's assume the common 2-bit format for these data and let's assume that the "A" is represented by decimal 1 or binary 01.  The "T" can be decimal 2 or binary 10.

Now our target string of TTA appears to our computer as the following 6 bits: 101001

and our source (searched) string ATTATATTATTA appears as the following 24 bits:

As you can easily see, there are far fewer bits that need to be moved from disk, to memory, to xPU registers with the more terse 2-bit format.  This will be a huge performance gain.  Plus, we ultimately use the xPU registers to hold our binary patterns and compare those with some memory contents.  If we have smaller registers or larger search patterns, the system would need to "chunk" up the larger patterns by swapping in and out of memory.  These memory swaps are what we are working to avoid!!

For example, if we use the ASCII format of 8-bit base representations, and assume 32-bit registers...we can compare four bases in our registers at a given time.  However, once we cross that barrier, by say looking for 11-mers, we would need to "chunk up" or data and hold some level of matching state across pattern matches.  That is, we look for the first part of the pattern across the first 32 bits and the next part of the pattern over the next and so on.

However, if we use a 2-bit format, our 11-mer can fit into 22 bits (instead of 88) and thus we can more efficiently use the xPU registers for our compares.

KEY POINT:  If we use a 2-bit format for our data, we reduce our memory operations by roughly 75% - and increase our performance by the same factor!

Let's talk 3-bit

One problem with 2-bit is that we need to represent more than four bases.  At the very least, we need the "N" variable and so five items.  We could make our lives easier by using a 4-bit pattern since it would fit nicely in our 8-bit, 32-bit, 64-bit, etc. native containers; but we don't need all four bits.  For our use, we have opted for a 3-bit pattern since it is of optimal size and will accommodate 8 items or entries.  For the POC example, I included the following:  A,T,C,G,N,EOF, and Chromosome_Break; with one item left for whatever.

What make this 3-bit pattern difficult is that it won't break on the byte (or other native container) boundary since it's an odd number.  That is, a 4-mer sequence such as TTAA represented in 8-bit ASCII would fit nicely in 8 bytes; and would also fit nicely in a single byte when represented by the 2-bit format; or in two bytes when represented by the 4-bit format. In other words, we never break a base "letter" over the byte boundary when we use an even bit format (2-bit, 4-bit, or 8-bit).

However, our 3-bit pattern is not as easy to work with.  For example, our search pattern of TTAA in 3-bit might look like:  "A" as decimal 2 in 3-bit would be 010 and T as decimal 1 in 3-bit would be 001.  Our 4-mer would then need 12 bits: 001 001 010 010.

But remember that all data is stored as bytes on the computer.  Therefore, we need to have these 12 bits fit into our 8-bit byte native containers.  Assuming we don't want any extraneous bits in our storage, we would need to break our bits at the third base (assuming we load from left to right):

So our first byte would be: 001 + 001 + 01  and our second byte would be 0010 0000

Since our 3-bits don't break at the byte boundary, we need to store our binary sequence in the smallest native container that can hold our sequence.  

To understand this, you have to forget comparing bytes (as we tend to do today in most genomic programs); and think about comparing bits.

If we use our 3-bit format for pattern matching, we can store up to 2-mers in a byte (two 3-bit items with no remainder).  Similarly, a 16-bit container could hold up to 5-mers; a 32-bit container would hold up 11-mers, a 64-bit container would hold up to 21-mers, and so on.

The key point here is that we should use the smallest container to hold our k-mer represented in the 3-bit format so that we use the low-level hardware as efficiently as possible; and to reduce our expensive memory transfers.

Let's illustrate with a 5-Mer of TTAAT.  

Using the "A" as decimal 2 in 3-bit would be 010 and T as decimal 1 in 3-bit would be 001. Therefore, our 3-bit search pattern for TTAAT would look like this in binary:  001 001 010 010 001.  After we pack these into bytes, which we must do at the lowest level of storage, our bytes would look like this (assuming loading from left to right): 
   0010 0101 
   0010 0010 (the final zero is a remainder and will start the next k-mer)

Now this is where it gets a bit funky so pay attention.  If we try to read our bytes into a 16-bit integer using native code, the endianness will make a big difference since the byte order (not bit-order!!!) can differ from one system to another.  So to read straight into an int16 in one system, the bits would be: 0010 0101 0010 0010 (which is correct for us since we load from left to right); but on a different platform, they would be reversed and look like this: 0010 0010 0010 0101 

Therefore, if we want to be portable from one platform to another, and eliminate the need to swap bytes, we need to read the bits into our container bit by bit at runtime.

In pseudo code, this is what this looks like:

Byte b1 = decimal 37 or 0010 0101
Byte b2 = decimal 34 or 0010 0010

When we read in our bits from our bytes, we want to pack them left to right.  So we read the number of bits from all of our bytes until we get to the total number of bits (K-mer size times bit-size).  So for our example of a 5-mer, we need to read 15 bits from our first two bytes and pack them into our int16 from left to right.

Our int16 would end up looking like this: 0010 0101 0010 0010.  We frankly don't care what the decimal number is (9,506 on little endian) since we never use it!  Our container simply holds the bytes packed left to right that are read from bytes held on the file system or memory in the same order.

To do our compare now, we simply read our bytes from memory or the file system into our program (or onto our xPU register) in the size of our container we need, and then compare the values. In other words, we load the bits into an int16 and compare with our compare sequence also loaded into an int16 with the bits loaded on both from left to right.

For example, we might read the 3-bit representation of our genome from the bytes in the file system into the smallest container size for our k-mer (the int16 for our 5-mer), we mask off the remainder bit (since we don't divide enevnly) and compare our numbers.

So we read in our search size of bits (15 in our 5-mer example) from the source string such as the full human genome, and load that into the type we are using; an int16 from our 5-mer.  We mask off the remainder we don't care about; and then using those two int16 numbers, we compare them.

Closing comments and Thoughts:

  1. This approach is platform independent and so the endianness of the bytes makes no difference.
  2. Search is automatically optimized by k-mer size so searches for small k-mers is extremely fast, even across the full genome.
  3. Instead of using hashes of letters as indexes, we can use the integer values; but I would argue we no longer need indexes since the search is so fast ...milliseconds across the whole genome.
  4. Assuming the CPU, we can load a different k-mer search pattern into memory using the native container (typically an unsigned long or long long) and then simply walk the array of bytes of the source loading them bit by bit into the native containers for comparisons.  Since we are not typing (but simply using the native containers for convenience); our loads and comparisons are very fast.  We simply slide through the array of bytes and look for pattern matches.  
  5. Assuming the GPU, things become more interesting.  We'll talk about how this logic is reversed in the next post; but as a teaser:  each thread holds a portion of the source pattern (the whole human genome takes millions of threads but only milliseconds to match) and then does the compare with the search k-mer.  
  6. The examples of int16 are not actually correct since it has the sign-bit; we should always use an unsigned container since we don't use the sign bit and we need the space.

Tuesday, February 23, 2016

Genomic Data: Binary vs ASCII - Part One: Introduction to Systems

This and subsequent posts will discuss the advantages of storing and processing Genomic data in a binary format vs ASCII.  We will begin at a very high level and dive deep to using binary processing at the lowest level in the computer system.  Feel free to skip this introduction if you want to dig into the details.

Introduction to Computer Storage

Data on the computer is stored as either "on or off" (or positive or negative in the transistors). These are the only two possible states and are often represented as zero or one and known as binary; with each of these two states known as a "bit".  A bit can be either 0 or 1.

From this native state, we can build all constructs from which programs operate.  Most computers use an 8 bit byte as the fundamental memory-addressable unit. In other words, everything builds on this basic container and all types derive from one or more bytes.

A byte can store up to 256 numbers represented by our binary (2 ^ 8).  For example, let's take a number such as 71; it will have a bit/byte representation as 01000111.

But now you may ask, how can I represent a really big number such as one million? For that number, you would need a container that holds at least 20 bits.  Since computers use an 8-bit byte for their fundamental storage, that would mean we would need to allocate space for three bytes or 24 bits in order to hold a number that large. Therefore, on our computer system, the actual storage would look like this (ignoring the endians for now): 0000 1111 0100 0010 0100 0000  That is, we would read the three bytes and format the data therein into a number that would represent one million.  

The important thing to note here is that our program must know that it needs to read these three bytes in order to return the number that large.  This is key to understand "typing" in computers:

KEY:  A type in a computer is a construct that will hold the bits and bytes in that container.  

NOTE: to be stored in three bytes, our one-million number above would need to a custom container since most computer systems use 16-bit, 32-bit, and 64-bit containers of 2 bytes, 4 bytes, and 8 bytes respectively.

As an illustration, computers have a numeric type of a ULONG which is a 4 byte container with 32 bits. This container will hold number as large as 4,294,967,295 (2 ^ 32) 

SIDEBAR:  Did anyone do the math: 2 raised to the 32nd?  If so you got a different number! Your calc shows 4,294,967,296  Correct.  But remember, we need to show zero and so from zero to 4,294,967,295... which is 4,294,967,296 numbers with a maximum size of 4,294,967,295.

Note that this is an unsigned number.  What the heck is that?  Well, if you want to represent negative numbers, you would need something to represent the sign for the number.  To do that, we need to use a "sign bit".  But now we only have 2 ^ 31 or 2,147,483,648 places since we need a bit for our sign. So our typed 4 byte container for a number known as a LONG (a "signed" value) in most systems will hold data that ranges from -2,147,483,648 (the zero is handled on the positive side) to 2,147,483,647

Now, the thing to note is that when we use this type (container), we need to allocate space for the largest number we might encounter.  The rest of the space is essentially wasted. For example, the container we need for our number of one million, will look like this for a number of one: 0000 0000 0000 0000 0000 0000 0000 0001.  

KEY PRINCIPAL:  Space and computational power are wasted when we use containers that are larger than we need.

I just illustrated the space problem; that is, if the largest number we would need to store is 100, there is no need to use a LONG since all those zero bytes are just wasted space.

OK, space is cheap, right?  Well, all those extra zeros will need to be read from storage and sent over the network and so that consumes those resources.  Just as important, all those unneeded zeros would need to be stored in "hot" memory on the machine; and paged out when memory becomes constrained.  Therefore, this wasted space has a ripple effect for system wide performance.

At the lowest level, work is done on the xPU registers ("x" can be "C" as in CPU; or "G" as in GPU).  These have a finite size and so wasted space increases the memory transfers as data is shifted in and out of the registers and local RAM.  This fact is key to the proposed Genomic architecture that we will discuss in future blog posts.

Now, some things that are interesting and have only a peripheral relationship to our work:

The Endians

At the lowest level, we only have bits with two states: one or zero.  Next, we assemble those into a byte comprised of 8-bits and this is the fundamental way data is accessed.  In words, memory addresses are in bytes on all systems and those bytes are constructed the same way on all computers.  For example, the number 71 will be represented by a byte with the following bit structure:0100 0111 on all computers.

Where things get confusing is when we need to assemble our bytes into types. Specifically, when we start typing our data using more than one byte, we must consider the computer platform's "endianness"; that is, whether it's "big" or "little endian".  (OK, there are mixed endians but I'll leave the research for those to the reader).

To understand the endians, we need to understand what the most significant digit is.  We typically look at a number such as one million as: 1,000,000 and in this case, the most significant digit (the millions) is listed first.  The smallest and thus least significant number is listed last.  So we view our numbers in big-endian format with the most significant digit first.

If we were to fire up Calculator in Windows and put in our decimal value of one million, it will show the following byte format for a 4 byte word:

0000 0000 0000 1111 0100 0010 0100 0000

Now what makes this crazy confusing is that this is shown in big-endian with the most significant bytes first; but on Windows these bytes are stored in little-endian order!  It kinda makes sense for Windows to display the bytes this way since programs display our number in big-endian with the millions over to the left and the single values to the far right.

KEY:  Most most PCs with x86 architecture (Windows and Linux in general) put the least significant byte first in our byte stream and are Little-Endian.  Therefore, the bytes are stored in the opposite order of what we might see in Calculator or otherwise represent them.

Where endianness becomes a problem is when we store multi-byte data in a file that is created on a system with one type of endianness and then attempt to read (type) the data on a system with a different endianness.  For example, if we take a LONG and save it to a file based byte stream on one system; and then attempt to read the same stream of bytes into a LONG on a system with the opposite endianness, the number will be bogus.

Back to our Calculator output for the number one-million: 0000 0000 0000 1111 0100 0010 0100 0000 which is represented in big-endian.  On a big endian system, the bytes will be stored in the same order in files:

0000 0000 
0000 1111 
0100 0010 
0100 0000 

However, on our Little-Endian PC system, the bytes are stored in the opposite order with the least significant byte first:

0100 0000 
0100 0010 
0000 1111 
0000 0000 

To think of it a different way, if we viewed our decimal number of one million in little-endian, it would look like this:  000,000,1

So what's the solution for genomic data?  Create a custom type that is platform agnostic but can still take advantage of native types.

SIDEBAR: The UCSC genomic 2-bit format uses a multi-byte signature to make it easy to determine whether the platform is big or little endian.  From their docs:  "If the signature value is not as given, the reader program should byte-swap the signature and check if the swapped version matches. If so, all multiple-byte entities in the file will have to be byte-swapped. This enables these binary files to be used unchanged on different architectures."

SIDEBAR 2: When work at the lowest level (fastest) on xPU registers, endianness doesn't make a bit of difference.  A 32bit xPU register will hold 32 bits of data and so there is no endianness at that level.  This is key to our genomic tool design (it is platform independent); which is discussed in future posts.


We talked about numbers, but what about words and letters?  

Just as we need to type (a verb) our bits into numbers, we also need to type them into letters and eventually words.  How's that work?

By convention, we use a byte of 8-bits to represent each character of the standard ASCII character set.  This is a "single byte" representation.  Therefore, the letter "T" is represented by a byte-sized decimal number of 84, but is represented in binary as: 0101 0100.  In the early days, we could use just 7-bits for all of the characters and control things.  However, quickly we moved to use the full 8-bits of each byte in what was/is known as extended-ascii.

KEY:  Every letter in the ASCII alphabet consumes 8-bits if we use the alpha representation.

There are several implications for us in this.  Here are two key thoughts:

Numbers as Characters

One way to "cheat" typing is to leave everything as it's ASCII representation and to convert it at run time.  Let's see how that might work for our one-million LONG number.  

As a LONG, this value will take four bytes and will thus consume 32 bits of bandwidth.

On the other hand, if we use the ASCII representation of "1000000", that would be 7 bytes or 56 bits (one byte for each letter) or or almost twice as large!  On the other hand, if we sent the number as "1", this would only be one byte and so we'd only use .25% as much bandwidth as we would if sent as a LONG. 

This should illustrate why the data developer should evaluate their data for storage and transmission and when to type it.

Characters as Numbers

Remember our rule:  we should store our types in the smallest container as possible to reduce hardware load and processing time.  Our genome is comprised of ~3.2 billion bases which are typically represented in ASCII format.  For DNA, these are typically "A", "C", "T", and "G".  OK, so we have a byte sized container that will hold 256 values but we only need to store four.  Why not use a smaller container?  That's the premise of the 2-bit format we introduced earlier.  It's also part of the newish CRAM format.

As I mentioned in an earlier post, the next few posts will give specifics on the work I've done to evaluate using a binary format with low level processing to see if we can improve performance; and it was a fun intellectual exercise to learn CUDA programming.