Sunday, February 20, 2011

Decoding Cerner’s Decompression

In my previous post, I introduced the rationale behind my efforts to determine how to decompress binary images and textual data held by our Cerner HIS.  This <very long> post will discuss the final library and how I produced the working code.

In my preliminary research, I had read and heard that Cerner had attempted to obfuscate or otherwise “hide” the compressed data from their customers.  I also encountered several comments where people had found that the algorithm was Lempel–Ziv–Welch (“LZW”) – a common lossless compression algorithm.  Given the wide range of implementation options available for LZW, I feared the worst.

At first, I approached the problem expecting to find some intended obfuscation; but to my surprise, there was none.  As it turned out, Cerner deploys their compression using the standard LZW algorithm making decompression readily available to their customers.  However, there are a number of deployment options that an LZW implementer can select from and so the fun challenge was to determine what path Cerner had chosen for their particular implementation.  Of course, Cerner could have made all this easy by publishing an API so we can more easily use our data….

To begin with, I used a handful of “BLOBs” from our CE_BLOB table and knowing the decompressed values, I determined what I thought was the final solution.  However, when I pulled a very large file to test, I found that I had more work to do since my initial approach did not work for the larger files (more below…).

So let’s review the basics of LZW – strictly from the decompress perspective:

  1. LZW uses a decompression table that can be created by the consumer (“client”) at run time.  Therefore, there’s no need to transfer or store the table on the client side.
  2. The table can be determined at run time; but there are a number of assumptions that must be understood by the client and the compressor.  We’ll look at some of these shortly.
  3. Any time two or more characters repeat, their decompressed values are read from the table and not from the byte stream.  This proved to be the most challenging aspect: to determine the table entries to verify the assumptions of decompression; but knowing that the table I had created may be (and frequently was) incorrect.
  4. Tables can be cleared at some predetermined time and reinitialized. 
  5. Tables can be customized with some predetermined values.

Frankly, Cerner could have been creative with #4 and/or #5 and thus make it very difficult to reconstruct the table.  But I quickly found that they did neither which made things much easier to figure out.

As it turned out, #3 was the only thing I needed to determine and because I had the encrypted BLOBs from the database, and the original text, it was just a matter of determining the variables that they chose for their implementation.

So what variables?  Let’s review a bit (computer pun) about bits and bytes in the computer world.

At the lowest level, computers store everything as ones and zeros; or to be more precise as positive or negative values (it used to be +5volts or –5volts but that may no longer be correct).

From these “bits” of information, a byte is created and a byte is comprised of 8 bits.

In the early days of computing, we could express just about everything in 127 characters and so 7 bits were adequate.  For example, the number 127 in decimal turns out to be 1111111 in binary (base 2).  Eventually, the computer world decided to use all 8 bits in the byte and so ASCII was “extended” to the full range of a byte or 256 characters.  In other words, the ASCII Lookup Table is set by convention at 256 characters and this is standard across computers.

The thing to remember is that the numbers are index positions in the lookup table.  For ASCII, this table has 256 entries starting with entry 0.  So the character at entry position 0 by convention is NULL.  Similarly, the letter X is at index location 88 decimal and is expressed in binary as 1011000.

In practice, binary data is typically stored in an array of bytes and individual bytes can be represented by a number from 0 to 255 (zero based so that’s 256 characters).  If we have a number that’s larger than 255, it will not fit into a single byte but rolls over into another byte.  So the number 256 is represented as one more than the maximum of one byte of 11111111; so 256 in binary is 1 0000 0000.  [This can be confusing but think of it in decimal:  if you have a value of 999,999 and add just 1 to it, it becomes 1,000,000].

If we don’t need compression, we would just use the standard 8 bits in each byte and we’d be set.  But for LZW (and most other) compression routines, repeating characters are represented with a new entry in the lookup table.  But dang, we just said that the ASCII table already takes up all the places possible so we must use indexes larger than 255 and so we must roll over into another byte if we need numbers greater than 255.

This is the first place that LZW allows us to make a choice from several alternatives:

  1. We can specify a fixed table size that can accommodate the typical source file size.  If we know the size and composition of the source files, this makes the most sense.  But given the nature of health data, this is not a good approach for us.
  2. We can set a fixed table size that allows us to clear the table after it fills up.  This is easy on the programmer but not optimal for good compression.
  3. The final choice is to slowly grow the bit size to expand to meet the size of the index.  For example, we know that indexes as large as 511 would fit in a 9 bit size.  And a 10 bit number size would allow us to store indexes as large as 1023.  This has the advantage of improving compression; but the problem is that we need to make assumptions about our data and to choose the bit size-break (when we move to the next larger bit sized number) and the appropriate time.

Cerner opted for #3 but that left us with the challenge of figuring out where they made these bit size-breaks; that is, when they moved from 9 bit numbers to 10 and so on.

NOTE:  most textbook implementations of LZW show the algorithm as #1 with the maximum number held in 12 bits which is a convenient byte and a half.  Therefore, a table in this convention could not exceed 4095 entries.  This is even more convenient for the programmer since the bits always break on a byte boundary; that is, “take 8 bits from the first byte and add 4 bits from the next one” and then “take the leftover 4 bits and add them to the 8 bits from the next byte”.  This is “Textbook” LZW but not very efficient in the real world.

OK, so let’s focus on the Cerner implementation because I’m sure you’re getting tired.

As we have already observed, they increase the maximum index size at set places in the byte stream (our #3 above).  Since the bit size-breaks are fixed for their implementation and are somewhat arbitrary, the only way I could find to determine this location was to take a variety of compressed BLOBs, run them through the decompression code, and find out where the table values didn’t match; in other words, the decompressed string did not match the pre-compressed string.  Fortunately, these “bit size-breaks” always occurred at the same places in all the files (meaning it was a set place and not based on a variable) and my code reflects those positions.

The concept of these byte splits (or “bit splits”) may be confusing so let me explain with an example. 

If we are in the middle of the byte stream and we have three bytes with these values in order:

0010 1111
1110 0001
0011 1101

Now, if we are using the Textbook LZW split of 12 bits, our two codes would be (note: for the sake of demonstration, we are ignoring the shift of the first byte; in practice, this must be shifted first):

001011111110  all 8 bits from the first byte and 4 bits from the second for our 12

000100111101  which is the four bits left from the second byte and all of the last byte

So in these three bytes, we have two lookup codes: 766  and 317

But these same three bytes would represent different codes if we use a different split size of say 10 bits where our two codes would be:

0010111111 which is 191

1000010011 which is 531

and we’d have the following bits left over: 1101.

When the bit split size is smaller, we have the capacity for more codes (numbers) in the same number of bytes since these numbers are smaller.  The trick is keep the bit size low enough to represent numbers (the Lookup Table index values) that are efficient but high enough to account for highly random character arrangements in the source file and thus the need for more index locations (less compression) and larger numbers.

The key point here is that split size (how many bits represent the lookup codes) is critical for things to work but somewhat arbitrarily chosen by the implementation developer.  Since I had the original text (net yet compressed which should match the decompressed string), I could walk through the code as it created the lookup table byte by byte and compare the output with the original text and once I found a discrepancy, note the location and shift up the bit size for the next sequence of index numbers accordingly.  Of course I could have written a sub-program to do this work for me; but instead I used a specially constructed pre-compressed sequence of characters (text) that made this very efficient (finding the discrepancies) and it proved quicker and easier than writing the subroutine.

For this implementation, they use 9 bit numbers for the first 285 bytes, they then use 10 bit numbers up to 925, and so on (see the code) up to 14 bit numbers beginning at byte 12061.

Since our maximum bit size is 14, our maximum Lookup Table index size is binary 11 1111 1111 1111 which in decimal is 16,383.  Therefore, our Lookup Table cannot hold more than 16,383 index locations.  If we deduct the 256 ASCII characters, our table for new index entries falls to 16,127.  After this, compression would need to be switched off for ensuing bytes.  Cerner apparently handles this by breaking files larger than roughly twice this number into subsequent BLOBs.  In other words, they assume (correctly I imagine) that most files (text or image) will have at least a 2X compression ratio at the bit level.  That is, on average at least half of the byte pairs will be repeating or better. Given the nature of the source data, this seems to be a reasonable approach for both image and text. 

The astute reader will see that there was one more problem to deal with (which is the surprise I alluded to earlier in this post).  For the Textbook LZW sample using 12 bit numbers, and with our 9 and 10 bit numbers; we will always break on a byte boundary.  In other words, we build up our code (the index number) by using 8 bits from one byte and 1 or more bits from the next byte.  In the case of a 9 bit number, our 1 bit will easily break at the 8th bit as we proceed through our incrementing shits (more on this shortly).  Similarly, our 10 and 12 bit words break at the bit boundary (since the bits we take from the second byte divide into our 8 bit byte with no remainder).  But, our 11 bit number will not break at the boundary and so we must take three bytes to make our number.  That is, we take the Leftover, the full middle byte, and the current shift from the last byte.

Another example will illustrate.

If we take a 12 bit number, we start with zero left over, take the full 8 bits from the middle byte (which really is the next byte in sequence), and then 4 bits from the right byte (again the next in sequence).  In the next iteration we have 4 bits left over from the previous iteration and take 8 bits from the middle byte and since we have 12 bits, we don’t need the new right byte.

However, if we have an 11 bit number, we again start with zero left over, take the full 8 bits from the middle byte (next in sequence), and then 3 bits from the right byte (the next in sequence).  We have 5 left over from the previous (left) byte, and take 6 from the middle byte to give us our 11 bits.  But the next iteration is where it gets interesting.  We have 2 bits leftover from our Left Byte (8 – 6), we use the full 8 bits from the middle byte, but we now need 1 bit from the right byte (the next in sequence).  The point is that for the first time, we actually use bits from three bytes and this must be handled in the code.

I found one online post that complained that Cerner had started their decompression with standard LZW; but later in the decompression, they introduced something that was not in the LZW spec.  I suspect that this three-byte-shift threw them off because I have to admit, it took some mental energy to figure out how and when they did this.

Finally, for very large files I had difficulty determining the closing character sequence at the end of the byte stream so I could stop decompression.  I suspect that this is why Cerner includes the decompressed (or pre-compressed) file size in their row data along with the BLOB.  In other words, these three byte shifts make it difficult to find the closing characters and so we need the actual files size to complete decompression correctly.

Therefore, to call the pure managed code version of the decompress, you should pass in this file size along with the BLOB.

As I mentioned before, this was a fun puzzle to solve and I hope this explains what the code is doing and answers any question you might have.  If you see a way to do this more efficiently, please leave a comment and I can modify the code if it makes sense.

The code can be downloaded and viewed from here.

Saturday, February 19, 2011

Decrypting Our Cerner “BLOB” Data

This was a “no brainer”.  I am intrigued with data compression, particularly in a medical environment (and thus why I co-authored two patents on medical image compression); and I love puzzles.  So this (and the next post) will discuss the path I took to figuring out how Cerner compresses BLOBs. If you dunno what a BinaryLargeOBject is, and just want the code, see below but please don’t ask me for support Smile

We are working on several “quality improvement” projects in my day job (isn’t every hospital?) and part of that is to mine (as in extract) the valuable medical insights in the notes --- those semi-structured repositories of information in the medical record that are so difficult to organize.

Most of our notes and reports are contained in our Transcription system and Cerner is updated via an HL7 feed from Transcription.  However, I found that our Pathology reports were transcribed directly into Cerner and those were “compressed” and pretty much non-readable.  Moving forward, we have a way to intercept this interaction; but we had years of notes that we wanted to work with that were stored in the notorious CE_BLOB table (cue the scary music).

Let me give an example of our need that hits very close to home for me.  My wife has Lymphoma and suffered through her first bone marrow biopsy a few months ago.  The doctor who did the procedure was very accomplished (thanks Bones) and successfully aspirated her marrow sample on the first try.  However, this is often not the case when “para-professionals” do the procedure such as untrained PAs or Nurse Practitioners (raise the shields Scotty) and thus a training opportunity is to find a term in the “path reports” as simple as “insufficient sample”.  Correlating this with the person who did the procedure should tell us who needs some additional training (or perhaps reassignment to Alpha Centauri?).  So the goal is to search the notes for simple phrases and link those to the clinician performing the procedure.

This was relatively simple (see previous blog entries) for the other notes; but I ran into the dreaded “Cerner Blob” --- the “Billy The Kid” of Healthcare data….a dreaded reputation that may be deserved – but probably not.

I’m an ex-softie (Microsoft alumnus) and so I wanted to do this in C# and .NET so I needed a way to call the Cerner DLL from .NET.  Once I had the function signature and found the correct library, the rest was pretty straightforward C#.

For those interested, the uar_ocf_uncompress library is in their shrccluar.dll and if you have licensed this library from Cerner, then you are welcome to use it to decompress the BLOBs.  The parameters are the same in the compress and the decompress functions so here are the .NET interop snippets:

image

[DllImport(@"C:\myCernerDirectory\shrccluar.DLL",EntryPoint = "uar_ocf_uncompress")]    
    public extern static int uar_ocf_uncompress( [In, Out, MarshalAs(UnmanagedType.LPArray, SizeParamIndex=1)] Byte[] Buffer, ref int BufSize,
        [In, Out, MarshalAs(UnmanagedType.LPArray, SizeParamIndex=1)] Byte[] Buffer2, ref int BufSize2, ref int ilen);

    [DllImport(@"C:\myCernerDirectory\shrccluar.DLL", EntryPoint = "uar_ocf_compress")]   
    public extern static int uar_ocf_compress([In, Out, MarshalAs(UnmanagedType.LPArray, SizeParamIndex = 1)] Byte[] Buffer, ref int BufSize,
       [In, Out, MarshalAs(UnmanagedType.LPArray, SizeParamIndex = 1)] Byte[] Buffer2, ref int BufSize2, ref int ilen);   

So the next step was to try to figure out how to decompress the BLOBs directly from the Oracle database.  Fortunately, folks have mentioned in various forums that the compression algorithm that Cerner uses is LZW.  In my next Blog post, I’ll talk about how I determined how they implemented the algorithm but the good news is that Cerner didn’t try to hide the data (thank you Cerner).  In other words, they did not try to make this difficult to decompress but instead simply deployed a well known Lossless compression algorithm, LZW, using standard procedures.

OK, back to the code.  I have attached the simple decompression library here that you are welcome to download, compile and use.   This is so simple, that it would be a snap to convert this code to JAVA or you could even convert it using JavaScript in a web browser. Again, complements to Cerner for not trying to do something fancy and for using standard programming constructs.

To test this, I created a simple program that utilized Microsoft LINQ to iterate through a few thousand records from the CE_BLOB table. I first decompressed them using the Cerner API, and then again using the .NET library I created.  After decompression using both APIs, I ran a string comparison.  For these sample records, the final strings all compared successfully. 

Disclaimer:  This code is not supported by me or Cerner and so you must compile and agree to use the code at your own risk.

image

That’s it for now.  In the next post, I’ll discuss the steps to develop the stand-alone solution and talk a bit more about how to use the library.

Again, here’s the C# class.