Deconstructing LZW Decompression

This post and the attached code have been updated as of 3/30/2013.  Both replace a blog and code I wrote a couple of years ago titled: “Decoding Cerner’s Decompression” which I left in place for archival purposes.

I wrote that code and blog as part of a project to extract compressed data from the reporting copy of our hospital’s EMR and database to put into a SQL Database for indexing and searching.

At that time, our needs were met using single rows of data and so the original code worked fine. However, we need now to decompress all of our medical notes, many of which hold scanned images in very large files, and so I need to modify the code to handle the larger blobs of data.
This blog post is an update to the earlier treatise with minor changes in the code to allow us now to decompress files of any size.
I initially suspected that Cerner deployed some fancy compression routine; but it turns out that they use a standard LZW sliding window compression algorithm which I will describe in detail below.
Let’s start with a review the basics of LZW – strictly from the decompress perspective:
1. LZW uses a decompression table that can be created by the consumer (“client”) at run time. Therefore, there’s no need to transfer or store the table on the client side:  the table is created as the compressed file is processed
2. Although the table can be determined at run time; there are a number of variable configuration settings that must be honored by both the compression and decompression routines.  We’ll look at some of these shortly.
3. The byte stream is comprised of indexes for the lookup table we create on the fly.  By-and-large these contain very little actual decompressed data.  That is, when two or more characters repeat, their decompressed values are read from the table and not from the byte stream.
4. The compression engine handles both text and binary data and can effectively compress both.
5. The LZW compression engine is “lossless” meaning you will have bit for bit fidelity when the compressed file is decompressed.  This is a regulatory requirement for health data!
6. LZW is a very efficient algorithm and when written in native code (C/C++), it is blazingly fast.
Although there are simpler implementations of LZW, Cerner chose to deploy the most compression-aggressive form of LZW by using a sliding bit-window that also compresses the lookup table.
So let’s start at the lowest level and work up the stack to explain how all this works.
To understand LZW, it is critical to recognize that our compressed stream is basically a series of lookup indexes to locations in a table that we create on the fly. That table holds the actual decompressed data.  We benefit from compression because we are able to store multiple repeating characters in a single location of the lookup table.
In other words, once a given table entry is populated with a sequence of characters, future encounters of the same sequence of characters only need to pass a single index for that sequence instead of an index for each individual character.
For our decompression engine, we therefore need a programming construct to hold our lookup table and one or more bit sequences.  This will be the lookup table we populate on the fly as we decompress (much more later).
For our “sliding window” compression, we need another efficient construct to hold the indexes.  This is where this gets tricky so a review of numbers in computers is called for.
At the lowest level, computers store everything as zeros and 1s or as “bits”.  To make our programming life easier, these can be lumped together into sets such as bytes which holds 8 bits.  But for our discussion, we are going to call our sets “buckets”, and these can be of different sizes.  So a computer byte, which holds 8 bits will be an 8-bit bucket for our analogy.
Moving all 8 bits around in a byte-sized bucket will allow us to create 256 different combinations of values using a decimal (base 10) numeric.  Therefore, we would be limited to only 256 lookup positions in our lookup table — not nearly enough.
Sidebar:  How did we determine that 256 is the maximum number combinations that fit into an 8-byte bucket or byte?  Each position has only one of two possible combinations it can hold: a 0 or 1.  These occur over our 8 locations.  So our 256 is a result of 2 raised to the 8th power.  A 9-bit number can have 2^9 or 512 index locations, etc.
We have a trade-off here that the programmer needs to consider.  As the table size become larger (more indexes), it will require more bits (less compression) to hold the index values.  Therefore, the programmer should decide when the benefits from a larger table no longer merit the increased storage for the indexes in the table.  For highly repeating and smaller files, a lower index is preferable and most examples of LZW use tables of 4096 indexes.  However, for highly random and generally large medical data, a larger index is useful and Cerner indeed chose to use a table of 8192 indexes.
Our “sliding window” is implemented by a set of bit-buckets that grow as the size of our index grows.  Since we grow our table at runtime, we can increase the value of our new lookup index by a value of 1 for each new sequence.  If we have an available 9-bit bucket (or to think of it a different way, if the index fits into a 9-bit bucket), we put it there; otherwise, it goes in the next size larger, the 10-bit bucket.
We now have a rough idea of how our sliding-window lookup indexes work, so let’s talk now about the data they point to; that is, our lookup table.
We know our table has indexes which are pointers to our decompressed data elements.  Compression comes from the fact that we store repeating sequences of characters in only one index location in the table.  For example, the phrase:  “Hello World!  I say again, Hello World!” might have one entry in the table for “Hello World!”  Therefore, as we decompress our data, we would have one entry in our lookup table for “Hello World!” and our compressed stream would have the associated index location twice in the compressed stream.
This is actually how things work in an uncompressed stream too.  A series of bits are sent to a program that looks up the data in the associated lookup table.  There are many tables in computers.  The most common is the 7-bit ASCII and 8-bit extended ASCII tables.  Most LZW implementations “hard code” the first 256 positions of the lookup table with this Extended ASCII table data.  If that’s all we needed, we could use an 8-bit bucket, or a byte, and we’d be set.  But to get data compression we must have more index locations.  This is why most LZW compression routines start with 9-bit buckets.
OK, we’ve agreed to populate the first 256 9-bit buckets with the associated data from the extended ASCII table which conveniently enough, fills it up completely.  For example, bucket zero has the ASCII character NULL in it and bucket number 75 has capital “K” in it, and so on.  Therefore, when we see a value of 75 in our compressed stream we know that we need to add the letter “K” to our output stream.
Assume we need to decompress the value of “KKKK”.  If we were to put the value of “KKKK” in our lookup table that, let’s say, is index position 500, all we’d need to send is the 9 bits for index 500 that points to “KKKK”.  When we decompress, we look up the value at position 500 and add “KKKK” to our output stream.  Every subsequent time we see a bit sequence that translates to the number 500 (decimal), we know we need to add another “KKKK” to the output stream.  We have gone from the 32 bits needed for the uncompressed “KKKK” (8 bits times 4); to just 9 bits.  Pretty good compression!
Remember, the lookup table is not stored in the compressed stream, just the indexes so the size of the lookup table is not factored in to the compression ratio.
Let’s review where we are now.  We are creating a lookup table on the fly using a bunch of indexes that are passed to us in our compressed stream.  Many of those indexes represent repeating character sequences contributing to our compression.  As our indexes grow larger, we put them into larger and larger bit-buckets.  But remember that we will hit a point where the size of the bucket is too large to be efficient.  The programmer needs to decide at what point to stop growing the bit bucket size.  Further, when the program hits the maximum bucket size, the programmer needs to decide if they want to continue to use the fully populated lookup table for further lookups; or, if they want to empty all the bit buckets and start over.
The magic of LZW is the lookup table created on the fly.  Let’s now turn and examine how the lookup table is dynamically constructed.
As we go through our decompression, we add decompressed values to our lookup table in an incrementing order of indexes.  When we encounter a sequence of characters, we first check to see if we have already indexed those characters.  If we have, we know that they are in an index position that is less than our current position.  If not, we increment our index position by one number and add it to our table.
The important point here is that the sliding-window of bit sizes works because we never have a lookup index from the compressed stream that is greater than our current index counter.  This is essential to the dynamic lookup table and is essential to the way the sliding-window works.
At the end of this article, I’m going to write an extended sidebar as to why all this works for any type of data and not just text; so if you are confused here, read that and jump back up here.
Let’s get specific with a new example string of “ABABAACE”.  We get our compressed stream of bits and taking the first 9 bits (001000001) we get a value of 65. We’ll talk about bytes and stuff in a minute.
This value is “A” from our associated table and so we add the value to our output.  Since we already have an index for “A” in our table, we don’t need to add a new lookup index.
Our next 9 bits are (001000010) or 66 decimal which corresponds to “B” as we expect and we add that to our output stream which is now:  “AB”.
So far we don’t have compression since we have sent 18 bits for two 8-bit characters that would have fit into a 16-bit stream.
But now, we also have the first multiple character sequence that we need to index.  After we add the “B” to the output stream, we add a value of “AB” at our next available index location of 258.
Sidebar:  Since we are zero based, our 256 entries stop at index position 255.  Another implementation variable is where to start the first index in our lookup table.  For most textbook implementations of LZW that use a sliding-window, index position 256 is used to tell us to empty our buckets and start a new lookup table; and index position 257 is an End-of-File marker.  So our first index position is 258 and we’ll start adding our entries there.
We process the next bits in the stream and now it gets interesting.  Our next 9 bits are (100000010) or 258.  We go to our lookup table and find an entry at 258 (which we just put there) so we don’t need to add a new one.  We now add the value of “AB” from index 258 to our output stream which now contains:  “ABAB”.
At this point, our compressed stream has sent three 9-bit patterns or 27 bits of data to give us “ABAB”.  If we had used non-compression and just used the 8-bit bytes that represent “ABAB”, we would have needed four 8-bit patterns or 32 bits which is about 19% more space.  We have compression!
Sidebar: In the attached code, I use a sequence lookup table that is a modified linked list.  This eliminates the need to do pattern matching whether that be in our code or in an abstracted hash table which are costly and slow.  I find that this approach is substantially faster than the “textbook implementations” where the programmer is instructed to use pattern matching and lookups of some kind.
You may now be thinking:  “Wow, this is really simple stuff.”  We use a sliding window of bits to create our lookup table and build that table on the fly with information we retrieve as go.  What’s the big deal?
There’s one little BITty detail I neglected to mention.  The higher level programming languages such as .NET and JAVA do not readily support our use of bits.
In C#, the smallest bit-bucket is 8 bits; that is, a byte.  It’s even worse in JavaScript where the smallest bit-bucket is 32 bits!  Therefore, we have to employ a shifty workaround that’s every bit as complicated as our previous example is simple (Pretty good eh?  Two puns in the same sentence.  No?  OK, I’ll stick with writing code and an occasional blog).
Seriously, the solution to our problem is not really all that hard, but it does take a refresher on bit shifting and masking.  Let me hit the highlights and the reader is encouraged to read-up on this stuff.  I also encourage the reader to pull out the first few bytes of their compressed stream and shift the bits on paper or in Notepad so they can see the result.  Let’s do that now for a couple of example, one basic and easy and the other a bit more complex.
Let’s use our most recent example of “ABABAACE” and the compressed stream we are working with.
We saw previously that our bit stream (not a byte stream yet) holds the following series of 9 bit segments: 001000001 which is a decimal value of 65; 001000010 which is a decimal value of 66; and 100000010 which is a decimal of 258.
Our bit stream before we pull out the 9 bit segments looks like this: 001000001001000010100000010
When we use our C# library to read the stream, it pulls 8 bit segments as bytes and so we now get the following:  00100000 decimal 32, 10010000 decimal 144, and 10100000 decimal 160; all in 8 bit buckets called bytes.  However, we need our bits to live in 9 bit buckets (well, at least until we run out of 9-bit buckets and then increase our bit bucket size).
For our program, .NET will take the bit stream and drop 8 bits into each byte-sized bucket of 8 bits each to make three bytes.  Specifically, we end up with a byte array of the three elements: 32, 144, and 160.
Since we need our bits to live in our 9-bit buckets, we have to take our first 9-bit bucket and drop all 8 bits from the first 8-bit bucket into it: 00100000
Next, we take the first bit, a “1”, from our second 8-bit bucket and put that at the end of our bit sequence in the first 9-bit bucket.  Our first 9-bit bucket now has the following bits which is the decimal 65, “A”, we were expecting:  001000001
Our second 8-bit bucket now only has 7 bits remaining which we dump into our second 9-bit bucket: 0010000.  Since our second 9-bit bucket has only 7 bits, we need two more bits to fill our bucket.  We get those by taking the first two bits from the third 8-bit bucket which starts with 10100000 in bits.  Once that’s done, we end up with the second 9-bit bucket holding:  0010000 + 10 or 001000010 which is decimal value of 66 or “B” which is what we expect.
We continue on putting 9 bits into each of the 9 bit buckets from the original 8 bit buckets until we run out of 9 bit buckets.  At that time, we take our pile of 10 bit buckets and start dumping bits in there from the 8 bit buckets until they are all filled.
In essence, we are working around the byte constructs forced on us by the .NET framework by just taking all the bits in order from the 8-bit buckets and moving them into the 9-bit buckets until they are all filled and then the 10-bit buckets and so on.  In effect, we are “shifting” the bits from a smaller sized container into a larger sized container.  Note that in C/C++, you could use a much more efficient memory pointer and just grab the number of bits you need directly from memory.
This does get a bit tricky when we have larger buckets, say 11-bit.  For example, we may have 1 bit left from an 8-bit bucket after the other bits are shifted out.  Then, we add the 8 bits from the next 8-bit bucket.  But we still need two more bits from a third 8-bit bucket.  So in this example and in our program, we need to keep track of the bits from three 8-bit buckets.
If I were teaching CompSci 101, here I would assign a homework project to write a decompression routine in JavaScript which supports bit shifting on 32-bit buckets instead of our 8-bit buckets.
Well, that’s pretty much it.  We have shown how to take a compressed stream of bits and build our decompression lookup table on the fly.  Using that table, we build our original decompressed stream back into its original state.
A cool next exercise would be to add encryption to the mix.  Do you think it would be more efficient to compress and then encrypt or to encrypt and then compress?  Really?
Extended Sidebar:
One question I frequently see come up about LZW is whether it works for binary data or just text?  At this point, the reader should know the answer.  But in case I’ve lost you, let me provide some hints.
Are we storing compressed text in our example above?
Let me give a couple of examples that may help.
Let’s say I want to use JBJ (my initials) compression.  I can compress the phrase:  “Hello World!  This is Bruce” into 1 bit.  That is, I can take the 27 bytes and corresponding 216 bits and compress them down into 1 bit.  Am I good or what?
But I may want to add my picture into the compressed stream and put that in index position two.
So now I can send my text and my picture in just two bits in my compressed stream.  The only problem is that both sides need to have the same lookup table already defined and in place.    
So does the JBJ compression routine work for text or binary?  Why?
Sidebar to the Sidebar:  This is the foundation for my patents on medical image compression which were designed over a napkin at Cheesecake Factory while conducting a job interview for my co-inventor at Microsoft (he was obviously hired, by the way, and is a very smart and accomplished programmer).
Remember we said that compressed medical images have to be lossless.  But when a given imaging device takes a medical image, much of that image holds data that’s not clinically relevant.  For example, if I take a picture of a hand, the space between the fingers is not clinically relevant and would look the same for every image from the same device.  For these images for a given modality, we store the non-clinically relevant parts of the images in a static lookup table on both ends of the process.  When we retrieve the images, we can take very large blocks of non-clinically relevant data and insert them into the final decompressed image.  Much like my bogus example above, we can get extremely efficient data compression with medical images and maintain bit-for-bit fidelity on the clinically relevant sections.
/End Sidebar to the Sidebar – back to the Sidebar
OK, I hope you see that LZW, as most compression algorithms such as the bogus JBJ, use a lookup table to find repeating sequences of data.  It doesn’t make any difference what the data is.  As long as the length of the index of for our lookup table is smaller than the data it points to, we will have compression.
To put this a different way, we store repeating sequences of bits in our lookup table and when we process our input compressed stream, we extract those sequences of bits in order and stick them in our output bit stream.  Once that is completed, it’s up the program that uses the output (file) to decide what translation table to apply.  For example, if the content is indeed text, the program will break the bits into bytes at 8-bit sequences and use the ASCII table for translation.  However, if it’s an image file, it may apply different rules to organize the bits.
So what would happen if we do not initialize our lookup table with the extended ASCII character set?  Of course we’d need to change our end-of-file marker and some other things, but let’s just focus on the first few bits to design our new LZW implementation.
Let’s say that we know our original stream is always going to be comprised of alpha characters such as A, a, B, b, etc.  It’s a waste of index space to reserve locations for data we’ll never use and so we don’t want to pre-populate our table.  In our case, we start our bit-bucket size at 7 which will hold a maximum of 128 index locations.  This is ample for storing the bit patterns of all the alpha characters.
Let’s keep NULL at index location zero, so our first index location is 1.
If our example stream is our “ABABAACE” example, our first seven bits are: 100 0001 and we put that bit sequence in our index position 1.
Our next 7-bit sequence is: 100 0010 and we put that at position 2.
Our next 7 bit sequence is: 000 0011 which is decimal 2 and corresponds to our index 2.
Now, our compressed bit stream is 21 bits: 100000110000100000011
The key takeaway here is that we simply store a collection of bits in our lookup table.  When we decompress the input stream, our output should be identical at the bit level with the original file.  The program that creates the file needs to understand how to interpret the bits; for example: as 7-bit ASCII in this example, or as 8-bit ASCII in our previous example, or perhaps an integer such as 1,073,411.
So for our compression, we are only concerned with bits and repeating segments of bits and how those are rendered or interpreted fall back to the program that reads them later.
In short, our compression works with any kind of data since it maintains bit for bit fidelity.

6 Comments

  1. Well for the Cerner example, the size of the decompressed file is stored in the database and so that makes it easy since the size comes with the raw data.

    If the size is unavailable to you, you can set it to a value that is large enough to hold whatever you are decompressing and then trim it down later. Or, you might take the compressed buffer size, and allocate based on some reasonable multiple of that.

  2. Thanks, I'm coming in a bit second hand to this blob table with not much depth to my Cerner knowledge. I see a length column which appears to be length of the compressed blob and a checksum which I suppose could be the length of the uncompressed blob. I'll check it out and see otherwise just over allocate and take the performance hit. Thanks!

  3. I'm working on an end-to-end document on how to process blobs out of the Oracle database and hope to have it posted up here and on Ucern in the next week or so. In the meantime, feel free to send me an email if you have questions and I'll attempt to assist. My email is in my profile for the blog. Either way, good luck.

  4. Hi Bruce, Your article is interesting. But are you aware of an easy method for extracting the compressed data?
    There are some assumptions on my part – I'm assuming that the compressed data is text of some sort and can be expressed as text once I've decompressed the data. The table I've been asked to decode for an analyst is named CE_BLOB.
    Any help is much appreciated. There isn't much out there for help in this area. Cerner support has offered up some c code, which I have yet to see.
    Additionally, it appears that cerner is tacking on a text string of "ocf_blob" onto the already compressed data. Which I'm a little confused about why they would have done that. Any insight will be much appreciated.
    Thanks in advance.

  5. Hi Kevin,

    Yeah, I have to be a bit careful because Cerner will claim I'm releasing intellectual property. If you post me a message on their internal support site, Ucern; or, send me an email (address in my profile here), I'll send you a document I wrote that explains how to do all that. And I have an end-to-end build solution that I'd be happy to share that includes how to do full text search integrated with Snomed on the documents once you have 'em.

    But to address at a high level the specific question you asked, you need to remove the text string from the end of each row except the last one (I just take 32768 minus the 9 bytes of text, or 32759 bytes from the binary), concatenate them together with the full text in the last row and hand them off to the decompressor. Since this isn't doing any text compares for the ocf_blob string, it's much faster than their UAR or the code you reference. Takes us about 48 hours to extract ~14 million notes, decompress them, and store them in SQL Server's new File Table.

    One gotcha: our early images were all pretty small, less than 1000 bytes. I quickly filled up a 4TB partition after a couple of hours until I realized I had left the cluster size at the default of 64K. In essence, I was taking 64 times as much storage as I needed. So, I dropped the cluster size to 1K and now all documents fit in about 200 GB of storage.

    Good luck.

Leave a Reply

Your email address will not be published. Required fields are marked *

eighteen − sixteen =