Sunday, December 22, 2013

XPS to TIFF

In the last post, I described how to convert a PDF into XPS as an intermediate step to split out and convert the individual pages to TIFF format.  Our goal is to combine single page TIFF pages back into their original form where multiple page documents are held in a single file.  We have to do this since the health providers insist on sending us our health records combined into a single PDF file.

This post is geek-speak.  If you don’t know C# and .NET, forget it.  Just skip to the bottom and pull down the tool if you want.

The first step is to open the XPS file and to print the individual pages as single page TIFF files. 

We create a new XpsDocument object to hold our XPS file.  Next, we get the number of pages and walk the stack of pages one at a time and save each one as a new TIFF file.

image

Now, the toolset become pretty crude.  For my sample PDF file of 877 pages, I should now have 877 TIFF files in a directory and I can use the Windows File Explorer and an image viewer to sort, delete, and combine the files.  I find that the Windows Photo Viewer application works well. 

The tool (download source or object below), allows me to drag the pages that should be combined into a single TIFF file.  It then (optionally) deletes the original files.

Therefore, this process may look something like:  a) walk the files using Explorer and the Windows Photo Viewer and delete files we don’t want to keep; b) when we find a file we want, walk the individual TIFF files until we see the last page we want to combine into a single saved TIFF file; c) grab all the individual TIFF files and drop them on the application from below; d) make sure the pages are in the correct order, resort them if not into the order you want them saved as; e) save the pages into a new single TIFF file that contains one or more pages.

As I say, this process is a bit rough and the code and tool are not “production” applications; but, I post them here as reference for someone else who may want to use the logic and improve the functionality to meet their needs. 

Of course, all of this would be much simpler if the health care institutions would comply with the federal regs (45 C.F.R. § 164.524) where it says: 

(i) The covered entity must provide the individual with access to the protected health information in the form or format requested by the individual, if it is readily producible in such form or format; or, if not, in a readable hard copy form or such other form or format as agreed to by the covered entity and the individual.

Although I have requested my records to be in high-quality multi-page TIFF format; or at least a single TIFF file, this has NEVER happened and they always come in a very low quality PDF.  Nor has the health care institution ever contacted me to see if we could “agree” on the format as per these regulations Sad smile  Hopefully, the magical “blue button” will fix all this; but in the meantime, here’s the full (and still a bit rough) tool set. 

The “Document Doctor” or DocDoc – source files here; object (executables) here.

Saturday, December 21, 2013

Converting the PDF to XPS

Last time we removed the password from the PDF supplied by our Provider and now we have an 877 page PDF file contains dozens of multi-page documents from my medical record.  Inside this one large PDF, I may have four pages of labs, a two-page EKG, four pages of physician notes, etc.  I want these discreet documents together as individual files and not munged together in a giant PDF.  Once they are saved back into individual files containing the relevant pages of a procedure such as lab results; it will be easier for me and the indexer to find and organize the documents.

One problem with dealing with PDF formatted files is that it is difficult to change from PDF to other formats; at least using Open Source tools and utilities.  Therefore, we need a way to translate the PDF into a format we can use.  It turns out that on the Windows platform, we have a nice tool available that allows us to process the documents:  the XPS printer driver.  Once we convert the PDF to XPS, we can use code to further process the file into individual documents.

If you haven’t done so, install and configure the XPS printer driver.  Use the free Adobe reader and open the large PDF.  When you press the print menu item in Adobe, select the XPS printer – it will ask you where you want to create the file.  The point here is that XPS driver writes to a file and not a printer.

We need to retain the quality as much as possible.  Unfortunately, many providers print to the PDF in very poor quality so will attempt to clean things up so our OCR engine has a better chance later to index the documents.

For our current task, I suggest you use the grayscale setting if you are not printing color:

image

Once you press the “Print” button, a new dialog will pop up to ask you where you want to save the new XPS document.  You will find that it is very similar to the PDF in both size and quality.  The only thing we have gained is that the file is now in a format we can work with.

In the next post, we will detail another tool which will open the new XPS file and print out each page as a separate high-quality TIFF file.  After that, we can use another tool to combine the individual TIFF pages back into a single multiple-page TIFF file which can be indexed and viewed.  Almost there then.

Friday, December 20, 2013

Remove (or Add) Password Protection to PDF

As I mentioned in the previous post, I created several tools to assist with the processing of my Personal Health Record.  The first tool I wrote will unlock a PDF file.  Locked PDF seems to be how most Providers send along copies of the health record.  In other words, when I request my medical records, they come in one large multi-page PDF file that is password protected. To make processing easier, I want to remove the password.
To make this work, I use the PDFsharp library which is an Open Source toolkit that greatly simplifies everything.
I have created two projects:  one to unlock a PDF file and the other to lock the PDF.
The source code is available here.  If you want the executables, they are here.
To illustrate just how great and easy the PDFsharp library is, here’s the three lines of code that unlock the PDF document:
PdfDocument document = PdfReader.Open(fileName, pw, PdfDocumentOpenMode.Modify);


document.SecuritySettings.DocumentSecurityLevel = PdfSharp.Pdf.Security.PdfDocumentSecurityLevel.None;


document.Save(fileName);
The “fileName” variable is the path and name to the PDF file and the “pw” variable is the password for the document we want to clear.
That’s it.  Three simple lines of code to clear the password.  The “lock” functions are just as easy.  Mark another win for Open Source software.
Once we have the document unlocked, we find that we may have a few hundred pages of medical records.  Next time we’ll talk about how to break those down into manageable and indexable documents so we can use them

Sunday, April 28, 2013

SQL File Table–Technical Details

In my previous post, I provided a step-by-step on how to set up SQL Server to crawl and index large files of semi-structured data in either text-based formats (Word documents, PDF files, HTML files, etc.); or as scanned documents in formats such as TIFF.

This post will attempt to provide a bit more technical background as to why some of the configuration steps are needed and more technical details in general.

One thing to keep in mind with SQL File Tables (“SFT”) is that it does not support all of the typical ways in which you can access or manipulate files.  I like to think of it as a file system emulator that creates and exposes a virtual file system which posses most of the best parts of the file system; in fact, I will call this the SQL File System in this post.

Let’s use one of my favorite features of the Windows file system: the Encrypted File System (“EFS”) to illustrate how SFT can leverage and extend the underlying file system.  EFS was designed to be used by a single user to encrypt one or more files or folders.  Unfortunately, EFS only allows the user to share the unencrypted file with other users at the file, and not the folder, level.  This is a non-starter for us where we want to store and share millions of files. 

As we expect, SFT technology uses the SQL account to read and write from the encrypted folder and is the only account that can decrypt the files.  In other words, other accounts (users) can get to the files from the file system using the local drive letter; however, they will not be able to view the content since they will not be able to decrypt the files. 

However, SFT and the SQL File system extends EFS functionality by providing the ability to access the encrypted files, which are decrypted through SQL server using SQL user permissions.  In other words, access to the encrypted files and folders is handled through SQL and not the file system and unlike the native EFS in the OS, we can now grant permissions to the decrypted content to a host of users through SQL security.  In my view, this is a huge improvement and benefit! 

In a future post, I plan to provide some details on how this might work with a web based viewer to the files; and to provide auditing to comply with health privacy regulations; but back now to the goal of this post- to discuss some of the technical details of SFT.

First of all, if you failed to enable File Stream in SQL when you did the install, you can now enable it without uninstalling and reinstalling as was required before.  To do this, open the SQL Configuration Manager.  Expand the SQL Server Services in the left panel and in the right panel, right-click on the SQL Server service and select Properties.   You will see a FileStream tab which you should click and in there, enable the stack.  You next need to enable at the database instance.  Open SSMS and right click on the server instance in the left window and select Properties.  Come down to Advanced and under Filestream, set the Access level to “Full Access enabled” .

When you create your database, you need to enable FileStream support for the db.  There are a few moving parts:  you should create a folder on the file system to hold your stream data, you need to create a file group, you then add a new file to the filegroup, and finally you need to provision a FileStream directory name for this database and the “non-transacted” access to the directory.  This is rather convoluted to do in SSMS but it’s a good practice so you understand all the parts.  However, if you’d rather script it, it might look something like:

CREATE DATABASE [TestFS]
CONTAINMENT = NONE  ON  PRIMARY
( NAME = N'TestFS', FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL11.MSSQLSERVER\MSSQL\DATA\TestFS.mdf' , SIZE = 5120KB , MAXSIZE = UNLIMITED, FILEGROWTH = 1024KB ),
FILEGROUP [FileStreamGroup] CONTAINS FILESTREAM  DEFAULT
( NAME = N'TestFSFileGroup', FILENAME = N'C:\FS\TestFSFileGroup' , MAXSIZE = UNLIMITED)
LOG ON  ( NAME = N'TestFS_log', FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL11.MSSQLSERVER\MSSQL\DATA\TestFS_log.ldf' , SIZE = 1024KB , MAXSIZE = 2048GB , FILEGROWTH = 10%)
GO

ALTER DATABASE [TestFS] SET FILESTREAM( NON_TRANSACTED_ACCESS = FULL, DIRECTORY_NAME = N'FileStreamDir' )

We have a bunch of directory names floating around and it becomes confusing as to where they are all provisioned and what their purposes are so let me help clarify.  Remember that there are two distinct file systems at work here that need to operate in concert:  the regular OS based file and share system; and the SQL file system.  In some areas, they do NOT overlap so it’s important to understand each:

We create an OS folder called “FS”.  This is recognized by the file system but has no part in our SFT file path structure.  In fact, once you get things configured, you should forget it even exists and leave it and everything in it alone!

When you created your database, you created a new filegroup and added to that a new “file” hosted in the filegroup. 

image

That “file” is really the folder name nested under the OS folder.  Therefore, if the LogicalName of our file is “TestFSFileGroup”, our OS folder might be:  D:\FS\TestFSFileGroup.  If you decide to compress or encrypt your data, you would configure that at the OS level here.

OK, we are done now with the OS based file system.  You can now forget it exists because everything you do after this, you will do using the SQL File system.

When you installed or enabled the FileStream for the SQL Server, you completed the share name.  The default is “mssqlserver” and we’ll use that.  However, for our purposes, we will never use OS tools to work with this share.  If our server name is:  “MyServer”, we will have a windows share called:  “\\MyServer\mssqlserver” but this is a SQL Server share and our interaction is through SQL.

When we created our database above, we created a directory name called “FileStreamDir”.  Now our SQL path for this database is:  \\MyServer\mssqlserver\FileStreamDir

Let’s say we create a new File Table called “MyMedicalRecord” using the follow SQL script:

Create Table MyMedicalRecord as FileTable

The path to this table’s data is: \\MyServer\mssqlserver\FileStreamDir\MyMedicalRecord

If you right-click over the table name in SSMS and select the “Explore File Table Directory”, you will see that this is in fact the share or folder that comes up.

Even though this folder shows up in File Explorer, you are NOT going through the OS file system to reach the folder.  You are using the SQL file system instead.  But you can use File Explorer as a tool to manipulate the data in SQL.  When you drag and drop files to this folder, they magically show up as rows in the SQL table.  You can even use File Explorer to add new folders and to arrange the files in the folder.  

For example, if you create a new folder called “NewFolder” using File Explorer to your SQL path, you can use the new path to reach that folder with: \\MyServer\mssqlserver\FileStreamDir\MyMedicalRecord\NewFolder

However, if you go to your OS mounted folder (I’ll let you do that this one time), and look under the FS folder, you will see a folder with a GUID as a name (that relates to the database folder name in SQL), and under that another folder with a GUID (that relates to the table folder name), but you won’t find a sub-folder for our “NewFolder” directory since this is totally managed by the SQL file system.

One final point here that I hope to elaborate more fully on in a future post, you manage access to these files and folders at the SQL level and NOT at the OS level!

So drop a few large files into the folder and select * for the table in SSMS.  You can see that it takes a long time to get the data back.  That’s because the query is returning the full blob of data from the file system.  You can speed things up by removing the file_stream column. 

This illustrates that returning the data through SQL can take quite a bit of time.  That’s the primary reason that we were seeing the 0x80040e97 and 0x8004fd02 errors in the SQL logs and that we needed to increase the “ft_timeout” values for the engine (see previous post).

The point here though is that you can use standard SQL queries to retrieve the varbinary(max) blob through SQL to the file system; or, you can use standard Win32 file access APIs to do this through the SQL File system share.  Very cool.

One final “gotcha” to be careful with.  The standard setup for SQL is to set up your databases on partition that has been formatted to a 64K cluster size.  However, we are storing files on the OS and this does NOT apply to our file based storage.  For our medical notes, the vast majority of our files are far less than 64K and so each small file, will still consume a full 64K cluster – orders of magnitude more than we need.  Therefore, you should determine the average file size of the files you store and set your cluster size accordingly.

SQL File Table: Step-by-Step Installation and Configuration for Documents

I’m finalizing a project at work where we will index several million health notes going back about fifteen years.  As an adjunct to that project, I was interested to explore how we might index scanned documents and my previous post describes how to do that using SQL Server and the TIFF IFilter included in recent Microsoft OS releases.

This all worked well and fine but I became intrigued with the new SQL File Table technology since it could address a number of challenges.  Here are some of the reasons I like the technology:

  1. As I have blogged before, it is important to encrypt health data that contains Personal Health Information (“PHI”).  SQL Enterprise allows us to encrypt the database and logs and does a good job; but it’s expensive.  With SQL File Tables, we don’t’ store the data (and PHI) in the database but as files on the file system.  SQL File Tables will allow us to store the files in a compressed directory; or for health data, in an encrypted directory using EFS. Now, we can encrypt our semi-structured data held in medical notes without incurring the expense of SQL Enterprise.  And… see next
  2. The freeware version of SQL Server, the “Express Edition with Advanced Services” includes File Table support and the data stored on the file system does NOT count toward the 10 GB DB limit.  Cool.
  3. It is super simple to load and read documents from the File Table.  In fact, documents can be dragged and dropped into a folder.  If EFS has been utilized, the contents of the files will be secured – even from high privileged administrators who might have access to the file system.

There are a couple of drawbacks however.  First of all, there is a bunch of disk I/O and possible contention if these are stored on the same spindles as the rest of the database or logs.  When we add full-text indexing to all this I/O, we must make some configuration tweaks or we won’t be successful.

So the following is a step-by-step with minimal explanation.  In the next post, I’ll dig deeper into some of the technical aspects of these steps.  For the following, I used the Developer Edition of SQL 2012

  1. Install SQL Server select Full-text and Semantic Extractions for Search
  2. At the “Database Engine Configuration” screen, click the FileStream tab and select all the checkboxes (note, in the next post, we’ll talk about how to configure this on an existing instance of SQL that did not have this enabled on installation)
    image
  3. You should optimize the file system for the I/O load.  Refer to the MS documents for this; but I suggest you disable 8.3 with: FSUTIL BEHAVIOR SET DISABLE8DOT3 1   and disable last access with: FSUTIL BEHAVIOR SET DISABLELASTACCESS 1  (requires a reboot).
  4. Create a folder for your FileStream and make sure indexing is not enabled for that folder since SQL will handle that for us. 
  5. Create your  new database with a FileStream  filegroup that points to the folder created in step #4 and add a new File to the filegroup; and set the Filestream property to non-transacted-access.  Again, see the next post for details on this and the other steps.
  6. At any point up to now, you will want to configure the filters for the OS. See the previous post for more information.  Note, you should read the Adobe release information if you want to crawl and index .pdf files; especially the section on setting the path.  Similarly, if you want to crawl multi-page TIFF files, you should make the appropriate group policy change to enable that as per the previous post.
  7. There is one little key problem that we need to address before we add our File Table and enable Full-Text-Search.  As mentioned before, there is quite a bit of file contention with the File Tables and this is exacerbated when we try to index new documents. In fact, I have found that without these tweaks, the service can’t get past the “Error 0x80040e97“ reported in the SQL logs and the full text engine won’t index all of the documents.  Another error you might see for the same reason is:  Error '0x8004fd02: The filter daemon MSFTEFD failed to load an IFilter  Therefore, we need to increase the timeout by setting the “ft_timeout” value for the service with: Exec sp_fulltext_service 'ft_timeout', 600000;
  8. So assuming you have installed and configured the various IFilters and you have not created your table yet, from a SQL Query window, execute the following:
  9. EXEC sp_fulltext_service @action='load_os_resources', @value=1;
    EXEC sp_fulltext_service 'verify_signature', 0;
    EXEC sp_fulltext_service 'update_languages';

    Exec sp_fulltext_service 'ft_timeout', 600000; – ten minutes
    Exec sp_fulltext_service 'ism_size',@value=16; – the max

    EXEC sp_fulltext_service 'restart_all_fdhosts';
    EXEC sp_help_fulltext_system_components 'filter';
    reconfigure with override
  10. You can now create your FileTable.  For example:  Create Table MyNotesTable as FileTable
  11. Now the fun can begin.  Right-click over your new filetable and select the Explore File Table Directory to open a window to the FileTable directory. 
  12. Drag two or three documents into the empty directory.  TIP:  you might want to install Search and index a test folder where you can crawl your test documents with Windows Search.  Since you are using the same filters as the Operating System, you can verify that the words you use to query are found by the wordbreaker. For example, I tried to use a rather dirty PDF of a scanned document and the IFilter was unable to recognize several words from the document and Windows Search let me easily discover the problem.
  13. Go back to your table and Select Top 1000 rows in SSMS to verify that your documents made into the table.  Cool eh?
  14. Now, let’s crawl them.  Enable Full Text search on this table.  Use the default PK index for the unique index, the file_stream as your text content, and the file_type as the Type_Column. 
  15. Give the crawler a few minutes to complete the crawl.  In the ‘Select Top 1000 …”  set a where clause to something such as:  “Where contains (file_stream, ‘mywordtosearchfor’).  You should see a list of matching documents that contain the word or phrase.

In review, the new SQL File Table is a great way to load and retrieve documents into SQL Server so that they can leverage the power of SQL’s Full-Text-Search.  With this technology, we can crawl and index dozens of different file types including scanned images and faxes and enjoy the benefits of either file compression or file encryption.

In the next post, I’ll dig a bit deeper into some of the technical aspects of these steps.

Saturday, April 20, 2013

Indexing Scanned Documents

Frequently in Health IT, we battle our own “urban legends” regarding how difficult some solution might be.  In this post, I’ll attempt to dispel one of those: that it’s difficult or impossible to index scanned documents.

First, some technical background.  A scanned document is simply an image of a document that holds text.  From that image we need to identify and isolate the alpha-numeric characters and index those.  Sounds simple, but not so.

To do isolate the textual content in the image, the program must first determine the type and size of font used for the characters.  Using those data, it then scans the image piece by piece to find a match.

Now you can imagine some of the challenges of making this work.  If the document is slightly tilted, then it would be difficult to match the fonts.  Or, if the image quality is bad or there’s a smudge on the image right over the text, the engine would have difficulty in reading it.

Fortunately, there are a number of vendors who have written programs to handle this “optical character recognition” or “OCR” so we can extract the text we need.  Unfortunately, most are very expensive.

However, a little known fact is that most modern Windows systems come with the most common OCR engine built in; specifically, TIFF. 

Tiff is a lossless compression algorithm that is used in most digital fax systems and across most medical images because the regulators mandate that medical images must be lossless; that is, have bit-to-bit fidelity before and after compression.

So let’s assume that you have scanned in some TIFF images and you want to offer them in your data mining tool for health data.  The following post will describe how to configure a Windows system and SQL Server to do just that. 

For our example, we’ll use Windows 2008 R2 (the Windows 7 kernel) and enabled the TIFF filter which is not enabled by default.  However, this works for Windows 7 too.

On the Server 2008 R2 machine, open the Server Manager and click on the Features node.  Verify that Windows TIFF IFILTER is not yet enabled.  To enable it, click on the right:  Add Features  and scroll to the bottom of the list and enable the Windows TIFF IFILTER.  You will also need the .NET Framework 3.5.1 Features for SQL 2012 so add them too while you’re at it.

If you are using Win7, Click Start, then Control Panel, then Programs, and then Turn Windows features on or off.   Then select the Windows TIFF IFilter checkbox.

For production, you will want a full version of SQL Server since you will probably want to store more than the Express (free) version’s limit of 10 GB of data; but for this exercise, you can use the Advanced Version of SQL Server Express since it comes with the Full Text Search we’ll need.

If you plan to import and index other common document types, you should also install the Office Filter Pack and the latest service pack.  I suggest you install the Adobe 64bit IFilter while you’re at it.

So let’s see what type of filters SQL recognizes.  From a SQL Query Window, execute:

exec sp_help_fulltext_system_components 'filter'

For a clean SQL installation, this will return around 50 rows and types.  Let’s let SQL know about the new filters we just added.  Run the following commands:

EXEC sp_fulltext_service @action='load_os_resources', @value=1;
EXEC sp_fulltext_service 'update_languages';
EXEC sp_fulltext_service 'restart_all_fdhosts';
exec sp_help_fulltext_system_components 'filter'

Assuming you added all the filters we just mentioned, you should get a list of around 166 filters.

There’s one more little trick you need to do before all this works as expected.  By default, the Microsoft IFilter will only make a cursory attempt to OCR a document.  However, if the image quality is poor, or the document has multiple pages, the filter won’t do the job until we force it.

Open the Group Policy Editor by keying in from a command window:  gpedit.msc. We need to find the OCR settings but they are placed differently if you have Search installed which is the default in Win7.  For our Server R2 config, we look under Computer ConfigurationAdministrative Templates and we find OCR.  However, if Search is installed it’s located at Computer Configuration – Administrative Templates – Windows Components – Search – OCR.

In either case, find the Force TIFF IFilter to perform OCR for every page in a TIFF document and enable it:

image

You have now configured your system to enable the SQL Server Full Text engine to crawl and index scanned images.  All that’s left to do is to pull the images into SQL, enable Full Text Search for the database and table and you can then easily find documents with a given term.

A future blog post (and ebook) will describe a step-by-step on how to do that and provide sample code.

Saturday, March 30, 2013

Deconstructing LZW Decompression

This post and the attached code have been updated as of 3/30/2013.  Both replace a blog and code I wrote a couple of years ago titled: “Decoding Cerner’s Decompression”.

I wrote that code and blog as part of a project to extract compressed data from the reporting copy of our hospital’s EMR and database to put into a SQL Database for indexing and searching.
At that time, our needs were met using single rows of data and so the original code worked fine. However, we need now to decompress all of our medical notes, many of which hold scanned images in very large files, and so I need to modify the code to handle the larger blobs of data.
This blog post is an update to the earlier treatise with minor changes in the code to allow us now to decompress files of any size.
I initially suspected that Cerner deployed some fancy compression routine; but it turns out that they use a standard LZW sliding window compression algorithm which I will describe in detail below.
Let’s start with a review the basics of LZW – strictly from the decompress perspective:
1. LZW uses a decompression table that can be created by the consumer (“client”) at run time. Therefore, there’s no need to transfer or store the table on the client side:  the table is created as the compressed file is processed
2. Although the table can be determined at run time; there are a number of variable configuration settings that must be honored by both the compression and decompression routines.  We’ll look at some of these shortly.
3. The byte stream is comprised of indexes for the lookup table we create on the fly.  By-and-large these contain very little actual decompressed data.  That is, when two or more characters repeat, their decompressed values are read from the table and not from the byte stream.
4. The compression engine handles both text and binary data and can effectively compress both.
5. The LZW compression engine is “lossless” meaning you will have bit for bit fidelity when the compressed file is decompressed.  This is a regulatory requirement for health data!
6. LZW is a very efficient algorithm and when written in native code (C/C++), it is blazingly fast.
Although there are simpler implementations of LZW, Cerner chose to deploy the most compression-aggressive form of LZW by using a sliding bit-window that also compresses the lookup table.
So let’s start at the lowest level and work up the stack to explain how all this works.
To understand LZW, it is critical to recognize that our compressed stream is basically a series of lookup indexes to locations in a table that we create on the fly.  That table holds the actual decompressed data.  We benefit from compression because we are able to store multiple repeating characters in a single location of the lookup table.
In other words, once a given table entry is populated with a sequence of characters, future encounters of the same sequence of characters only need to pass a single index for that sequence instead of an index for each individual character.
For our decompression engine, we therefore need a programming construct to hold our lookup table and one or more bit sequences.  This will be the lookup table we populate on the fly as we decompress (much more later).
For our “sliding window” compression, we need another efficient construct to hold the indexes.  This is where this gets tricky so a review of numbers in computers is called for.
At the lowest level, computers store everything as zeros and 1s or as “bits”.  To make our programming life easier, these can be lumped together into sets such as bytes which holds 8 bits.  But for our discussion, we are going to call our sets “buckets”, and these can be of different sizes.  So a computer byte, which holds 8 bits will be an 8-bit bucket for our analogy.
Moving all 8 bits around in a byte-sized bucket will allow us to create 256 different combinations of values using a decimal (base 10) numeric.  Therefore, we would be limited to only 256 lookup positions in our lookup table --- not nearly enough.
Sidebar:  How did we determine that 256 is the maximum number combinations that fit into an 8-byte bucket or byte?  Each position has only one of two possible combinations it can hold: a 0 or 1.  These occur over our 8 locations.  So our 256 is a result of 2 raised to the 8th power.  A 9-bit number can have 2^9 or 512 index locations, etc.
We have a trade-off here that the programmer needs to consider.  As the table size become larger (more indexes), it will require more bits (less compression) to hold the index values.  Therefore, the programmer should decide when the benefits from a larger table no longer merit the increased storage for the indexes in the table.  For highly repeating and smaller files, a lower index is preferable and most examples of LZW use tables of 4096 indexes.  However, for highly random and generally large medical data, a larger index is useful and Cerner indeed chose to use a table of 8192 indexes.
Our “sliding window” is implemented by a set of bit-buckets that grow as the size of our index grows.  Since we grow our table at runtime, we can increase the value of our new lookup index by a value of 1 for each new sequence.  If we have an available 9-bit bucket (or to think of it a different way, if the index fits into a 9-bit bucket), we put it there; otherwise, it goes in the next size larger, the 10-bit bucket.
We now have a rough idea of how our sliding-window lookup indexes work, so let’s talk now about the data they point to; that is, our lookup table.
We know our table has indexes which are pointers to our decompressed data elements.  Compression comes from the fact that we store repeating sequences of characters in only one index location in the table.  For example, the phrase:  “Hello World!  I say again, Hello World!” might have one entry in the table for “Hello World!”  Therefore, as we decompress our data, we would have one entry in our lookup table for “Hello World!” and our compressed stream would have the associated index location twice in the compressed stream.
This is actually how things work in an uncompressed stream too.  A series of bits are sent to a program that looks up the data in the associated lookup table.  There are many tables in computers.  The most common is the 7-bit ASCII and 8-bit extended ASCII tables.  Most LZW implementations “hard code” the first 256 positions of the lookup table with this Extended ASCII table data.  If that’s all we needed, we could use an 8-bit bucket, or a byte, and we’d be set.  But to get data compression we must have more index locations.  This is why most LZW compression routines start with 9-bit buckets.
OK, we’ve agreed to populate the first 256 9-bit buckets with the associated data from the extended ASCII table which conveniently enough, fills it up completely.  For example, bucket zero has the ASCII character NULL in it and bucket number 75 has capital “K” in it, and so on.  Therefore, when we see a value of 75 in our compressed stream we know that we need to add the letter “K” to our output stream.
Assume we need to decompress the value of “KKKK”.  If we were to put the value of “KKKK” in our lookup table that, let’s say, is index position 500, all we’d need to send is the 9 bits for index 500 that points to “KKKK”.  When we decompress, we look up the value at position 500 and add “KKKK” to our output stream.  Every subsequent time we see a bit sequence that translates to the number 500 (decimal), we know we need to add another “KKKK” to the output stream.  We have gone from the 32 bits needed for the uncompressed “KKKK” (8 bits times 4); to just 9 bits.  Pretty good compression!
Remember, the lookup table is not stored in the compressed stream, just the indexes so the size of the lookup table is not factored in to the compression ratio.
Let’s review where we are now.  We are creating a lookup table on the fly using a bunch of indexes that are passed to us in our compressed stream.  Many of those indexes represent repeating character sequences contributing to our compression.  As our indexes grow larger, we put them into larger and larger bit-buckets.  But remember that we will hit a point where the size of the bucket is too large to be efficient.  The programmer needs to decide at what point to stop growing the bit bucket size.  Further, when the program hits the maximum bucket size, the programmer needs to decide if they want to continue to use the fully populated lookup table for further lookups; or, if they want to empty all the bit buckets and start over.
The magic of LZW is the lookup table created on the fly.  Let’s now turn and examine how the lookup table is dynamically constructed.
As we go through our decompression, we add decompressed values to our lookup table in an incrementing order of indexes.  When we encounter a sequence of characters, we first check to see if we have already indexed those characters.  If we have, we know that they are in an index position that is less than our current position.  If not, we increment our index position by one number and add it to our table.
The important point here is that the sliding-window of bit sizes works because we never have a lookup index from the compressed stream that is greater than our current index counter.  This is essential to the dynamic lookup table and is essential to the way the sliding-window works.
At the end of this article, I’m going to write an extended sidebar as to why all this works for any type of data and not just text; so if you are confused here, read that and jump back up here.
Let’s get specific with a new example string of “ABABAACE”.  We get our compressed stream of bits and taking the first 9 bits (001000001) we get a value of 65. We’ll talk about bytes and stuff in a minute.
This value is “A” from our associated table and so we add the value to our output.  Since we already have an index for “A” in our table, we don’t need to add a new lookup index.
Our next 9 bits are (001000010) or 66 decimal which corresponds to “B” as we expect and we add that to our output stream which is now:  “AB”.
So far we don’t have compression since we have sent 18 bits for two 8-bit characters that would have fit into a 16-bit stream.
But now, we also have the first multiple character sequence that we need to index.  After we add the “B” to the output stream, we add a value of “AB” at our next available index location of 258.
Sidebar:  Since we are zero based, our 256 entries stop at index position 255.  Another implementation variable is where to start the first index in our lookup table.  For most textbook implementations of LZW that use a sliding-window, index position 256 is used to tell us to empty our buckets and start a new lookup table; and index position 257 is an End-of-File marker.  So our first index position is 258 and we’ll start adding our entries there.
We process the next bits in the stream and now it gets interesting.  Our next 9 bits are (100000010) or 258.  We go to our lookup table and find an entry at 258 (which we just put there) so we don’t need to add a new one.  We now add the value of “AB” from index 258 to our output stream which now contains:  “ABAB”.
At this point, our compressed stream has sent three 9-bit patterns or 27 bits of data to give us “ABAB”.  If we had used non-compression and just used the 8-bit bytes that represent “ABAB”, we would have needed four 8-bit patterns or 32 bits which is about 19% more space.  We have compression!
Sidebar: In the attached code, I use a sequence lookup table that is a modified linked list.  This eliminates the need to do pattern matching whether that be in our code or in an abstracted hash table which are costly and slow.  I find that this approach is substantially faster than the “textbook implementations” where the programmer is instructed to use pattern matching and lookups of some kind.
You may now be thinking:  “Wow, this is really simple stuff.”  We use a sliding window of bits to create our lookup table and build that table on the fly with information we retrieve as go.  What’s the big deal?
There’s one little BITty detail I neglected to mention.  The higher level programming languages such as .NET and JAVA do not readily support our use of bits.
In C#, the smallest bit-bucket is 8 bits; that is, a byte.  It’s even worse in JavaScript where the smallest bit-bucket is 32 bits!  Therefore, we have to employ a shifty workaround that’s every bit as complicated as our previous example is simple (Pretty good eh?  Two puns in the same sentence.  No?  OK, I’ll stick with writing code and an occasional blog).
Seriously, the solution to our problem is not really all that hard, but it does take a refresher on bit shifting and masking.  Let me hit the highlights and the reader is encouraged to read-up on this stuff.  I also encourage the reader to pull out the first few bytes of their compressed stream and shift the bits on paper or in Notepad so they can see the result.  Let’s do that now for a couple of example, one basic and easy and the other a bit more complex.
Let’s use our most recent example of “ABABAACE” and the compressed stream we are working with.
We saw previously that our bit stream (not a byte stream yet) holds the following series of 9 bit segments: 001000001 which is a decimal value of 65; 001000010 which is a decimal value of 66; and 100000010 which is a decimal of 258.
Our bit stream before we pull out the 9 bit segments looks like this: 001000001001000010100000010
When we use our C# library to read the stream, it pulls 8 bit segments as bytes and so we now get the following:  00100000 decimal 32, 10010000 decimal 144, and 10100000 decimal 160; all in 8 bit buckets called bytes.  However, we need our bits to live in 9 bit buckets (well, at least until we run out of 9-bit buckets and then increase our bit bucket size).
For our program, .NET will take the bit stream and drop 8 bits into each byte-sized bucket of 8 bits each to make three bytes.  Specifically, we end up with a byte array of the three elements: 32, 144, and 160.
Since we need our bits to live in our 9-bit buckets, we have to take our first 9-bit bucket and drop all 8 bits from the first 8-bit bucket into it: 00100000
Next, we take the first bit, a “1”, from our second 8-bit bucket and put that at the end of our bit sequence in the first 9-bit bucket.  Our first 9-bit bucket now has the following bits which is the decimal 65, “A”, we were expecting:  001000001
Our second 8-bit bucket now only has 7 bits remaining which we dump into our second 9-bit bucket: 0010000.  Since our second 9-bit bucket has only 7 bits, we need two more bits to fill our bucket.  We get those by taking the first two bits from the third 8-bit bucket which starts with 10100000 in bits.  Once that’s done, we end up with the second 9-bit bucket holding:  0010000 + 10 or 001000010 which is decimal value of 66 or “B” which is what we expect.
We continue on putting 9 bits into each of the 9 bit buckets from the original 8 bit buckets until we run out of 9 bit buckets.  At that time, we take our pile of 10 bit buckets and start dumping bits in there from the 8 bit buckets until they are all filled.
In essence, we are working around the byte constructs forced on us by the .NET framework by just taking all the bits in order from the 8-bit buckets and moving them into the 9-bit buckets until they are all filled and then the 10-bit buckets and so on.  In effect, we are “shifting” the bits from a smaller sized container into a larger sized container.  Note that in C/C++, you could use a much more efficient memory pointer and just grab the number of bits you need directly from memory.
This does get a bit tricky when we have larger buckets, say 11-bit.  For example, we may have 1 bit left from an 8-bit bucket after the other bits are shifted out.  Then, we add the 8 bits from the next 8-bit bucket.  But we still need two more bits from a third 8-bit bucket.  So in this example and in our program, we need to keep track of the bits from three 8-bit buckets.
If I were teaching CompSci 101, here I would assign a homework project to write a decompression routine in JavaScript which supports bit shifting on 32-bit buckets instead of our 8-bit buckets.
Well, that’s pretty much it.  We have shown how to take a compressed stream of bits and build our decompression lookup table on the fly.  Using that table, we build our original decompressed stream back into its original state.
A cool next exercise would be to add encryption to the mix.  Do you think it would be more efficient to compress and then encrypt or to encrypt and then compress?  Really?
Extended Sidebar:
One question I frequently see come up about LZW is whether it works for binary data or just text?  At this point, the reader should know the answer.  But in case I’ve lost you, let me provide some hints.
Are we storing compressed text in our example above?
Let me give a couple of examples that may help.
Let’s say I want to use JBJ (my initials) compression.  I can compress the phrase:  “Hello World!  This is Bruce” into 1 bit.  That is, I can take the 27 bytes and corresponding 216 bits and compress them down into 1 bit.  Am I good or what?
But I may want to add my picture into the compressed stream and put that in index position two.
So now I can send my text and my picture in just two bits in my compressed stream.  The only problem is that both sides need to have the same lookup table already defined and in place.    
So does the JBJ compression routine work for text or binary?  Why?
Sidebar to the Sidebar:  This is the foundation for my patents on medical image compression which were designed over a napkin at Cheesecake Factory while conducting a job interview for my co-inventor at Microsoft (he was obviously hired, by the way, and is a very smart and accomplished programmer).
Remember we said that compressed medical images have to be lossless.  But when a given imaging device takes a medical image, much of that image holds data that’s not clinically relevant.  For example, if I take a picture of a hand, the space between the fingers is not clinically relevant and would look the same for every image from the same device.  For these images for a given modality, we store the non-clinically relevant parts of the images in a static lookup table on both ends of the process.  When we retrieve the images, we can take very large blocks of non-clinically relevant data and insert them into the final decompressed image.  Much like my bogus example above, we can get extremely efficient data compression with medical images and maintain bit-for-bit fidelity on the clinically relevant sections.
/End Sidebar to the Sidebar – back to the Sidebar
OK, I hope you see that LZW, as most compression algorithms such as the bogus JBJ, use a lookup table to find repeating sequences of data.  It doesn’t make any difference what the data is.  As long as the length of the index of for our lookup table is smaller than the data it points to, we will have compression.
To put this a different way, we store repeating sequences of bits in our lookup table and when we process our input compressed stream, we extract those sequences of bits in order and stick them in our output bit stream.  Once that is completed, it’s up the program that uses the output (file) to decide what translation table to apply.  For example, if the content is indeed text, the program will break the bits into bytes at 8-bit sequences and use the ASCII table for translation.  However, if it’s an image file, it may apply different rules to organize the bits.
So what would happen if we do not initialize our lookup table with the extended ASCII character set?  Of course we’d need to change our end-of-file marker and some other things, but let’s just focus on the first few bits to design our new LZW implementation.
Let’s say that we know our original stream is always going to be comprised of alpha characters such as A, a, B, b, etc.  It’s a waste of index space to reserve locations for data we’ll never use and so we don’t want to pre-populate our table.  In our case, we start our bit-bucket size at 7 which will hold a maximum of 128 index locations.  This is ample for storing the bit patterns of all the alpha characters.
Let’s keep NULL at index location zero, so our first index location is 1.
If our example stream is our “ABABAACE” example, our first seven bits are: 100 0001 and we put that bit sequence in our index position 1.
Our next 7-bit sequence is: 100 0010 and we put that at position 2.
Our next 7 bit sequence is: 000 0011 which is decimal 2 and corresponds to our index 2.
Now, our compressed bit stream is 21 bits: 100000110000100000011
The key takeaway here is that we simply store a collection of bits in our lookup table.  When we decompress the input stream, our output should be identical at the bit level with the original file.  The program that creates the file needs to understand how to interpret the bits; for example: as 7-bit ASCII in this example, or as 8-bit ASCII in our previous example, or perhaps an integer such as 1,073,411.
So for our compression, we are only concerned with bits and repeating segments of bits and how those are rendered or interpreted fall back to the program that reads them later.
In short, our compression works with any kind of data since it maintains bit for bit fidelity.