Saturday, April 20, 2013

Indexing Scanned Documents

Frequently in Health IT, we battle our own “urban legends” regarding how difficult some solution might be.  In this post, I’ll attempt to dispel one of those: that it’s difficult or impossible to index scanned documents.

First, some technical background.  A scanned document is simply an image of a document that holds text.  From that image we need to identify and isolate the alpha-numeric characters and index those.  Sounds simple, but not so.

To do isolate the textual content in the image, the program must first determine the type and size of font used for the characters.  Using those data, it then scans the image piece by piece to find a match.

Now you can imagine some of the challenges of making this work.  If the document is slightly tilted, then it would be difficult to match the fonts.  Or, if the image quality is bad or there’s a smudge on the image right over the text, the engine would have difficulty in reading it.

Fortunately, there are a number of vendors who have written programs to handle this “optical character recognition” or “OCR” so we can extract the text we need.  Unfortunately, most are very expensive.

However, a little known fact is that most modern Windows systems come with the most common OCR engine built in; specifically, TIFF. 

Tiff is a lossless compression algorithm that is used in most digital fax systems and across most medical images because the regulators mandate that medical images must be lossless; that is, have bit-to-bit fidelity before and after compression.

So let’s assume that you have scanned in some TIFF images and you want to offer them in your data mining tool for health data.  The following post will describe how to configure a Windows system and SQL Server to do just that. 

For our example, we’ll use Windows 2008 R2 (the Windows 7 kernel) and enabled the TIFF filter which is not enabled by default.  However, this works for Windows 7 too.

On the Server 2008 R2 machine, open the Server Manager and click on the Features node.  Verify that Windows TIFF IFILTER is not yet enabled.  To enable it, click on the right:  Add Features  and scroll to the bottom of the list and enable the Windows TIFF IFILTER.  You will also need the .NET Framework 3.5.1 Features for SQL 2012 so add them too while you’re at it.

If you are using Win7, Click Start, then Control Panel, then Programs, and then Turn Windows features on or off.   Then select the Windows TIFF IFilter checkbox.

For production, you will want a full version of SQL Server since you will probably want to store more than the Express (free) version’s limit of 10 GB of data; but for this exercise, you can use the Advanced Version of SQL Server Express since it comes with the Full Text Search we’ll need.

If you plan to import and index other common document types, you should also install the Office Filter Pack and the latest service pack.  I suggest you install the Adobe 64bit IFilter while you’re at it.

So let’s see what type of filters SQL recognizes.  From a SQL Query Window, execute:

exec sp_help_fulltext_system_components 'filter'

For a clean SQL installation, this will return around 50 rows and types.  Let’s let SQL know about the new filters we just added.  Run the following commands:

EXEC sp_fulltext_service @action='load_os_resources', @value=1;
EXEC sp_fulltext_service 'update_languages';
EXEC sp_fulltext_service 'restart_all_fdhosts';
exec sp_help_fulltext_system_components 'filter'

Assuming you added all the filters we just mentioned, you should get a list of around 166 filters.

There’s one more little trick you need to do before all this works as expected.  By default, the Microsoft IFilter will only make a cursory attempt to OCR a document.  However, if the image quality is poor, or the document has multiple pages, the filter won’t do the job until we force it.

Open the Group Policy Editor by keying in from a command window:  gpedit.msc. We need to find the OCR settings but they are placed differently if you have Search installed which is the default in Win7.  For our Server R2 config, we look under Computer ConfigurationAdministrative Templates and we find OCR.  However, if Search is installed it’s located at Computer Configuration – Administrative Templates – Windows Components – Search – OCR.

In either case, find the Force TIFF IFilter to perform OCR for every page in a TIFF document and enable it:

image

You have now configured your system to enable the SQL Server Full Text engine to crawl and index scanned images.  All that’s left to do is to pull the images into SQL, enable Full Text Search for the database and table and you can then easily find documents with a given term.

A future blog post (and ebook) will describe a step-by-step on how to do that and provide sample code.

No comments:

Post a Comment