Sunday, April 28, 2013

SQL File Table–Technical Details

In my previous post, I provided a step-by-step on how to set up SQL Server to crawl and index large files of semi-structured data in either text-based formats (Word documents, PDF files, HTML files, etc.); or as scanned documents in formats such as TIFF.

This post will attempt to provide a bit more technical background as to why some of the configuration steps are needed and more technical details in general.

One thing to keep in mind with SQL File Tables (“SFT”) is that it does not support all of the typical ways in which you can access or manipulate files.  I like to think of it as a file system emulator that creates and exposes a virtual file system which posses most of the best parts of the file system; in fact, I will call this the SQL File System in this post.

Let’s use one of my favorite features of the Windows file system: the Encrypted File System (“EFS”) to illustrate how SFT can leverage and extend the underlying file system.  EFS was designed to be used by a single user to encrypt one or more files or folders.  Unfortunately, EFS only allows the user to share the unencrypted file with other users at the file, and not the folder, level.  This is a non-starter for us where we want to store and share millions of files. 

As we expect, SFT technology uses the SQL account to read and write from the encrypted folder and is the only account that can decrypt the files.  In other words, other accounts (users) can get to the files from the file system using the local drive letter; however, they will not be able to view the content since they will not be able to decrypt the files. 

However, SFT and the SQL File system extends EFS functionality by providing the ability to access the encrypted files, which are decrypted through SQL server using SQL user permissions.  In other words, access to the encrypted files and folders is handled through SQL and not the file system and unlike the native EFS in the OS, we can now grant permissions to the decrypted content to a host of users through SQL security.  In my view, this is a huge improvement and benefit! 

In a future post, I plan to provide some details on how this might work with a web based viewer to the files; and to provide auditing to comply with health privacy regulations; but back now to the goal of this post- to discuss some of the technical details of SFT.

First of all, if you failed to enable File Stream in SQL when you did the install, you can now enable it without uninstalling and reinstalling as was required before.  To do this, open the SQL Configuration Manager.  Expand the SQL Server Services in the left panel and in the right panel, right-click on the SQL Server service and select Properties.   You will see a FileStream tab which you should click and in there, enable the stack.  You next need to enable at the database instance.  Open SSMS and right click on the server instance in the left window and select Properties.  Come down to Advanced and under Filestream, set the Access level to “Full Access enabled” .

When you create your database, you need to enable FileStream support for the db.  There are a few moving parts:  you should create a folder on the file system to hold your stream data, you need to create a file group, you then add a new file to the filegroup, and finally you need to provision a FileStream directory name for this database and the “non-transacted” access to the directory.  This is rather convoluted to do in SSMS but it’s a good practice so you understand all the parts.  However, if you’d rather script it, it might look something like:

CREATE DATABASE [TestFS]
CONTAINMENT = NONE  ON  PRIMARY
( NAME = N'TestFS', FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL11.MSSQLSERVER\MSSQL\DATA\TestFS.mdf' , SIZE = 5120KB , MAXSIZE = UNLIMITED, FILEGROWTH = 1024KB ),
FILEGROUP [FileStreamGroup] CONTAINS FILESTREAM  DEFAULT
( NAME = N'TestFSFileGroup', FILENAME = N'C:\FS\TestFSFileGroup' , MAXSIZE = UNLIMITED)
LOG ON  ( NAME = N'TestFS_log', FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL11.MSSQLSERVER\MSSQL\DATA\TestFS_log.ldf' , SIZE = 1024KB , MAXSIZE = 2048GB , FILEGROWTH = 10%)
GO

ALTER DATABASE [TestFS] SET FILESTREAM( NON_TRANSACTED_ACCESS = FULL, DIRECTORY_NAME = N'FileStreamDir' )

We have a bunch of directory names floating around and it becomes confusing as to where they are all provisioned and what their purposes are so let me help clarify.  Remember that there are two distinct file systems at work here that need to operate in concert:  the regular OS based file and share system; and the SQL file system.  In some areas, they do NOT overlap so it’s important to understand each:

We create an OS folder called “FS”.  This is recognized by the file system but has no part in our SFT file path structure.  In fact, once you get things configured, you should forget it even exists and leave it and everything in it alone!

When you created your database, you created a new filegroup and added to that a new “file” hosted in the filegroup. 

image

That “file” is really the folder name nested under the OS folder.  Therefore, if the LogicalName of our file is “TestFSFileGroup”, our OS folder might be:  D:\FS\TestFSFileGroup.  If you decide to compress or encrypt your data, you would configure that at the OS level here.

OK, we are done now with the OS based file system.  You can now forget it exists because everything you do after this, you will do using the SQL File system.

When you installed or enabled the FileStream for the SQL Server, you completed the share name.  The default is “mssqlserver” and we’ll use that.  However, for our purposes, we will never use OS tools to work with this share.  If our server name is:  “MyServer”, we will have a windows share called:  “\\MyServer\mssqlserver” but this is a SQL Server share and our interaction is through SQL.

When we created our database above, we created a directory name called “FileStreamDir”.  Now our SQL path for this database is:  \\MyServer\mssqlserver\FileStreamDir

Let’s say we create a new File Table called “MyMedicalRecord” using the follow SQL script:

Create Table MyMedicalRecord as FileTable

The path to this table’s data is: \\MyServer\mssqlserver\FileStreamDir\MyMedicalRecord

If you right-click over the table name in SSMS and select the “Explore File Table Directory”, you will see that this is in fact the share or folder that comes up.

Even though this folder shows up in File Explorer, you are NOT going through the OS file system to reach the folder.  You are using the SQL file system instead.  But you can use File Explorer as a tool to manipulate the data in SQL.  When you drag and drop files to this folder, they magically show up as rows in the SQL table.  You can even use File Explorer to add new folders and to arrange the files in the folder.  

For example, if you create a new folder called “NewFolder” using File Explorer to your SQL path, you can use the new path to reach that folder with: \\MyServer\mssqlserver\FileStreamDir\MyMedicalRecord\NewFolder

However, if you go to your OS mounted folder (I’ll let you do that this one time), and look under the FS folder, you will see a folder with a GUID as a name (that relates to the database folder name in SQL), and under that another folder with a GUID (that relates to the table folder name), but you won’t find a sub-folder for our “NewFolder” directory since this is totally managed by the SQL file system.

One final point here that I hope to elaborate more fully on in a future post, you manage access to these files and folders at the SQL level and NOT at the OS level!

So drop a few large files into the folder and select * for the table in SSMS.  You can see that it takes a long time to get the data back.  That’s because the query is returning the full blob of data from the file system.  You can speed things up by removing the file_stream column. 

This illustrates that returning the data through SQL can take quite a bit of time.  That’s the primary reason that we were seeing the 0x80040e97 and 0x8004fd02 errors in the SQL logs and that we needed to increase the “ft_timeout” values for the engine (see previous post).

The point here though is that you can use standard SQL queries to retrieve the varbinary(max) blob through SQL to the file system; or, you can use standard Win32 file access APIs to do this through the SQL File system share.  Very cool.

One final “gotcha” to be careful with.  The standard setup for SQL is to set up your databases on partition that has been formatted to a 64K cluster size.  However, we are storing files on the OS and this does NOT apply to our file based storage.  For our medical notes, the vast majority of our files are far less than 64K and so each small file, will still consume a full 64K cluster – orders of magnitude more than we need.  Therefore, you should determine the average file size of the files you store and set your cluster size accordingly.

SQL File Table: Step-by-Step Installation and Configuration for Documents

I’m finalizing a project at work where we will index several million health notes going back about fifteen years.  As an adjunct to that project, I was interested to explore how we might index scanned documents and my previous post describes how to do that using SQL Server and the TIFF IFilter included in recent Microsoft OS releases.

This all worked well and fine but I became intrigued with the new SQL File Table technology since it could address a number of challenges.  Here are some of the reasons I like the technology:

  1. As I have blogged before, it is important to encrypt health data that contains Personal Health Information (“PHI”).  SQL Enterprise allows us to encrypt the database and logs and does a good job; but it’s expensive.  With SQL File Tables, we don’t’ store the data (and PHI) in the database but as files on the file system.  SQL File Tables will allow us to store the files in a compressed directory; or for health data, in an encrypted directory using EFS. Now, we can encrypt our semi-structured data held in medical notes without incurring the expense of SQL Enterprise.  And… see next
  2. The freeware version of SQL Server, the “Express Edition with Advanced Services” includes File Table support and the data stored on the file system does NOT count toward the 10 GB DB limit.  Cool.
  3. It is super simple to load and read documents from the File Table.  In fact, documents can be dragged and dropped into a folder.  If EFS has been utilized, the contents of the files will be secured – even from high privileged administrators who might have access to the file system.

There are a couple of drawbacks however.  First of all, there is a bunch of disk I/O and possible contention if these are stored on the same spindles as the rest of the database or logs.  When we add full-text indexing to all this I/O, we must make some configuration tweaks or we won’t be successful.

So the following is a step-by-step with minimal explanation.  In the next post, I’ll dig deeper into some of the technical aspects of these steps.  For the following, I used the Developer Edition of SQL 2012

  1. Install SQL Server select Full-text and Semantic Extractions for Search
  2. At the “Database Engine Configuration” screen, click the FileStream tab and select all the checkboxes (note, in the next post, we’ll talk about how to configure this on an existing instance of SQL that did not have this enabled on installation)
    image
  3. You should optimize the file system for the I/O load.  Refer to the MS documents for this; but I suggest you disable 8.3 with: FSUTIL BEHAVIOR SET DISABLE8DOT3 1   and disable last access with: FSUTIL BEHAVIOR SET DISABLELASTACCESS 1  (requires a reboot).
  4. Create a folder for your FileStream and make sure indexing is not enabled for that folder since SQL will handle that for us. 
  5. Create your  new database with a FileStream  filegroup that points to the folder created in step #4 and add a new File to the filegroup; and set the Filestream property to non-transacted-access.  Again, see the next post for details on this and the other steps.
  6. At any point up to now, you will want to configure the filters for the OS. See the previous post for more information.  Note, you should read the Adobe release information if you want to crawl and index .pdf files; especially the section on setting the path.  Similarly, if you want to crawl multi-page TIFF files, you should make the appropriate group policy change to enable that as per the previous post.
  7. There is one little key problem that we need to address before we add our File Table and enable Full-Text-Search.  As mentioned before, there is quite a bit of file contention with the File Tables and this is exacerbated when we try to index new documents. In fact, I have found that without these tweaks, the service can’t get past the “Error 0x80040e97“ reported in the SQL logs and the full text engine won’t index all of the documents.  Another error you might see for the same reason is:  Error '0x8004fd02: The filter daemon MSFTEFD failed to load an IFilter  Therefore, we need to increase the timeout by setting the “ft_timeout” value for the service with: Exec sp_fulltext_service 'ft_timeout', 600000;
  8. So assuming you have installed and configured the various IFilters and you have not created your table yet, from a SQL Query window, execute the following:
  9. EXEC sp_fulltext_service @action='load_os_resources', @value=1;
    EXEC sp_fulltext_service 'verify_signature', 0;
    EXEC sp_fulltext_service 'update_languages';

    Exec sp_fulltext_service 'ft_timeout', 600000; – ten minutes
    Exec sp_fulltext_service 'ism_size',@value=16; – the max

    EXEC sp_fulltext_service 'restart_all_fdhosts';
    EXEC sp_help_fulltext_system_components 'filter';
    reconfigure with override
  10. You can now create your FileTable.  For example:  Create Table MyNotesTable as FileTable
  11. Now the fun can begin.  Right-click over your new filetable and select the Explore File Table Directory to open a window to the FileTable directory. 
  12. Drag two or three documents into the empty directory.  TIP:  you might want to install Search and index a test folder where you can crawl your test documents with Windows Search.  Since you are using the same filters as the Operating System, you can verify that the words you use to query are found by the wordbreaker. For example, I tried to use a rather dirty PDF of a scanned document and the IFilter was unable to recognize several words from the document and Windows Search let me easily discover the problem.
  13. Go back to your table and Select Top 1000 rows in SSMS to verify that your documents made into the table.  Cool eh?
  14. Now, let’s crawl them.  Enable Full Text search on this table.  Use the default PK index for the unique index, the file_stream as your text content, and the file_type as the Type_Column. 
  15. Give the crawler a few minutes to complete the crawl.  In the ‘Select Top 1000 …”  set a where clause to something such as:  “Where contains (file_stream, ‘mywordtosearchfor’).  You should see a list of matching documents that contain the word or phrase.

In review, the new SQL File Table is a great way to load and retrieve documents into SQL Server so that they can leverage the power of SQL’s Full-Text-Search.  With this technology, we can crawl and index dozens of different file types including scanned images and faxes and enjoy the benefits of either file compression or file encryption.

In the next post, I’ll dig a bit deeper into some of the technical aspects of these steps.

Saturday, April 20, 2013

Indexing Scanned Documents

Frequently in Health IT, we battle our own “urban legends” regarding how difficult some solution might be.  In this post, I’ll attempt to dispel one of those: that it’s difficult or impossible to index scanned documents.

First, some technical background.  A scanned document is simply an image of a document that holds text.  From that image we need to identify and isolate the alpha-numeric characters and index those.  Sounds simple, but not so.

To do isolate the textual content in the image, the program must first determine the type and size of font used for the characters.  Using those data, it then scans the image piece by piece to find a match.

Now you can imagine some of the challenges of making this work.  If the document is slightly tilted, then it would be difficult to match the fonts.  Or, if the image quality is bad or there’s a smudge on the image right over the text, the engine would have difficulty in reading it.

Fortunately, there are a number of vendors who have written programs to handle this “optical character recognition” or “OCR” so we can extract the text we need.  Unfortunately, most are very expensive.

However, a little known fact is that most modern Windows systems come with the most common OCR engine built in; specifically, TIFF. 

Tiff is a lossless compression algorithm that is used in most digital fax systems and across most medical images because the regulators mandate that medical images must be lossless; that is, have bit-to-bit fidelity before and after compression.

So let’s assume that you have scanned in some TIFF images and you want to offer them in your data mining tool for health data.  The following post will describe how to configure a Windows system and SQL Server to do just that. 

For our example, we’ll use Windows 2008 R2 (the Windows 7 kernel) and enabled the TIFF filter which is not enabled by default.  However, this works for Windows 7 too.

On the Server 2008 R2 machine, open the Server Manager and click on the Features node.  Verify that Windows TIFF IFILTER is not yet enabled.  To enable it, click on the right:  Add Features  and scroll to the bottom of the list and enable the Windows TIFF IFILTER.  You will also need the .NET Framework 3.5.1 Features for SQL 2012 so add them too while you’re at it.

If you are using Win7, Click Start, then Control Panel, then Programs, and then Turn Windows features on or off.   Then select the Windows TIFF IFilter checkbox.

For production, you will want a full version of SQL Server since you will probably want to store more than the Express (free) version’s limit of 10 GB of data; but for this exercise, you can use the Advanced Version of SQL Server Express since it comes with the Full Text Search we’ll need.

If you plan to import and index other common document types, you should also install the Office Filter Pack and the latest service pack.  I suggest you install the Adobe 64bit IFilter while you’re at it.

So let’s see what type of filters SQL recognizes.  From a SQL Query Window, execute:

exec sp_help_fulltext_system_components 'filter'

For a clean SQL installation, this will return around 50 rows and types.  Let’s let SQL know about the new filters we just added.  Run the following commands:

EXEC sp_fulltext_service @action='load_os_resources', @value=1;
EXEC sp_fulltext_service 'update_languages';
EXEC sp_fulltext_service 'restart_all_fdhosts';
exec sp_help_fulltext_system_components 'filter'

Assuming you added all the filters we just mentioned, you should get a list of around 166 filters.

There’s one more little trick you need to do before all this works as expected.  By default, the Microsoft IFilter will only make a cursory attempt to OCR a document.  However, if the image quality is poor, or the document has multiple pages, the filter won’t do the job until we force it.

Open the Group Policy Editor by keying in from a command window:  gpedit.msc. We need to find the OCR settings but they are placed differently if you have Search installed which is the default in Win7.  For our Server R2 config, we look under Computer ConfigurationAdministrative Templates and we find OCR.  However, if Search is installed it’s located at Computer Configuration – Administrative Templates – Windows Components – Search – OCR.

In either case, find the Force TIFF IFilter to perform OCR for every page in a TIFF document and enable it:

image

You have now configured your system to enable the SQL Server Full Text engine to crawl and index scanned images.  All that’s left to do is to pull the images into SQL, enable Full Text Search for the database and table and you can then easily find documents with a given term.

A future blog post (and ebook) will describe a step-by-step on how to do that and provide sample code.