I’m finalizing a project at work where we will index several million health notes going back about fifteen years. As an adjunct to that project, I was interested to explore how we might index scanned documents and my previous post describes how to do that using SQL Server and the TIFF IFilter included in recent Microsoft OS releases.
This all worked well and fine but I became intrigued with the new SQL File Table technology since it could address a number of challenges. Here are some of the reasons I like the technology:
- As I have blogged before, it is important to encrypt health data that contains Personal Health Information (“PHI”). SQL Enterprise allows us to encrypt the database and logs and does a good job; but it’s expensive. With SQL File Tables, we don’t’ store the data (and PHI) in the database but as files on the file system. SQL File Tables will allow us to store the files in a compressed directory; or for health data, in an encrypted directory using EFS. Now, we can encrypt our semi-structured data held in medical notes without incurring the expense of SQL Enterprise. And… see next
- The freeware version of SQL Server, the “Express Edition with Advanced Services” includes File Table support and the data stored on the file system does NOT count toward the 10 GB DB limit. Cool.
- It is super simple to load and read documents from the File Table. In fact, documents can be dragged and dropped into a folder. If EFS has been utilized, the contents of the files will be secured – even from high privileged administrators who might have access to the file system.
There are a couple of drawbacks however. First of all, there is a bunch of disk I/O and possible contention if these are stored on the same spindles as the rest of the database or logs. When we add full-text indexing to all this I/O, we must make some configuration tweaks or we won’t be successful.
So the following is a step-by-step with minimal explanation. In the next post, I’ll dig deeper into some of the technical aspects of these steps. For the following, I used the Developer Edition of SQL 2012
- Install SQL Server select Full-text and Semantic Extractions for Search
- At the “Database Engine Configuration” screen, click the FileStream tab and select all the checkboxes (note, in the next post, we’ll talk about how to configure this on an existing instance of SQL that did not have this enabled on installation)
- You should optimize the file system for the I/O load. Refer to the MS documents for this; but I suggest you disable 8.3 with: FSUTIL BEHAVIOR SET DISABLE8DOT3 1 and disable last access with: FSUTIL BEHAVIOR SET DISABLELASTACCESS 1 (requires a reboot).
- Create a folder for your FileStream and make sure indexing is not enabled for that folder since SQL will handle that for us.
- Create your new database with a FileStream filegroup that points to the folder created in step #4 and add a new File to the filegroup; and set the Filestream property to non-transacted-access. Again, see the next post for details on this and the other steps.
- At any point up to now, you will want to configure the filters for the OS. See the previous post for more information. Note, you should read the Adobe release information if you want to crawl and index .pdf files; especially the section on setting the path. Similarly, if you want to crawl multi-page TIFF files, you should make the appropriate group policy change to enable that as per the previous post.
- There is one little key problem that we need to address before we add our File Table and enable Full-Text-Search. As mentioned before, there is quite a bit of file contention with the File Tables and this is exacerbated when we try to index new documents. In fact, I have found that without these tweaks, the service can’t get past the “Error 0x80040e97“ reported in the SQL logs and the full text engine won’t index all of the documents. Another error you might see for the same reason is: Error ‘0x8004fd02: The filter daemon MSFTEFD failed to load an IFilter Therefore, we need to increase the timeout by setting the “ft_timeout” value for the service with: Exec sp_fulltext_service ‘ft_timeout’, 600000;
- So assuming you have installed and configured the various IFilters and you have not created your table yet, from a SQL Query window, execute the following:
EXEC sp_fulltext_service @action=’load_os_resources’, @value=1;
EXEC sp_fulltext_service ‘verify_signature’, 0;
EXEC sp_fulltext_service ‘update_languages’;
Exec sp_fulltext_service ‘ft_timeout’, 600000; – ten minutes
Exec sp_fulltext_service ‘ism_size’,@value=16; – the max
EXEC sp_fulltext_service ‘restart_all_fdhosts’;
EXEC sp_help_fulltext_system_components ‘filter’;
reconfigure with override
- You can now create your FileTable. For example: Create Table MyNotesTable as FileTable
- Now the fun can begin. Right-click over your new filetable and select the Explore File Table Directory to open a window to the FileTable directory.
- Drag two or three documents into the empty directory. TIP: you might want to install Search and index a test folder where you can crawl your test documents with Windows Search. Since you are using the same filters as the Operating System, you can verify that the words you use to query are found by the wordbreaker. For example, I tried to use a rather dirty PDF of a scanned document and the IFilter was unable to recognize several words from the document and Windows Search let me easily discover the problem.
- Go back to your table and Select Top 1000 rows in SSMS to verify that your documents made into the table. Cool eh?
- Now, let’s crawl them. Enable Full Text search on this table. Use the default PK index for the unique index, the file_stream as your text content, and the file_type as the Type_Column.
- Give the crawler a few minutes to complete the crawl. In the ‘Select Top 1000 …” set a where clause to something such as: “Where contains (file_stream, ‘mywordtosearchfor’). You should see a list of matching documents that contain the word or phrase.
In review, the new SQL File Table is a great way to load and retrieve documents into SQL Server so that they can leverage the power of SQL’s Full-Text-Search. With this technology, we can crawl and index dozens of different file types including scanned images and faxes and enjoy the benefits of either file compression or file encryption.
In the next post, I’ll dig a bit deeper into some of the technical aspects of these steps.