Sunday, April 28, 2013

SQL File Table: Step-by-Step Installation and Configuration for Documents

I’m finalizing a project at work where we will index several million health notes going back about fifteen years.  As an adjunct to that project, I was interested to explore how we might index scanned documents and my previous post describes how to do that using SQL Server and the TIFF IFilter included in recent Microsoft OS releases.

This all worked well and fine but I became intrigued with the new SQL File Table technology since it could address a number of challenges.  Here are some of the reasons I like the technology:

  1. As I have blogged before, it is important to encrypt health data that contains Personal Health Information (“PHI”).  SQL Enterprise allows us to encrypt the database and logs and does a good job; but it’s expensive.  With SQL File Tables, we don’t’ store the data (and PHI) in the database but as files on the file system.  SQL File Tables will allow us to store the files in a compressed directory; or for health data, in an encrypted directory using EFS. Now, we can encrypt our semi-structured data held in medical notes without incurring the expense of SQL Enterprise.  And… see next
  2. The freeware version of SQL Server, the “Express Edition with Advanced Services” includes File Table support and the data stored on the file system does NOT count toward the 10 GB DB limit.  Cool.
  3. It is super simple to load and read documents from the File Table.  In fact, documents can be dragged and dropped into a folder.  If EFS has been utilized, the contents of the files will be secured – even from high privileged administrators who might have access to the file system.

There are a couple of drawbacks however.  First of all, there is a bunch of disk I/O and possible contention if these are stored on the same spindles as the rest of the database or logs.  When we add full-text indexing to all this I/O, we must make some configuration tweaks or we won’t be successful.

So the following is a step-by-step with minimal explanation.  In the next post, I’ll dig deeper into some of the technical aspects of these steps.  For the following, I used the Developer Edition of SQL 2012

  1. Install SQL Server select Full-text and Semantic Extractions for Search
  2. At the “Database Engine Configuration” screen, click the FileStream tab and select all the checkboxes (note, in the next post, we’ll talk about how to configure this on an existing instance of SQL that did not have this enabled on installation)
    image
  3. You should optimize the file system for the I/O load.  Refer to the MS documents for this; but I suggest you disable 8.3 with: FSUTIL BEHAVIOR SET DISABLE8DOT3 1   and disable last access with: FSUTIL BEHAVIOR SET DISABLELASTACCESS 1  (requires a reboot).
  4. Create a folder for your FileStream and make sure indexing is not enabled for that folder since SQL will handle that for us. 
  5. Create your  new database with a FileStream  filegroup that points to the folder created in step #4 and add a new File to the filegroup; and set the Filestream property to non-transacted-access.  Again, see the next post for details on this and the other steps.
  6. At any point up to now, you will want to configure the filters for the OS. See the previous post for more information.  Note, you should read the Adobe release information if you want to crawl and index .pdf files; especially the section on setting the path.  Similarly, if you want to crawl multi-page TIFF files, you should make the appropriate group policy change to enable that as per the previous post.
  7. There is one little key problem that we need to address before we add our File Table and enable Full-Text-Search.  As mentioned before, there is quite a bit of file contention with the File Tables and this is exacerbated when we try to index new documents. In fact, I have found that without these tweaks, the service can’t get past the “Error 0x80040e97“ reported in the SQL logs and the full text engine won’t index all of the documents.  Another error you might see for the same reason is:  Error '0x8004fd02: The filter daemon MSFTEFD failed to load an IFilter  Therefore, we need to increase the timeout by setting the “ft_timeout” value for the service with: Exec sp_fulltext_service 'ft_timeout', 600000;
  8. So assuming you have installed and configured the various IFilters and you have not created your table yet, from a SQL Query window, execute the following:
  9. EXEC sp_fulltext_service @action='load_os_resources', @value=1;
    EXEC sp_fulltext_service 'verify_signature', 0;
    EXEC sp_fulltext_service 'update_languages';

    Exec sp_fulltext_service 'ft_timeout', 600000; – ten minutes
    Exec sp_fulltext_service 'ism_size',@value=16; – the max

    EXEC sp_fulltext_service 'restart_all_fdhosts';
    EXEC sp_help_fulltext_system_components 'filter';
    reconfigure with override
  10. You can now create your FileTable.  For example:  Create Table MyNotesTable as FileTable
  11. Now the fun can begin.  Right-click over your new filetable and select the Explore File Table Directory to open a window to the FileTable directory. 
  12. Drag two or three documents into the empty directory.  TIP:  you might want to install Search and index a test folder where you can crawl your test documents with Windows Search.  Since you are using the same filters as the Operating System, you can verify that the words you use to query are found by the wordbreaker. For example, I tried to use a rather dirty PDF of a scanned document and the IFilter was unable to recognize several words from the document and Windows Search let me easily discover the problem.
  13. Go back to your table and Select Top 1000 rows in SSMS to verify that your documents made into the table.  Cool eh?
  14. Now, let’s crawl them.  Enable Full Text search on this table.  Use the default PK index for the unique index, the file_stream as your text content, and the file_type as the Type_Column. 
  15. Give the crawler a few minutes to complete the crawl.  In the ‘Select Top 1000 …”  set a where clause to something such as:  “Where contains (file_stream, ‘mywordtosearchfor’).  You should see a list of matching documents that contain the word or phrase.

In review, the new SQL File Table is a great way to load and retrieve documents into SQL Server so that they can leverage the power of SQL’s Full-Text-Search.  With this technology, we can crawl and index dozens of different file types including scanned images and faxes and enjoy the benefits of either file compression or file encryption.

In the next post, I’ll dig a bit deeper into some of the technical aspects of these steps.

4 comments:

  1. Bruce, this is AWESOME STUFF. While following along, I ran into a minor issue in that SQL Server Management Studio currently does not recognize that SQL Server 2012 Express with Advanced Services supports full text indexing, and thus the management UI refused to create the full text index. You just have to do it using Data Definition Language rather than through the UI. Issue is described here: http://stackoverflow.com/questions/10407337/express-with-advanced-services-cant-create-full-text-index

    Also, it should be noted that if you're using a "Home" or "Home Premium" version of Windows 7 (haven't tried this on Windows 8), you can install the TIFF iFilter, but you can't configure it to crawl all pages of a multipage TIFF file because the necessary Group Policy isn't supported by the OS. I imagine it could be configured more directly through a registry entry, but it's probably not worth worrying about, as this OS wouldn't be a suitable platform for a server anyway (noting that I only ran into this on a development box).

    Again, EXCELLENT STUFF. The combination of full text search, compression, encryption, and enhanced security make SQL Server 2012 Express with Advanced Services IDEAL for this type of application. Keep it coming!

    ReplyDelete
  2. Thank you Don for the clarification and additional information and taking the time to write about it.

    ReplyDelete
  3. Your blog is superb, the information contained here is not only readable and informative but seems to me to be unique to the Internet in its level of detail and technical understanding, at least all gathered into one place!

    I'm having trouble getting full-text indexing working on my File Table, though. I've set up a test VM on which I'm running SQL Server 2012 Standard and I've configured everything correctly, as far as I can tell. For my testing I'm using 100 historic company documents (a mixture of .doc and .pdf) which are around five years old because, if I can get this working, these are a tiny sample of the documents that I would want to index. However, whatever I try, my index consistently fails to collect any words at all. My crawl log contains this line for every document in the table:

    2014-06-10 10:43:17.21 spid21s Error '0x8004fd02: The filter daemon MSFTEFD failed to load an IFilter interface for document, so it can't be indexed.' occurred during full-text index population for table or indexed view '[FilestreamTestDB].[dbo].[filetable2]' (table or indexed view ID '18099105', database ID '5'), full-text key value '/387167963850.266385332230334.627976731/'. Attempt will be made to reindex it.

    I came across your blog whilst searching for information about this error, but increasing the timeout for the indexing doesn't seem to have made any difference. In fact, reading a forum thread on the MSDN website regarding somebody else's experience with this same error, it seems as if there may be a problem with Microsoft's iFilter when it comes to reading older .doc files (mine were probably created in Office 2003/2007).

    I'm not asking for personalised support or anything, I just wondered if this was something you'd come across, or if you were aware of a problem with indexing of older documents. I thought setting up full-text indexing was going to be relatively straightforward - and it did appear to be as I was configuring it - but I've now spent many hours trying to get my index to populate and I'm feeling very demoralised!

    ReplyDelete
  4. Look at the next post that talks about things you can do at the OS level to get ready to crawl the documents. I also suggest that you keep SQL out of the picture until you are sure that the OS is configured by using Search with the OS. You will find that Search will provide more and better information in the Event Log if there are problems with your documents or your core configuration. Once you can crawl, index, and search for the documents using Windows Search, then you can light up SQL. Good luck

    ReplyDelete