In my previous post, I provided a step-by-step on how to set up SQL Server to crawl and index large files of semi-structured data in either text-based formats (Word documents, PDF files, HTML files, etc.); or as scanned documents in formats such as TIFF.
This post will attempt to provide a bit more technical background as to why some of the configuration steps are needed and more technical details in general.
One thing to keep in mind with SQL File Tables (“SFT”) is that it does not support all of the typical ways in which you can access or manipulate files. I like to think of it as a file system emulator that creates and exposes a virtual file system which posses most of the best parts of the file system; in fact, I will call this the SQL File System in this post.
Let’s use one of my favorite features of the Windows file system: the Encrypted File System (“EFS”) to illustrate how SFT can leverage and extend the underlying file system. EFS was designed to be used by a single user to encrypt one or more files or folders. Unfortunately, EFS only allows the user to share the unencrypted file with other users at the file, and not the folder, level. This is a non-starter for us where we want to store and share millions of files.
As we expect, SFT technology uses the SQL account to read and write from the encrypted folder and is the only account that can decrypt the files. In other words, other accounts (users) can get to the files from the file system using the local drive letter; however, they will not be able to view the content since they will not be able to decrypt the files.
However, SFT and the SQL File system extends EFS functionality by providing the ability to access the encrypted files, which are decrypted through SQL server using SQL user permissions. In other words, access to the encrypted files and folders is handled through SQL and not the file system and unlike the native EFS in the OS, we can now grant permissions to the decrypted content to a host of users through SQL security. In my view, this is a huge improvement and benefit!
In a future post, I plan to provide some details on how this might work with a web based viewer to the files; and to provide auditing to comply with health privacy regulations; but back now to the goal of this post- to discuss some of the technical details of SFT.
First of all, if you failed to enable File Stream in SQL when you did the install, you can now enable it without uninstalling and reinstalling as was required before. To do this, open the SQL Configuration Manager. Expand the SQL Server Services in the left panel and in the right panel, right-click on the SQL Server service and select Properties. You will see a FileStream tab which you should click and in there, enable the stack. You next need to enable at the database instance. Open SSMS and right click on the server instance in the left window and select Properties. Come down to Advanced and under Filestream, set the Access level to “Full Access enabled” .
When you create your database, you need to enable FileStream support for the db. There are a few moving parts: you should create a folder on the file system to hold your stream data, you need to create a file group, you then add a new file to the filegroup, and finally you need to provision a FileStream directory name for this database and the “non-transacted” access to the directory. This is rather convoluted to do in SSMS but it’s a good practice so you understand all the parts. However, if you’d rather script it, it might look something like:
CREATE DATABASE [TestFS]
CONTAINMENT = NONE ON PRIMARY
( NAME = N’TestFS’, FILENAME = N’C:Program FilesMicrosoft SQL ServerMSSQL11.MSSQLSERVERMSSQLDATATestFS.mdf’ , SIZE = 5120KB , MAXSIZE = UNLIMITED, FILEGROWTH = 1024KB ),
FILEGROUP [FileStreamGroup] CONTAINS FILESTREAM DEFAULT
( NAME = N’TestFSFileGroup’, FILENAME = N’C:FSTestFSFileGroup’ , MAXSIZE = UNLIMITED)
LOG ON ( NAME = N’TestFS_log’, FILENAME = N’C:Program FilesMicrosoft SQL ServerMSSQL11.MSSQLSERVERMSSQLDATATestFS_log.ldf’ , SIZE = 1024KB , MAXSIZE = 2048GB , FILEGROWTH = 10%)
ALTER DATABASE [TestFS] SET FILESTREAM( NON_TRANSACTED_ACCESS = FULL, DIRECTORY_NAME = N’FileStreamDir’ )
We have a bunch of directory names floating around and it becomes confusing as to where they are all provisioned and what their purposes are so let me help clarify. Remember that there are two distinct file systems at work here that need to operate in concert: the regular OS based file and share system; and the SQL file system. In some areas, they do NOT overlap so it’s important to understand each:
We create an OS folder called “FS”. This is recognized by the file system but has no part in our SFT file path structure. In fact, once you get things configured, you should forget it even exists and leave it and everything in it alone!
When you created your database, you created a new filegroup and added to that a new “file” hosted in the filegroup.
That “file” is really the folder name nested under the OS folder. Therefore, if the LogicalName of our file is “TestFSFileGroup”, our OS folder might be: D:FSTestFSFileGroup. If you decide to compress or encrypt your data, you would configure that at the OS level here.
OK, we are done now with the OS based file system. You can now forget it exists because everything you do after this, you will do using the SQL File system.
When you installed or enabled the FileStream for the SQL Server, you completed the share name. The default is “mssqlserver” and we’ll use that. However, for our purposes, we will never use OS tools to work with this share. If our server name is: “MyServer”, we will have a windows share called: “\MyServermssqlserver” but this is a SQL Server share and our interaction is through SQL.
When we created our database above, we created a directory name called “FileStreamDir”. Now our SQL path for this database is: \MyServermssqlserverFileStreamDir
Let’s say we create a new File Table called “MyMedicalRecord” using the follow SQL script:
Create Table MyMedicalRecord as FileTable
The path to this table’s data is: \MyServermssqlserverFileStreamDirMyMedicalRecord
If you right-click over the table name in SSMS and select the “Explore File Table Directory”, you will see that this is in fact the share or folder that comes up.
Even though this folder shows up in File Explorer, you are NOT going through the OS file system to reach the folder. You are using the SQL file system instead. But you can use File Explorer as a tool to manipulate the data in SQL. When you drag and drop files to this folder, they magically show up as rows in the SQL table. You can even use File Explorer to add new folders and to arrange the files in the folder.
For example, if you create a new folder called “NewFolder” using File Explorer to your SQL path, you can use the new path to reach that folder with: \MyServermssqlserverFileStreamDirMyMedicalRecordNewFolder
However, if you go to your OS mounted folder (I’ll let you do that this one time), and look under the FS folder, you will see a folder with a GUID as a name (that relates to the database folder name in SQL), and under that another folder with a GUID (that relates to the table folder name), but you won’t find a sub-folder for our “NewFolder” directory since this is totally managed by the SQL file system.
One final point here that I hope to elaborate more fully on in a future post, you manage access to these files and folders at the SQL level and NOT at the OS level!
So drop a few large files into the folder and select * for the table in SSMS. You can see that it takes a long time to get the data back. That’s because the query is returning the full blob of data from the file system. You can speed things up by removing the file_stream column.
This illustrates that returning the data through SQL can take quite a bit of time. That’s the primary reason that we were seeing the 0x80040e97 and 0x8004fd02 errors in the SQL logs and that we needed to increase the “ft_timeout” values for the engine (see previous post).
The point here though is that you can use standard SQL queries to retrieve the varbinary(max) blob through SQL to the file system; or, you can use standard Win32 file access APIs to do this through the SQL File system share. Very cool.
One final “gotcha” to be careful with. The standard setup for SQL is to set up your databases on partition that has been formatted to a 64K cluster size. However, we are storing files on the OS and this does NOT apply to our file based storage. For our medical notes, the vast majority of our files are far less than 64K and so each small file, will still consume a full 64K cluster – orders of magnitude more than we need. Therefore, you should determine the average file size of the files you store and set your cluster size accordingly.