Term Completion & Search–Another Great use for Full Text Search in SQL Server

As is every hospital in the US, we are working to move to ICD-10 for our coding and billing.  Our EMR system has been challenged with the conversion and we’ve found a number of anomalies in our validation.

Our vocabularies include: ICD-9, ICD-10, SNOMED, and great commercial vocabulary from IMO.  Both IMO and SNOMED have around a quarter million terms each and so finding and correlating the terms proved challenging.

To address the challenge of identifying and addressing mapping issues, I developed a validation tool to cross map each of the vocabularies.  However, the process needs to begin with finding a term and that’s where using SQL Server’s Full Text Search comes in.

Here’s the rough specs of how I wanted the tool to work.  Let’s use the ICD-10 vocabulary list for our example and let’s search for:

“Atherosclerosis of autologous vein coronary artery bypass graft(s) with angina pectoris”

  1. The user is presented with a text box where they can input a word found in the term they are searching for.  To make this easier with complex medical terms, I offer them an autocomplete list after they key in the third character.  When they key in “Ath”, the system will offer all the words found in all the terms that begin with “ath”.  As they key in each additional letter, the list offered becomes shorter and more specific.  For our example, keying in “ath” finds only six matching words in ICD-10.
  2. The user selects Atherosclerosis presses the ‘Enter’ key to select, and ‘Enter’ again to submit.
  3. The program returns every term with the word Atherosclerosis  in it; a total of 338 terms from ICD-10 and the application shows the first 25.
  4. Back at the word input box, the user can key in a second word found in the 338 terms to narrow the list.  The “magic” of this application is that now after the user keys in the first three letters, the returned list shows only the words that start with the letters keyed in AND that are contained in the 388 matching terms.  Say we want to use “autologous”.  This word appears after we key in the first three letters, “aut” and we select it by hitting the ‘Enter’ key and then submit by hitting the ‘Enter’ key again.
  5. Our application now returns a filtered list of only 58 records.
  6. Since Angina is associated with our patient, we key in “ang” and select “angina” and now when we Enter-Enter, we get a short list of ten terms and it’s easy to see and select our full term.
  7. For our validation tool, I return a nice grid with all the associated cross-vocabularies mapped accordingly when the user selects a given term.

The key here is that in less than 20 key strokes (usually less than 10), a provider can find and select a given complex medical term from across hundreds of thousands of choices.

Before I talk about how I built this, I invite you to try the select functionality using the full ICD-10 list here; and the full SNOMED list here.  Keep in mind that the SNOMED list is about a quarter of a million terms and it does this word and term processing on every word in every term offering millions of combinations.

So how would one build this?  We want to find the terms that hold the words we use to search; and we want to filter the words offered in the autocomplete to only those terms which has the combination of words.

In the first version of the tool, I used a custom word breaker I wrote in C# to break each term.  However, even though the code was simple and I used multiple threads, it still took days to complete.  I used an in-memory dictionary to hold the words already processed, but the process of identifying each unique word was still too slow.

Since I had already done a word break using FTS in SQL Server, I wondered if I could access the word list from there.  Lo and behold, after the reading the documentation, I found that indeed this was possible and very easy to do using two system stored procedures.

The first will find all the words across all the terms; that is, return the full list of words found by the word breaker.  We can use the system sproc: dm_fts_index_keywords to do this.

The second sproc we use will return all the rows or terms that hold a particular word.  We use the dm_fts_index_keywords_by_document for this.

Although we could use these functions at runtime, we need the queries to be very, very fast.  Therefore, we need the speed benefits of indexing our data.  To accomplish this, we create two tables that sit alongside our primary data table that holds the terms:  a word table that holds all the words and a bridge table holding all the mapped words to terms.  We use the two system sprocs to populate these.

At runtime, we use the Words table to return the filtered list of all words found across all terms that being with the letters the user keys in.  Using LINQ, our query looks like this where the variable “a” holds the letters we want to filter by such as “ath”:
var ret = (from r in ctx.Words where r.display_term.StartsWith(a) select r.display_term).Take(20);

We serialize the “ret” and send it back as json to populate the autocomplete.

For our subsequent autocomplete term searches, we use our bridge table where we take all the terms that hold the current list of words, and return all the words found in those terms except those already chosen.

Our term list is populated using standard FTS with the selected words.

Leave a Reply

Your email address will not be published. Required fields are marked *

17 + nineteen =