When I started at Microsoft a couple of decades ago, the “Big Data” Daddies of the day were two systems in the world that held just over 1 TB of data. Today, we have USB thumb drives that hold that much. Point is, what is known today as “Big Data” will not loom large in the near future. However, we need to build systems today to store and retrieve whatever we currently call “Big Data” in efficient and flexible ways. In the medical/health arena, these systems can be live changing and that’s exciting!
Health “Big Data” has some unique challenges. Here are just a few:
- Very wide and deep data sets. I exported just one table from our EMR and it holds a half-billion rows with over eighty columns of data and consumes about a terabyte of disk space without indexing.
- Data is extremely dynamic. New medications, tests, procedures are introduced all the time and must be tracked.
- Nomenclatures (aka “ontologies” or “vocabularies”) are just plain hard. An ECG, is also known as an EKG, and electrocardiogram, and an electrocardiograph. How do we correlate synonymous terms and concepts for storage and retrieval?
- Meds can be listed as their scientific and/or brand name; they can have a variety of forms and routes of administration and doses. To make matters more complex, a med can be primary or an ingredient of something else, the so-called “Is a” or “has a” problem. This same thing applies all through health data. As another example, is a disease primary or part of something else?
- Much of medical knowledge is held in semi-structured clinical notes and observations making it very difficult to extract, organize, and store. For example, say we want to find if a patient had XYZ disease. If we just use a term search in the notes, we might find in the History section that Mom had XYZ disease (a useful fact in other contexts, but not in our current request). We might also see in the Observation section of the notes something like: “no indication of XYZ disease”. This “negation” would give us just the opposite “hit” of what we are after. However, it is useful since even this “negation” would tell us that this patient was evaluated for XYZ disease and so we could exclude them. But what if there’s no comment about it? This “dunno” case makes it impossible to classify as “yes” or “no” (see next).
- Even binary states are typically not binary. In all too many cases, “dunno” is the third state. “Dunno” may be as important as Yes/No, True/False, Male/Female, etc. It may be “dunno” because of conflicting indicators, poor or missing measurements, or disagreement between experts.
- Health data is at least four dimensional: most data elements have time aspect that must be retained. This contrasts with most data systems which only requires the most recent observation to be retained. For example, if we take a wound culture, the results for a given encounter may be (in order): pending, inconclusive, positive, and (after treatment) negative. We must keep each of these results along with the complete patient picture at a point in time to understand the full treatment. In other words, one may ask to see a snapshot in time of the encounter (hospital visit) when the result was “inconclusive” to see what treatments were given at that time. Or, we may look for all positive wound cultures for patients and the final one (negative since the patient had been treated) is interesting, but does not indicate that they did in fact have a positive early in the stay.
- Health data screams for tagging (I love the idea of hash-tagging our health data like we do for Tweets and pics in Instagram). When we have cases of “dunno”, we might have SMEs (subject matter experts) weigh in via tagging. This will become extremely helpful as we tie genetic and physical traits (the genotype to phenotype) together. An expert could look at a patient or encounter or whatever and evaluate whether they merit a given tag such as “#penicillinAllergy” or “#PossibleAutism”. These data, though not part of the “permanent health record” might be retained and greatly assist us in subsequent analytics and help us narrow down patient populations and cohorts.
- Patient supplied data, especially with the Internet of Things with Health data will also need to be stored. But how much weight can/should we give to these data? We might want to know if a BP was taken with an automatic FDA approved home device or perhaps Grandpa trying to hear the pulse using an old sphyg and stethoscope.
- Some experts have even proposed that part of our Health Data challenge is the social media content. Heck, Google can forecast a flu epidemic based on searches; and Facebook can provide some great historical facts. We can’t exclude these data, can we?
- Add to all this the privacy concerns; that is, that we can’t show information that can tie the data back to the individual without a definite “need to know”.
Big Data solutions such as Hadoop sound great. Schema-on-Read is wonderful but our problems are far more complex than just the data model. We must account for the challenges just listed and more! Hadoop is a start but not the panacea some think it is.
Maybe I have it wrong, but the Hadoop Hammer was developed to hit a much different nail. Key-Value is limited to getting the Key right. That’s easy when you have something well defined and discreet such as Source IP Address and associate that with a small number of values such as “web pages” to give us a report of the sites visited by a user (IP Address). In other words, data that are deep and narrow are prime candidates for Hadoop.
In health, we can’t typically even define the key well enough to send to map-reduce. In our ECG example, how do we key off ECG or EKG in order to find the associated values?
Interoperability in health is a hot topic today. But I’d argue that intra-operability is a big challenge too; that is, “how do we deal with a myriad of IT systems in the same institution?”. Since we we use a variety of terms in the same hospital to define the same concept, intra-operability is a huge challenge too! And no, I’m not going to weigh in on ICD-9 and the move to ICD-10 – but folks can see the challenges there too as something defined this week will be defined differently next week when we move to ICD-10 in the US.
I’ve been developing database systems for four decades and working as an architect of health data systems for over fifteen years. In future posts, I hope to dig into some of these problems and provide some thoughts on technologies such as: local software, cloud-based technologies, and one custom solution I am excited about… that might help us with our challenges in health.