Sunday, February 14, 2016

Memory Mapped Files For Large Datasets

I'm a fan of memory mapped files and think that they can play a role in processing large datasets on computers with limited memory.  The following is a copy of an internal email interchange I had with a couple of other engineers that I think describes the benefits.

Memory mapped files are fast in both .NET and JAVA but super fast using pointers in C or C++.

Here's the email thread:


Hadoop reads entire files and then parses the data.  Hadoop keys tend to be single fields of data which makes the Map-Reduce process slow and complex for large (wide) data sets such as those in health.  The following will show a solution to both problems using memory-mapped-files which only reads the key fields from the file and never loads the entire file; and uses smaller composite field hashes for the key data to make it much faster to search and filter across large text fields such as gene sequences in wide and deep data sets.

After playing with Hadoop and other big data solutions, I have come to the conclusion that we will need the ability to do additional processing of large data sets AFTER we pull them from Hadoop.  Even if we were to hire a team of map-reduce java developers, we’ll continue to need the ability to handle some level of pattern-matching across our result sets.  Genomic data, primarily represented as large string sequences, makes this even more challenging.


After installing Windows 10 on my laptop I put an SSD in the machine and was surprised that Win10 would start in about 5 seconds. 

SSDs are faster than spinning drives since they don’t have to wait for the seek heads to find the data on the platter(s).  But that would not fully explain the quick load times.

It turns out that Windows now supports memory-mapped-file access of the executables.  What this means is that the OS at kernel level, will map a large file into memory and only load what it needs when it needs it.  In the old days, Windows would load the entire file into memory and then map the executable library from the file into memory and execute it (kinda like Hadoop across the HDFS).  This explains the super-fast load times for Windows and has merit for our work too.

One of the problems with Hadoop is that it scales-out and uses a brute-force method to achieve performance:  lots of servers, disks, and memory.  Hadoop was written in JAVA and so it’s slow (folks are moving core functions to c/c++ but that work is still early).  To make matters worse, Hadoop must read/parse lots of HDFS files across the machines from mostly spinning drives; and then write interim results across the network and to other machines and drives.  This is a ton of file I/O.

Another problem I have seen with Hadoop is that it is a key-value system at it’s core.   It was written to have a known key (index) from which to do the initial search (the map phase) of results before being sent to the reducer(s).  Unfortunately, Hadoop does not work very well when we have large keys (such as genomic data); or when we have a variety of columns that we can use for keys.

And since Hadoop reads the full files into memory AND THEN parses them, the larger the files become, the slower the system runs.  This is what happened to Windows: the executables became so large, it took forever to load the OS since the files had to be read from disk into memory and mapped before they could be executed.  Hadoop solves this by scaling-out: adding more machines, drives, and memory (and of course complexity and expense).  Windows solved it by using memory mapped file.

I suggest that our solution would be similar to what Windows has done: use memory-mapped-files for very fast processing.

Memory-Mapped-Files in a Nutshell:

A memory-mapped-file is a file that is mapped into kernel address space as it sits on the drive.  No need to read the entire file into memory (saving RAM).  Since it runs at the kernel mode, it’s super fast and efficient. If we know a value is at a given position, we point our program to that position in the file and the contents from that are read into memory. 

So Memory-Mapped-Files can be a very fast technology to help us with our post-processing of large data sets (see example below).

Key Problem: 

But we still have our “key” problem; from our key-value technology.  Hadoop will open a file and map, into the reducer, all the matching “rows” that have the key we are interested in.  This means every file has to be read in toto, and each “row” parsed for our value.  But what if we have large “key” data such as genomic data; or if we want to use multiple “columns” as our key?  The solution to the problem is to use a Hash algorithm.  This is exactly what relational databases (SQL Server, Oracle, etc.) use to determine if a row is “dirty” or has changed.  This is how files are verified that they have not changed or been tampered with.  For us, this will work very well…

Say we want to “key” on a genomic string, say CGGGTCTGACCTGAGGAGAACTGTGCTCCGCCTTCAG and we then want to link that to another field such as MRN, 123456789.  We could map on either the genomic string or the MRN but we want to use both.  The solution is to concatenate these fields into a single string: CGGGTCTGACCTGAGGAGAACTGTGCTCCGCCTTCAG123456789 and then HASH that value (most modern CPUs support the hashing we use in the CPU so this is super-fast).  The hashed value is a much smaller binary size (typically 4 bytes instead of the ~ 50 bytes originally) so it’s much more efficient to store and search for.  Yes, we might have collisions, but those would be filtered out in the reduce phase.  Now, we can key/search across multiple columns using a terse hash.

Putting it All Together:

So the experiment was this: 
·         take a deep and wide dataset (the  Clinical Event table that has been denormalized) and hash composite columns to create a new Hash-key column (4 bytes wide). 
·         Export this table to a new binary file and memory map the file to see how quickly we can find a set of keys (matching columns). 
·         Compare that to using SQL Server with a composite key to find the same results.

Results and Observations:

The denormalized Clinical Event table with the hash keys ended up being about .5 billion rows with each row about 80 columns wide.  The final data files was ~ .5TB in size.  As you can imagine, it would take the OS a long time to read the file and even longer to then parse out the matching values.

I pointed a custom memory-mapped reader at the file and asked it find all the “rows” with a matching key (representing our genomic string and MRN).  It returned the results in single-digit seconds; typically less than 5 seconds.  This was on a host machine with spinning drives and I would expect this to be much faster when the data is hosted on SSDs.

I compared that to the relational database.  First of all, the relational database had to store the index information which added greatly to the size of the table.  Next, it had to load the large index from the file system before it could execute the query.  The index load took many minutes; however, once the indexes were loaded (into a machine with a ton of memory), the queries returned in a handful of seconds; but took much longer than the MMF solution.

Net/Net:  Memory-mapped-files of large data sets are orders of magnitude faster to process than even the most powerful relational database.  Plus, we can emulate huge systems with tons of RAM on a commodity machine sitting on top of relatively inexpensive SSD drives (Here’s a great read from MSR on this topic and another similar read).

--- I received these questions from one of my peers:     1) Is it possible to have an index around the hashkey, or is this unnecessary due to memory mapped architecture? 2)  As I understand Hashing algorithms do not guarantee unique hash value, so a few non-matching rows may occur in large datasets that happen to have the same hash. Those could easily be managed in a reduced fileset.  Any concerns with this?

To which I answered:

Yes, the “index” around the hashkey would really be a dictionary with the “key” being the hashkey value; and the values being the offsets in the file.  The cool thing about this is that we can create a persisted “index” and say, stick in a buffer at the start of the file; or, we can create any number of these ad-hoc and store them independently.  So we could have a virtual table of diagnosis codes that would simply be a file based dictionary/hashtable with the diagnosis code as key and the “row” offsets pointing to the matching rows.  This would be extremely terse and very fast to build. 

We could also use this as a way to cross-join by using the “row” offsets that match.  For example, say our dictionary/hashtable/index has MRN as the key and the values are the “rows” with those MRN values.  We might also have another dictionary with diagnosis code associated with those rows.  To find a match, we pull back the “rows” (offsets in the MMF) for each set and do an inner-join on those.  Once we have the offsets, we just point them to the MMF and pull the “rows” and the “columns” for those.  Since the MMF is so fast, pulling the data using the offset happens in milliseconds.

They key to this technology is using SSD drives as virtual RAM.  So now we are scaling up; but instead of using expensive and fast RAM, we use relatively inexpensive and a bit slower SSD.    Since an x64 system can theoretically map to 4 PB of RAM, we could put a collection of SSDs on there and let the MMF architecture map that into kernel-addressed memory space.

As for #2, you are absolutely correct that there could be hashing “collisions” so that there might be some non-matching rows.  Therefore, when we build our dictionary/index or do the data-pull, it would be important to check the values that were hashed to insure they are pulling the correct rows.

One thing I didn’t dive into but probably should at this point is describe a “row” and/or “column”.

At the start of the MMF, I hold the “table” metadata.  This is the length of each row (say 100 bytes) and the name and location (offset) in that buffer of each “column”; and I’ve even played with typing the columns.

Therefore, we store binary in bytes but can find and type them easily on the way out.

So let’s say we want to find a “row” with the MRN “column” with a certain value.  We look at our metadata table and see that each “row” is 1000 bytes long; and in those 1000 bytes, we might find MRN in offset 20 that is 10 bytes long and is returned as an INT.  To find all MRNs, we start at the beginning of the data, add the MRN offset and size, and then just loop through the file adding the row length on each iteration.  As you can see, this can really help when we have wide (think de-normalized) data sets.  I’m shocked at how incredibly fast this is.

Let’s say that our “index” or dictionary shows that we have a certain MRN if 123456789 at Row (offset 9999).  We would know that our complete “row” starts at memory location 9999, and all of our “columns” defined in the header are held in the 1000 bytes so we can pull those bytes and then populate a “row” type by reading those bytes into our type.  As a result our “MRN” would be read from 9999 + 20 (the offset) to 9999 + 20 + 10 (the length of the MRN). In code, we could then “type” the binary to our int.

For my test, I used a denormalized version of Clinical Event, where I did the code-value lookups.  Even without an index, I was able to search for a given value across the huge file in a matter of seconds.

No comments:

Post a Comment