This and subsequent posts will discuss the advantages of storing and processing Genomic data in a binary format vs ASCII. We will begin at a very high level and dive deep to using binary processing at the lowest level in the computer system. Feel free to skip this introduction if you want to dig into the details.
Introduction to Computer Storage
Data on the computer is stored as either “on or off” (or positive or negative in the transistors). These are the only two possible states and are often represented as zero or one and known as binary; with each of these two states known as a “bit”. A bit can be either 0 or 1.
From this native state, we can build all constructs from which programs operate. Most computers use an 8 bit byte as the fundamental memory-addressable unit. In other words, everything builds on this basic container and all types derive from one or more bytes.
A byte can store up to 256 numbers represented by our binary (2 ^ 8). For example, let’s take a number such as 71; it will have a bit/byte representation as 01000111.
But now you may ask, how can I represent a really big number such as one million? For that number, you would need a container that holds at least 20 bits. Since computers use an 8-bit byte for their fundamental storage, that would mean we would need to allocate space for three bytes or 24 bits in order to hold a number that large. Therefore, on our computer system, the actual storage would look like this (ignoring the endians for now): 0000 1111 0100 0010 0100 0000 That is, we would read the three bytes and format the data therein into a number that would represent one million.
The important thing to note here is that our program must know that it needs to read these three bytes in order to return the number that large. This is key to understand “typing” in computers:
KEY: A type in a computer is a construct that will hold the bits and bytes in that container.
NOTE: to be stored in three bytes, our one-million number above would need to a custom container since most computer systems use 16-bit, 32-bit, and 64-bit containers of 2 bytes, 4 bytes, and 8 bytes respectively.
As an illustration, computers have a numeric type of a ULONG which is a 4 byte container with 32 bits. This container will hold number as large as 4,294,967,295 (2 ^ 32)
SIDEBAR: Did anyone do the math: 2 raised to the 32nd? If so you got a different number! Your calc shows 4,294,967,296 Correct. But remember, we need to show zero and so from zero to 4,294,967,295… which is 4,294,967,296 numbers with a maximum size of 4,294,967,295.
Note that this is an unsigned number. What the heck is that? Well, if you want to represent negative numbers, you would need something to represent the sign for the number. To do that, we need to use a “sign bit”. But now we only have 2 ^ 31 or 2,147,483,648 places since we need a bit for our sign. So our typed 4 byte container for a number known as a LONG (a “signed” value) in most systems will hold data that ranges from -2,147,483,648 (the zero is handled on the positive side) to 2,147,483,647
Now, the thing to note is that when we use this type (container), we need to allocate space for the largest number we might encounter. The rest of the space is essentially wasted. For example, the container we need for our number of one million, will look like this for a number of one: 0000 0000 0000 0000 0000 0000 0000 0001.
KEY PRINCIPAL: Space and computational power are wasted when we use containers that are larger than we need.
I just illustrated the space problem; that is, if the largest number we would need to store is 100, there is no need to use a LONG since all those zero bytes are just wasted space.
OK, space is cheap, right? Well, all those extra zeros will need to be read from storage and sent over the network and so that consumes those resources. Just as important, all those unneeded zeros would need to be stored in “hot” memory on the machine; and paged out when memory becomes constrained. Therefore, this wasted space has a ripple effect for system wide performance.
At the lowest level, work is done on the xPU registers (“x” can be “C” as in CPU; or “G” as in GPU). These have a finite size and so wasted space increases the memory transfers as data is shifted in and out of the registers and local RAM. This fact is key to the proposed Genomic architecture that we will discuss in future blog posts.
Now, some things that are interesting and have only a peripheral relationship to our work:
At the lowest level, we only have bits with two states: one or zero. Next, we assemble those into a byte comprised of 8-bits and this is the fundamental way data is accessed. In words, memory addresses are in bytes on all systems and those bytes are constructed the same way on all computers. For example, the number 71 will be represented by a byte with the following bit structure:0100 0111 on all computers.
Where things get confusing is when we need to assemble our bytes into types. Specifically, when we start typing our data using more than one byte, we must consider the computer platform’s “endianness”; that is, whether it’s “big” or “little endian”. (OK, there are mixed endians but I’ll leave the research for those to the reader).
To understand the endians, we need to understand what the most significant digit is. We typically look at a number such as one million as: 1,000,000 and in this case, the most significant digit (the millions) is listed first. The smallest and thus least significant number is listed last. So we view our numbers in big-endian format with the most significant digit first.
If we were to fire up Calculator in Windows and put in our decimal value of one million, it will show the following byte format for a 4 byte word:
0000 0000 0000 1111 0100 0010 0100 0000
Now what makes this crazy confusing is that this is shown in big-endian with the most significant bytes first; but on Windows these bytes are stored in little-endian order! It kinda makes sense for Windows to display the bytes this way since programs display our number in big-endian with the millions over to the left and the single values to the far right.
KEY: Most most PCs with x86 architecture (Windows and Linux in general) put the least significant byte first in our byte stream and are Little-Endian. Therefore, the bytes are stored in the opposite order of what we might see in Calculator or otherwise represent them.
Where endianness becomes a problem is when we store multi-byte data in a file that is created on a system with one type of endianness and then attempt to read (type) the data on a system with a different endianness. For example, if we take a LONG and save it to a file based byte stream on one system; and then attempt to read the same stream of bytes into a LONG on a system with the opposite endianness, the number will be bogus.
Back to our Calculator output for the number one-million: 0000 0000 0000 1111 0100 0010 0100 0000 which is represented in big-endian. On a big endian system, the bytes will be stored in the same order in files:
However, on our Little-Endian PC system, the bytes are stored in the opposite order with the least significant byte first:
To think of it a different way, if we viewed our decimal number of one million in little-endian, it would look like this: 000,000,1
So what’s the solution for genomic data? Create a custom type that is platform agnostic but can still take advantage of native types.
SIDEBAR: The UCSC genomic 2-bit format uses a multi-byte signature to make it easy to determine whether the platform is big or little endian. From their docs: “If the signature value is not as given, the reader program should byte-swap the signature and check if the swapped version matches. If so, all multiple-byte entities in the file will have to be byte-swapped. This enables these binary files to be used unchanged on different architectures.”
SIDEBAR 2: When work at the lowest level (fastest) on xPU registers, endianness doesn’t make a bit of difference. A 32bit xPU register will hold 32 bits of data and so there is no endianness at that level. This is key to our genomic tool design (it is platform independent); which is discussed in future posts.
We talked about numbers, but what about words and letters?
Just as we need to type (a verb) our bits into numbers, we also need to type them into letters and eventually words. How’s that work?
By convention, we use a byte of 8-bits to represent each character of the standard ASCII character set. This is a “single byte” representation. Therefore, the letter “T” is represented by a byte-sized decimal number of 84, but is represented in binary as: 0101 0100. In the early days, we could use just 7-bits for all of the characters and control things. However, quickly we moved to use the full 8-bits of each byte in what was/is known as extended-ascii.
KEY: Every letter in the ASCII alphabet consumes 8-bits if we use the alpha representation.
There are several implications for us in this. Here are two key thoughts:
Numbers as Characters
One way to “cheat” typing is to leave everything as it’s ASCII representation and to convert it at run time. Let’s see how that might work for our one-million LONG number.
As a LONG, this value will take four bytes and will thus consume 32 bits of bandwidth.
On the other hand, if we use the ASCII representation of “1000000”, that would be 7 bytes or 56 bits (one byte for each letter) or or almost twice as large! On the other hand, if we sent the number as “1”, this would only be one byte and so we’d only use .25% as much bandwidth as we would if sent as a LONG.
This should illustrate why the data developer should evaluate their data for storage and transmission and when to type it.
Characters as Numbers
Remember our rule: we should store our types in the smallest container as possible to reduce hardware load and processing time. Our genome is comprised of ~3.2 billion bases which are typically represented in ASCII format. For DNA, these are typically “A”, “C”, “T”, and “G”. OK, so we have a byte sized container that will hold 256 values but we only need to store four. Why not use a smaller container? That’s the premise of the 2-bit format we introduced earlier. It’s also part of the newish CRAM format.
As I mentioned in an earlier post, the next few posts will give specifics on the work I’ve done to evaluate using a binary format with low level processing to see if we can improve performance; and it was a fun intellectual exercise to learn CUDA programming.