Counts representing Items and the impact on Data Compression and Encryption

All things in computers are identified in counts of digits (called bits) which are either on or off.  In this post, we’ll discuss how counts (discussed in the previous post) are used for identification.

It’s easier to understand this if we think of a continuous line of bits that are either on or off (1 or 0).

We need to use a lookup table to translate this line of bits into something.  This lookup table is what we use to define:  our data, compression, and encryption.

So if you think about it in the most basic way, we need a table with index keys that translate a count (a number) into something else such as a letter.

The ASCII table is just that.  It uses counts of 8-bit sequences (a byte) to represent each letter.  The Unicode character table is similar, but uses multiple bytes and bits to represent the same characters.

Compression

Computer compression is accomplished when we use a different lookup table for our things that results in smaller storage requirements.

So compression is first achieved by using the smallest number of indexes to look up our discreet things (such as Letters).

Compression can also be accomplished when have repeating sequences of our smaller things and we represent those in our lookup table.

Let’s use an example to make this clear.  Say we have our four base DNA things which are represented by “A”, “C”, “T”, or “G”.

By typical convention, we use the single-byte (8 bit) ASCII character set so each character is represented by lookup value that is 8 bits long.

The KEY POINT is that we take our long string of bits, and knowing that our lookup values are 8-bits long, we break that stream into 8-bit buckets in order to get our lookup index values which we can later represent as our DNA bases.

In the ASCII table:

A is represented by decimal count of 65 or the following 8 bits: 0100 0001
C is represented by decimal count of 67 or the following 8 bits: 0100 0011
T is represented by decimal count of 84 or the following 8 bits: 0101 0100
G is represented by decimal count of 71 or the following 8 bits: 0100 0111

So our bits are stored on the system as: 01000001010000110101010001000111

Because we decided on 8-bit buckets and the ASCII lookup table, we know that in order to read (that is, to translate this bit stream back to our DNA sequence of ACTG), we need to break our stream into 8-bit sequences so we can use those counts as a lookup in our conversion table, ASCII.

But, as you can see, we are using 8-bits which give us 2^8 or 256 index locations.  Why use an index this large when we can get by with only four indexes of 2 bits?

Now let’s say we make our own conversion table that has the 2 bit representation as follows:

A is represented by a count of zero or the following 2 bits:  00
C is represented by a count of one or the following 2 bits:   01
T is represented by a count of two or the following 2 bits:   10
G is represented by a count of three or the following 2 bits: 11

Now, our bits are stored on the system as: 00011011

As you can see, this is much smaller and so we have achieved compression!

The key point is that we must take this stream of bits and KNOW that we have two-bit indexes and that instead of breaking at 8-bits as we did for our ASCII lookup table, we will break at 2-bits for our new custom table.

Of course, when we read our bits, we’ll need to transform them to the correct thing. So in our ASCII program, we’ll know that “A” is represented by 01000001 but in our custom 2-bit program, the same “A” is represented by 00.

Taking this a step further, what if we have a number of repeating sequences of characters? For example, what if we have something like AATAAGAAC?

We want to put the sequence of “AA” into a new index location since we could have one location for those two letters.  However, at 2-bits, all of our index locations are used up.

So, we might now declare that our new lookup table is 3-bits instead of 2-bit and would look like this:

A is represented by a count of zero or the following 3 bits:  000
C is represented by a count of one or the following 3 bits:   001
T is represented by a count of two or the following 3 bits:   010
G is represented by a count of three or the following 3 bits: 011

Using our 2-bit to represent our example sequence of AATAAGAAC would take 18 bits of storage (2 bits times 9 things).

Using our 3-bit to represent our example sequence would take 27 bits of storage (3 bits times 9 things).

However, if we add a frequently repeating sequence such as “AA”to our lookup table at the index position of four represented by the following 3 bits: 100, our final bit stream would have 18 bits total.  Let’s look at the bit stream to see why:

Using our new compressed, repeating segment, lookup table, it appears like this:

A is represented by a count of zero or the following 3 bits:  000
C is represented by a count of one or the following 3 bits:   001
T is represented by a count of two or the following 3 bits:   010
G is represented by a count of three or the following 3 bits: 011

AA is represented by a count of four or the following 3 bits: 100

Using this, we can translate our sequence of AATAAGAAC into our bit stream:

AA: 100
T:    010
AA: 100
G:   011
AA: 100
C:   001

KEY POINT:  Determining optimal data compression is difficult.  Your goal is to use the smallest lookup table to represent the data.  However, you must realize that when you increase the size of the lookup table to accommodate repeating sequences, EVERY element will need to be represented by this larger index size.  For example, in our original 2-bit format where each character is represented by 2 bits, we can store our sequence in 18 bits.  We can also store the same sequence in 18 bits since we have three repeating sequences (the “AA”) represented.

However, if we did NOT have these repeating sequences; or there were fewer than the three we used in our example, the 3-bit format would be larger and we would not have compression but data inflation; that is, we achieve just the opposite of what we intended.

DNA Application:  The DNA applications that use data compression use compression routines that were generally devised for large and regular repeating words or phrases in text.  These do not work as well for the randomly repeating segments in DNA.

Lossless vs Lossy

Medical data must typically be lossless so what’s that?  Well, with the lookup tables we just described, the data in will be exactly what we’ll see in data out since we have a one-to-one lookup.  This is lossless.

Lossy is when we have one lookup value represent more than one value.  The best example of this is color.  We might have a color value of red of a particular hue that is represented on the computer by a large and precise number (count).  But for our purposes, we may be satisfied with a less precise hue.  So we might take all of the numbers that represent red very precisely in the larger-number format; and put them into a lookup table where “red” is represented by a single lookup.  We may have thousands of variations of red represented by one lookup of “red” in our smaller (compressed) table.

The key point here is that you cannot go back to the original since there is no way to know what precise hue of red you have when just one value represents red in our smaller and compressed table.  We have “lost” the ability to go back to the original.

Encryption:

Thinking about compression, encryption should now be easy to understand.  All things in computers start with a bit stream.  The system will break the stream into index positions that will look up things from a table.  All encryption does is redefine the table.

Let’s take our 2-bit example above.

We said that our table would look like this:

A is represented by a count of zero or the following 2 bits:  00
C is represented by a count of one or the following 2 bits:   01
T is represented by a count of two or the following 2 bits:   10
G is represented by a count of three or the following 2 bits: 11

So if we agree on this translation, I can send you 00011011 and you can use the agreed-upon lookup table to get:  ACTG.

Now, wink-wink, let’s say that you and I agree on a different table.  Our table looks like this:

T is represented by a count of zero or the following 2 bits:  00
G is represented by a count of one or the following 2 bits:   01
A is represented by a count of two or the following 2 bits:   10
C is represented by a count of three or the following 2 bits: 11

I will send you 10110001 and you will know, since you have the same table, that I mean: ACTG.

But if someone else uses the original table, they will see:  TGAC.

So with encryption, we simply agree on our new, secret, lookup table and use that for our translation of things.  Simple really.

Leave a Reply

Your email address will not be published. Required fields are marked *

fifteen + 10 =