Saturday, April 28, 2012

Database Design–Introduction

A few years back, I worked for a Value-Added-Reseller (a “VAR”) that specialized in computer ID systems such as bar codes and the new RF Tags.  We attended a show over at the Hanford Nuclear Reservation where one of the contractors expressed an interest in the new tags.  Apparently, they had recently had a public-relations fiasco when several of their drums for radioactive waste had been discovered, battered, rusty, and empty at the local dump.  Of course, the local media and populace were concerned that they had dumped radioactive waste in the local landfill.

After considerable research, it turned out that the contractor’s quality control measures had rejected the drums upon preliminary inspection.  Foolishly, they sent them to the dump unused but with their radioactive symbol still prominently displayed causing panic in the neighborhood.  Our prospective role was to help them label each drum as soon as it was received using either bar codes or RF Tags so they could track them quickly and easily through their process and to quickly respond with a full history of each drum. 

In database design, I too frequently see DB developers make this same mistake; that is, they fail to tag their data or design their schema in a way that facilitates location and retrieval.

In the next few posts, I’ll provide some guidelines that I’ve learned over the past three decades of building and deploying relational databases.  As a disclaimer, this treatise is not a database doctrinal statement; but more of a set of guidelines to consider.

One reason the waste contractor had suffered the public-relations nightmare was because he had no quick way of distinguishing one drum from another. To learn the history, he had to query the employees until he found one who know why the drums where there and what their history was.  Often, databases have the same problem so developers have to run multiple queries before they can locate the data they’re after.

Guideline One – Every Table Should Have a Primary Key (and that key should be as small as possible and be efficiently sortable).
  • A Primary Key (“PK”) is like the barcode identifier on the drums and enables the DB engine to locate a particular record in a given table.
  • They are used by indexes to find a particular row and therefore, they must be unique to that column in the table. 
  • They should be easily and efficiently storable and sortable by the DB engine.  This almost always means a single column (even though multiple columns can be joined to form a composite primary key) and almost always means numbers, typically integers. 
  • GUIDs as keys (even the so called “sequential GUIDs”) are almost never a good idea.
  • If you have a simple lookup table, use the smallest integer type you can which will improve query response time.  For example, the SQL Server engine can store 8 times as many pointers in the same space when you use a tinyInt over a BigInt (1 byte vs 8 bytes).
  • The simplest and often the best PK is to use is an “Identity Key” which the DB will auto-increment for each new record.
A common trap I’ve seen developers fall into is they attempt to use the Primary Key to hold data such as social security numbers, employee ids, or even names.  The best way to avoid this trap is to think of the Primary Key as an item that is only used by the DB engine and never by the developer.  In other words, the Primary Key should not hold data!

Guidelines always have exceptions and here are a couple:
  • Since the GUID (see my anecdote below) is globally unique, it is sometimes used in DB design to make sure the values of a table are unique to a table that is distributed.  For example, you may have multiple distributed databases and tables holding the same data and use replication to merge the values back later.  In this case, the use of a globally unique number can make sure each unique row is replicated to the various copies of the table.  As a designer/developer, you should understand though that this design is optimized for replication and not query performance.
  • Composite Primary Keys are alright in cross-reference tables.  For example, you might have a Patient table and an Encounter table, each with integer primary keys.  The cross-reference table, Patient_Encounter might have only two columns that are joined as a composite Primary Key:  the PatientId, and EncounterId which would support the one-to-many relationship between Patient and Encounter.
Good database designs begins with good primary key design.

An Anecdote:  What’s a GUID?

The formal definition of GUID is that it’s Microsoft’s implementation of the standards based UUID or Universally Unique Identifier.  So why does Microsoft call it a Globally Unique Identifier instead of a Universally Unique Id? 

Well in fact, Microsoft started out using UUID and still has a Win32 API called UUIDCreate.  But the new Star Trek had come out about this time and there was a lively internal and external debate about whether these number were truly “universally” unique or just “globally”.  To appease developers both inside and outside the company (who were TOTAL geeks), they changed the moniker to “Globally” Unique ID.
Now, if we can all just agree that it’s pronounced GWID and not GOO’ ID, I’ll be happy.

Monday, April 2, 2012

Securing the Network – Part 1: Run Separate Networks

A few years ago, I worked as an independent Microsoft Solution Provider.  Microsoft put our company URL on their web site in the Solution Providers listings area, correctly touting us as experts in MS technology.  Unfortunately, the only “customers” who found and used the URL were bright hackers out of China.   I’ve always been impressed with that thinking:  to target the top echelon of technical expertise to see if you can get past their door.  In other words, to hack into a hacker’s system should be the ultimate challenge.

One Chinese fellow would leave polite notes on my desktop or sometimes hidden in a directory to let me know he had found a way in and left a little surprise or two for me somewhere.  We spent several months playing this “cat and mouse” game until I either secured the network well enough to keep him out; or, he went in search of different prey.

This was about the time I started travelling broadly for Microsoft helping to build and secure large public networks and I’d like to share some of the simple best practices that almost any small to medium business can deploy – especially hospitals and health organizations.

I’ll start in this post with one of the most powerful and really very simple ways to secure the network:
Deploy two, three, or more discreet networks that are virtually and physically separated from each other.

For example, you might have an application network on 192.168.x.x that your general network users will use to interact with line-of-business applications.

On the 172.16.x.x network, you can host your management network.  This is where your SNMP traffic will travel, logging traffic is transmitted or consumed, the system administrators will do their thing, and generally provide access to non-application or non-management related data.

Finally, on the 10.x.x.x network, you may support your data network.  Your data network is generally only available to machine and service accounts and transmits:  database traffic, domain login traffic, and service to remote-service communication for things such as HL7 messages.  In other words, this network should almost never be available to human eyes.

Of course the DMZ would have it’s own network too…

The key advantages of these separate networks are:
  • Smaller attack surface.   Application users with lower rights, can only see items on the application network and are unable to even connect to the data or management networks.  Management users with elevated rights can see the whole network, but they are still isolated from the data and/or the applications.  The data network is generally only available to service accounts and machine accounts and is very isolated.  
  • Traffic can be prioritized according to need.  Application users listening to Pandora will not impact the data network; and the transfer of the 500GB log file across the management network won’t disturb the web to database traffic.
  • Configuration errors on one network, will typically not impact the other networks.
  • Firewalls, SSL, and other deterrents geared for the application network often retard the data networks.  Isolate the data network and remove all the stuff needed by the application network and management, performance, and security will be improved.
One closing tip, in the datacenter it can prove to be very difficult to determine which network is which. One way to solve the problem is to use color-coded network cables for Cat 5/6; or tag the fiber cables with color tags to differentiate one network from the other.

Separate networks would probably solve a high percentage of security breaches both inside and outside the perimeter.