So I have been involved in a few endeavors lately that have made me realize something that I often do second nature and take for granted -- good database design. I don't know what I've thought throughout my career thus far, but I guess I just always figured that everyone thought like I did and ensured the delivery of a well-designed, robust, database model as persistence for their software. Apparently I figured wrong.
I am running into situations where a developer had both the complete opportunity and full control to design a robust data model, but didn't. Why? Well, we could get into an interesting discussion on top-down design vs. bottom up design, ORMs, and the like. However, that is a subject for a future post (er, perhaps many future posts). We could also take into consideration performance implications of certain database designs versus others, more on this in a bit, below. My observations have led me to the assertion that, for whatever the reason is, sometimes there are developers who just don't do good database designs. In line-of-business apps, and especially those that are enterprise-class, and of strategic importance, perhaps nothing is more important than a good database design. In this post, I will explain why this is with a simple example:
Imagine a case where you are designing perhaps a supply chain management system and you need to cache supplier part information in your database for performance reasons (as opposed to querying the supplier's remote data repository directly). You start building your app by getting an Excel sheet of suppliers from your vendor management folks who work with the suppliers. It has columns like so:
Supplier Name, Supplier Address, Supplier CSZ, Supplier's Item #, Supplier's Price
It certainly might be tempting to upload / import this data right into a single database table as is and begin working with it. However, there are clearly problems with doing this:
- Redundancy of Data. Note that for every row (sometimes, in database theory called a "tuple") that contains a different supplier item, the supplier's address repeats. This creates duplicates in terms of the name, address and CSZ fields.
- Brittle Data Consistency. As a direct consequence of the repeated address, it is conceivably possible that should the supplier's address change, an update is performed inconsistently (perhaps only to 1 row). This would result in duplicate addresses for the supplier -- we physically have two addresses for the supplier when we know logically we should only have one (let's pretend this isn't a multi-site supplier). In database theory you'll sometimes hear this problem described as "update anomalies".
- Insertion Anomalies. Interestingly, if you think about it you might note that if we make the table field Supplier's Item # and Supplier's Price required (NOT NULL), then we can't record the address of a supplier if that supplier doesn't currently supply us with at least one item. I suppose we could be stuff garbage in the fields, but that just devalues our database and our system. We could make the item number and price fields non-required (NULL), so we could store a row with the address. But, would we remember to delete the "hacked" address row when the supplier finally starts supplying us with items? And, if we do this, hopefully you can see that we can easily run into a problem if data gets into the database with a item number specified but no price, or vice versa.
- Deletion Anomalies. Imagine what happens now when we decide that we need to delete all rows containing supplier item and price information, because the supplier no longer supplies us with items? What happens to the supplier's address information?
It is easy to see why we need good, well-designed database models from this example. I've only scratched the surface but if you're interested pick up a classic text on relational theory. So now what?
The way to arrive at a good design with our example is to decompose (break down, split apart) the Excel sheet into a set of multiple tables, like so:
Table: Suppliers (Columns: SupplierKey, SupplierName, Address, City, State, Zip)
Table: Items (Columns: ItemNumber, SupplierKey, Price)
I've done a few things here. First, I've made a decision to store the data in two tables in my design. The first table is called "Suppliers". I've also added a contrived key to the Suppliers table, called "SupplierKey". A contrived key is a key you make up, or generate, (or have generated for you), when a good natural key is not present. A natural key is a semantically clear identifier for an entity being represented by the table. So, an example might be the state driver's license bureau where information about you is kept in a Driver table using your driver's license number as the natural key. In this example, I've created a contrived key because names, like our supplier name, typically don't make good key fields.
Second, I broke out the City, State, and Zip fields.
Finally, I have an Items table, and have moved the ItemNumber and Price fields to it. ItemNumber in this case is a good natural key. I also added what is known as a foreign key field, named SupplierKey. The SupplierKey foreign key field in table Items is the same SupplierKey field in table Suppliers (same data type). This will be used to relate the two tables together to perform queries.
The resultant design now solves all four problems mentioned above. The process of decomposition in database design is called "normalization" and we say that each table in a database design or model is in a state of one of the "normal forms". The basic idea behind normalization is to identify your entities (typically nouns in your problem domain), and create tables for each. Then, you identify attributes for your entities, and create columns for each. You want to validate that each of your entity attributes are facts about and depend upon the entity, and not other attributes or combinations of other attributes. The process is rounded out by creating links or relationships between entities. A full and appropriate treatment of normalization is outside the scope of what I wanted to do here, not to mention it's been done by others many times over. Aside from grabbing a hold of any of E.F. Codd's work (E.F. Codd is considered the "father of normalization"), perhaps one of the early and great articles on normalization was written by William Kent, entitled "A Simple Guide to Five Normal Forms in Relational Database Theory", and located here. A good summary and jump page to check out as well would be wikipedia, also. Thus, if you've never taken the time to study normalization, I can assure you that now is the time. You will reap great benefit from the effort. Normalization: live it, learn it, love it. 
So, as mentioned earlier, what if you fully understand the process of normalization, the normal forms, etc., and choose not to normalize (or use a relatively denormalized design) because of performance reasons? And just exactly what are the performance concerns? Classic database and relational theory taught that when you write database operations via the database engine's DML (SQL) and had to join tables, that the joining of tables was an expensive operation. It is my belief that we have progressed, especially with innovative database management systems such as Microsoft SQL Server, well past this and it isn't nearly as true today as it once was when initially proclaimed. The benefits of normalization for both transactional and data warehousing scenarios typically outweigh any [perceived] performance gains. I found a nice point on this topic on this IBM treatment of normalization:
In practice, many databases are de-normalized to greater or lesser degree. The reason most often stated has to do with performance -- a de-normalized database may require fewer joins and can, therefore, be faster for retrievals.
While this reasoning may be true, the usual caveats against premature optimization apply here as well as everywhere else. First, you should determine sufficiently that a performance problem exists and that the proposed de-normalization improves it before introducing a conceptually suboptimal design.
Furthermore, a de-normalized schema can be harder to update. The additional integrity checks that are necessary in this case may offset the performance gains for queries obtained through denormalization.
And another nice point, here:
Well normalized data makes programming (relatively) easy, and works very well in multi-platform, enterprise wide environments. Non-normalized data leads to heartbreak.
In summary, I hoped I've inspired some of you to think about your database model design more thoroughly, and to check out the topic of database normalization. Until next time...