Good Database Design, Revisited

By erik at February 23, 2010 14:25
Filed Under: Architecture, Design

Once again, I find myself feeling bad for not writing recently.  I guess the duties of family, job and the holidays got the best of me these last many months.  I hope to write more frequently in the next several weeks and months…

Anyway, I came across an article related to one of my initial posts that I felt compelled to share, with the hope that you will read it here.  In this article entitled Seven Deadly Sins of Database Design, Jason Tiret of Embarcadero Technologies cites seven ‘deadly’ sins that one can commit when designing a physical data model (many of these points also apply to conceptual/logical models, too):

  1. Poor or missing documentation for database(s) in production
  2. Little or no normalization
  3. Not treating the data model like a living, breathing organism
  4. Improper storage of reference data
  5. Not using foreign keys or check constraints
  6. Not using domains and naming standards
  7. Not choosing primary keys properly

I think Jason has nailed many of the additional core issues with planning and designing physical data model implementations, above and beyond my initial post on design and normalization.  Thanks for the great work, Jason!

Finally, Sweet Validation…

By erik at October 13, 2009 00:40
Filed Under: Architecture, Design, Software Development

Hello all!  Sorry I have been away for several months – it has been a busy spring, summer and now fall here in northeast Ohio.  I thought I’d drop a quick post today highlighting a great piece of research that I just finished reading.

Those of you who know me know what a great fan I am of Conway’s Law.  Having worked for and consulted for many organizations over my career thus far, I have found Conway’s Law to be a solid, immutable truth.  For those of you not familiar with Conway’s Law, it essentially states that:

…organizations which design systems ... are constrained to produce designs which are copies of the communication structures of these organizations.

Since software development is about expressing and constructing a solution to solve a problem, the pattern of communication about the problem domain and the vocabulary used to describe it becomes paramount.  In addition, the manner in which this pattern of communication takes place and evolves – ultimately – influences solution design, construction and implementation.  I guess I like to say that good, well-run organizations produce good software.  And, that… …well, you get the picture.  Note that in my experience the word “organization” here applies to both the technology department and the business people in the organization (can’t have one without the other).

Though there has been additional research on Conway’s Law since Melvin Conway first introduced it in 1968, empirical evidence has seemingly been hard to come by.  Thus, often (at least from my perspective), Conway’s Law can end up relegated to the subject of discreet giggling around the water coolers and hallways of dysfunctional organizations. Wink

How fitting is it then that this study, in the midst of the unit test code coverage quality debate, actually ties organizational metrics to the quality of software produced!?  I say, quite fitting.  Take a moment and read it.  I believe it will validate all that you’ve likely come to know and expected about the effects of “the organization” on producing software.

Non-softies take note: You just might even find it humorous that the guinea pig used to exercise the metrics was none other than Vista!

Cheers!

TheInfluenceOfOrgStructureOnSoftwareQuality.pdf (172.27 kb)

Dino Esposito on dotnetrocks.com

By erik at June 30, 2009 00:36
Filed Under: .NET, Architecture

Per my last post on Some Interesting Architecture Reading, I wanted to note that one of the authors of the book Microsoft .NET: Architecting Applications for the Enterprise, Dino Esposito, was recently over on dotnetrocks.com doing a show with Carl Franklin and Richard Campbell.  He mentions that book which I had recently reviewed in my blog, and of course has a nice interview with Carl and Richard on interesting things architectural.

By the way, if you’ve never checked out dotnetrocks.com, it’s a great way to catch up on trends occurring in the .Net stack when you’re driving or have other time when you can listen to an audio podcast.

Some Interesting Architecture Reading

By erik at May 28, 2009 20:47
Filed Under: Architecture, Design, .NET

This past winter I completed a well-done book by Dino Esposito and Andrea Saltarello entitled Microsoft .NET: Architecting Applications for the Enterprise.  I’d like to give my endorsement to this well-written, well-thought-out book.  It encapsulates all of the core concepts needed to correctly build an enterprise-class, durable, yet flexible application.  In many cases, it also gives these concepts detailed treatment, as well.

What I like most about this book is that, although it’s got “.NET” in the title – and it occasionally touts Microsoft-specific technologies, most of book content is dedicated on traditional application architecture and design concepts that are time-tested, proven, and have developed outside of the .NET community.  This includes concepts such as UML, layered architecture and the use of design patterns.  This is a good thing.

Although the book is not a replacement for the core application design and architecture books that I consider mandatory reading from authors such as Booch, GoF, and Fowler (more on this in a future post), it is a great book to dive into that you can consider a comprehensive and detailed treatment of application architecture.  Check it out.

Additionally, while listening to a podcast today, I heard about App Arch Guide 2.0 from the Microsoft Patterns & Practices team.  While I have not read this book, I can say that the interview with one of the authors was compelling enough for me to download it.  More to come on this one.

Tip: Remove Stack Trace Information in ASMX (XML) Web Services

By erik at May 26, 2009 13:36
Filed Under: .NET, Architecture

A non-obvious way to alter the behavior of .NET automatically populating Soap fault elements with stack trace information (specifically, in the faultString element) is to use the customErrors configuration element, which is part of the ASP.NET Settings Schema.

I state this is non-obvious because in the documentation, the following statement is made here under the Remarks section:

The <customErrors> element does not apply to errors that occur in XML Web services.

However, the use of this element’s mode attribute clearly does alter the behavior, when setting the mode attribute to “On”, in the system.web configuration section:

<configuration>
    <system.web>
        <customErrors mode="On" />
    </system.web>
</configuration>

Note that the default setting for the customErrors element’s mode attribute is “RemoteOnly” which will add stack trace information for local requests and remove it for remote requests.  This is usually desirable behavior but depending upon your own needs it is good to know the availability of the mode attribute’s “On” setting to always, universally remove the stack trace information, should you desire to do so.

A Simple Case For Good Database Design

By erik at May 08, 2009 22:52
Filed Under: Architecture, Design

So I have been involved in a few endeavors lately that have made me realize something that I often do second nature and take for granted -- good database design.  I don't know what I've thought throughout my career thus far, but I guess I just always figured that everyone thought like I did and ensured the delivery of a well-designed, robust, database model as persistence for their software.  Apparently I figured wrong.

I am running into situations where a developer had both the complete opportunity and full control to design a robust data model, but didn't.  Why?  Well, we could get into an interesting discussion on top-down design vs. bottom up design, ORMs, and the like.  However, that is a subject for a future post (er, perhaps many future posts).  We could also take into consideration performance implications of certain database designs versus others, more on this in a bit, below.  My observations have led me to the assertion that, for whatever the reason is, sometimes there are developers who just don't do good database designs.  In line-of-business apps, and especially those that are enterprise-class, and of strategic importance, perhaps nothing is more important than a good database design.  In this post, I will explain why this is with a simple example:

Imagine a case where you are designing perhaps a supply chain management system and you need to cache supplier part information in your database for performance reasons (as opposed to querying the supplier's remote data repository directly).  You start building your app by getting an Excel sheet of suppliers from your vendor management folks who work with the suppliers.  It has columns like so:

Supplier Name, Supplier Address, Supplier CSZ, Supplier's Item #, Supplier's Price

It certainly might be tempting to upload / import this data right into a single database table as is and begin working with it.  However, there are clearly problems with doing this:

  1. Redundancy of Data.  Note that for every row (sometimes, in database theory called a "tuple") that contains a different supplier item, the supplier's address repeats.  This creates duplicates in terms of the name, address and CSZ fields.
     
  2. Brittle Data Consistency.  As a direct consequence of the repeated address, it is conceivably possible that should the supplier's address change, an update is performed inconsistently (perhaps only to 1 row).  This would result in duplicate addresses for the supplier -- we physically have two addresses for the supplier when we know logically we should only have one (let's pretend this isn't a multi-site supplier).  In database theory you'll sometimes hear this problem described as "update anomalies".
     
  3. Insertion Anomalies.  Interestingly, if you think about it you might note that if we make the table field Supplier's Item # and Supplier's Price required (NOT NULL), then we can't record the address of a supplier if that supplier doesn't currently supply us with at least one item.  I suppose we could be stuff garbage in the fields, but that just devalues our database and our system.  We could make the item number and price fields non-required (NULL), so we could store a row with the address.  But, would we remember to delete the "hacked" address row when the supplier finally starts supplying us with items?  And, if we do this, hopefully you can see that we can easily run into a problem if data gets into the database with a item number specified but no price, or vice versa.
     
  4. Deletion Anomalies.  Imagine what happens now when we decide that we need to delete all rows containing supplier item and price information, because the supplier no longer supplies us with items?  What happens to the supplier's address information?

It is easy to see why we need good, well-designed database models from this example.  I've only scratched the surface but if you're interested pick up a classic text on relational theory.  So now what?

The way to arrive at a good design with our example is to decompose (break down, split apart) the Excel sheet into a set of multiple tables, like so:

Table: Suppliers (Columns: SupplierKey, SupplierName, Address, City, State, Zip)
Table: Items (Columns: ItemNumber, SupplierKey, Price)

I've done a few things here.  First, I've made a decision to store the data in two tables in my design.  The first table is called "Suppliers".  I've also added a contrived key to the Suppliers table, called "SupplierKey".  A contrived key is a key you make up, or generate, (or have generated for you), when a good natural key is not present.  A natural key is a semantically clear identifier for an entity being represented by the table.  So, an example might be the state driver's license bureau where information about you is kept in a Driver table using your driver's license number as the natural key.  In this example, I've created a contrived key because names, like our supplier name, typically don't make good key fields.

Second, I broke out the City, State, and Zip fields.

Finally, I have an Items table, and have moved the ItemNumber and Price fields to it.  ItemNumber in this case is a good natural key.  I also added what is known as a foreign key field, named SupplierKey.  The SupplierKey foreign key field in table Items is the same SupplierKey field in table Suppliers (same data type).  This will be used to relate the two tables together to perform queries.

The resultant design now solves all four problems mentioned above.  The process of decomposition in database design is called "normalization" and we say that each table in a database design or model is in a state of one of the "normal forms".  The basic idea behind normalization is to identify your entities (typically nouns in your problem domain), and create tables for each.  Then, you identify attributes for your entities, and create columns for each.  You want to validate that each of your entity attributes are facts about and depend upon the entity, and not other attributes or combinations of other attributes.  The process is rounded out by creating links or relationships between entities.  A full and appropriate treatment of normalization is outside the scope of what I wanted to do here, not to mention it's been done by others many times over.  Aside from grabbing a hold of any of E.F. Codd's work (E.F. Codd is considered the "father of normalization"), perhaps one of the early and great articles on normalization was written by William Kent, entitled "A Simple Guide to Five Normal Forms in Relational Database Theory", and located here.  A good summary and jump page to check out as well would be wikipedia, also.  Thus, if you've never taken the time to study normalization, I can assure you that now is the time.  You will reap great benefit from the effort.  Normalization: live it, learn it, love it. Smile

So, as mentioned earlier, what if you fully understand the process of normalization, the normal forms, etc., and choose not to normalize (or use a relatively denormalized design) because of performance reasons?  And just exactly what are the performance concerns?  Classic database and relational theory taught that when you write database operations via the database engine's DML (SQL) and had to join tables, that the joining of tables was an expensive operation.  It is my belief that we have progressed, especially with innovative database management systems such as Microsoft SQL Server, well past this and it isn't nearly as true today as it once was when initially proclaimed.  The benefits of normalization for both transactional and data warehousing scenarios typically outweigh any [perceived] performance gains.  I found a nice point on this topic on this IBM treatment of normalization:

In practice, many databases are de-normalized to greater or lesser degree. The reason most often stated has to do with performance -- a de-normalized database may require fewer joins and can, therefore, be faster for retrievals.

While this reasoning may be true, the usual caveats against premature optimization apply here as well as everywhere else. First, you should determine sufficiently that a performance problem exists and that the proposed de-normalization improves it before introducing a conceptually suboptimal design.

Furthermore, a de-normalized schema can be harder to update. The additional integrity checks that are necessary in this case may offset the performance gains for queries obtained through denormalization.

And another nice point, here:

Well normalized data makes programming (relatively) easy, and works very well in multi-platform, enterprise wide environments. Non-normalized data leads to heartbreak.

In summary, I hoped I've inspired some of you to think about your database model design more thoroughly, and to check out the topic of database normalization.  Until next time...

 

Visitors

Disclaimer

Views expressed are my own and in no way represent those of my employer; this is in no way a work or work-related blog.  All postings are presented "AS IS" and confer no rights.

Copyright

All content copyright © 2009-2011 by Erik Mlincek.  You may not use content without the express written consent of the author, however you may link back to blog topics hosted at mlincek.com without obtaining permission in advance.