Wednesday, December 16, 2009

Dimensional Designs are not Denormalized

I don’t usually bother to take a stand on issues surrounding terminology or theory.  For this post I am going to make an exception.

I often hear people say that a star schema is a denormalized data model.  There is an element of truth hiding in those words, but I think it describes how some people design stars, and not what a star is.  A better description is unnormalized or not normalized. But even these words do not describe all stars.

Regardless of the words you choose, using the vocabulary of normalization is generally a disservice to the novice designer.  And none of these words should ever by used to define the star schema.

Continue reading to learn why.

Normal Forms

The word “normalized” is used in connection with various properties that may be exhibited by a relational set, or table.  (I’m going to gloss over the difference here, and just hit the highlights.)  These properties describe specific characteristics of the structure of the table or of the relationships between its attributes. 

It’s important to understand that we are talking about properties, or characteristics here.  In fact, these properties have more specific names: first normal form, second normal form and so forth.  For operational systems, designers generally strive to achieve third normal form

If you want to learn about normal forms, you can start with Wikipedia.  Better yet, consult An Introduction to Database Systems, an essential volume by Chris Date.  But you don’t need to understand normal forms to finish reading this article.

What you do need to understand is that a table design that satisfies third normal form is devoid of several characteristics that are typically found in the dimension tables of a star schema. 

For this reason, people familiar with the principles of normalization, or who have backgrounds in E/R modeling, often refer to dimension tables as “denormalized.”

Denormalized?

To say that a something is denormalized implies that it was once normalized, and that someone acted to reverse that characteristic.  Some people arrive at their dimensional designs in this way, but not all of us do.

Defining a star schema as denormalized suggests that one of the following happened:
  • The designer first produced a normalized “logical design” to fit the business requirements, then denormalized it to create the “physical design”(presumably for performance reasons)
  • The designer took the normalized design of a source system, and denormalized it for the target system.
Certainly, both of these techniques can be used to produce star schema designs.  However, I’ve never put a design together in either of these ways.  (Well, almost never.) 

You can (and usually should) design a star schema without ever denormalizing anything. 

Start with a business process and observe how it is measured.  Identify the measurements (facts) and their context (dimensions).  Group the dimensions together, add some keys.  The result is a star schema. 

I do this all the time, and have no qualms with signing my name to the following statement:
No tables were denormalized during the design of this schema.
The result is not normalized, to be sure.  But it has not been actively denormalized, either.  In fact, steps are usually taken toward one of the normal forms, not away. 

A better term is unnormalized, or simply not normalized.

A Non-essential Characteristic

Whatever term you use—denormalized, not normalized, unnormalized—it is important to recognize that it describes some stars, but not all.  The non-normal format of a star is not an essential characteristic of its design. 

It is possible for a star schema to be in normal form.  Most stars violate first normal form, but there are some that do not.  Of those, some may also be considered to satisfy second normal form, though this will depend on what you are willing to accept as a candidate key.  And I have seen stars that fit the definition of third normal form as well.

It is clearly incorrect to define a star schema as a group of tables where the dimensions are not normalized.  In New York, most taxis are yellow, but New Yorkers do not define taxis as “yellow cars.”  Nor should a star schema be defined as a denormalized table design.

So what is a star, then?  It is the implementation of a dimensional model in relational tables.  A dimensional model describes a business process in terms of measurements (facts) and their context (dimensions.)  It can also be implemented in a multidimensinal database, where the result is a cube.

An Alternate Vocabulary

We get wrapped around the axle in conversations like this because we are really trying to apply principles from once discipline to activities in another.  To someone with a hammer, everything looks like a nail. But sometimes it’s not the right tool for the job.

The vocabulary of normalization includes terms like candidate key, superkey, non-prime attribute.  These terms are used to define principles of normalization that help us develop operational systems which maintain integrity in the face of insert, update and delete activities.

Dimensional design has its own vocabulary, which is suited to the design of tables for analytic systems.  This vocabulary includes terms like natural key, surrogate key, grain, type 1 attribute, and type 2 attribute.  They are used to form a set of principles that guide the development systems where the primary interaction with is query, and where insert update and delete anomalies can be controlled programmatically. The best thing you can do when designing dimensional data structures is to forget about the vocabulary of normalization.

These two vocabularies are not opposites, they are simply different.  Each is useful in designing a particular kind of database system.  Neither is “right” or “wrong.”  And most importantly, neither is more or less “relational.”

Why This Long Rant?

As I said at the top of this post, I usually don't like to get involved in questions of theory or terminology.  My focus is usually on getting things done.

But I teach dimensional design, and I find that students who cannot separate these two disciplines have the most trouble.

Do everyone a service, and stop using the word "denormalized" to define a star schema.

(End of rant.)

-Chris