Wednesday, December 16, 2009

Dimensional Designs are not Denormalized

I don’t usually bother to take a stand on issues surrounding terminology or theory.  For this post I am going to make an exception.

I often hear people say that a star schema is a denormalized data model.  There is an element of truth hiding in those words, but I think it describes how some people design stars, and not what a star is.  A better description is unnormalized or not normalized. But even these words do not describe all stars.

Regardless of the words you choose, using the vocabulary of normalization is generally a disservice to the novice designer.  And none of these words should ever by used to define the star schema.

Continue reading to learn why.

Wednesday, December 9, 2009

Rule 1: State Your Grain

Make sure you have a statement of grain for each fact table or cube in your dimensional design.

I receive lots of questions from people who are working through an issue with a star or cube. Most of the time, I must counter with a question of my own: “What is the grain?” Without this basic information, it is usually impossible to comment on whatever design issue the person is facing.

Being able to state grain is important, and not just for its value as a conversation starter. Here’s a brief look at what grain is, how to define it, and what happens if you don’t.

What is Grain

A statement of grain identifies what is represented by a single, granular row in a fact table (or an un-aggregated measure in a cube.) Each and every row in the fact table should meet this definition, with no wiggle-room.

When grain is not explicitly defined, or is defined in an ambiguous way, all manner of problems may arise. In addition to hampering your ability to talk about the design with someone else, ill-defined grain can cause to severe technical challenges in the reporting process.  It may even lead to reports that are just plain wrong.

Defining Grain

There are two ways to state the grain of a fact table. The first way is to use business language, referencing a specific, well-understood artifact of the activity described by the star. For example, a star that describes orders may have the following grain: “Order measurements at the order-line level of detail.” That sums things up pretty well. Each row of the fact table corresponds to a single order line.

Sometimes it is not easy to state grain in this manner. When that’s the case, you use the dimensions in a star to indicate what each unique row in the fact table represents. For example, a star that tracks the processing milestones of mortgage applications might have the following grain: “Processing measurements by application and status.” This fact table will have a new row each time an application (one dimension) undergoes a change in status (another dimension.)

When using dimensional terms to define grain, do not simply rattle off all the dimensions present in the star. Instead, list only those that are necessary to define a unique row. In the mortgage status change example, it is not necessary to mention that the star will also contain dimensions for date, customer, mortgage officer, mortgage product, and so forth.

Business definitions of grain are usually used for transaction fact tables.  Dimensional grain definitions are commonly used for snapshots, accumulating snapshots, derived schemas and aggregates. 

Fuzzy Grain

Poorly defined grain can lead to trouble. Ill-defined grain can mask a situation where a fact table is actually being used to track two or more processes. It can also lead to situations in which the fact table contains two or more levels of aggregation. In the former case, single-process reporting will be hampered. BI developers will be bending over backwards to focus on the relevant subset of data.  In the latter case, double-counting, triple-counting or worse is possible.

This is why “sate your grain” is rule #1 in any dimensional design.

-Chris


Image by GravityX9 licensed under Creative Commons 2.0