Wednesday, December 16, 2009

Dimensional Designs are not Denormalized

I don’t usually bother to take a stand on issues surrounding terminology or theory.  For this post I am going to make an exception.

I often hear people say that a star schema is a denormalized data model.  There is an element of truth hiding in those words, but I think it describes how some people design stars, and not what a star is.  A better description is unnormalized or not normalized. But even these words do not describe all stars.

Regardless of the words you choose, using the vocabulary of normalization is generally a disservice to the novice designer.  And none of these words should ever by used to define the star schema.

Continue reading to learn why.

Wednesday, December 9, 2009

Rule 1: State Your Grain

Make sure you have a statement of grain for each fact table or cube in your dimensional design.

I receive lots of questions from people who are working through an issue with a star or cube. Most of the time, I must counter with a question of my own: “What is the grain?” Without this basic information, it is usually impossible to comment on whatever design issue the person is facing.

Being able to state grain is important, and not just for its value as a conversation starter. Here’s a brief look at what grain is, how to define it, and what happens if you don’t.

What is Grain

A statement of grain identifies what is represented by a single, granular row in a fact table (or an un-aggregated measure in a cube.) Each and every row in the fact table should meet this definition, with no wiggle-room.

When grain is not explicitly defined, or is defined in an ambiguous way, all manner of problems may arise. In addition to hampering your ability to talk about the design with someone else, ill-defined grain can cause to severe technical challenges in the reporting process.  It may even lead to reports that are just plain wrong.

Defining Grain

There are two ways to state the grain of a fact table. The first way is to use business language, referencing a specific, well-understood artifact of the activity described by the star. For example, a star that describes orders may have the following grain: “Order measurements at the order-line level of detail.” That sums things up pretty well. Each row of the fact table corresponds to a single order line.

Sometimes it is not easy to state grain in this manner. When that’s the case, you use the dimensions in a star to indicate what each unique row in the fact table represents. For example, a star that tracks the processing milestones of mortgage applications might have the following grain: “Processing measurements by application and status.” This fact table will have a new row each time an application (one dimension) undergoes a change in status (another dimension.)

When using dimensional terms to define grain, do not simply rattle off all the dimensions present in the star. Instead, list only those that are necessary to define a unique row. In the mortgage status change example, it is not necessary to mention that the star will also contain dimensions for date, customer, mortgage officer, mortgage product, and so forth.

Business definitions of grain are usually used for transaction fact tables.  Dimensional grain definitions are commonly used for snapshots, accumulating snapshots, derived schemas and aggregates. 

Fuzzy Grain

Poorly defined grain can lead to trouble. Ill-defined grain can mask a situation where a fact table is actually being used to track two or more processes. It can also lead to situations in which the fact table contains two or more levels of aggregation. In the former case, single-process reporting will be hampered. BI developers will be bending over backwards to focus on the relevant subset of data.  In the latter case, double-counting, triple-counting or worse is possible.

This is why “sate your grain” is rule #1 in any dimensional design.

-Chris


Image by GravityX9 licensed under Creative Commons 2.0

Thursday, November 19, 2009

The Tolkien Effect

Schema designers must be on the lookout for data elements that are known by more than one name.  Equally common is the use of a single name to signify very different things.

It may surprise you to learn that there is an important connection between data warehouse design and The Lord of the Rings. In the books, author JRR Tolkien challenges readers by using many different names, without explanation, for the same character.

The character Aragorn, for example, is also known as Strider, Dunadan, the Heir of Isildur, and several other names and titles.  Each name, it turns out, is associated with a different culture or point of view.

At first, all this can be deeply confusing.  With a little effort and patience, however, things begin to make sense, and the end result can be deeply rewarding.

The Tolkien Effect in Business

The same kind of thing happens when gathering requirements for a dimensional model.  Within a business, it is commonplace to find several names for the same thing.   Different departments, for example, may refer to products in different ways.  Even within a department, there may be multiple names for the same thing.

Depending on your previous experience with the area in question, it may take you some time to realize this is going on. I will never forget the day  I realized that a finance group meant the same thing by Ten-Digit-Department, Level 3 Code and Budget Line. 

It’s crucial to identify these situations, or the capabilities of your model will be crippled.  Data elements of interest in multiple contexts should be given a single, shared definition in your model.  For dimensions in particular, this will be crucial in supporting analysis that crosses subject areas.

These shared dimensions are called conformed dimensions, and they are the key to avoiding stove-pipe subject areas. Even within a subject area, this can be crucial.  The Ten-Digit-Department realization was essential in permitting comparison of budgets to actuals.

The Reverse-Tolkien

The converse is also a commonplace: a single name used to signify very different things.  The best example of this is “Sales.”  A salesperson will often use this word to refer to an order or contract.  In finance, however, the word is reserved for the event that allows the recognition or revenue, which is often fulfillment or shipment of the order. 

Once again, it is crucial that analyst keep an eye out for these situations; failure to produce consistent and well defined definitions for each fact or measurement is also a failure of conformance.  The result will be challenges to the accuracy of the data, distrust of the solution, and a lack of user adoption.

What You Can Do

How then to avoid these problems?  Listen.  Don’t assume you know what people mean. Repeat things back in your own words.  Be sure to write down and review definitions of each data element. 

Look out for The Tolkien Effect.  Pay close attention to people who live and work on the cusp of two subject areas or departments, as they will be keenly aware of these kind of linguistic challenges.  So will the data administrator, if your organization has one.

-Chris

Image Attribution:  lrargerich
CC BY 2.0