Friday, January 7, 2011

Build High Resultion Stars

On TV, a low-resolution photo is never a problem for crime scene investigators. Load it into a computer, zoom in on a reflection, click "enhance", and presto! the criminal is revealed.

In the world of data warehousing, we cannot rely on the CSI effect. If we start with low-resolution data, the detail is lost forever. Eventually, someone will ask a question that requires more detail. And it simply wont be there to answer the question.

A few simple guidelines will help you avoid this unfortunate situation.

A Common Problem

Omission of detail is one of the biggest frustrations caused by legacy designs. It is the number three problem I encounter in design reviews (after failure to use surrogate keys and failure to plan slowly changing dimensions.)

The reason for this is simple: people often design their schema to support current requirements. Unfortunately, this does not work in data warehousing. Analytic requirements constantly change. As a designer, your job is to produce solutions that will answer questions that are not yet known.

This may sound like an impossible task, but it's not. Its one of the strong suits of dimensional design. Three guidelines for high-resolution design will help future-proof your solution.

1. Match Grain to Source Data, Not Requirements

When you're designing a fact table, you need to establish its grain. A statement of grain defines what is represented by a row in a fact table. This is the "resolution" of your measurements. (For a refresher on grain, see the post Rule 1: State Your Grain.)

Don't set grain at the level of detail that meets requirements. Set it at the level of detail at which data is available.

For example, you are designing a fact table that measures sales. Someone asks for daily totals of sales in dollars. Don't assume daily totals of sales is sufficient. Eventually someone will want to go deeper, perhaps looking for products that are bought together. This requires setting your grain at the order line level of detail.

Once you have identified the process a fact table represents, look at the resolution of the source data. That should guide your decisions on grain. Of course, this needs to be kept within reason. If the available level of detail would result in fact tables that are too large or expensive to manage, you might need to do some summarization. But be careful; you'll be working with low-resolution data in the future.

2. Capture All the Facts that Fit

Remember that each fact table describes a process. When you fill in the facts, don't just include ones that people asked for. Include all available facts that describe that process.

Returning to our sales data, this means you should not stop at sales dollars. What else is known about sales? Quantities, tax, and other information is likely to be readily available, and many operational systmes may also offer a rudimentary concept of cost or margin. Include them!

3. Fill Out Dimensions

The last guideline for a high-resolution design involves the dimensions. You can probably guess what it is:

When you design a dimension table, don't just include attributes someone asked for. Find what is available in the source system, and include it all.

A simple "Product Name" and "Product ID" might be all that's needed to support sales by product. But what else do source tables hold that describes your products? Manufactures? Suppliers? Colors? Sizes? Weights? Include all this in your designs.

Remember, dimensions are the source of all context in our reports. The more we fill them out, the more kinds of questions people will be able to ask.

Initial Design is the Right Time

Match your grain, facts and dimensions to the available data, even if that goes beyond the requirements.

Of course, you can recover from errors in any of these areas. This will require modifying your design, ETL process, and so forth. When its decided that additional detail is needed going forward, you can make these changes.

But it is often difficult to add detail to historic data already in the data warehouse. For example, it may require accessing backup data. Worse, it may put ETL developers through the difficult process of performing slow change processing in reverse.

Think high-resolution today, and you can avoid these problems tomorrow.

Image Credit: Alan Cleaver via Creative Commons 2.0