I receive a lot of questions regarding the best way to structure warehouse data to support an analytics program. The answer is simple: follow the same best practices you've already learned.
I'll cover these practices from a dimensional modeling perspective. Keep in mind that they apply in any data warehouse, including those modeled in third normal form.
1. Store Granular Facts
Analytic modelers often choose sources external to the data warehouse, even when the warehouse seems to contain relevant data. The number one reason for this is insufficient detail. The warehouse contains summarized data; the analytic model requires detail.
In this situation, the analytic modeler has no choice but to look elsewhere. Worse, she may be forced to build redundant processes to transform source data and compile history. Luckily, this is not a failure of warehouse design principles; its a failure to follow standard best practices.
Best practices of dimensional design dictate that we set the grain of base fact tables at the lowest level of detail possible. Need a daily summary of sales? Store the individual order lines. Asked to track the cost of tips? Store detail about each leg.
Dimensional solutions can contain summarized data. This takes the form of cubes, aggregates, or derived schemas. But these summaries should be derived exclusively from detailed data that also lives in the warehouse.
Like all rules, this rule has exceptions. There are times when the cost/benefit calculus is such that it doesn't make sense to house highly granular indefinitely. But more often than not, summary data is stored simply because basic best practices were not followed.
2. Build “Wide” Dimensions
The more attributes there are in your reference data (aka dimensions), the more useful source material there is for analytic discovery. So build dimensions that are full of attributes, as many as you can find.
If the grain of your fact table gives the analytics team “observations” to work on, the dimensions give them “variables.” And the more variables there are, the better the odds of finding useful associations, correlations, or influences.
Luckily, this too is already a best practice. Unfortunately, it is one that is often misunderstood and violated. Misguided modelers frequently break things down into the essential pieces only, or model just to specific requirements.
2. Build “Wide” Dimensions
The more attributes there are in your reference data (aka dimensions), the more useful source material there is for analytic discovery. So build dimensions that are full of attributes, as many as you can find.
If the grain of your fact table gives the analytics team “observations” to work on, the dimensions give them “variables.” And the more variables there are, the better the odds of finding useful associations, correlations, or influences.
Luckily, this too is already a best practice. Unfortunately, it is one that is often misunderstood and violated. Misguided modelers frequently break things down into the essential pieces only, or model just to specific requirements.
When reference data changes, too many dimensional models default to updating corresponding dimensions, because it is easier.
For example, suppose your company re-brands a product. It's still the same product, but with a new name. You may be tempted to simply update the reference data in your data warehouse. This is easier than tracking changes. It may even seem to make business sense, because 90% of your reports require this-year-versus-last comparison by product name.
Unfortunately, some very important analysis may require understanding how consumer behavior correlates with the product name. You've lost this in your data set. Best practices help avoid these problems.
Dimensional models should track the change history of reference data. In dimensional speak, this means application of type 2 slow changes as a rule. This preserves the historic context of every fact recorded in the fact table.
In addition, every row in a dimension table should track "effective" and "expiration" dates, as well as a flag rows that are current. This enables the delivery of type 1 behavior (the current value) even as we store type 2 behavior. From an analytic perspective, it also enables useful "what if" analysis.
As with all rules, again there are exceptions. In some cases, there may be good reason not to respond to changes in reference data by tracking history. But more often than not, type 1 responses are chosen for the wrong reason: because they are easier to implement.
4. Record Identifying Information, Including Alternate Identifiers
Good dimensional models allow us to trace back to the original source data. To do this, include transaction identifiers (real or manufactured) in fact tables, and maintain identifiers from source systems in dimension tables (these are called "natural keys").
Some of this is just plain necessary in order to get a dimensional schema loaded. For example, if we are tracking changes to a product name in a dimension, we may have multiple rows for a given product. The product's identifier is not a unique identifier, but we must have access to it. If we don't, it would become impossible to load a fact into the fact table.
Identifying information is also essential for business analytics. Data from the warehouse is likely to be combined with data that comes from other places. These identifiers are the connectors that allow analytic modelers to do this. Without them, it may become necessary to bypass the warehouse.
Your analytic efforts, however, may require blending new data with your enterprise data. And that new data may not come with handy identifiers. You have a better chance blending it with enterprise data if your warehouse also includes alternate identifiers, which can be used to do matching. Include things like phone numbers, email addresses, geographic coordinates—anything that will give the analytics effort a fighting chance of linking up data sources.
Summary
If you've been following the best practices of dimensional modeling, you've produced an asset that maximized value for analytic modelers:
- You have granular, detailed event data.
- You have rich, detailed reference data.
- You are tracking and time-stamping changes to reference data.
- You've got transaction identifiers, business keys, and alternate identifiers.
It also goes without saying that conformed dimensions are crucial if you hope to sustain a program of business analytics.
Of course, there are other considerations that may cause an analytic modeler to turn her back on the data warehouse. Latency issues, for example, may steer them to operational solutions. Accessibility and procedural issues, too, may get in the way of the analytic process.
But from a database design perspective, the message is simple: follow those best practices!
Further Reading
You can also read more in prior posts. For example:
- Rule 1: State Your Grain (December 9, 2009) covers the fundamentals of grain
- Build High Resolution Stars (January 7, 2011) discusses the importance of setting grain at the lowest level possible
- For Slowly Changing Dimensions, Change is Relative (October 9, 2007) covers type 1 vs type 2 processing and surrogate keys vs. natural keys.
- Responding to Star Schema Detractors with Timestamps (March 12, 2008) covers the use of effective and expiration dates with Type 2 slow changes.
- Do I Really Need Surrogate Keys (May 20, 2009) covers the fundamentals of business keys vs. warehouse keys
- Creating transaction identifiers for fact tables (October 17, 2011) covers real and manufactured transaction identifiers
It covers the best practices of dimensional design in depth. For example:
- Grain, identifiers, keys and basic slow change techniques are covered in Chapter 3, "Stars and Cubes"
- The place of summary data is covered in Chapter 14, "Derived Schemas" and Chapter 15, "Aggregates"
- Conformance is covered in Chapter 5, "Conformed Dimensions"
- Advanced slow change techniques are explored in Chapter 8, "More Slow Change Techniques"