Wednesday, March 12, 2008

Responding to Star Schema Detractors with Timestamps

Time-stamped dimensions can pack additional analytic punch, compared with standard slow change techniques. There is additional cost in the load process, but it may be worth it, especially if you plan to load historic data.

This is the fourth in a series of posts on Slowly Changing Dimensions. If you are just joining in, you might want to review Part I, Part II and Part III.

Star schema detractors often suggest that that a dimensional model throws away valuable detail. Proponents of dimensional design counter that one of their basic design tenets is to set fact table grain at the lowest level of detail possible. Fair enough. But this response does not address the concern fully. What about the dimensions?

Designers who eschew the dimensional model for their enterprise data warehouse point out that they can tell you what a particular customer looked like on a given day, regardless of whether there were any sales. A dimensional design using standard Type 2 techniques does not offer this capability. It writes a new row to the dimension table each time a characteristic changes. But we can only link these versions of the customer to a point in time if there is a transaction recorded in a fact table.

In the over-hyped "religious wars" of data warehousing, this is one criticism of the star schema that I have always felt had some validity. However, it is an easy one to address.

Time Stamped Dimensions

The solution is so simple that I have encountered it being employed numerous times by groups with no formal training in dimensional design. The solution is to add a pair of dates to the dimension table: effective_date and expiration_date.

When a row is added to the dimension table, it is given an effective_date that indicates when the change became effective. Later, when a slow change occurs, two things occur. The new record is inserted with the new effective_date, similar to a standard Type 2 response. In addition, the previous record is updated: its expiration_date is set to the date before the change occurred.

More ETL Work, but More Analytic Possibilities

This extra step may be a bit of additional work for the ETL developers, but it provides some additional analytic capability. Using these dates, it is possible to:

  1. List a series of changes to something in sequence: order by the effective_date.
  2. Determine the status of something at a point in time: qualify the date in question as greater than or equal to the effective_date and less than or equal to the expiration_date
  3. Quickly find the current value for a given dimension, by qualifying on expiration_date
For all this to work, it is useful to initialize expiration dates of current records to an actual date. This avoids the tortured query predicates that are brought on by the use of NULL values. I prefer to use the highest date value supported by the RDBMS, but that’s a topic for another post.

As described, this is more of a "Date Stamped" dimension. But if the granularity of significant change is finer than the day, it can easily be extended to include a time component.

Because there is a bit of extra ETL processing, it is worth considering whether the additional analytic value is worth it. But there is also another consideration to take into account, and here the ETL developers may actually save some work.

Another Benefit

Having the timestamps on our conformed dimensions will make it easier to load past history when we build new fact tables in the future. This is especially handy if we are building the warehouse one subject area at a time.

For example, our Customer dimension may be implemented first as part of an orders schema. Later, it may come into play again as part of a receivables schema. When it comes time to load the receivables history into a fact table, each transaction will need to be associated with the correct historic version of the customer. Without timestamps, ETL developers will need to use the operational data to identify exactly what the customer looked like at the time of the transaction, then search for a match in the customer dimension.

With timestamps, all the ETL developers will need is the customer_id. We can easily find the correct version of the customer in the dimension table by referring to the timestamps.

For this simple reason alone, I encourage designers to use time-stamps on the dimensions that are important across the enterprise--the conformed dimensions that form the backbone of the data warehouse.