Wednesday, March 12, 2008

Responding to Star Schema Detractors with Timestamps

Time-stamped dimensions can pack additional analytic punch, compared with standard slow change techniques. There is additional cost in the load process, but it may be worth it, especially if you plan to load historic data.

This is the fourth in a series of posts on Slowly Changing Dimensions. If you are just joining in, you might want to review Part I, Part II and Part III.

Star schema detractors often suggest that that a dimensional model throws away valuable detail. Proponents of dimensional design counter that one of their basic design tenets is to set fact table grain at the lowest level of detail possible. Fair enough. But this response does not address the concern fully. What about the dimensions?

Designers who eschew the dimensional model for their enterprise data warehouse point out that they can tell you what a particular customer looked like on a given day, regardless of whether there were any sales. A dimensional design using standard Type 2 techniques does not offer this capability. It writes a new row to the dimension table each time a characteristic changes. But we can only link these versions of the customer to a point in time if there is a transaction recorded in a fact table.

In the over-hyped "religious wars" of data warehousing, this is one criticism of the star schema that I have always felt had some validity. However, it is an easy one to address.

Time Stamped Dimensions

The solution is so simple that I have encountered it being employed numerous times by groups with no formal training in dimensional design. The solution is to add a pair of dates to the dimension table: effective_date and expiration_date.

When a row is added to the dimension table, it is given an effective_date that indicates when the change became effective. Later, when a slow change occurs, two things occur. The new record is inserted with the new effective_date, similar to a standard Type 2 response. In addition, the previous record is updated: its expiration_date is set to the date before the change occurred.

More ETL Work, but More Analytic Possibilities

This extra step may be a bit of additional work for the ETL developers, but it provides some additional analytic capability. Using these dates, it is possible to:

  1. List a series of changes to something in sequence: order by the effective_date.
  2. Determine the status of something at a point in time: qualify the date in question as greater than or equal to the effective_date and less than or equal to the expiration_date
  3. Quickly find the current value for a given dimension, by qualifying on expiration_date
For all this to work, it is useful to initialize expiration dates of current records to an actual date. This avoids the tortured query predicates that are brought on by the use of NULL values. I prefer to use the highest date value supported by the RDBMS, but that’s a topic for another post.

As described, this is more of a "Date Stamped" dimension. But if the granularity of significant change is finer than the day, it can easily be extended to include a time component.

Because there is a bit of extra ETL processing, it is worth considering whether the additional analytic value is worth it. But there is also another consideration to take into account, and here the ETL developers may actually save some work.

Another Benefit

Having the timestamps on our conformed dimensions will make it easier to load past history when we build new fact tables in the future. This is especially handy if we are building the warehouse one subject area at a time.

For example, our Customer dimension may be implemented first as part of an orders schema. Later, it may come into play again as part of a receivables schema. When it comes time to load the receivables history into a fact table, each transaction will need to be associated with the correct historic version of the customer. Without timestamps, ETL developers will need to use the operational data to identify exactly what the customer looked like at the time of the transaction, then search for a match in the customer dimension.

With timestamps, all the ETL developers will need is the customer_id. We can easily find the correct version of the customer in the dimension table by referring to the timestamps.

For this simple reason alone, I encourage designers to use time-stamps on the dimensions that are important across the enterprise--the conformed dimensions that form the backbone of the data warehouse.

Tuesday, January 8, 2008

Have it Both Ways With Hybrid Slow Changes

This is the third in a series on Slowly Changing Dimensions. It looks at an often misunderstood approach: the hybrid SCD.

When the source data for a dimension table changes, the Slowly Changing Dimension characteristics of a dimensional model dictate how the warehouse schema should respond. In my previous two posts, I examined the three most common techniques. (See Part I & Part II.) When the source value for a dimension attribute value changes:
  • Type 1 response: Update the existing value in the dimension table
  • Type 2 response: Add a new row to the dimension table, with a new surrogate key
  • Type 3 response: Maintain a single row in the dimension table, with separate columns for current and previous values of the dimension attribute in question.
Sometimes, one single technique does not fit the bill. Requirements will come along that seem to call for both Type 1 and Type 2 treatment. What to do?

“We always want to do it this way. But sometimes we want to do it that way.”

Consider a company building a typical orders star schema for business-to-business sales. Customer companies are represented by a customer dimension table. You are trying to determine what the correct slow change response should be when the name of a customer company changes. In an interview, the first thing you are told is:
“We always want to use the most current name for the company in our analysis…”
This sounds to you like a type 1 change, which looks reasonable. After all, you might reason, one wouldn’t want this year vs. last year comparisons to be befuddled by different company names. But no sooner have you noted this, the speaker continues:
“...except for when we want to use the name that was in effect at the time of the transaction.”
Now things look more complicated. To provide the company name as it existed at the time of an order would require a Type 2 response. But that would conflict with the first requirement.

What to do? Don’t jump for Type 3. That’s not the solution either. A type 3 response will not tell you which value was in effect at the time of a given transaction. (It also usually does not capture more than one previous value of the attribute – see the previous post.)

Having it Both Ways

The hybrid approach allows you to meet both requirements. It requires that you model two dimension attributes that will capture the value in question – company name in this case.
  • company_name_current will capture the current name of the company. This attribute will exhibit Type 1 behavior. When a company name changes, it will be updated.
  • company_name_historic will capture the detailed history of company names. This attribute will exhibit Type 2 behavior. When a company name changes, a new row will be created to capture the new value, and to associate with any subsequent transactions

This means that when a hybrid change occurs, the ETL process must do at least two things:
  1. Update the _current value for all records that share the same natural key
  2. Insert a new row, initializing both _current and _historic values to the new value
Years ago, it was popular to refer to this approach as a “Type 4 Change.” With the second edition of The Data Warehouse Toolkit, Kimball and Ross suggest we call this a hybrid change. And for good reason: it is really a bit of Type 1 and a bit of Type 2.

Hybrid Change in Action

Developers often have trouble with this type of design. To fully understand it, look what happens when the name of a company changes.

Suppose that one of the customers was Apple Computer. You may recall that in 2007, the company changed its name from Apple Computer Inc. to Apple Inc.

Initially, the dimension table contains a row that looks like this:

In 2007, Apple changed its name from “Apple Computer Inc.” to “Apple Inc.” The next illustration shows the schema’s hybrid response in yellow:

The first row shows the existing record. The _current value is updated with the new name, “Apple Inc.” Its _historic value is not touched. The next row shows the new record that has been inserted. It shows the new name for both _current and _historic values. Any facts subsequently recorded will be connected to this new row.

To analyze all facts (before and after the name change) using the new name, Apple Inc., you use company_name_current to group query results. To study all facts in the context of the historically accurate name, group query results by company_name_historic.

Repeatable

This process is repeatable. If Apple once again changed the corporate name, say to “Apple Electronics, Inc.”, the ETL process for the customer table would:
  1. Change company_name_current to “Apple Electronics Inc.” for both existing rows.
  2. Add a third row (with a new surrogate key) with _current and _historic values recorded as “Apple Electronics, Inc.” Subsequent facts would be connected to this row.
The Importance of Good Documentation

It is important to document the hybrid change carefully. ETL developers must understand how it is to be loaded, and report/query developers must understand how to use it. Otherwise, the hybrid approach can lead to confusion.

And be sure that it is worth the trouble. You might ask why people want the historic name, if 99% of the time they only need the current name. If it is because they want to be able to reproduce an invoice, perhaps that would be better served by going back to the source system. Then the schema can sport a more simple Type 1 approach.

But if a real analytic need exists, the hybrid approach is the only way to go.

Copyright (c) 2007 Chris Adamson

Thursday, November 15, 2007

How To Document a Type 3 SCD: A pair of Type 1 Responses

Many dimensionally-aware database modeling tools allow you to document the Slow Change characteristics of each dimension attribute. But often, your only choices are Type 1 or Type 2. This post describes what to do when your design calls for the less common Type 3 change. As it turns out, a Type 3 SCD response can be understood as a pair of Type 1 responses.

In a previous post, I pointed out that we think about slow changes with respect to a source system, we usually document them as part of the dimensional schema. For example, if the customer's name or date of birth changes on the source, what do we do in the star schema? A Type 1 response would be to overwrite a corresponding value in the star schema; a Type 2 response would be to insert a new row for the revised version of the customer. We usually document this behavior by flagging each attribute in the star as "Type 1" or "Type 2", even though the "change" happens on the source.

(Not sure Slowly Changing Dimensions are? See For Slowly Changing Dimensions, Change Is Relative for an definition of the concept and explanations of the Type 1 and Type 2 techniques.)

A Type 3 change calls for yet another kind of response to a change in source data. This approach response maintains before and after versions of the value in separate dimension attributes. Because the before and after value are stored in the same row, either version can be associated with activity (read: facts) that occurred before or after the change.

Schema designers often throw up their arms in disgust when they discover that their modeling tool does not support the concept of a Type 3 change. But this despair may not be warranted; it is possible to document a Type 3 change as a pair of Type 1 changes.

Before we look at how to do this, here's a quick example of a Type 3 slow change.

Type 3 SCD: An Example

Suppose that the dimensional model contains a dimension table called product, and that one of the dimension attributes is category. The category attribute carries a standard classification mechanism used to categorize products, and appears in a large number of reports. An example of a category is "stationery."

A wide variety of products have this category designation, including several types of envelope, note pad, and paper sheets. All these products have rows in the product dimension table with a category of "stationery." Fact tables linked to the product dimension can be analyzed by category, grouping all these products under the value "stationery."

Now lets assume that the business has expanded its product lines, and the old "stationery" category is a bit too broad. Management plans to redesignate existing stationery products as "Personal Stationery" or "Business Stationery". Going forward, they want to use the new values for all analysis, even when studying facts that were recorded prior to the changeover. So far, sounds like a Type 1 change.

The catch: some planning numbers are tied to the old designations. For at least a while, management will need the ability to use the old value to study facts, whether they were recorded before or after the change. (A type 2 response does not help here; it associates all facts recorded before the change with the old value, and all facts recorded after the change with the new value.) And, now that they've thought about it some more, it might be nice to be able to do this every year.

The solution is to maintain two dimensional attributes for the category. The first is meant to carry the current category designation, and might be called category_current. The second is meant to carry the old category designation, and might be called category_previous.

When the category for a product changes, say from "stationery" to "business stationery", we update the category_current to contain the new value "business stationery" and place the old value, "stationery" in category_previous. When management wants to study facts using the new designations, their reports should group facts using the category_current dimension attribute. When they want to study facts using the old categorization scheme, they can use the category_previous attribute.

Documenting Type 3 Change

As discussed in the previous post, we often document slow change response on an attribute by attribute basis in the dimensional model. For our product example, this might be done as shown in the following illustration.



Many modeling tools allow us to document the SCD characteristic of a column, offering a pick list that allows us to choose what type of response is expected when the source of an attribute changes. But what if your tool does not support Type 3? Or what if you have a tool that automates the construction of an ETL routine, aggregate table or OLAP cube, and the tool does not offer the Type 3 option?

There may be a solution. Notice that when a type 3 change occurs, we update both versions of the attribute. In our example, category_current is updated withe the new name of the product, and category_previous is updated with the prior value (which is the old value of category_current.) We effectively have a pair of Type 1 attributes, and can document them as shown below. I've added a highlight to illustrate the pair of attributes in question.


Of course, if you have tools that use this information to do something, such as generate a cube or ETL script, you need to be sure this method will not be disruptive. You will probably need to provide sources for each of the attributes, and it may be challenging to define a source for category_previous, which will probably not be present in the source after the change has occurred. The solution may be able to define its source as the the category_current value of the product dimension table itself. This, in turn, may require that you join to a copy of the dimension when extracting source data.

Repeatable Process

To round out this discussion of the Type 3 change, I should point out that this slow change process is repeatable. Next year, if the designations change again, we can move category_current values into category_previous, and place the new values in category_current.

Note that we always have access to two values: the current value and the prior value. Not enough? Its also possible to maintain more than two columns for a type 3 response. We might choose to maintain columns called category_current, category_last_year, category_2yrs_ago, etc. Now, any or all of the values can be used to study all facts, regardless of when they were recorded. (A scheme like category_2007, category_2006, category_2005 might work as well, but might require you to redesign reports each year to use the correct attributes.)

More Changes on the Way

That's not quite the end of the story of SCD's. In future posts, I will look at the popular "timestamped" and "hybrid" approaches, as well as run down a few advanced concepts and caveats.

Copyright (c) 2007 Chris Adamson