Thursday, November 15, 2007

How To Document a Type 3 SCD: A pair of Type 1 Responses

Many dimensionally-aware database modeling tools allow you to document the Slow Change characteristics of each dimension attribute. But often, your only choices are Type 1 or Type 2. This post describes what to do when your design calls for the less common Type 3 change. As it turns out, a Type 3 SCD response can be understood as a pair of Type 1 responses.

In a previous post, I pointed out that we think about slow changes with respect to a source system, we usually document them as part of the dimensional schema. For example, if the customer's name or date of birth changes on the source, what do we do in the star schema? A Type 1 response would be to overwrite a corresponding value in the star schema; a Type 2 response would be to insert a new row for the revised version of the customer. We usually document this behavior by flagging each attribute in the star as "Type 1" or "Type 2", even though the "change" happens on the source.

(Not sure Slowly Changing Dimensions are? See For Slowly Changing Dimensions, Change Is Relative for an definition of the concept and explanations of the Type 1 and Type 2 techniques.)

A Type 3 change calls for yet another kind of response to a change in source data. This approach response maintains before and after versions of the value in separate dimension attributes. Because the before and after value are stored in the same row, either version can be associated with activity (read: facts) that occurred before or after the change.

Schema designers often throw up their arms in disgust when they discover that their modeling tool does not support the concept of a Type 3 change. But this despair may not be warranted; it is possible to document a Type 3 change as a pair of Type 1 changes.

Before we look at how to do this, here's a quick example of a Type 3 slow change.

Type 3 SCD: An Example

Suppose that the dimensional model contains a dimension table called product, and that one of the dimension attributes is category. The category attribute carries a standard classification mechanism used to categorize products, and appears in a large number of reports. An example of a category is "stationery."

A wide variety of products have this category designation, including several types of envelope, note pad, and paper sheets. All these products have rows in the product dimension table with a category of "stationery." Fact tables linked to the product dimension can be analyzed by category, grouping all these products under the value "stationery."

Now lets assume that the business has expanded its product lines, and the old "stationery" category is a bit too broad. Management plans to redesignate existing stationery products as "Personal Stationery" or "Business Stationery". Going forward, they want to use the new values for all analysis, even when studying facts that were recorded prior to the changeover. So far, sounds like a Type 1 change.

The catch: some planning numbers are tied to the old designations. For at least a while, management will need the ability to use the old value to study facts, whether they were recorded before or after the change. (A type 2 response does not help here; it associates all facts recorded before the change with the old value, and all facts recorded after the change with the new value.) And, now that they've thought about it some more, it might be nice to be able to do this every year.

The solution is to maintain two dimensional attributes for the category. The first is meant to carry the current category designation, and might be called category_current. The second is meant to carry the old category designation, and might be called category_previous.

When the category for a product changes, say from "stationery" to "business stationery", we update the category_current to contain the new value "business stationery" and place the old value, "stationery" in category_previous. When management wants to study facts using the new designations, their reports should group facts using the category_current dimension attribute. When they want to study facts using the old categorization scheme, they can use the category_previous attribute.

Documenting Type 3 Change

As discussed in the previous post, we often document slow change response on an attribute by attribute basis in the dimensional model. For our product example, this might be done as shown in the following illustration.

Many modeling tools allow us to document the SCD characteristic of a column, offering a pick list that allows us to choose what type of response is expected when the source of an attribute changes. But what if your tool does not support Type 3? Or what if you have a tool that automates the construction of an ETL routine, aggregate table or OLAP cube, and the tool does not offer the Type 3 option?

There may be a solution. Notice that when a type 3 change occurs, we update both versions of the attribute. In our example, category_current is updated withe the new name of the product, and category_previous is updated with the prior value (which is the old value of category_current.) We effectively have a pair of Type 1 attributes, and can document them as shown below. I've added a highlight to illustrate the pair of attributes in question.

Of course, if you have tools that use this information to do something, such as generate a cube or ETL script, you need to be sure this method will not be disruptive. You will probably need to provide sources for each of the attributes, and it may be challenging to define a source for category_previous, which will probably not be present in the source after the change has occurred. The solution may be able to define its source as the the category_current value of the product dimension table itself. This, in turn, may require that you join to a copy of the dimension when extracting source data.

Repeatable Process

To round out this discussion of the Type 3 change, I should point out that this slow change process is repeatable. Next year, if the designations change again, we can move category_current values into category_previous, and place the new values in category_current.

Note that we always have access to two values: the current value and the prior value. Not enough? Its also possible to maintain more than two columns for a type 3 response. We might choose to maintain columns called category_current, category_last_year, category_2yrs_ago, etc. Now, any or all of the values can be used to study all facts, regardless of when they were recorded. (A scheme like category_2007, category_2006, category_2005 might work as well, but might require you to redesign reports each year to use the correct attributes.)

More Changes on the Way

That's not quite the end of the story of SCD's. In future posts, I will look at the popular "timestamped" and "hybrid" approaches, as well as run down a few advanced concepts and caveats.

Copyright (c) 2007 Chris Adamson