Tuesday, January 8, 2008

Have it Both Ways With Hybrid Slow Changes

This is the third in a series on Slowly Changing Dimensions. It looks at an often misunderstood approach: the hybrid SCD.

When the source data for a dimension table changes, the Slowly Changing Dimension characteristics of a dimensional model dictate how the warehouse schema should respond. In my previous two posts, I examined the three most common techniques. (See Part I & Part II.) When the source value for a dimension attribute value changes:
  • Type 1 response: Update the existing value in the dimension table
  • Type 2 response: Add a new row to the dimension table, with a new surrogate key
  • Type 3 response: Maintain a single row in the dimension table, with separate columns for current and previous values of the dimension attribute in question.
Sometimes, one single technique does not fit the bill. Requirements will come along that seem to call for both Type 1 and Type 2 treatment. What to do?

“We always want to do it this way. But sometimes we want to do it that way.”

Consider a company building a typical orders star schema for business-to-business sales. Customer companies are represented by a customer dimension table. You are trying to determine what the correct slow change response should be when the name of a customer company changes. In an interview, the first thing you are told is:
“We always want to use the most current name for the company in our analysis…”
This sounds to you like a type 1 change, which looks reasonable. After all, you might reason, one wouldn’t want this year vs. last year comparisons to be befuddled by different company names. But no sooner have you noted this, the speaker continues:
“...except for when we want to use the name that was in effect at the time of the transaction.”
Now things look more complicated. To provide the company name as it existed at the time of an order would require a Type 2 response. But that would conflict with the first requirement.

What to do? Don’t jump for Type 3. That’s not the solution either. A type 3 response will not tell you which value was in effect at the time of a given transaction. (It also usually does not capture more than one previous value of the attribute – see the previous post.)

Having it Both Ways

The hybrid approach allows you to meet both requirements. It requires that you model two dimension attributes that will capture the value in question – company name in this case.
  • company_name_current will capture the current name of the company. This attribute will exhibit Type 1 behavior. When a company name changes, it will be updated.
  • company_name_historic will capture the detailed history of company names. This attribute will exhibit Type 2 behavior. When a company name changes, a new row will be created to capture the new value, and to associate with any subsequent transactions

This means that when a hybrid change occurs, the ETL process must do at least two things:
  1. Update the _current value for all records that share the same natural key
  2. Insert a new row, initializing both _current and _historic values to the new value
Years ago, it was popular to refer to this approach as a “Type 4 Change.” With the second edition of The Data Warehouse Toolkit, Kimball and Ross suggest we call this a hybrid change. And for good reason: it is really a bit of Type 1 and a bit of Type 2.

Hybrid Change in Action

Developers often have trouble with this type of design. To fully understand it, look what happens when the name of a company changes.

Suppose that one of the customers was Apple Computer. You may recall that in 2007, the company changed its name from Apple Computer Inc. to Apple Inc.

Initially, the dimension table contains a row that looks like this:

In 2007, Apple changed its name from “Apple Computer Inc.” to “Apple Inc.” The next illustration shows the schema’s hybrid response in yellow:

The first row shows the existing record. The _current value is updated with the new name, “Apple Inc.” Its _historic value is not touched. The next row shows the new record that has been inserted. It shows the new name for both _current and _historic values. Any facts subsequently recorded will be connected to this new row.

To analyze all facts (before and after the name change) using the new name, Apple Inc., you use company_name_current to group query results. To study all facts in the context of the historically accurate name, group query results by company_name_historic.

Repeatable

This process is repeatable. If Apple once again changed the corporate name, say to “Apple Electronics, Inc.”, the ETL process for the customer table would:
  1. Change company_name_current to “Apple Electronics Inc.” for both existing rows.
  2. Add a third row (with a new surrogate key) with _current and _historic values recorded as “Apple Electronics, Inc.” Subsequent facts would be connected to this row.
The Importance of Good Documentation

It is important to document the hybrid change carefully. ETL developers must understand how it is to be loaded, and report/query developers must understand how to use it. Otherwise, the hybrid approach can lead to confusion.

And be sure that it is worth the trouble. You might ask why people want the historic name, if 99% of the time they only need the current name. If it is because they want to be able to reproduce an invoice, perhaps that would be better served by going back to the source system. Then the schema can sport a more simple Type 1 approach.

But if a real analytic need exists, the hybrid approach is the only way to go.

Copyright (c) 2007 Chris Adamson