Chris Adamson’s Blog: January 2008

Tuesday, January 8, 2008

Have it Both Ways With Hybrid Slow Changes

This is the third in a series on Slowly Changing Dimensions. It looks at an often misunderstood approach: the hybrid SCD.

When the source data for a dimension table changes, the Slowly Changing Dimension characteristics of a dimensional model dictate how the warehouse schema should respond. In my previous two posts, I examined the three most common techniques. (See Part I & Part II.) When the source value for a dimension attribute value changes:

Type 1 response: Update the existing value in the dimension table
Type 2 response: Add a new row to the dimension table, with a new surrogate key
Type 3 response: Maintain a single row in the dimension table, with separate columns for current and previous values of the dimension attribute in question.

Sometimes, one single technique does not fit the bill. Requirements will come along that seem to call for both Type 1 and Type 2 treatment. What to do?

“We always want to do it this way. But sometimes we want to do it that way.”

Consider a company building a typical orders star schema for business-to-business sales. Customer companies are represented by a customer dimension table. You are trying to determine what the correct slow change response should be when the name of a customer company changes. In an interview, the first thing you are told is:

“We always want to use the most current name for the company in our analysis…”

This sounds to you like a type 1 change, which looks reasonable. After all, you might reason, one wouldn’t want this year vs. last year comparisons to be befuddled by different company names. But no sooner have you noted this, the speaker continues:

“...except for when we want to use the name that was in effect at the time of the transaction.”

Now things look more complicated. To provide the company name as it existed at the time of an order would require a Type 2 response. But that would conflict with the first requirement.

What to do? Don’t jump for Type 3. That’s not the solution either. A type 3 response will not tell you which value was in effect at the time of a given transaction. (It also usually does not capture more than one previous value of the attribute – see the previous post.)

Having it Both Ways

The hybrid approach allows you to meet both requirements. It requires that you model two dimension attributes that will capture the value in question – company name in this case.

company_name_current will capture the current name of the company. This attribute will exhibit Type 1 behavior. When a company name changes, it will be updated.
company_name_historic will capture the detailed history of company names. This attribute will exhibit Type 2 behavior. When a company name changes, a new row will be created to capture the new value, and to associate with any subsequent transactions

This means that when a hybrid change occurs, the ETL process must do at least two things:

Update the _current value for all records that share the same natural key
Insert a new row, initializing both _current and _historic values to the new value

Years ago, it was popular to refer to this approach as a “Type 4 Change.” With the second edition of The Data Warehouse Toolkit, Kimball and Ross suggest we call this a hybrid change. And for good reason: it is really a bit of Type 1 and a bit of Type 2.

Hybrid Change in Action

Developers often have trouble with this type of design. To fully understand it, look what happens when the name of a company changes.

Suppose that one of the customers was Apple Computer. You may recall that in 2007, the company changed its name from Apple Computer Inc. to Apple Inc.

Initially, the dimension table contains a row that looks like this:

In 2007, Apple changed its name from “Apple Computer Inc.” to “Apple Inc.” The next illustration shows the schema’s hybrid response in yellow:

The first row shows the existing record. The _current value is updated with the new name, “Apple Inc.” Its _historic value is not touched. The next row shows the new record that has been inserted. It shows the new name for both _current and _historic values. Any facts subsequently recorded will be connected to this new row.

To analyze all facts (before and after the name change) using the new name, Apple Inc., you use company_name_current to group query results. To study all facts in the context of the historically accurate name, group query results by company_name_historic.

Repeatable

This process is repeatable. If Apple once again changed the corporate name, say to “Apple Electronics, Inc.”, the ETL process for the customer table would:

Change company_name_current to “Apple Electronics Inc.” for both existing rows.
Add a third row (with a new surrogate key) with _current and _historic values recorded as “Apple Electronics, Inc.” Subsequent facts would be connected to this row.

The Importance of Good Documentation

It is important to document the hybrid change carefully. ETL developers must understand how it is to be loaded, and report/query developers must understand how to use it. Otherwise, the hybrid approach can lead to confusion.

And be sure that it is worth the trouble. You might ask why people want the historic name, if 99% of the time they only need the current name. If it is because they want to be able to reproduce an invoice, perhaps that would be better served by going back to the source system. Then the schema can sport a more simple Type 1 approach.

But if a real analytic need exists, the hybrid approach is the only way to go.

Copyright (c) 2007 Chris Adamson

Classes

Chris is scheduled to present at the following events. Course enrollment is open to the general public.

All these courses are also available on site (see below).

August 18, 2019
San Diego, CA
Data Modeling in the Age of Big Data
Registration: TDWI San Diego
August 20, 2019
San Diego, CA
Data Architecture: Managing Information in the Age of Big Data
Registration: TDWI San Diego
August 20, 2019
San Diego, CA
Workshop: Building the Business Case for Advanced Analytics
Registration: TDWI San Diego Strategy Summit
Monday October 21, 2019
San Francisco, CA
TDWI Dimensional Data Modeling Primer: From Requirements to Business Analysis
Registration: TDWI Seminars
Tuesday October 22, 2019
San Francisco, CA
Advanced Dimensional Modeling: Techniques for Practitioners
Registration: TDWI Seminars
Wednesday October 23, 2019
San Francisco, CA
Dimensional Models: What’s New in the Big Data Era
Registration: TDWI Seminars
November 12, 2019
Orlando, FL
Data Architecture: Managing Information in the Age of Big Data
Registration: TDWI Orlando
November 12, 2019
Orlando, FL
The Dimensional Model Refactored: New Techniques for the 21st Century
Registration: TDWI Orlando
November 15, 2019
Orlando, FL
Advanced Dimensional Modeling: Complete Tour of Modern Best Practices
Registration: TDWI Orlando

Onsite Education

You can bring Chris to your team for interactive education.

Dimensional Modeling

Chris provides full-day and expanded two-day courses covering the dimensional design concepts from Star Schema: The Complete Reference.
TDWI Courses

Chris teaches select TDWI courses that cover topics like data BI fundamentals, performance management, business analytics, dashboards and scorecards, and more.

All of Chris's education offerings are provided through TDWI.

For information on onsite offerings, contact TDWI Onsite Education. or Oakton Software