Tuesday, October 9, 2007

For Slowly Changing Dimensions, "Change" is Relative

There's a difference between the way we think about Slowly Changing Dimensions and the way we document them. In this post, I'll highlight this difference by examining the two most common Slow Change techniques.

The term "slowly changing dimension" originated with Ralph Kimball, who identified three techniques for dealing with changed data. Commonly abbreviated as SCD's, these techniques are applied in any form of dimensional design, regardless of the data warehouse architecture.

In practice, there is a subtle but importance between the way we think about these changes and the way we describe them in a dimensional design. This sometimes leads to confusion.

Before I explain this important distinction, let me review the difference between surrogate and natural keys, and describe the two most common SCD techniques. (Future posts will look at other slow change techniques.)

Natural Keys and Surrogate Keys

We usually think of dimension tables in a star schema as corresponding to something in a source system. For example, each row in a customer dimension table relates to a single customer in a source system. Each column is loaded from one or more sources, based on a set of rules. The link back to a source system is preserved in the form of a natural keyusually a unique identifier in a source system, such as a customer_id.

But the star schema design does not rely on this natural key, or business key, to uniquely identify rows in dimension tables. Instead, a surrogate key is introduced. This surrogate key gives the dimensional design flexibility to handle changes differently than they are handled in source systems, while preserving the ability to perform joins using a single column.

Type 1 and Type 2 Slow Changes

Slowly changing dimension techniques determine how the dimensional model will respond to changes in the source system. If the customer with id 8472 changes, what do we do with that change? Alert readers may already be concerned about what I mean by "change" here, but let's first recap the two most common techniques.

  • Type 1: Update When the dimensional model responds to a change in source data by updating a column, Kimball calls this a type 1 change. For example, if a customer's date of birth changes, it is probably appropriate to update the corresponding row for that customer in the dimension table.

    Under this scenario, any facts that were already associated with the dimension table row have effectively been revised as well. A report of sales dollars by date of birth, for example, will provide different results immediately before and after the type 1 change is applied. The type 1 change does not preserve history of the attribute value.
  • Type 2: New Row A more common response to a changed data element is to insert a new row into the dimension table. For example, when the address of customer 8472 changes, we create a new row for the customer in the dimension table. This row has a different surrogate key, and the new address. Customer 8472 now has two rows in the dimension, each with its own surrogate key.

    This preserves the history of the attribute, and does not revise any previously stored facts. New facts will be associated with the new version of customer 8472; old facts remain associated with the old version.

For the most part, these two techniques form the basis of a dimensional model's response to change. (Future posts will consider the less common type 3 change, and additional techniques.) While these concepts are fairly easy to understand, it is important to look a bit deeper.

We think about slow changes with respect to the source

Notice the way that the original problem was framed. I asked how the dimensional schema would "respond to changes in the source data." This is how we usually think about the problem, and for good reason. After all, the source data exists before it is loaded into the dimensional schema. If birth_date changes, we overwrite; if address changes, we insert a new record.

Now observe that a change to the source does not always result in a change in the dimensional schema. In the example, a change in address resulted in a new rownot a changed row. No data is changed.

Still, we refer to this process as the occurrence of a type 2 change. Why? Because we think about slow changes with respect to the source data. And there, a change did occur.

We document slow changes with respect to the star

The most common way to document the dimensional schema's response to change is on the dimensional side, on an attribute by attribute basis. For each column in a dimension table, we note how changes in the source data will be handled. Our customer example might be documented as follows:



In the diagram, each non-key attribute is tagged with a 1 or a 2. This indicates whether changes in the source of the attribute should be handled as type 1 or type 2 changes.

Documenting SCD behavior in this way is handy. ETL developers use this information to design a scheme for performing incremental loads. Report developers use this information to understand how facts will be grouped when combined with different dimension attributes.

The only drawback to documenting SCD rules in this way is that it can lead to confusion. By tagging an attribute as a "type 2 SCD" we risk implying that attribute values may change. After all, the "C" in "SCD" stands for "change."

But of course, this attribute does not change. Rather, its classification as a type 2 attribute means "for a given natural key, if the source for this attribute undergoes a change, it will be necessary to insert a new row. "

In future posts, I will look at some common misconceptions about slowly changing dimensions, and discuss additional techniques for handling changes.

Copyright (c) 2007 Chris Adamson