Monday, August 29, 2011

Slowly Changing Facts?

This post looks at the issues faced when facts can change retroactively.
Q: Our source system allows facts to be updated.  We are trying to come up with a way to preserve this history in the data warehouse.  One idea is to use the type 2 technique for facts.  We call it "slowly changing facts."  Is this a good solution?
A: When transactions are changed retroactively, you should update corresponding facts. If you need to track the history of facts, do so in separate audit tables.

To understand why, we will  look at what happens if you record multiple "versions" of a fact in your fact tables. But first, a refresher on the type 2 slow change technique.

Slowly changing dimensions

Dimensional modelers must decide what will happen when the source data for a dimension attribute changes. There are several possible responses to a change; one of them is known as the "Type 2" response.

The type 2 response calls for keeping a sort of "version history" in the dimension table. For example, if a customer's address changes, we insert a second record into the dimension for the same customer.  This way, the old "version" of the customer can be linked to facts that took place prior to the change, and the new version can be linked to new facts.

I covered the basics in more detail in a previous post; you can also refer to the slowly changing dimensions category for more posts on the topic.

Keeping the history of facts

A fact table contains a record of things that happened.  A customer ordered 200 widgets yesterday; another ordered 300 today. Measurement of these events can be aggregated across various dimensions to answer a variety of questions.

But sometimes, what we know about an event changes.  We originally thought 200 widgets were ordered yesterday, but now we know that 250 were ordered.

At first glance,  it may seem like the same "type 2" technique will solve the problem.  When details about a specific transaction are changed, simply insert a new row into the fact table.  The most current version can be flagged for easy access.

Unfortunately, this seriously harms the usability of the fact table. When people study orders, they must now be sure to single out a single version for each transaction to avoid double counting. You can try to resolve this by adding multiple date-stamps to each record, or converting changes into "deltas", but the damage is done. The model is open to serious misuse.

We'll look at the solution in a moment, but first a look at the same problem in a snapshot model.

Snapshots exhibit the same problems

Another kind of fact table, the periodic snapshot, tracks status at predefined intervals.  A financial data mart, for example, might log the balance of each account in the general ledger at the close of business each day.

Here too, our understanding of what happened may change.  At the close of business Monday, a particular account contains US$ 12,000.  On Tuesday, back-dated transactions change that balance to $13,000.

If we try to track both versions of Monday's balance, usability is damaged in the same way.  We must always qualify our analysis for (1) the date we are studying, and (2) the "as of" date of our analysis.  If we don't, grave errors ensue.

Here, usability problems are compounded by density issues.  Tracking the daily balance of each account over 5 years would normally require 1,826 rows per account (365 days x 5 years = 1,825 days, plus 1 leap-day).  If we must also re-record balances for all prior days, the number grows to an astounding 1,668,051.  (This is a triangular number, computed as ( n2+n ) / 2 where n is the number of days.)

If there are 500 accounts in the general ledger, that's the difference between 913,000 rows in a standard periodic snapshot, and 834,025,500 for one that logs history.  As each additional day is added, the difference increases.

The alternative 

There is only on version of the truth, and this is what belongs in the fact table.

When our understanding about an event has changed, the corresponding facts should be updated. This holds for both transaction models and snapshot models.

If you must be able to produce an audit trail, do so in a separate table.  The audit table need only capture the transaction identifier, the date, and the value of each fact. Record rows for the original fact and each revision.

If need be, this can also be organized dimensionally as a fact table. Keep in mind, however, that using this fact table will be subject to the same usability challenges.  And ask the business if all this trouble is really necessary.

For a snapshot model, and audit table may not be necessary; a corresponding transaction table may be able to hold the history. Each transaction that affects status is recorded with 2 dates: date of the transaction, and the date we became aware of the information (often a "posted" date).