Monday, August 29, 2011

Slowly Changing Facts?

This post looks at the issues faced when facts can change retroactively.
Q: Our source system allows facts to be updated.  We are trying to come up with a way to preserve this history in the data warehouse.  One idea is to use the type 2 technique for facts.  We call it "slowly changing facts."  Is this a good solution?
A: When transactions are changed retroactively, you should update corresponding facts. If you need to track the history of facts, do so in separate audit tables.

To understand why, we will  look at what happens if you record multiple "versions" of a fact in your fact tables. But first, a refresher on the type 2 slow change technique.

Slowly changing dimensions

Dimensional modelers must decide what will happen when the source data for a dimension attribute changes. There are several possible responses to a change; one of them is known as the "Type 2" response.

The type 2 response calls for keeping a sort of "version history" in the dimension table. For example, if a customer's address changes, we insert a second record into the dimension for the same customer.  This way, the old "version" of the customer can be linked to facts that took place prior to the change, and the new version can be linked to new facts.

I covered the basics in more detail in a previous post; you can also refer to the slowly changing dimensions category for more posts on the topic.

Keeping the history of facts

A fact table contains a record of things that happened.  A customer ordered 200 widgets yesterday; another ordered 300 today. Measurement of these events can be aggregated across various dimensions to answer a variety of questions.

But sometimes, what we know about an event changes.  We originally thought 200 widgets were ordered yesterday, but now we know that 250 were ordered.

At first glance,  it may seem like the same "type 2" technique will solve the problem.  When details about a specific transaction are changed, simply insert a new row into the fact table.  The most current version can be flagged for easy access.

Unfortunately, this seriously harms the usability of the fact table. When people study orders, they must now be sure to single out a single version for each transaction to avoid double counting. You can try to resolve this by adding multiple date-stamps to each record, or converting changes into "deltas", but the damage is done. The model is open to serious misuse.

We'll look at the solution in a moment, but first a look at the same problem in a snapshot model.

Snapshots exhibit the same problems

Another kind of fact table, the periodic snapshot, tracks status at predefined intervals.  A financial data mart, for example, might log the balance of each account in the general ledger at the close of business each day.

Here too, our understanding of what happened may change.  At the close of business Monday, a particular account contains US$ 12,000.  On Tuesday, back-dated transactions change that balance to $13,000.

If we try to track both versions of Monday's balance, usability is damaged in the same way.  We must always qualify our analysis for (1) the date we are studying, and (2) the "as of" date of our analysis.  If we don't, grave errors ensue.

Here, usability problems are compounded by density issues.  Tracking the daily balance of each account over 5 years would normally require 1,826 rows per account (365 days x 5 years = 1,825 days, plus 1 leap-day).  If we must also re-record balances for all prior days, the number grows to an astounding 1,668,051.  (This is a triangular number, computed as ( n2+n ) / 2 where n is the number of days.)

If there are 500 accounts in the general ledger, that's the difference between 913,000 rows in a standard periodic snapshot, and 834,025,500 for one that logs history.  As each additional day is added, the difference increases.

The alternative 

There is only on version of the truth, and this is what belongs in the fact table.

When our understanding about an event has changed, the corresponding facts should be updated. This holds for both transaction models and snapshot models.

If you must be able to produce an audit trail, do so in a separate table.  The audit table need only capture the transaction identifier, the date, and the value of each fact. Record rows for the original fact and each revision.

If need be, this can also be organized dimensionally as a fact table. Keep in mind, however, that using this fact table will be subject to the same usability challenges.  And ask the business if all this trouble is really necessary.

For a snapshot model, and audit table may not be necessary; a corresponding transaction table may be able to hold the history. Each transaction that affects status is recorded with 2 dates: date of the transaction, and the date we became aware of the information (often a "posted" date).

Monday, August 15, 2011

Multiple Stars and Conformed Dimensions

The concept of conformed dimensions is central to the discipline of dimensional modeling. This post introduces conformed dimensions; future posts will explore them in more detail. 

In your data warehouse, each star corresponds to a business process. Combining facts from different processes can produce powerful compound metrics. The key to making this work is a set of conformed dimensions.

Conformed dimensions are closely associated with Kimball's "Bus Architecture," but are crucial in any scenario that involves dimensional data.

Multiple Stars

In a dimensional design, each star captures collects measurements that describe a discrete business process.

If we have two measurements that describe different processes, we place them into separate fact tables.

For example, a sales data mart may contain multiple stars:
  • Proposal information by Salesperson, Prospect, Product, Proposal and Proposal Date
  • Order information by Salesperson, Customer, Product, Proposal, Contract and Order Date
  • Shipping information by Salesperson, Customer, Product, Proposal, Contract, Shipment, Shipper, and Shipment Date
  • Return information by Salesperson, Customer, Product, Contract, Reason and Return Date
In each of these stars, the fact table will record measurements that describe the processes of issuing proposals, taking orders, shipping product and handling returns.

By recording measurements of each process in a different star, we are able to capture information at the most detailed level possible.  We can study each of these processes, complete with attendant details, by accessing the appropriate star.

Cross-process Metrics

Some of the most powerful measurements actually combine information from multiple processes. These metrics require combining facts from different stars.

In the sales data mart, the ratio of proposals to orders is the "close rate," a powerful indicator that sales managers and executives look at on a regular basis.

Similarly, the ratio of shipments to returns is the "return rate," an essential quality control metric.

Drilling Across

When we compare facts from different stars, we don't simply join the fact tables.  To do so might cause double-counting of some facts.

Instead we follow a process that Kimball calls drilling across.  The drill-across process can be broken down into two phases.  In the first phase, each star is queried, and results are aggregated to a common level of detail.  In the second phase, these result sets are merged based on their common dimensions.

For example, to compute the return rate by product for August of 2011, we do the following:
  1. a. Figure out quantity shipped by product for August 2011
    b. Figure out quantity returned by product for August 2011
  2. Merge these amounts based on the common product names and compute the ratio
We may perform this drill across operation at query time (many BI tools can do this automatically), or we may do it at ETL time, storing the results in a separate star or cube (sometimes called a second-line data mart.) 

Conformed Dimensions

The key to making all this work is the organization of the dimensions.  As you saw in the example above, we used the dimension values to link our results together: product names were used to merge together shipment and return quantities and compute their ratio.  This would not have worked if the product dimensions for Shipments and Returns had been different.

This is the basic idea behind conformed dimensions.  We say that two dimensions conform if they have the same structure and content.  Both our stars, had a product dimension table with a product name attribute, and the product names were specified the same way in each.  Sharing a single physical table is one way to ensure conformance, but it is not required.

Two dimensions can also conform if one has a subset of the other's attributes.  As long as the common attributes have the same structure and content, they are said to conform.

Planning Conformance

By planning a set of conformed dimensions, we ensure that fact tables can be used to compare processes.  This is important within a single data mart, such as the one above, and it is also important when looking across multiple data marts.

Conformed dimensions are the organizing principle in Kimball's architecture.  Conformed dimensions are planned up-front, as a part of a project that establishes a data architecture based on dimensional design. Implementation proceeds once this conformance bus has been planned.

The concept is also important in other architectures. For example, the sales data mart discussed above might be part of Corporate Information Factory architecture.  Within this data mart, conformance guarantees we can compare shipments to returns, proposals to orders, and so forth.

More to come

In the coming weeks, I will post more about conformed dimensions.  We will look at "levels" of conformance, how to document conformed dimensions, and how different tools work with conformed dimensions. If you have my book, I also encourage you to read Chapters 4 and 5, which look at these concepts in detail.

Photo by Agnes Periapse, licensed under Creative Commons 2.0