Sunday, February 19, 2012

Accumulating snapshots and Type 2 changes

How does one maintain an accumulating snapshot when the dimension that defines its grain can undergo type 2 changes?

A reader recently asked this question via the comments for a recent post.

The short answer:  an accumulating snapshot has only 1 row for each thing being tracked.  If that "thing" can undergo type 2 changes, then the accumulating snapshot should link to only one version -- the current version.

State transitions

Many businesses track items that go through state transitions.  A trouble ticket, for example, is opened, assigned to a support rep, and eventually closed.  A mortgage application is submitted, reviewed, processed, underwritten, and settled upon.

These activities may be represented by a series of fact tables associated with the various process milestones.  Or, they may be captured by a single fact table that records a row for each status change.

When it comes time to study the average time spent at the various stages, however, these options may not be ideal.  In the former case, it will be necessary to drill across several fact tables.  In the latter case it will be necessary to perform correlated subqueries.

The accumulating snapshot model addresses these concerns.

The grain of an accumulating snapshot

An accumulating snapshot receives one row for each item being tracked.  This single row is updated as the item reaches various milestones.  A series of facts are incremented for each day spent in a given stage.  When a milestone is achieved, the appropriate date key is populated.

(For a full refresher on the accumulating snapshot, see the links at the end of this post.)

Unlike other fact tables, then, the accumulating snapshot always reflects the current state of affairs.  In the case of a mortgage processing fact table, for example, each row shows a single application, along with how many days it has spent at each processing stage.

This makes it very easy to study the average time between various stages -- just take the average of the facts in question.  No drilling across, no correlated subqueries.

Changes in the defining dimension

Back to the original question, then. What happens if the dimension that defines the grain of the accumulating snapshot can undergo type 2 slow changes?

The goal of the accumulating snapshot is to summarize each real world item in a single row.  Even if the dimension contains multiple rows for a particular item, the fact table must contain only one. If it contained more than one, its usability would harmed.

As I have written before, this means the defining dimension may have more rows than the accumulating snapshot.  This is OK.  Its just another way in which the accumulating snapshot differs from other types of fact tables.

Our mortgage application, then, must have one and only one row in the fact table -- even if type 2 changes mean there are more than one rows for it in the dimension.

Linking to the current row

The question, then, becomes which row should the fact table link to?

Because the accumulating snapshot is current-valued, it makes the most sense to have it link to the row in the dimension table that represents the current state of the item being tracked.  In our example, this would be the most recent version of the mortgage application.

If you increment the facts in your accumulating snapshot daily (something I suggest in my book), you should also take this opportunity to adjust the key values for any items that have undergone a type 2 change.  If you adjust the facts only when a milestone is met (something you may do if the volume is very high), you may choose to update the key value at this time instead.

Learn more

This and other topics surrounding the accumulating snapshot are discussed in depth in Star Schema: The Complete Reference, in Chapter 11: "Transactions, Snapshots and Accumulating snapshots."

You can help support this blog by using the links in the sidebar to purchase a copy.

You can also read more in these posts:
  • When do you need an accumulating snapshot (January 26, 2012) a recent post that triggered the question, wherein a reader asked how to choose between building an accumulating snapshot and tracking something as a series of type 2 changes. In some cases, the answer is to do both.
  • Q&A: Accumulating snapshots (October 1, 2010) Explores the cardinality relationship between accumulating snapshot and dimension table 

Image Credit:  Creativity103 via Creative Commons

Thursday, January 26, 2012

When do you need an accumulating snapshot?

A reader wonders how to decide between two options:  designing an accumulating snapshot vs. tracking status changes within a dimension.

I have new project to track the status of order transition. I'm unable to reach a conclusion as to implement as an accumulating snapshot or a type 2 slowly changing dimension. ETL  integrates a number of individual systems as the order transits each stage. What is the best way to design it?
Kumar
Milton Keynes, UK

Many businesses have one or more central dimensions that undergo a state transition as they touch multiple processes. Is it enough to track the changes to the dimension?  Or is an accumulating snapshot needed?

I'll walk you through some decision criteria.  But first a little refresher on the two design options the reader is considering.

Type 2 changes and timestamps


Type 2 changes track the history of something represented by a dimension.  Each time there is a change to this item, a new row is inserted in to the dimension table.

This allows any row in a fact table to be associated with a historically accurate version of the dimension as of the relevant point in time.

In the reader's case, status of the order might be an attribute of an Order dimension.  Modeled as a type 2 change, the dimension holds a status history for each order.  Adding effective and expiration dates to each row, you know exactly when each state transition occrred.

Accumulating snapshots

An accumulating snapshot is a type of fact table that records a single row for something the enterprise tracks closely, such as a trouble ticket or mortgage application--or, in the reader's case, an order.

This fact table contains multiple references to the date dimension -- one for each of the major milestones that the item in question can reach.  In the case of the order, this might be the date of order, the date of credit approval, the date of picking, the date of shipment and the date of delivery.

Unlike other kinds of fact tables, the accumulating snapshot is intended to be updated.  These dates are adjusted each time one of the milestones is reached.

There may also be facts that track the number of days (or minutes) spent between each milestone.  These "lags" are a convenience -- they can be computed from the dates.  (Building them into the fact table makes analysis much easier, but does require that the ETL process revisit rows on a regular basis, rather than when status changes.)

Avoiding correlated subqueries

If a type 2 slowly changing dimension with timestamps tracks the history, why consider an accumulating snapshot?

The analytic value of the accumulating snapshot is that it allows us to study the time spent at various stages. In the reader's case, it can make it simple to study the average time an order spends in the "picking" stage, for example.

We can do this with a type 2 slowly changing dimension as well, but it will be more difficult to study the average time between stages. For the order in question, days spent in the picking stage requires knowing the date of credit approval and the date picked.  These will be in two different rows of the dimension.  Now imagine doing this for all orders placed in January 2012.  This will require a correlated subquery.

The accumulating snapshot pre-correlates these events and places them in a single row.  This makes the queries much easier to write, and they are likely to run faster as well.  The cost, of course, is the increased data integration burden of building the additional fact table.

Avoiding drilling across

When each of the discrete milestones is captured by a different fact table, lag may be computed without correlated subqueries.  In this case, it will involve drilling across.

For example, separate fact tables track orders, credit approvals, picking and shipping.  Each references the order dimension.  Days spent in the picking stage can be studied by drilling across credit approvals and picking, with results linked by the common order dimension. 1

Here, the pressure for an accumulating snapshot is reduced.  It may still be warranted, depending on your reporting tools, developer skills an user base.

Summary and final advice

In the end, your decision should boil down to the following:
  1. An accumulating snapshot should only be considered if you are studying the time spent between major milestones
  2. If it helps avoid correlated subqueries, it may be a strong option
  3. If it avoids drill-across queries, it may be a useful option
Making the choice will impact several groups -- ETL developers, report developers, and potentially users.  Make sure this is a shared decision.

Also keep in mind the following:
  • If you build an accumulating snapshot, you will probably also want to track status in the dimension as a type 2 change.
  • Accumulating snapshots work best where the milestones are generally linear and predicable.  If they are not, the design and maintenance will be significantly more complex.
Last but not least:
  • The accumulating snapshot should be derived from one or more base fact tables that capture the individual activities.  
When in doubt, build the base transaction-grained fact tables first.  You can always add an accumulating snapshot later. 
Learn more

This is a popular topic for this blog.  Here are some places where you can read more:
  • Q&A: Accumulating snapshots (October 1, 2010) Explores the cardinality relationship between accumulating snapshot and dimension table
And of course, these concepts are covered extensively in my books.  In the latest one, Star Schema: The Complete Reference, the following chapters may be of interest:
  • Chapter 8, "More Slow Change Techniques" discusses time stamped tracking of slow changes
  • Chapter 11, "Transactions, Snapshots and Accumulating Snapshots" explores the accumulating snapshot in detail.
  • Chapter 14, "Derived Schemas", discusses derivation of an accumulating snapshot from transaction-grained stars.
  • Chapter 4, "A Fact Table for Each Process", includes a detailed discussion of dril-across analysis.

Image Credit:  Creativity103 via Creative Commons

1 Note that this scenario applies to the reader, but does not always apply.  Trouble tickets, for example, may be tracked in a single fact table that receives a new row for each status change.  In this case, there is no drill-across option.