Monday, December 13, 2010

Deeper into the Accumulating Snapshot

The most popular posts on this blog discuss accumulating snapshot designs. In this post, I'll share some new insights on the technique, which come from my book Star Schema The Complete Reference.

In particular, I want to debunk some common misconceptions about accumulating snapshots.

The Accumulating Snapshot

One of the three major kinds of fact table, the accumulating snapshot is the least understood and the most underutilized. 

I provided a basic description of the technique several years ago in the first post to this blog.  The basic concept is simple: in this kind of design, each row in the fact table tracks the changes to something over time.  The idea is to facilitate reports that look at how long things spend in various processing stages.

For example, the post mentioned above describes an accumulating snapshot that tracks the processing of mortgage applications.  Each row in the fact table tracks a single application, and is updated each time it completes a processing stage.  It contains foreign key references to multiple dates:
  • Submitted Date
  • Approval Date
  • Processed Date
  • Underwritten Date
  • Settlement Date
Corresponding to each of these milestones is a fact that indicates how long it spent in the stage.  In the previous post, I've called these facts "lags," but I'll suggest a different way of looking at them in a moment.

These facts make it it easy to study the average time items spend in each stage. Without an accumulating snapshot, this might require correlated sub-queries,  or the comparison of detail rows from multiple fact tables.

Unique Characteristics

Unlike other kinds of fact table, the intent is to update rows in an accumulating snapshot.  When an application reaches a new status, for example, the appropriate date is added.  I've touched on this topic in a prior Q and A post.

Another characteristic that differs from other kinds of fact tables is the size of the accumulating snapshot.  It usually has less rows than the dimension table representing the item being processed.  That's because the item itself is likely to undergo type 2 slow changes.  I've noted this in a previous post as well.

Lastly, the accumulating snapshot is essentially current valued.  It is used to study the net effects of a process up to and including the current point in time. 

Non Linear Processes

In the past, I've written that the accumulating snapshot is only effective for processes that follow a consistent, predictable series of steps.  (See my previous books in the sidebar to this blog.)

But the technique can easily be adapted to accommodate processes that are not linear.

For example, suppose our mortgage application may bounce back to the processing stage if the underwriter discovers some missing documentation. Conventional wisdom holds that this will "break" the accumulating snapshot, since it violates the standard sequence of status milestones.

This non-linear behavior is easily accounted for. First, we will rename our "lag" facts to represent the number of days spend at each stage:
  • Days awaiting approval
  • Days processing
  • Days underwriting
  • Days awaiting settlement
The ETL process will be developed to check each open application on a daily basis, and increment the appropriate fact for its processing stage.  

This means it does not matter if an application reaches the underwriting stage and then gets sent back to the processing stage.  If this occurs, the ETL process will simple resume incrementing the "days processing" fact.

One thing that will need to be worked out is what dates to use for each milestone. For the application in question, once the processor finishes with it for the second time, what is the processed date?  The first time it achieved this status, or the second?  Business rules must drive these decisions.

Long Duration Processes

We are also accustomed to thinking of accumulating snapshots as appropriate only for short-duration processes.

However, my customers and students have found success in modeling activities of significant duration. Examples include the tenure of faculty members, the life cycle of a service contract, the duration of a mortgage or insurance policy, and the incarceration of a prisoner.

Some of these processes do not have fixed durations, and may prove to span years or even decades. As long as it is feasible for the ETL process to continually update records for active entities in the process, the approach is effective.

Going Even Deeper

Several other topics are worth considering:
  • You may have an operational system that logs very detailed status changes. It is possible to map a set of hundreds of statuses into a few key milestones for an accumulating snapshot. 
  • If the item or entity being processed undergoes slow changes, update its surrogate key in its corresponding fact table row.  This will be sure the fact table always points to the most recent version of the item.
  • Some milestones make useful dimension attributes as well. Incorporate the dates into the dimension table as appropriate.  For example, "date of approval" may be a useful characteristic in the mortgage application dimension.
  • For milestones that have not been achieved, don't include a NULL-valued foreign key in the fact table.  Instead, establish a special row your day dimension for events that have yet to occur.  Set its attributes to "N/A" and give it the highest possible date supported by your DBMS.  This will simplify joins and query predicates.
  • If at all possible, source the accumulating snapshot from other fact tables that log status changes as transactions.  This will make the ETL process much simpler. 
These topics are all discussed in the book.

You will find this same kind of deep dive in every topic it covers.  You can help support this blog by using the links in the sidebar to order a copy.

Thanks for your support, and feel free to send in your questions!

Image Credit:  Creativity103 via Creative Commons