Thursday, January 26, 2012

When do you need an accumulating snapshot?

A reader wonders how to decide between two options:  designing an accumulating snapshot vs. tracking status changes within a dimension.

I have new project to track the status of order transition. I'm unable to reach a conclusion as to implement as an accumulating snapshot or a type 2 slowly changing dimension. ETL  integrates a number of individual systems as the order transits each stage. What is the best way to design it?
Kumar
Milton Keynes, UK

Many businesses have one or more central dimensions that undergo a state transition as they touch multiple processes. Is it enough to track the changes to the dimension?  Or is an accumulating snapshot needed?

I'll walk you through some decision criteria.  But first a little refresher on the two design options the reader is considering.

Type 2 changes and timestamps


Type 2 changes track the history of something represented by a dimension.  Each time there is a change to this item, a new row is inserted in to the dimension table.

This allows any row in a fact table to be associated with a historically accurate version of the dimension as of the relevant point in time.

In the reader's case, status of the order might be an attribute of an Order dimension.  Modeled as a type 2 change, the dimension holds a status history for each order.  Adding effective and expiration dates to each row, you know exactly when each state transition occrred.

Accumulating snapshots

An accumulating snapshot is a type of fact table that records a single row for something the enterprise tracks closely, such as a trouble ticket or mortgage application--or, in the reader's case, an order.

This fact table contains multiple references to the date dimension -- one for each of the major milestones that the item in question can reach.  In the case of the order, this might be the date of order, the date of credit approval, the date of picking, the date of shipment and the date of delivery.

Unlike other kinds of fact tables, the accumulating snapshot is intended to be updated.  These dates are adjusted each time one of the milestones is reached.

There may also be facts that track the number of days (or minutes) spent between each milestone.  These "lags" are a convenience -- they can be computed from the dates.  (Building them into the fact table makes analysis much easier, but does require that the ETL process revisit rows on a regular basis, rather than when status changes.)

Avoiding correlated subqueries

If a type 2 slowly changing dimension with timestamps tracks the history, why consider an accumulating snapshot?

The analytic value of the accumulating snapshot is that it allows us to study the time spent at various stages. In the reader's case, it can make it simple to study the average time an order spends in the "picking" stage, for example.

We can do this with a type 2 slowly changing dimension as well, but it will be more difficult to study the average time between stages. For the order in question, days spent in the picking stage requires knowing the date of credit approval and the date picked.  These will be in two different rows of the dimension.  Now imagine doing this for all orders placed in January 2012.  This will require a correlated subquery.

The accumulating snapshot pre-correlates these events and places them in a single row.  This makes the queries much easier to write, and they are likely to run faster as well.  The cost, of course, is the increased data integration burden of building the additional fact table.

Avoiding drilling across

When each of the discrete milestones is captured by a different fact table, lag may be computed without correlated subqueries.  In this case, it will involve drilling across.

For example, separate fact tables track orders, credit approvals, picking and shipping.  Each references the order dimension.  Days spent in the picking stage can be studied by drilling across credit approvals and picking, with results linked by the common order dimension. 1

Here, the pressure for an accumulating snapshot is reduced.  It may still be warranted, depending on your reporting tools, developer skills an user base.

Summary and final advice

In the end, your decision should boil down to the following:
  1. An accumulating snapshot should only be considered if you are studying the time spent between major milestones
  2. If it helps avoid correlated subqueries, it may be a strong option
  3. If it avoids drill-across queries, it may be a useful option
Making the choice will impact several groups -- ETL developers, report developers, and potentially users.  Make sure this is a shared decision.

Also keep in mind the following:
  • If you build an accumulating snapshot, you will probably also want to track status in the dimension as a type 2 change.
  • Accumulating snapshots work best where the milestones are generally linear and predicable.  If they are not, the design and maintenance will be significantly more complex.
Last but not least:
  • The accumulating snapshot should be derived from one or more base fact tables that capture the individual activities.  
When in doubt, build the base transaction-grained fact tables first.  You can always add an accumulating snapshot later. 
Learn more

This is a popular topic for this blog.  Here are some places where you can read more:
  • Q&A: Accumulating snapshots (October 1, 2010) Explores the cardinality relationship between accumulating snapshot and dimension table
And of course, these concepts are covered extensively in my books.  In the latest one, Star Schema: The Complete Reference, the following chapters may be of interest:
  • Chapter 8, "More Slow Change Techniques" discusses time stamped tracking of slow changes
  • Chapter 11, "Transactions, Snapshots and Accumulating Snapshots" explores the accumulating snapshot in detail.
  • Chapter 14, "Derived Schemas", discusses derivation of an accumulating snapshot from transaction-grained stars.
  • Chapter 4, "A Fact Table for Each Process", includes a detailed discussion of dril-across analysis.

Image Credit:  Creativity103 via Creative Commons

1 Note that this scenario applies to the reader, but does not always apply.  Trouble tickets, for example, may be tracked in a single fact table that receives a new row for each status change.  In this case, there is no drill-across option.

Friday, December 23, 2011

Three ways to drill across

Previous posts have explored the importance of conformed dimensions and drilling across.  This post looks at the process of drilling across in more detail. 

There are three primary methods for drilling across.  These methods can be leveraged manually, automated with BI software, or performed during the data integration process.

Querying multiple stars

Conformed dimensions ensure compatibility of information in various stars and data marts; drilling across is the process of bringing it together.  (For a refresher, see this post.)
 
For example, suppose we wish to compute "Return Rate" by product.  Return rate is the ratio of shipments to returns.

Information about shipments and returns is captured in two granular star schemas:
  • shipment_facts captures shipment metrics by Salesperson, Customer, Product, Proposal, Contract, Shipment, Shipper, and Shipment Date
  • return_facts captures return metrics by Salesperson, Customer, Product, Contract, Reason and Return Date
Reporting on return rate will require comparing facts from each of these stars. 

Drilling across

Recall that fetching facts from more than one fact table requires careful construction of queries. It is not appropriate to join two fact tables together, nor to link them via shared dimensions. Doing so will double-count facts, triple-count them, or worse.

Instead, the process must be completed in two phases.
  • Phase 1:  Fetch facts from each fact table separately, aggregating them to a common level of detail
  • Phase 2:  Merge these intermediate result sets together based on their common dimensions
In practice, there are several ways that this task can be performed.

Method 1: Issue two queries, then merge the results

The first method for drilling across completes phase 1 on the database, and phase 2 in the application (or on the application server).
  1. Construct two separate queries: the sum of quantity shipped by product, and the sum of quantity returned by product.
  2. Take the two result sets as returned by the DBMS, and merge them based on the common products. Compute the ratio at this point.
While it may seem odd that phase 2 not be performed on the DBMS, note that if the data sets are already sorted, this step is trivial.

Method 2:  Build temp tables, then join them

The second method performs both phases on the DBMS, making use of temporary tables.
  1. Construct two SQL statements that create temporary tables: the sum of quantity shipped by product, and the sum of quantity returned by product
  2. When these are completed, issue a query that performs a full outer join of these tables on the product names and computes the ratio.
Be sure that the temporary tables are cleaned up.

Method 3: Join subqueries

Like the previous method, this method performs all the work on the DBMS.  In this case, however, a single query does all the work.

Queries for each fact table are written, then joined together in the FROM clause of a master query.  For example:


SELECT
  COALESCE (shp.product, rtn.product) as Product,
  quantity_returned / quantity_shipped as ReturnRate
FROM
  ( SELECT product, sum(quantity_shipped)as quantity_shipped
    FROM shipment_facts, product
    WHERE .....
  ) shp
FULL OUTER JOIN
  ( SELECT product, sum(quantity_returned) as quantity_returned
    FROM return_facts, product
    WHERE....
  ) rtn
 ON
    shp.product = rtn.product

   
The two subqueries in the FROM clause represent phase 1.  Phase 2 is represented by the main SELECT query that joins them and computes the ratio.

Applying these techniques

These techniques may applied in a variety of ways:
  1. Report developers may write their own queries using one of more of these methods
  2. You may have BI software that can automate drilling across using one or more of these methods
  3. Drilling across may be performed at ETL time using one of these methods (or an incremental variant)
In the latter case, the ETL process builds a new star (or cube) that contains the result of drilling across. This is called a derived schema, or second line data mart.

Learn More

For more information, see the following resources:


Many pages are devoted to this topic in my books. In the latest one, Star Schema: The Complete Reference, the following chapters may be of interest:
  • Chapter 5, "Conformed Dimensions" discusses these techniques in greater detail.  
  • Chapter 14, "Derived Schemas" looks at special considerations when creating derived stars that pre-compute drill-across comparisons.
  • Chapter 16, "Design and Business Intelligence", discusses how to work with SQL-generating BI software.
More to come on this topic in the future.  If you have questions, send them in.

-Chris



Tuesday, November 15, 2011

Conformed Dimensions

This second post on conformed dimensions explores different ways in which dimensions can conform. 
There are several flavors of conformed dimensions. Dimensions may be identical, or may share a subset of attributes that conform.
Conformance basics

Conformed dimensions are central to the discipline of dimensional modeling.  The basics of conformance were introduced in a post from earlier this year.  In a nutshell:


  • Measurements of discrete business processes are captured in individual star schemas (e.g. proposals and orders)
  • Some powerful business metrics combine information from multiple processes.  (e.g. fill rate: the ratio of orders to proposals)
  • We construct these metrics through a process called drilling across
  • Drilling across requires dimensions with the same structure and content (e.g. proposals and orders have a common customer dimension)

For a refresher on these concepts, see "Multiple stars and conformed dimensions" (8/15/2011). 

Physically shared dimensions not required 

When two fact tables share the same dimension table, their conformance is a given. Since the shared dimensions are the same table, we know they will support drilling across.

For example, stars for proposals and orders may share the customer dimension table.  This makes it possible to query orders by customer and products by customer, and then merge the results together.

But this process of drilling across does not require shared dimension tables. It works equally well if proposals and orders are in separate data marts in separate databases.

As long as the stars each include dimensions that share the same structure (e.g. a column called customer_name) and content (i.e. the customer values are the same), it will be possible to merge information from the stars. 

Levels of conformance 

It is easy to take this a step further.  We can also observe that there is compatibility between dimensions that are not identical.

If a subset of attributes from two dimension share the same structure and content, they form a sort of “lowest common denominator” across which we can compare data from the stars.

For example, suppose we establish budgets at the monthly level, and track spending at the daily level.  Clearly, days roll up to months.  If designed correctly, it should be possible to compare data from budget and spending stars by month.

The picture below illustrates the conformance of a MONTH and DAY table graphically.  The ring highlights the shared attributes; any of these can be used as the basis for comparing facts in associated fact tables.

In this case, the two conformed dimensions participate in a natural hierarchy.  Months summarize days. The month table is referred to as a “conformed roll-up” of day.

To successfully drill across, the content of the shared attributes must also be the same.  Instances of month names, for example, must be identical in each table -- "January" and "January" conform; "January" and "JAN." do not.

To guarantee conformance of content, the source of the rollup should be the base dimension table. This also simplifies the ETL process, since it need not reach back to the source a second time. 

Other kinds of conformance 

Identical tables and conformed roll-ups are the most common kinds of conformed dimensions.  Other kinds are less common. 

Degenerate dimensions (dimensions that appear in a fact table) may also conform. This is particularly useful with transaction identifiers. 

Overlapping dimensions may share a subset of attributes, but not participate in a hierarchy. This is most common with geographical data. 

More Info 

For more information, see the following posts: 
I also write about conformed dimensions extensively in Star Schema, The Complete Reference.  
If you enjoy this blog, you can help support it by picking up a copy!

Photo by Agnes Periapse, licensed under Creative Commons 2.0