Friday, December 23, 2011

Three ways to drill across

Previous posts have explored the importance of conformed dimensions and drilling across.  This post looks at the process of drilling across in more detail. 

There are three primary methods for drilling across.  These methods can be leveraged manually, automated with BI software, or performed during the data integration process.

Querying multiple stars

Conformed dimensions ensure compatibility of information in various stars and data marts; drilling across is the process of bringing it together.  (For a refresher, see this post.)
 
For example, suppose we wish to compute "Return Rate" by product.  Return rate is the ratio of shipments to returns.

Information about shipments and returns is captured in two granular star schemas:
  • shipment_facts captures shipment metrics by Salesperson, Customer, Product, Proposal, Contract, Shipment, Shipper, and Shipment Date
  • return_facts captures return metrics by Salesperson, Customer, Product, Contract, Reason and Return Date
Reporting on return rate will require comparing facts from each of these stars. 

Drilling across

Recall that fetching facts from more than one fact table requires careful construction of queries. It is not appropriate to join two fact tables together, nor to link them via shared dimensions. Doing so will double-count facts, triple-count them, or worse.

Instead, the process must be completed in two phases.
  • Phase 1:  Fetch facts from each fact table separately, aggregating them to a common level of detail
  • Phase 2:  Merge these intermediate result sets together based on their common dimensions
In practice, there are several ways that this task can be performed.

Method 1: Issue two queries, then merge the results

The first method for drilling across completes phase 1 on the database, and phase 2 in the application (or on the application server).
  1. Construct two separate queries: the sum of quantity shipped by product, and the sum of quantity returned by product.
  2. Take the two result sets as returned by the DBMS, and merge them based on the common products. Compute the ratio at this point.
While it may seem odd that phase 2 not be performed on the DBMS, note that if the data sets are already sorted, this step is trivial.

Method 2:  Build temp tables, then join them

The second method performs both phases on the DBMS, making use of temporary tables.
  1. Construct two SQL statements that create temporary tables: the sum of quantity shipped by product, and the sum of quantity returned by product
  2. When these are completed, issue a query that performs a full outer join of these tables on the product names and computes the ratio.
Be sure that the temporary tables are cleaned up.

Method 3: Join subqueries

Like the previous method, this method performs all the work on the DBMS.  In this case, however, a single query does all the work.

Queries for each fact table are written, then joined together in the FROM clause of a master query.  For example:


SELECT
  COALESCE (shp.product, rtn.product) as Product,
  quantity_returned / quantity_shipped as ReturnRate
FROM
  ( SELECT product, sum(quantity_shipped)as quantity_shipped
    FROM shipment_facts, product
    WHERE .....
  ) shp
FULL OUTER JOIN
  ( SELECT product, sum(quantity_returned) as quantity_returned
    FROM return_facts, product
    WHERE....
  ) rtn
 ON
    shp.product = rtn.product

   
The two subqueries in the FROM clause represent phase 1.  Phase 2 is represented by the main SELECT query that joins them and computes the ratio.

Applying these techniques

These techniques may applied in a variety of ways:
  1. Report developers may write their own queries using one of more of these methods
  2. You may have BI software that can automate drilling across using one or more of these methods
  3. Drilling across may be performed at ETL time using one of these methods (or an incremental variant)
In the latter case, the ETL process builds a new star (or cube) that contains the result of drilling across. This is called a derived schema, or second line data mart.

Learn More

For more information, see the following resources:


Many pages are devoted to this topic in my books. In the latest one, Star Schema: The Complete Reference, the following chapters may be of interest:
  • Chapter 5, "Conformed Dimensions" discusses these techniques in greater detail.  
  • Chapter 14, "Derived Schemas" looks at special considerations when creating derived stars that pre-compute drill-across comparisons.
  • Chapter 16, "Design and Business Intelligence", discusses how to work with SQL-generating BI software.
More to come on this topic in the future.  If you have questions, send them in.

-Chris