Wednesday, May 26, 2010

Avoid Using Dates as Keys

This post explores the impact of using a date as the primary key for a dimension table. 

Q: I'd like to hear your thoughts on using dates as keys.  Is it really necessary to have a surrogate key for a date dimension?

Hartford, CT

A:  It is not absolutely necessary.  But you should use a surrogate key anyway.  Here's why.

Suppose you have a DAY dimension which contains one row for each calendar date.  In addition to the full date, it contains columns that represent the year, quarter, month, day of week, holiday status, and so forth.  It also contains columns that represent frequently used combinations of these various attributes.

Users (or developers) can use the columns to do things like filter queries, aggregate facts, sort results, and drive subtotaling.  Database administrators can index these columns to optimize these activities.

You might be tempted to use the full date as the primary key of the table.  This is urge is understandable -- why bother creating an additional column called DAY_KEY.  Seems like extra work for nothing, right?

Consistency and Performance Issues

Remember that the primary key of a dimension table will also appear in fact tables.  If a date is used, users or developers will eventually try using it as a "shortcut" in queries, avoiding a join to the DAY table.

This leads to inconsistencies.  Developers may mean well, thinking they are "saving a join."  But they may apply different rules to compute things like fiscal periods.

Use of a day table guarantees consistent computation of all the characteristics of each day.  All the columns are computed exactly once, during the ETL process.  Everyone who uses them gets consistent results.

Moreover, avoiding the DAY table may generate performance issues.  If a report developer wants to filter for a particular day of the week, they may apply SQL date comparison functions to the day column in the fact table.  The database will probably have to make this comparison for each row in the fact table.

A day of the week in the DAY table is far more efficient, because it is pre-computed and indexed.

You may think you can prevent these mishaps.  But you cannot.  Sooner or later, they will occur. Dates in the fact table will beckon.  A user, developer, or consultant will try to take a shortcut.  Perhaps under your watch, or perhaps after you have gone.  

Guarantee consistency and performance by using a surrogate key.

If you are curious, read on for some other, less important reasons.

Wednesday, May 19, 2010

Kimball's Approach is Top-Down

Ralph Kimball's approach to data warehousing is frequently mis-characterized as being "bottom-up." This post aims to clear up that misconception.

Bus Architecture

Kimball's bus architecture (or dimensional data warehouse architecture) is an enterprise architecture.  At its core, a set of conformed dimensions ensure a consistent representation of standard terms and data elements across multiple subject areas.  The conformed dimensions describe important things like products, customers, locations, or anything of significance to the business.


The subject areas  are called data marts.  They represent things like manufacturing, sales, invoicing, receivables and so forth.  Data marts don't need to be implemented all at once.  They can be implemented one at a time, as part of an incremental program. Data marts also don't need to be stored in a single database (although they may.)  When they are stored in different databases, the conformance bus ensures consistency and compatibility.

Top-Down

Kimball advocates planning a set of conformed dimensions as an up-front (i.e. strategic) activity.  The conformance bus then serves as the blueprint for a set of integrated data marts, which can be built on whatever schedule makes the most sense.

Kimball and Ross put it this way:
During the limited-duration architecture phase, the team designs a master suite of standardized dimensions and facts that have uniform interpretation across the enterprise...We then tackle the implementation of separate data marts in which each iteration closely adheres to the architecture.

- From The Data Warehouse Toolkit, Second Edition
by Ralph Kimball and Margy Ross (Wiley, 2002)
Because it begins with an enterprise-level framework, then delivers departmental functionality, this is a top-down approach. 

Bottom-Up

A bottom-up approach is one that moves in the opposite direction, beginning with a departmental focus and later evolving into one that has an enterprise focus. This occurs when organizations build stand-alone data marts, then later decide to integrate them.  

Stand-alone data marts are designed and built for departmental use, without an enterprise context. They are cheaper in the short-run, offering a fast path to quick results.  Stand-alone data marts also arrive due to mergers and acquisitions, or through packaged software. 

When there is more than one stand-alone data mart, however, they are likely to exhibit incompatibilities and inconsistencies.  They are sometimes labeled "stovepipes." Faced with these inconsistent data marts, some organizations resolve to retrofit them into a conformance framework.  This can be a difficult and expensive process, requiring extensive rework.

When stand-alone data marts are successfully brought into conformance, a bottom-up path has been followed--one that starts with a departmental solution and moves to enterprise capability.   Bottom-up development is cheaper in the short term but more expensive in the long term. 

While the end result may be compatible with Kimball's vision, clearly the route is not.  If this is news to you, you might want to check out his book. (The link appears beneath the quotation above.)  You can also consult posts on data warehouse architectures and common misconceptions.

-- Chris

Image:  PCI Slot by  Ryan_Franklin_az
Licensed under Creative Commons 2.0

Tuesday, April 27, 2010

Basics: What is a Dimensional Model

In a recent post, I asserted that a star schema is the implementation of a dimensional model in a relational database.  Many readers may have missed this point, because it was embedded in a conversation about the term normalization

Here, then, is a clutter-free discussion the dimensional model.

Activities and Conditions

A dimensional model is a structured framework for measurement.   

Usually, a dimensional model describes a process or activity.  For example, a retailer might design a dimensional model of sales transactions.

A dimensional model may also describe conditions, as measured at predefined intervals. A water utility might design a dimensional model of reservoir levels, measured daily.

Facts and Dimensions

A dimensional model describes measurement of a process (or conditions) through facts and dimensions.

Facts are the measurements.  For the retailer's sales activities, facts that measure each sales transaction include the quantity sold and the price paid.  For the utility's reservoir conditions, a single fact is measured: gallons on hand.

Facts are not useful without context.  For example, "one hundred units sold," is a measurement of quantity sold, but it has no context.  One hundred of what?  When were they sold?  Where?

Dimensions are used to describe the context of facts.  For the retailer, dimension values provide context for each measurement.  These include the date and time of each purchase, the product sold, and the store in which it was purchased. 

For the utility's model of reservoir status, dimensions include the date the measurement was taken, the reservoir or facility where the measurement was taken, and the inspector who measured it.

If you can define a process, facts and dimensions, you have the core of a dimensional model.  Of course, there is a bit more to it than that (the concept of grain is also essential), but that's the basic idea.

Uses

The dimensional model is most famous for its role as the basis for a star schema.  Before we get to that, though, its important to recognize that it has other uses.

First, a dimensional model is an excellent way to describe requirements for an analytic system.  A model of process measurement is far more efficient and flexible than a list of of specific business questions.  A single dimensional model may be able to answer thousands of questions, including some that have not yet been thought of.

A dimensional model is also an excellent tool for planning your data warehouse strategy, and for managing the scope of implementation projects.  This is particularly true when the model is translated to a database design and linked to sources of data.  (I've written about this before.)
 
Stars, Snowflakes, Cubes

A dimensional model can serve as the basis for a database design.  This, of course, is what it is famous for.  (If you can even say it is famous, that is.)
  • When a dimensional model is implemented in a relational database, it is called a star schema (or sometimes a snowflake schema.)  
  • When implemented in a multidimensional database, it is called a cube.  (I touched on this previously, as well.)
If you work with any of these things, you are working with a dimensional model.

- Chris

Image: Slide together Polyhedra by fdecomite
Licensed under Creative Commons 2.0