Tuesday, April 10, 2007

Three Myths About Star Schema

Think Bill Inmon is anti-star schema? Think again. This post debunks three common myths about star schema design.

As I wrote in a previous post, there are three styles of data warehousing that incorporate star schema design: Kimball’s Dimensional Data Warehouse, Inmon’s Corporate Information Factory, and what I call the “stand-alone” data mart.

These architectures make use of the star schema in different ways. And for proponents of each approach, terms like “enterprise data warehouse” and “data mart” have very different connotations. This confusing situation has given rise to some misconceptions about the star schema.

In this post, I will set the record straight on three common myths.

Myth #1: Bill Inmon is anti-Star Schema

Perhaps the most common misconception about the star schema is that Bill Inmon is opposed to it. In fact, Inmon advocates the use of the star schema to build data marts as part of his Corporate Information Factory architecture. Where he does not want dimensional design used is for the granular repository of enterprise data from which data marts are drawn.

So where did this myth come from? In part, it stems from a common oversimplification of data warehousing theory, which distills a large set of technical issues into the catch-phrase “Kimball vs. Inmon.” Because the star schema has become so closely associated with Kimball, "Kimball vs. Inmon" is often misconstrued as a debate over star schema.

Lest you have any doubts that Inmon advocates the use of star schema, all you need to do is peruse some of his work. Or, search out the first edition of Kimball’s seminal work, The Data Warehouse Toolkit. The forward to the first edition was penned by none other than…Bill Inmon.

Myth #2: The Star Schema is for Aggregated Data

I frequently encounter people who believe that the star schema is used only for aggregated data. While it can be used for aggregated data, it does not have to be used that way. In fact, in a Kimball-style Dimensional Data Warehouse, one guiding principle dictates that fact tables should store the most granular data possible. That is quite the opposite of aggregated data.

So why the confusion? In other architectures, this principle may not be followed. A star schema data mart in an Inmon-style data warehouse may aggregate information that is extracted from a repository of atomic data. A stand-alone data mart may aggregate data drawn from source systems. And even in a Kimball-style architecture, star schemas may be supplemented by aggregate schemas designed to improve performance. (In fact, I’ve written an entire book on this topic.)

But none of these uses of the star preclude its use for the storage of granular, atomic data. Don't fall into the trap of assuming a particular application of the technique fully describes its capabilities or limitations.

Myth #3: Star Schema Data Marts Are Stovepipes

I hear this one in many flavors: using a star schema design leads to “stovepipes” or to “silos” or to “islands of information.” Another variation holds that star schema is for "bottom-up" development. The notion is that if you have multiple star schema data marts, they will not work together.

Whatever words are used, the characterization is false. This myth has probably arisen from a particular application of star schema design: the stand-alone data mart. When a data mart is built in the absence of an enterprise context--a standalone data mart--it serves an immediate purpose, but may later found to be incompatible with analytic requirements in related subject areas.

This sort of "failure" is often observed in stand alone data marts. But it hardly makes sense to blame such failures on the star schema design. You can build a car without a steering wheel, but that does not mean that cars can only drive in straight lines.

In fact, star schema can be very effectively used to avoid stovepipes. The principle of dimensional conformance can be applied, whether designing stars as part of a Dimensional Data Warehouse (Kimball) or a Corporate Information Factory (Inmon). This design technique ensures that common dimensions are used consistently across subject-areas, allowing data from multiple stars to be easily combined into a single report. In technical-speak, it allows applications to “drill-across” schemas or subject areas.

The result is a set of data marts that serve departmental needs, but can also be combined to answer questions that span subject areas. They may be built incrementally, but they work together. Even builders of stand-alone data marts can take measures to ensure that the solution will conform with future data marts. (I'll tackle that topic in a future post.)

So remember: Inmon is not anti-star schema. The star schema is not limited to aggregate data. Star schema data marts do not have to be stovepipes.

Copyright (c) 2007 Chris Adamson