Tuesday, April 10, 2007

Three Myths About Star Schema

Think Bill Inmon is anti-star schema? Think again. This post debunks three common myths about star schema design.

As I wrote in a previous post, there are three styles of data warehousing that incorporate star schema design: Kimball’s Dimensional Data Warehouse, Inmon’s Corporate Information Factory, and what I call the “stand-alone” data mart.

These architectures make use of the star schema in different ways. And for proponents of each approach, terms like “enterprise data warehouse” and “data mart” have very different connotations. This confusing situation has given rise to some misconceptions about the star schema.

In this post, I will set the record straight on three common myths.

Myth #1: Bill Inmon is anti-Star Schema

Perhaps the most common misconception about the star schema is that Bill Inmon is opposed to it. In fact, Inmon advocates the use of the star schema to build data marts as part of his Corporate Information Factory architecture. Where he does not want dimensional design used is for the granular repository of enterprise data from which data marts are drawn.

So where did this myth come from? In part, it stems from a common oversimplification of data warehousing theory, which distills a large set of technical issues into the catch-phrase “Kimball vs. Inmon.” Because the star schema has become so closely associated with Kimball, "Kimball vs. Inmon" is often misconstrued as a debate over star schema.

Lest you have any doubts that Inmon advocates the use of star schema, all you need to do is peruse some of his work. Or, search out the first edition of Kimball’s seminal work, The Data Warehouse Toolkit. The forward to the first edition was penned by none other than…Bill Inmon.

Myth #2: The Star Schema is for Aggregated Data

I frequently encounter people who believe that the star schema is used only for aggregated data. While it can be used for aggregated data, it does not have to be used that way. In fact, in a Kimball-style Dimensional Data Warehouse, one guiding principle dictates that fact tables should store the most granular data possible. That is quite the opposite of aggregated data.

So why the confusion? In other architectures, this principle may not be followed. A star schema data mart in an Inmon-style data warehouse may aggregate information that is extracted from a repository of atomic data. A stand-alone data mart may aggregate data drawn from source systems. And even in a Kimball-style architecture, star schemas may be supplemented by aggregate schemas designed to improve performance. (In fact, I’ve written an entire book on this topic.)

But none of these uses of the star preclude its use for the storage of granular, atomic data. Don't fall into the trap of assuming a particular application of the technique fully describes its capabilities or limitations.

Myth #3: Star Schema Data Marts Are Stovepipes

I hear this one in many flavors: using a star schema design leads to “stovepipes” or to “silos” or to “islands of information.” Another variation holds that star schema is for "bottom-up" development. The notion is that if you have multiple star schema data marts, they will not work together.

Whatever words are used, the characterization is false. This myth has probably arisen from a particular application of star schema design: the stand-alone data mart. When a data mart is built in the absence of an enterprise context--a standalone data mart--it serves an immediate purpose, but may later found to be incompatible with analytic requirements in related subject areas.

This sort of "failure" is often observed in stand alone data marts. But it hardly makes sense to blame such failures on the star schema design. You can build a car without a steering wheel, but that does not mean that cars can only drive in straight lines.

In fact, star schema can be very effectively used to avoid stovepipes. The principle of dimensional conformance can be applied, whether designing stars as part of a Dimensional Data Warehouse (Kimball) or a Corporate Information Factory (Inmon). This design technique ensures that common dimensions are used consistently across subject-areas, allowing data from multiple stars to be easily combined into a single report. In technical-speak, it allows applications to “drill-across” schemas or subject areas.

The result is a set of data marts that serve departmental needs, but can also be combined to answer questions that span subject areas. They may be built incrementally, but they work together. Even builders of stand-alone data marts can take measures to ensure that the solution will conform with future data marts. (I'll tackle that topic in a future post.)

So remember: Inmon is not anti-star schema. The star schema is not limited to aggregate data. Star schema data marts do not have to be stovepipes.

Copyright (c) 2007 Chris Adamson

Monday, March 26, 2007

Three Data Warehouse Architectures that Use Star Schema

Whether you follow Kimball or Inmon, or you just build stand-alone data marts, this Blog offers practical advice for your star schemas.

Most data warehouses have a data architecture that fits into one of three categories. These styles of data warehousing often use the same terms to mean very different things. The term “data mart,” for example, has very different connotations for proponents of each of the different architectures. As a result, there is a lot of misinformation about the relative merits of each approach.

But dig a little deeper, and it turns out all three architectures share something in common. Each architecture has a place for the star schema – a set of tables designed according to the principles of dimensional modeling.

Three Data Architectures

At the risk of fanning the flames of a debate that is often characterized as “religious,” here is a very brief summary of each of the three styles of data warehouse design.

Corporate Information Factory This approach is advocated by Bill Inmon, Claudia Imhoff, and many others. It involves a central repository of detailed or “atomic,” data, organized in third normal form. Separate departmental data marts reorganize portions of this data for departmental purposes. These data marts may be built using star schema design.

Dimensional Data Warehouse This approach is advocated by Ralph Kimball, Margy Ross and many others. Atomic data from across the enterprise is stored in star-schema format. The star schemas are designed with a set of “conformed dimensions” allowing them to work together. Data marts are not separate entities, but subject areas within this set of star schemas.

Stand-Alone Data Marts Although there are no major proponents of this approach, it is very popular. Stand-alone data marts focus on a particular business process or department, and are not designed in an enterprise context. Stand alone data marts may be built using star schema designs, or follow other approaches.


This information is summarized in the table below, which also includes some alternate terms used to describe each approach.

Keep in mind that this is a very high level summary, and it is here only to illustrate how the star schema factors in. I encourage readers to seek out more information on each of these approaches.

Something in Common: The Star Schema

Despite the furor over which of these architectures makes the most sense, it should be obvious that they have something in common: each has a place for the star schema. And in all cases, the principles of star schema design are the same.

Regardless of the role of the star schema in your architecture, it must be designed carefully. A poor design will lead to issues, whether you are building a corporate information factory or a dimensional data warehouse. A good design, coupled with proper expectations for your data warehouse architecture, will yield success.

Pragmatic, Not Dogmatic

In my first book on data warehouse design, there is a chapter on data warehouse architecture which advocates the dimensional data warehouse approach. But I have designed star schemas that are used successfully in all three architectures, and the principles of dimensional modeling are the same.

Sometimes people are surprised that I am willing to be involved in projects that follow all three approaches, or that I don’t automatically move to throw away imperfect solutions. But the investment of resources in a data warehouse is a business decision. Design principles may factor in, but so do a variety of other factors – factors over which I am not qualified to weigh in.

Instead, I try to be pragmatic. Yes, I have my preferred approach. But what makes the most sense for the business? Is there already a significant investment in a different architecture? Is there simply not funding for the ideal solution?

The information and resources in this blog are meant to help you, regardless of your architecture. If you use the star schema for anything, you will find help here. From time to time, I will look at concerns specific to one style or another, but this will always be clearly spelled out. Most of the time, we will look at principles that have applicability across the board.


Copyright © 2007 Chris Adamson

Tuesday, March 6, 2007

The Accumulating Snapshot

Although my latest book is about aggregate tables, it covers design basics as well as advanced topics. This excerpt describes how to use an accumulating snapshot design to analyze business processes involving multiple steps.

Excerpted from Mastering Data Warehouse Aggregates, by Chris Adamson.
Used by permission.

Many business processes can be understood in terms of a series of stages or steps that something must undergo. In made-to-order manufacturing, an individual item is ordered, manufactured, quality assured, packaged, and shipped. In banking, a mortgage application is submitted, reviewed by a loan officer, evaluated by an underwriter, and eventually consummated at a closing.

When the individual items that pass through the process are readily identifiable (an individual item being manufactured; a specific mortgage application), and the processing steps or milestones are predicable (the manufacturing process stages, the mortgage approval phases), its status can be tracked by an accumulating snapshot model. Such a model provides a way to understand the status of an individual item at the detail level, the workload at each processing step, and the efficiency with which items move from one stage to the next.

Consider the process by which a mortgage application moves from initial submission to closing. The process begins when an applicant completes a mortgage application with a mortgage officer. A processor then takes over, ensuring that documentation required to evaluate the loan application is gathered. Next, the application is reviewed by an underwriter, who reviews the application and supporting documentation, evaluates potential risk, and approves or rejects the application. The amount approved for underwriting may be less than the amount of the original application. Finally, the mortgage becomes legally binding at the time of closing.

Figure 1, below, shows an accumulating snapshot for the mortgage application process. The grain of this fact table is an application. Each application will be represented by a single row in the fact table. The major milestones are represented by multiple foreign key references to the Day dimension—the date of submission, the date approved by mortgage officer, the date all supporting documentation was complete, the date approved by an underwriter, and the date of closing.



Figure 1: An accumulating snapshot tracking mortgage applications.

Note that it is not necessary to build five Day dimension tables. A single Day table can serve in all these roles; it will simply be aliased at query time or through a series of views. Similarly, the three employees involved in the process are represented by three foreign key references to the Employee dimension table—one for the mortgage officer, one for the processor that assembles the supporting materials, and one for the underwriter.

The facts include several dollar amounts. The amount of the original application is the application_amount. The amounts approved and rejected by the mortgage officer and underwriter are recorded as officer_approved_amount and underwritten_amount. The amount actually loaned at closing is the closing_amount.

When the application is first submitted, the only date that is known is the application submission date. The appropriate day_key value is assigned to the application_submission_date_key. In order to accommodate the dates that are as yet unknown, the Day dimension must contain a row for "unknown" or "n/a." The key value for this row will be assigned to the remaining dates. The application_amount is set to the requested mortgage amount from the application. The remaining dollar amounts are initialized to zero.

As the application works its way through the process, the fact table row will be updated. The various date keys are revised as each milestone is reached. The additional facts are filled in as the appropriate milestones are reached.

The fact table also contains a series of lag columns. These represent the number of days the application spends at each stage. Each is a simple calculation based on milestone dates, populated once it completes the appropriate stage. They allow measurement of processing time.

The fact table also contains a series of quantities, each corresponding to one of the milestones. The value of each count is zero or one, depending on whether the application has reached the corresponding milestone. These facilitate workload reporting, and will take on other values when aggregated. To make it easier to identify the current status of an application, a status attribute may be added to the application dimension or as a degenerate dimension in the fact table.

Using this schema, it is possible to study the lifecycle of a particular application, the workload at each processing step, and the efficiency of the overall process itself.


Excerpted from Mastering Data Warehouse Aggregates, by Chris Adamson.
Copyright (c) 2006 by Wiley Publishing, Inc
Used by permission.