Monday, March 26, 2007

Three Data Warehouse Architectures that Use Star Schema

Whether you follow Kimball or Inmon, or you just build stand-alone data marts, this Blog offers practical advice for your star schemas.

Most data warehouses have a data architecture that fits into one of three categories. These styles of data warehousing often use the same terms to mean very different things. The term “data mart,” for example, has very different connotations for proponents of each of the different architectures. As a result, there is a lot of misinformation about the relative merits of each approach.

But dig a little deeper, and it turns out all three architectures share something in common. Each architecture has a place for the star schema – a set of tables designed according to the principles of dimensional modeling.

Three Data Architectures

At the risk of fanning the flames of a debate that is often characterized as “religious,” here is a very brief summary of each of the three styles of data warehouse design.

Corporate Information Factory This approach is advocated by Bill Inmon, Claudia Imhoff, and many others. It involves a central repository of detailed or “atomic,” data, organized in third normal form. Separate departmental data marts reorganize portions of this data for departmental purposes. These data marts may be built using star schema design.

Dimensional Data Warehouse This approach is advocated by Ralph Kimball, Margy Ross and many others. Atomic data from across the enterprise is stored in star-schema format. The star schemas are designed with a set of “conformed dimensions” allowing them to work together. Data marts are not separate entities, but subject areas within this set of star schemas.

Stand-Alone Data Marts Although there are no major proponents of this approach, it is very popular. Stand-alone data marts focus on a particular business process or department, and are not designed in an enterprise context. Stand alone data marts may be built using star schema designs, or follow other approaches.

This information is summarized in the table below, which also includes some alternate terms used to describe each approach.

Keep in mind that this is a very high level summary, and it is here only to illustrate how the star schema factors in. I encourage readers to seek out more information on each of these approaches.

Something in Common: The Star Schema

Despite the furor over which of these architectures makes the most sense, it should be obvious that they have something in common: each has a place for the star schema. And in all cases, the principles of star schema design are the same.

Regardless of the role of the star schema in your architecture, it must be designed carefully. A poor design will lead to issues, whether you are building a corporate information factory or a dimensional data warehouse. A good design, coupled with proper expectations for your data warehouse architecture, will yield success.

Pragmatic, Not Dogmatic

In my first book on data warehouse design, there is a chapter on data warehouse architecture which advocates the dimensional data warehouse approach. But I have designed star schemas that are used successfully in all three architectures, and the principles of dimensional modeling are the same.

Sometimes people are surprised that I am willing to be involved in projects that follow all three approaches, or that I don’t automatically move to throw away imperfect solutions. But the investment of resources in a data warehouse is a business decision. Design principles may factor in, but so do a variety of other factors – factors over which I am not qualified to weigh in.

Instead, I try to be pragmatic. Yes, I have my preferred approach. But what makes the most sense for the business? Is there already a significant investment in a different architecture? Is there simply not funding for the ideal solution?

The information and resources in this blog are meant to help you, regardless of your architecture. If you use the star schema for anything, you will find help here. From time to time, I will look at concerns specific to one style or another, but this will always be clearly spelled out. Most of the time, we will look at principles that have applicability across the board.

Copyright © 2007 Chris Adamson

Tuesday, March 6, 2007

The Accumulating Snapshot

Although my latest book is about aggregate tables, it covers design basics as well as advanced topics. This excerpt describes how to use an accumulating snapshot design to analyze business processes involving multiple steps.

Excerpted from Mastering Data Warehouse Aggregates, by Chris Adamson.
Used by permission.

Many business processes can be understood in terms of a series of stages or steps that something must undergo. In made-to-order manufacturing, an individual item is ordered, manufactured, quality assured, packaged, and shipped. In banking, a mortgage application is submitted, reviewed by a loan officer, evaluated by an underwriter, and eventually consummated at a closing.

When the individual items that pass through the process are readily identifiable (an individual item being manufactured; a specific mortgage application), and the processing steps or milestones are predicable (the manufacturing process stages, the mortgage approval phases), its status can be tracked by an accumulating snapshot model. Such a model provides a way to understand the status of an individual item at the detail level, the workload at each processing step, and the efficiency with which items move from one stage to the next.

Consider the process by which a mortgage application moves from initial submission to closing. The process begins when an applicant completes a mortgage application with a mortgage officer. A processor then takes over, ensuring that documentation required to evaluate the loan application is gathered. Next, the application is reviewed by an underwriter, who reviews the application and supporting documentation, evaluates potential risk, and approves or rejects the application. The amount approved for underwriting may be less than the amount of the original application. Finally, the mortgage becomes legally binding at the time of closing.

Figure 1, below, shows an accumulating snapshot for the mortgage application process. The grain of this fact table is an application. Each application will be represented by a single row in the fact table. The major milestones are represented by multiple foreign key references to the Day dimension—the date of submission, the date approved by mortgage officer, the date all supporting documentation was complete, the date approved by an underwriter, and the date of closing.

Figure 1: An accumulating snapshot tracking mortgage applications.

Note that it is not necessary to build five Day dimension tables. A single Day table can serve in all these roles; it will simply be aliased at query time or through a series of views. Similarly, the three employees involved in the process are represented by three foreign key references to the Employee dimension table—one for the mortgage officer, one for the processor that assembles the supporting materials, and one for the underwriter.

The facts include several dollar amounts. The amount of the original application is the application_amount. The amounts approved and rejected by the mortgage officer and underwriter are recorded as officer_approved_amount and underwritten_amount. The amount actually loaned at closing is the closing_amount.

When the application is first submitted, the only date that is known is the application submission date. The appropriate day_key value is assigned to the application_submission_date_key. In order to accommodate the dates that are as yet unknown, the Day dimension must contain a row for "unknown" or "n/a." The key value for this row will be assigned to the remaining dates. The application_amount is set to the requested mortgage amount from the application. The remaining dollar amounts are initialized to zero.

As the application works its way through the process, the fact table row will be updated. The various date keys are revised as each milestone is reached. The additional facts are filled in as the appropriate milestones are reached.

The fact table also contains a series of lag columns. These represent the number of days the application spends at each stage. Each is a simple calculation based on milestone dates, populated once it completes the appropriate stage. They allow measurement of processing time.

The fact table also contains a series of quantities, each corresponding to one of the milestones. The value of each count is zero or one, depending on whether the application has reached the corresponding milestone. These facilitate workload reporting, and will take on other values when aggregated. To make it easier to identify the current status of an application, a status attribute may be added to the application dimension or as a degenerate dimension in the fact table.

Using this schema, it is possible to study the lifecycle of a particular application, the workload at each processing step, and the efficiency of the overall process itself.

Excerpted from Mastering Data Warehouse Aggregates, by Chris Adamson.
Copyright (c) 2006 by Wiley Publishing, Inc
Used by permission.