Thursday, October 2, 2008

Coming to TDWI New Orleans? Vote First!

If you'll be joining me at the TDWI conference in New Orleans during election week, make sure you vote ahead of time. Here's how.

This election day, I'll be teaching Advanced Dimensional Modeling at The Data Warehouse Institute World Conference in New Orleans. I was a bit reluctant to accept the invitation at first, out of fear that it would prevent my vote from being counted. But after doing a little bit of research, I accepted with confidence.

You can attend with confidence as well. All you need is an absentee ballot.
GoVoteAbsentee.org / learn how to vote absentee!Absentee voting procedures vary by locality. To find out how it works where you live, you can go to GoVoteAbsentee.org.

Plug in your zip code, and they'll provide you with information on how to apply for an absentee ballot. They'll even give you a link to download the ballot application. If you prefer, you can contact your county or city government for information.

The process is simple, and you can be sure your vote will be heard. Some localities even permit residents to cast absentee ballots in person. In Fairfax county Virgina, where I live, you can personally feed your ballot into the optical scanner that tallies the votes.

And as an added bonus, you won't have to stand in line on election day!

I have already cast my absentee vote. If you will be traveling to The Data Warehouse Institute Conference in New Orleans, I urge you to do the same. Go to GoVoteAbsentee.org now, and find out how.

- Chris

Tuesday, June 10, 2008

Available Now: Designing a Performance Layer

My article "Designing A Performance Layer: Aggregates and Derived Tables" appears in the current issue of The Business Intelligence Journal (Vol 13, No. 1), available to members of The Data Warehouse Institute.

This article shows you how to create a performance layer for your dimensional data warehouse -- a set of cubes and tables designed specifically for performance.

The article covers two types of performance design: aggregate tables and derived tables.
  • Aggregate Tables store partially summarized copies of existing data. These tables (or cubes) improve performance for queries that do not require granular detail. The article fully explains aggregates, including summarization of metrics, hierarchies, and implementation with or without aggregate navigators.
  • Derived Tables store restructured copies of existing data. They do not necessarily summarize information, but they may. Derived tables deliver performance benefits by streamlining the query and reporting processes. In the article, I show how to supplement an existing schema with merge, pivot, and set operation fact tables (or cubes).
The full article appears in The Business Intelligence Journal, which is available in print and on line for TDWI members.

Wednesday, March 12, 2008

Responding to Star Schema Detractors with Timestamps

Time-stamped dimensions can pack additional analytic punch, compared with standard slow change techniques. There is additional cost in the load process, but it may be worth it, especially if you plan to load historic data.

This is the fourth in a series of posts on Slowly Changing Dimensions. If you are just joining in, you might want to review Part I, Part II and Part III.

Star schema detractors often suggest that that a dimensional model throws away valuable detail. Proponents of dimensional design counter that one of their basic design tenets is to set fact table grain at the lowest level of detail possible. Fair enough. But this response does not address the concern fully. What about the dimensions?

Designers who eschew the dimensional model for their enterprise data warehouse point out that they can tell you what a particular customer looked like on a given day, regardless of whether there were any sales. A dimensional design using standard Type 2 techniques does not offer this capability. It writes a new row to the dimension table each time a characteristic changes. But we can only link these versions of the customer to a point in time if there is a transaction recorded in a fact table.

In the over-hyped "religious wars" of data warehousing, this is one criticism of the star schema that I have always felt had some validity. However, it is an easy one to address.

Time Stamped Dimensions

The solution is so simple that I have encountered it being employed numerous times by groups with no formal training in dimensional design. The solution is to add a pair of dates to the dimension table: effective_date and expiration_date.

When a row is added to the dimension table, it is given an effective_date that indicates when the change became effective. Later, when a slow change occurs, two things occur. The new record is inserted with the new effective_date, similar to a standard Type 2 response. In addition, the previous record is updated: its expiration_date is set to the date before the change occurred.

More ETL Work, but More Analytic Possibilities

This extra step may be a bit of additional work for the ETL developers, but it provides some additional analytic capability. Using these dates, it is possible to:

  1. List a series of changes to something in sequence: order by the effective_date.
  2. Determine the status of something at a point in time: qualify the date in question as greater than or equal to the effective_date and less than or equal to the expiration_date
  3. Quickly find the current value for a given dimension, by qualifying on expiration_date
For all this to work, it is useful to initialize expiration dates of current records to an actual date. This avoids the tortured query predicates that are brought on by the use of NULL values. I prefer to use the highest date value supported by the RDBMS, but that’s a topic for another post.

As described, this is more of a "Date Stamped" dimension. But if the granularity of significant change is finer than the day, it can easily be extended to include a time component.

Because there is a bit of extra ETL processing, it is worth considering whether the additional analytic value is worth it. But there is also another consideration to take into account, and here the ETL developers may actually save some work.

Another Benefit

Having the timestamps on our conformed dimensions will make it easier to load past history when we build new fact tables in the future. This is especially handy if we are building the warehouse one subject area at a time.

For example, our Customer dimension may be implemented first as part of an orders schema. Later, it may come into play again as part of a receivables schema. When it comes time to load the receivables history into a fact table, each transaction will need to be associated with the correct historic version of the customer. Without timestamps, ETL developers will need to use the operational data to identify exactly what the customer looked like at the time of the transaction, then search for a match in the customer dimension.

With timestamps, all the ETL developers will need is the customer_id. We can easily find the correct version of the customer in the dimension table by referring to the timestamps.

For this simple reason alone, I encourage designers to use time-stamps on the dimensions that are important across the enterprise--the conformed dimensions that form the backbone of the data warehouse.

Tuesday, January 8, 2008

Have it Both Ways With Hybrid Slow Changes

This is the third in a series on Slowly Changing Dimensions. It looks at an often misunderstood approach: the hybrid SCD.

When the source data for a dimension table changes, the Slowly Changing Dimension characteristics of a dimensional model dictate how the warehouse schema should respond. In my previous two posts, I examined the three most common techniques. (See Part I & Part II.) When the source value for a dimension attribute value changes:
  • Type 1 response: Update the existing value in the dimension table
  • Type 2 response: Add a new row to the dimension table, with a new surrogate key
  • Type 3 response: Maintain a single row in the dimension table, with separate columns for current and previous values of the dimension attribute in question.
Sometimes, one single technique does not fit the bill. Requirements will come along that seem to call for both Type 1 and Type 2 treatment. What to do?

“We always want to do it this way. But sometimes we want to do it that way.”

Consider a company building a typical orders star schema for business-to-business sales. Customer companies are represented by a customer dimension table. You are trying to determine what the correct slow change response should be when the name of a customer company changes. In an interview, the first thing you are told is:
“We always want to use the most current name for the company in our analysis…”
This sounds to you like a type 1 change, which looks reasonable. After all, you might reason, one wouldn’t want this year vs. last year comparisons to be befuddled by different company names. But no sooner have you noted this, the speaker continues:
“...except for when we want to use the name that was in effect at the time of the transaction.”
Now things look more complicated. To provide the company name as it existed at the time of an order would require a Type 2 response. But that would conflict with the first requirement.

What to do? Don’t jump for Type 3. That’s not the solution either. A type 3 response will not tell you which value was in effect at the time of a given transaction. (It also usually does not capture more than one previous value of the attribute – see the previous post.)

Having it Both Ways

The hybrid approach allows you to meet both requirements. It requires that you model two dimension attributes that will capture the value in question – company name in this case.
  • company_name_current will capture the current name of the company. This attribute will exhibit Type 1 behavior. When a company name changes, it will be updated.
  • company_name_historic will capture the detailed history of company names. This attribute will exhibit Type 2 behavior. When a company name changes, a new row will be created to capture the new value, and to associate with any subsequent transactions

This means that when a hybrid change occurs, the ETL process must do at least two things:
  1. Update the _current value for all records that share the same natural key
  2. Insert a new row, initializing both _current and _historic values to the new value
Years ago, it was popular to refer to this approach as a “Type 4 Change.” With the second edition of The Data Warehouse Toolkit, Kimball and Ross suggest we call this a hybrid change. And for good reason: it is really a bit of Type 1 and a bit of Type 2.

Hybrid Change in Action

Developers often have trouble with this type of design. To fully understand it, look what happens when the name of a company changes.

Suppose that one of the customers was Apple Computer. You may recall that in 2007, the company changed its name from Apple Computer Inc. to Apple Inc.

Initially, the dimension table contains a row that looks like this:

In 2007, Apple changed its name from “Apple Computer Inc.” to “Apple Inc.” The next illustration shows the schema’s hybrid response in yellow:

The first row shows the existing record. The _current value is updated with the new name, “Apple Inc.” Its _historic value is not touched. The next row shows the new record that has been inserted. It shows the new name for both _current and _historic values. Any facts subsequently recorded will be connected to this new row.

To analyze all facts (before and after the name change) using the new name, Apple Inc., you use company_name_current to group query results. To study all facts in the context of the historically accurate name, group query results by company_name_historic.

Repeatable

This process is repeatable. If Apple once again changed the corporate name, say to “Apple Electronics, Inc.”, the ETL process for the customer table would:
  1. Change company_name_current to “Apple Electronics Inc.” for both existing rows.
  2. Add a third row (with a new surrogate key) with _current and _historic values recorded as “Apple Electronics, Inc.” Subsequent facts would be connected to this row.
The Importance of Good Documentation

It is important to document the hybrid change carefully. ETL developers must understand how it is to be loaded, and report/query developers must understand how to use it. Otherwise, the hybrid approach can lead to confusion.

And be sure that it is worth the trouble. You might ask why people want the historic name, if 99% of the time they only need the current name. If it is because they want to be able to reproduce an invoice, perhaps that would be better served by going back to the source system. Then the schema can sport a more simple Type 1 approach.

But if a real analytic need exists, the hybrid approach is the only way to go.

Copyright (c) 2007 Chris Adamson