Friday, August 20, 2010

Surrogate Keys in a CIF

In this Q&A, two more questions about surrogate keys.  Here's the first:
Q:   In a Corporate Information Factory architecture, if the EDW contains a change history for every table, do data marts need surrogate keys?
- Los Angeles, CA
A:  Yes. 

This is similar to a question from last year, in which change history was tracked in an operational system. This time, we're looking at an EDW tracking changes.

First, a little background on the CIF for the uninitiated.

In a Corporate Information Factory (CIF) the Enterprise Data Warehouse (EDW) is an integrated repository of atomic data.  The tables in this repository are normalized.  Data marts take information from this repository and structure it in a dimensional format for departmental access.

Dimension tables are not normalized.  In a CIF, a given dimension may draw data from many tables in the EDW.

For example, a Company dimension might pull data in from a dozen or more normalized tables that include the information on the company, its locations, accounts, lines of business, market data and contact information. 

If the dimension relied exclusively on the unique identifiers of these tables to identify rows, it would require a multi-part key that would have numerous parts.

And if, as you say, your EDW tracks change history to each table, this multi-part key may be compounded by version numbers or date stamps from each source table. This is impractical.  (It also makes it difficult to support type 1 responses in the data marts.)

A related question sometimes comes up in my classes:

Q:  Can I maintain surrogate keys in the EDW?

A:  No.  

Most dimensions do not correspond to a single entity in the EDW, which is normalized, so there is nowhere to store and maintain surrogate keys.  Slow change processing and surrogate key management apply to the dimensional representation of data.

You can maintain a central repository of conformed dimensions, and you can even co-locate them with your EDW.  In fact, this is a good idea.  These master dimension tables can be replicated to your data marts, in whole or in part.

This ensures compatibility across subject areas.  It is a common misconception that conformed dimensions are only important in Kimball's bus architecture.  Even in a CIF, you may need to combine information from multiple data marts.

- Chris

Image:  "Keys" by Takacsi75, licensed under Creative Commons 2.0.