Tuesday, September 8, 2009

More on Surrogate Keys

This post addresses a common follow-up question regarding surrogate keys:
"My source system already tracks changes. In this case, do dimension tables really need their own unique identifiers?"
Yes. There may be exceptions, but they are far less common than you may think, and they limit your future options.

In a previous post, I laid out the case for surrogate keys.  In short:
  1. By not reusing unique identifiers from a source system, dimensions are free to respond to changed data in whatever manner makes most sense from an analytic perspective.
  2. Surrogate keys are the best option, though a multi-part key is also possible.
It is tempting to assume that if the source system logs all changes, there is no need to maintain a separate unique identifier for dimensions. I would seem that they can maintain all history, while reusing the natural keys from the source system. Usually, this is not the case. 

A Dimension is Not Equivalent to an Entity

Unfortunately, a dimension in a data warehouse rarely corresponds to a single entity in an operational system. Operational systems are more highly normalized; most dimensions correspond to more than one table in a source system. A customer dimension, for example, may draw information from multiple source tables: customer, household, address, phone_number, demographics and so forth.

A change log for the central entity corresponding to a dimension may not reflect changes to some of the related tables, especially when they represent dependent entities or are linked through intersect tables. If this is the case, it will be impossible for the dimension to undergo a type 2 response when ancillary information changes.

If every source table that feeds a dimension is logged, it may be possible to design a multi-part unique identifier that is completely recycled. (As observed previously, compound keys in dimensions are feasible but not optimal.) This approach can lock you into type 2 processing for defining elements of the related tables, even when a type 1 response is desired.

Surrogate Keys Will Limit Future Rework

Even in the rare case when a source system handles all changes exactly the way we want, reuse of the natural keys limits future options.

If the source system is modified, the desired behavior may not be maintained. It will either be necessary to engage in significant rework of the dimensional model (and all dependent reports), or simply live with the unsatisfactory change handling of the source system.

Mergers and acquisitions may also bring additional source systems that correspond to an existing dimension. These systems arrive complete with entirely different identifiers. Incorporating them as sources for an existing dimension will always necessitate some rework of the dimension. But if surrogate keys are in place, fact tables will be unaffected and existing reports will still work. If, on the other hand, natural keys are the basis of unique identifiers, rework will cascade to related fact tables and also to existing queries and reports.