Wednesday, May 20, 2009

Do I really need Surrogate Keys?

Here is the #1 most frequently question that people ask of me:
Q: Do I really need surrogate keys?
A: You absolutely must have a unique identifier for each dimension table, one that does not come from a source system. A surrogate key is the best way to handle this, but there are other possibilities.

The case for the surrogate key is entirely pragmatic. Read on for a full explanation.

Dimensions Need their Own Unique Identifier

It is crucial that the dimensional schema be able to handle changes to source data in whatever manner makes most sense from an analytic perspective. This may be different from how the change is handled in the source.

For this reason, every dimension table needs its own unique identifier -- not one that comes from a source system. (For more on handling changes, start with this post.)

A surrogate key makes the best unique identifier. It is simply an integer value, holding no meaning for end users. A single, compact column, it keeps fact table rows small and makes SQL easy to read and write. It is simple to manage during the ETL process.

Compound Keys Work, But Why Take that Route?

An alternative is to supplement a unique identifier from the source system (also known as a natural key) with a sequence number. This results in a compound key, or multi-part key.

This kind of compound key also allows the dimension to handle changes differently than the source does. But there are several disadvantages:
  • Fact table rows become larger, as they must include the multi-part key for each dimension
  • The ETL process must manage a sequence for each natural key value, rather than a single sequence for a surrogate key
  • SQL becomes more difficult to read, write or debug
  • Multi-part dimension keys can disrupt star join optimizers
So: a compound key takes more space, is not any easier, and may disrupt performance. Why bother?

(By the way, many source system identifiers are already compound keys. Adding a sequence number will make them even larger!)

Sometimes it is suggested that the natrual key be supplemented with a date, rather than a sequence. This may simplify ETL slightly, but the rest of the drawbacks remain. Plus, dates are even bigger than a sequence number. Worse, the date will appear in the fact table. That is sure to lead to trouble!

(This is not to say that datestamps on dimension rows are a bad thing. To the contrary, they can be quite useful. They just don't work well as part of a compound key.)

That's the basic answer. I'll respond to some common follow up questions over the coming weeks.

- Chris