Chris Adamson’s Blog: May 2009

Wednesday, May 20, 2009

Do I really need Surrogate Keys?

Here is the #1 most frequently question that people ask of me:

Q: Do I really need surrogate keys?

A: You absolutely must have a unique identifier for each dimension table, one that does not come from a source system. A surrogate key is the best way to handle this, but there are other possibilities.

The case for the surrogate key is entirely pragmatic. Read on for a full explanation.

Dimensions Need their Own Unique Identifier

It is crucial that the dimensional schema be able to handle changes to source data in whatever manner makes most sense from an analytic perspective. This may be different from how the change is handled in the source.

For this reason, every dimension table needs its own unique identifier -- not one that comes from a source system. (For more on handling changes, start with this post.)

A surrogate key makes the best unique identifier. It is simply an integer value, holding no meaning for end users. A single, compact column, it keeps fact table rows small and makes SQL easy to read and write. It is simple to manage during the ETL process.

Compound Keys Work, But Why Take that Route?

An alternative is to supplement a unique identifier from the source system (also known as a natural key) with a sequence number. This results in a compound key, or multi-part key.

This kind of compound key also allows the dimension to handle changes differently than the source does. But there are several disadvantages:

Fact table rows become larger, as they must include the multi-part key for each dimension
The ETL process must manage a sequence for each natural key value, rather than a single sequence for a surrogate key
SQL becomes more difficult to read, write or debug
Multi-part dimension keys can disrupt star join optimizers

So: a compound key takes more space, is not any easier, and may disrupt performance. Why bother?

(By the way, many source system identifiers are already compound keys. Adding a sequence number will make them even larger!)

Sometimes it is suggested that the natrual key be supplemented with a date, rather than a sequence. This may simplify ETL slightly, but the rest of the drawbacks remain. Plus, dates are even bigger than a sequence number. Worse, the date will appear in the fact table. That is sure to lead to trouble!

(This is not to say that datestamps on dimension rows are a bad thing. To the contrary, they can be quite useful. They just don't work well as part of a compound key.)

That's the basic answer. I'll respond to some common follow up questions over the coming weeks.

- Chris

Subscribe to: Posts (Atom)

Classes

Chris is scheduled to present at the following events. Course enrollment is open to the general public.

All these courses are also available on site (see below).

August 18, 2019
San Diego, CA
Data Modeling in the Age of Big Data
Registration: TDWI San Diego
August 20, 2019
San Diego, CA
Data Architecture: Managing Information in the Age of Big Data
Registration: TDWI San Diego
August 20, 2019
San Diego, CA
Workshop: Building the Business Case for Advanced Analytics
Registration: TDWI San Diego Strategy Summit
Monday October 21, 2019
San Francisco, CA
TDWI Dimensional Data Modeling Primer: From Requirements to Business Analysis
Registration: TDWI Seminars
Tuesday October 22, 2019
San Francisco, CA
Advanced Dimensional Modeling: Techniques for Practitioners
Registration: TDWI Seminars
Wednesday October 23, 2019
San Francisco, CA
Dimensional Models: What’s New in the Big Data Era
Registration: TDWI Seminars
November 12, 2019
Orlando, FL
Data Architecture: Managing Information in the Age of Big Data
Registration: TDWI Orlando
November 12, 2019
Orlando, FL
The Dimensional Model Refactored: New Techniques for the 21st Century
Registration: TDWI Orlando
November 15, 2019
Orlando, FL
Advanced Dimensional Modeling: Complete Tour of Modern Best Practices
Registration: TDWI Orlando

Onsite Education

You can bring Chris to your team for interactive education.

Dimensional Modeling

Chris provides full-day and expanded two-day courses covering the dimensional design concepts from Star Schema: The Complete Reference.
TDWI Courses

Chris teaches select TDWI courses that cover topics like data BI fundamentals, performance management, business analytics, dashboards and scorecards, and more.

All of Chris's education offerings are provided through TDWI.

For information on onsite offerings, contact TDWI Onsite Education. or Oakton Software