Chris Adamson’s Blog: Creating transaction identifiers for fact tables

Monday, October 17, 2011

Creating transaction identifiers for fact tables

Posted by Chris Adamson

Sometimes, the source data for a fact table does not include a transaction identifier. When this happens, we often crate our own. A reader asks if this goes against best practices:

Q: You said that we should not create surrogate keys in fact tables. But we have a source system that does not provide us with identifiers for phone calls. When we get the log files, we create surrogate keys for each call before loading them into the star.

A: This is a common practice, and it is just fine.

You are not creating what dimensional modelers call surrogate keys. You are creating transaction identifiers. These will be helpful as part of the ETL and QA processes.

Surrogate keys vs. transaction identifiers

Surrogate keys should not be confused with transaction identifiers.

As I wrote in the post referenced in the question, a surrogate key is an attribute that is created to uniquely identify rows in dimension tables. It does not come from a source system; it is created expressly for the dimensional schema.

A transaction identifier is a dimension attribute in a fact table that defines its grain, or at least helps to do so. (If you need a refresher on grain, see this post.)

For example, PAYROLL_FACTS may have the grain "one row per pay stub." Each pay stub has an identifying number in the source system, which is carried into the star as a degenerate dimension. This attribute is a transaction identifier.¹

When a transaction identifier defines grain, the DBA may define it as "the primary key of the fact table." Likewise, an ETL developer or QA analyst may use it as a sort of key to compare the star to operational data or staging data.

From the dimensional perspective, however, these transaction identifiers are dimensions. They can be used to filter queries, group or sort the results, and so forth. They will not appear in other tables as foreign keys that reference the fact table.

Creating transaction identifiers

When operational systems do not have good transaction identifiers, we often "cook one up" during the ETL process. They are not surrogate keys in the dimensional modeling sense, though the term might sound appropriate. They are transaction identifiers.

For example, suppose you are designing a star that records phone calls made from your call center. You want the grain of the fact table to be "one row per phone call." The data will come from your call switch. As you've mentioned, it creates a log file, but does not provide an identifier for each call.

In your design, you may define your grain: "one row per date, time, originating number and number dialed." This works just fine, but may be difficult for ETL developers to manage. Instead, they assign a unique identifier to each call in the log. You can use this identifier to define the grain of the fact table.

Carried into the fact table, this attribute aids in the data integration and quality assurance process. It can be used to quickly crosscheck the star against the log file.

However, for analytic purposes, it is not used as a key. It is a dimension attribute -- a "cooked up" transaction identifier.

Other concerns

When you are forced to create your own transaction identifiers, you will have other things to worry about.

First and foremost, you must be sure there is a mechanism to prevent loading of duplicate facts. You will need to understand how the source system generates its log files, to be sure that you extract data that is complete, accurate, and not subject to revision.

Also, note that the artificial identifiers may convey a false sense that you can reach all the way back to the source to find the original transaction. While it may be possible, your identifier will not be the mechanism.

Thanks for the question, and keep them coming.

-Chris

More Info

For more information, see these posts:

Avoid surrogate keys for fact tables (September 21, 2011)
Degenerate dimensions, ETL and BI (October 15, 2010)
Rule 1: State your grain (December 9, 2009)

Also check out my book, Star Schema: The Complete Reference. I discuss surrogate keys in Chapter 1, Analytic Databases and Dimensional Design. Degenerate dimensions and transaction identifiers are covered in Chapter 3, Stars and Cubes.

¹Not all degenerate dimensions are transaction identifiers. For an example that is not a transaction identifier, see my post of October 15, 2010.

Image licensed via Creative Commons 2.0
from Patrick Hoesley

Classes

Chris is scheduled to present at the following events. Course enrollment is open to the general public.

All these courses are also available on site (see below).

August 18, 2019
San Diego, CA
Data Modeling in the Age of Big Data
Registration: TDWI San Diego
August 20, 2019
San Diego, CA
Data Architecture: Managing Information in the Age of Big Data
Registration: TDWI San Diego
August 20, 2019
San Diego, CA
Workshop: Building the Business Case for Advanced Analytics
Registration: TDWI San Diego Strategy Summit
Monday October 21, 2019
San Francisco, CA
TDWI Dimensional Data Modeling Primer: From Requirements to Business Analysis
Registration: TDWI Seminars
Tuesday October 22, 2019
San Francisco, CA
Advanced Dimensional Modeling: Techniques for Practitioners
Registration: TDWI Seminars
Wednesday October 23, 2019
San Francisco, CA
Dimensional Models: What’s New in the Big Data Era
Registration: TDWI Seminars
November 12, 2019
Orlando, FL
Data Architecture: Managing Information in the Age of Big Data
Registration: TDWI Orlando
November 12, 2019
Orlando, FL
The Dimensional Model Refactored: New Techniques for the 21st Century
Registration: TDWI Orlando
November 15, 2019
Orlando, FL
Advanced Dimensional Modeling: Complete Tour of Modern Best Practices
Registration: TDWI Orlando

Onsite Education

You can bring Chris to your team for interactive education.

Dimensional Modeling

Chris provides full-day and expanded two-day courses covering the dimensional design concepts from Star Schema: The Complete Reference.
TDWI Courses

Chris teaches select TDWI courses that cover topics like data BI fundamentals, performance management, business analytics, dashboards and scorecards, and more.

All of Chris's education offerings are provided through TDWI.

For information on onsite offerings, contact TDWI Onsite Education. or Oakton Software