Q: You said that we should not create surrogate keys in fact tables. But we have a source system that does not provide us with identifiers for phone calls. When we get the log files, we create surrogate keys for each call before loading them into the star.
A: This is a common practice, and it is just fine.
You are not creating what dimensional modelers call surrogate keys. You are creating transaction identifiers. These will be helpful as part of the ETL and QA processes.
Surrogate keys vs. transaction identifiers
Surrogate keys should not be confused with transaction identifiers.
As I wrote in the post referenced in the question, a surrogate key is an attribute that is created to uniquely identify rows in dimension tables. It does not come from a source system; it is created expressly for the dimensional schema.
For example, PAYROLL_FACTS may have the grain "one row per pay stub." Each pay stub has an identifying number in the source system, which is carried into the star as a degenerate dimension. This attribute is a transaction identifier.1
When a transaction identifier defines grain, the DBA may define it as "the primary key of the fact table." Likewise, an ETL developer or QA analyst may use it as a sort of key to compare the star to operational data or staging data.
From the dimensional perspective, however, these transaction identifiers are dimensions. They can be used to filter queries, group or sort the results, and so forth. They will not appear in other tables as foreign keys that reference the fact table.
Creating transaction identifiers
When operational systems do not have good transaction identifiers, we often "cook one up" during the ETL process. They are not surrogate keys in the dimensional modeling sense, though the term might sound appropriate. They are transaction identifiers.
For example, suppose you are designing a star that records phone calls made from your call center. You want the grain of the fact table to be "one row per phone call." The data will come from your call switch. As you've mentioned, it creates a log file, but does not provide an identifier for each call.
In your design, you may define your grain: "one row per date, time, originating number and number dialed." This works just fine, but may be difficult for ETL developers to manage. Instead, they assign a unique identifier to each call in the log. You can use this identifier to define the grain of the fact table.
Carried into the fact table, this attribute aids in the data integration and quality assurance process. It can be used to quickly crosscheck the star against the log file.
However, for analytic purposes, it is not used as a key. It is a dimension attribute -- a "cooked up" transaction identifier.
When you are forced to create your own transaction identifiers, you will have other things to worry about.
First and foremost, you must be sure there is a mechanism to prevent loading of duplicate facts. You will need to understand how the source system generates its log files, to be sure that you extract data that is complete, accurate, and not subject to revision.
Also, note that the artificial identifiers may convey a false sense that you can reach all the way back to the source to find the original transaction. While it may be possible, your identifier will not be the mechanism.
Thanks for the question, and keep them coming.
For more information, see these posts:
- Avoid surrogate keys for fact tables (September 21, 2011)
- Degenerate dimensions, ETL and BI (October 15, 2010)
- Rule 1: State your grain (December 9, 2009)
Also check out my book, Star Schema: The Complete Reference. I discuss surrogate keys in Chapter 1, Analytic Databases and Dimensional Design. Degenerate dimensions and transaction identifiers are covered in Chapter 3, Stars and Cubes.
1Not all degenerate dimensions are transaction identifiers. For an example that is not a transaction identifier, see my post of October 15, 2010.