Friday, October 22, 2010

Q&A: Star Schema and Referential Integrity

Two questions came in this week regarding referential integrity:
Q: Regarding foreign keys (fact to dimension) do you think it is necessary to implement foreign key constraints in a data warehouse?

Q: What are the pros and cons of having referential integrity. Why have key constraints? I mean it looks like they add maintenance burden.
A: 

If you are following the best practices of star schema design, which include the use of surrogate keys, it is not necessary to enforce referential integrity at the DBMS.

Instead, the ETL process will be responsible for maintaining the integrity of relationships between rows of the tables.

Referential integrity is still a top priority.  But responsibility has been moved to the ETL process.

ETL Must Maintain RI

In fact, for a star schema, the ETL process must maintain referential integrity (RI).  There is no way around this, because the ETL process is the source of all key values that will be inserted into the star.

The ETL process must assign surrogate key values in the dimension tables, and transform natural key values to surrogate key values when loading facts.  Any data that would violate referential integrity must be be caught by these processes.  If not, the ETL process is not dong its job.

Turning Off DBMS Enforcement

If surrogate keys are being properly managed by the ETL process, referential integrity checks at the DBMS level become redundant.  In fact, they may even slow the load down.  For this reason, many database administrators turn off referential integrity constraints while the load process is running.

This is possible for a star schema because a single known process updates the data on a predictable schedule.  Contrast that to an OLTP system, where a variety of application modules may alter data on an unpredictable/chaotic schedule. In such an environment, DBMS-level enforcement makes the most sense.

Turning it On Anyway

This does not necessarily mean you should keep referential integrity turned off.  Some ETL developers like to have it turned back on after the load is executed, as a sort of safety net, to catch any errors.  Also, a database administrator may want to turn it on because other features of the DBMS require it.

No Surrogate Keys?

If your design does not include surrogate keys, you will probably want to have the DBMS enforce referential integrity, since your ETL process is not really managing it.

But in this case, referential integrity may be the least of your problems. You are beholden to the operational system's decisions regarding change history, and you may need to use multi-part compound keys.  See last year's posts Do I really need surrogate keys?  and More On Surrogate Keys for more on why you should be using them.


Thanks for the questions, and keep them coming!

Chris