Friday, October 22, 2010

Q&A: Star Schema and Referential Integrity

Two questions came in this week regarding referential integrity:
Q: Regarding foreign keys (fact to dimension) do you think it is necessary to implement foreign key constraints in a data warehouse?

Q: What are the pros and cons of having referential integrity. Why have key constraints? I mean it looks like they add maintenance burden.

If you are following the best practices of star schema design, which include the use of surrogate keys, it is not necessary to enforce referential integrity at the DBMS.

Instead, the ETL process will be responsible for maintaining the integrity of relationships between rows of the tables.

Referential integrity is still a top priority.  But responsibility has been moved to the ETL process.

ETL Must Maintain RI

In fact, for a star schema, the ETL process must maintain referential integrity (RI).  There is no way around this, because the ETL process is the source of all key values that will be inserted into the star.

The ETL process must assign surrogate key values in the dimension tables, and transform natural key values to surrogate key values when loading facts.  Any data that would violate referential integrity must be be caught by these processes.  If not, the ETL process is not dong its job.

Turning Off DBMS Enforcement

If surrogate keys are being properly managed by the ETL process, referential integrity checks at the DBMS level become redundant.  In fact, they may even slow the load down.  For this reason, many database administrators turn off referential integrity constraints while the load process is running.

This is possible for a star schema because a single known process updates the data on a predictable schedule.  Contrast that to an OLTP system, where a variety of application modules may alter data on an unpredictable/chaotic schedule. In such an environment, DBMS-level enforcement makes the most sense.

Turning it On Anyway

This does not necessarily mean you should keep referential integrity turned off.  Some ETL developers like to have it turned back on after the load is executed, as a sort of safety net, to catch any errors.  Also, a database administrator may want to turn it on because other features of the DBMS require it.

No Surrogate Keys?

If your design does not include surrogate keys, you will probably want to have the DBMS enforce referential integrity, since your ETL process is not really managing it.

But in this case, referential integrity may be the least of your problems. You are beholden to the operational system's decisions regarding change history, and you may need to use multi-part compound keys.  See last year's posts Do I really need surrogate keys?  and More On Surrogate Keys for more on why you should be using them.

Thanks for the questions, and keep them coming!


Friday, October 15, 2010

Q&A: Degenerate Dimensions, ETL and BI

A question from a reader about including dimensions in the fact table:

Q:  Question concerning an argument I am having with a colleague. We have a  transaction fact table that will have an attribute called "Reason Overpaid". This attribute can only contain one of 10 values. Is it better to create an "Reason Overpaid" dimension and put a FK in the fact table referencing to the dimension, or just have the "Reason Overpaid" description in the fact table.
A:  This is one argument I will not be able to settle.  Either approach is fine.

Stored in the fact table, this attribute would be known as a degenerate dimension.  It is perfectly acceptable there, but you may decide to move it to a separate table for other reasons. 

Degenerate Dimension

A degenerate dimension is nothing more than a dimension attribute stored in the fact table. This technique is commonly employed when there is an attribute left over that doesn't really fit into any of the other dimension tables.

Your "Reason Overpaid" attribute can be stored in the fact table as a degenerate dimensionYou can still use it in the exact same way as any other dimension attribute -- as a way to filter queries, group results, break subtotals, and so forth.

Keeping it in the fact table avoids unnecessary complexity -- a new table and key attribute to manage and load, a new foreign key lookup when processing facts, and most importantly an extra join to include in queries.

That said, a dimension table for the attribute may make sense in some situations.

Junk Dimension

If there are more than one degenerate dimensions, consider moving them all to a separate dimension table. This is called a junk dimension.  The attributes are not directly related to one another and there is no natural key.   It is populated with the table that contains the Cartesian product of all possible values. 

ETL Consistency Concerns

If your "Reason Overpaid" will also appear in other fact tables, worries about ETL consistency may arise.

Degenerate dimensions are still OK in this situation, but now two or more fact tables will contain the attribute, and it will be necessary to be sure it is loaded consistently.  Creating a separate dimension table allows the values to be created exactly once, avoiding any problems that might be created by inconsistent ETL processing.

While I would not go to a separate table for this reason, I do understand why many designers opt to do so.  The next situation is a different story.

BI Tool Capabilities

If your "Reason Overpaid" will also appear in other fact tables, the capabilities of your BI software may come into play.

The scenario is this:  you are configuring your BI tool to auto-generate SQL queries for users.  You'd like to have an item they can request called "Reason Overpaid", but the tool does not understand that it can appear in two places in the database schema.

Creating a dimension table for the attribute solves this problem. Both fact tables can link to the same dimension table. The tool can now have a definitive place to go for "Reason Overpaid", and may even be able to use it as the basis for comparing data in two fact tables. This is a strong reason to go with a separate table.

Luckily, many BI tools can be configured to acknowledge that a dimension may appear in more than one place, in which case this is not an issue.  And if you are building cubes for the purposes of BI reporting, you can trust your developers to choose the right attribute.

If you're interested in reading more about how BI tools may influence your dimensional design, be sure to check Chapter 16, "Design and Business Intelligence" in my latest book, Star Schema The Complete Reference.

Send in your questions

Have a question of your own about dimensional modeling?  Send it to the blog address in the sidebar.


Friday, October 1, 2010

Q&A: Accumulating Snapshots

A reader asks about 1:1 relationships in a star schema.
I am designing a superannuation data warehouse that will have a claims accumulating snapshot fact table. This table will contain  number of dates that will be filled in when a claim is being processed. From this  we will be able to calculate time spans between the various claims processing stages.

It is intended that the fact table will contain a claim_key that will reference a claims dimension (1:1 relationship ).  I have been told by a few people that a 1:1 relationship is not a recommended design model.
- A reader in Australia

A:  There is absolutely nothing wrong with having a 1:1 relationship.  (That's true for both ER models and dimensional models.) 

Accumulating snapshots often cause the kind of disagreement you mentioned.  They are less common.  People mistakenly generalize characteristics of more common designs, assuming they apply to all stars. 

This is natural, but it leads to three incorrect assumptions:
  1. The fact table is on the "many" end of a 1:Many relationship with each dimension
  2. The fact table will have the most rows
  3. Fact table rows are not updated once they are loaded
These assumptions are all false when it comes to an accumulating snapshot.  (In fact, even a transaction fact table may violate the first two assumptions.) 

You've asked about the first one, but let me debunk all three:

  1. By definition the accumulating snapshot will have a 1:1 relationship with one of the dimension tables.  Why? Because its purpose is to track the progression of some discrete item through a business process.  Its grain is almost always stated in terms of a dimension -- e.g. "One Row per Claim."

  2. If the dimension can undergo type 2 slow changes, the accumulating snapshot will likely have LESS ROWS than the dimension.  This is desirable, and I have written about it before.

  3. The purpose of the accumulating snapshot is to study lags between key events, while avoiding correlated sub queries.  Unlike other fact tables, this requires the row to be updated as more information becomes known about the item being tracked.

I've written about these topics a few times in the past, including the post mentioned above. I also explore these topics in much greater detail in Chapter 11 of Star Schema: The Complete Reference.

In the book, I also show that an accumulating snapshot be used even when there is not a fixed, linear series of business steps being tracked.

Anyone who is not familiar with the accumulating snapshot is encouraged to learn more.  It is an essential design technique for many common business requirements.

Thanks for the question, and keep them coming...

Image Credit:  Creativity103 via Creative Commons