Friday, October 15, 2010

Q&A: Degenerate Dimensions, ETL and BI

A question from a reader about including dimensions in the fact table:

Q:  Question concerning an argument I am having with a colleague. We have a  transaction fact table that will have an attribute called "Reason Overpaid". This attribute can only contain one of 10 values. Is it better to create an "Reason Overpaid" dimension and put a FK in the fact table referencing to the dimension, or just have the "Reason Overpaid" description in the fact table.
A:  This is one argument I will not be able to settle.  Either approach is fine.

Stored in the fact table, this attribute would be known as a degenerate dimension.  It is perfectly acceptable there, but you may decide to move it to a separate table for other reasons. 

Degenerate Dimension

A degenerate dimension is nothing more than a dimension attribute stored in the fact table. This technique is commonly employed when there is an attribute left over that doesn't really fit into any of the other dimension tables.

Your "Reason Overpaid" attribute can be stored in the fact table as a degenerate dimensionYou can still use it in the exact same way as any other dimension attribute -- as a way to filter queries, group results, break subtotals, and so forth.

Keeping it in the fact table avoids unnecessary complexity -- a new table and key attribute to manage and load, a new foreign key lookup when processing facts, and most importantly an extra join to include in queries.

That said, a dimension table for the attribute may make sense in some situations.

Junk Dimension

If there are more than one degenerate dimensions, consider moving them all to a separate dimension table. This is called a junk dimension.  The attributes are not directly related to one another and there is no natural key.   It is populated with the table that contains the Cartesian product of all possible values. 

ETL Consistency Concerns

If your "Reason Overpaid" will also appear in other fact tables, worries about ETL consistency may arise.

Degenerate dimensions are still OK in this situation, but now two or more fact tables will contain the attribute, and it will be necessary to be sure it is loaded consistently.  Creating a separate dimension table allows the values to be created exactly once, avoiding any problems that might be created by inconsistent ETL processing.

While I would not go to a separate table for this reason, I do understand why many designers opt to do so.  The next situation is a different story.

BI Tool Capabilities

If your "Reason Overpaid" will also appear in other fact tables, the capabilities of your BI software may come into play.

The scenario is this:  you are configuring your BI tool to auto-generate SQL queries for users.  You'd like to have an item they can request called "Reason Overpaid", but the tool does not understand that it can appear in two places in the database schema.

Creating a dimension table for the attribute solves this problem. Both fact tables can link to the same dimension table. The tool can now have a definitive place to go for "Reason Overpaid", and may even be able to use it as the basis for comparing data in two fact tables. This is a strong reason to go with a separate table.

Luckily, many BI tools can be configured to acknowledge that a dimension may appear in more than one place, in which case this is not an issue.  And if you are building cubes for the purposes of BI reporting, you can trust your developers to choose the right attribute.

If you're interested in reading more about how BI tools may influence your dimensional design, be sure to check Chapter 16, "Design and Business Intelligence" in my latest book, Star Schema The Complete Reference.

Send in your questions

Have a question of your own about dimensional modeling?  Send it to the blog address in the sidebar.

Chris