Showing posts with label Degenerate Dimensions. Show all posts
Showing posts with label Degenerate Dimensions. Show all posts

Tuesday, April 17, 2012

Q&A: Degenerate Dimensions

A reader asks about degenerate dimensions:
Q: I understand that when a particular dimension value is unique to each fact row (i.e., an invoice number perhaps) it should probably exist as a degenerate dimension in the fact table.  Otherwise, that dimension table will contain as many rows as the fact table. 

However, what about a situation in which a dimension table contains only one column aside from the surrogate key (i.e., a status of some sort) but is not unique to each fact row.  In that case, should it be moved into the fact table as a degenerate dimension?

- Brian

Both cases are good applications for a degenerate dimension: a transaction identifier or a low-cardinality attribute that would otherwise need to be a very boring single-column table.

In all cases, the degenerate dimension is entirely optional.

Degenerate dimensions

A degenerate dimension is nothing more than a dimension attribute stored in a fact table, rather than a dimension table.

In all other respects, the degenerate dimension behaves exactly like any other dimension attribute.  It can be used to provide context for facts, qualify queries, sort results, drive master-detail relationships, and so forth.

The usual reason to create a degenerate dimension is simple:  A dimension with one attribute really doesn't need to be in a separate table with a surrogate key.

Two important things to note here:
  1. Degenerate dimensions are optional
  2. There is nothing wrong with a 1:1 relationship between dimension table and fact table
Lets take a closer look at two common uses of degenerate dimensions.

Transaction identifiers as degenerate dimensions

For a transaction fact table, we usually set the grain at the lowest level of detail possible. Often that is an individual transaction such as an order line, invoice line or shipping line.  (For a refresher on grain, see this post from 2009.)

Typically, these things have some kind of unique identifier which is carried into the dimensional model.  Examples include an order line number, invoice line ID, transaction ID, etc.  When these identifiers do not exist, we often create them, as I have discussed in a previous post.

Many designers opt to place these transaction identifiers directly into the fact table as a degenerate dimension.  However, this is entirely optional.  There is nothing wrong with a dimension table being in a 1:1 relationship with the fact table, as I have discussed previously. 

Some designers forgo the degenerate dimension for transaction identifiers because that there will be other schemas where the attribute in question is also referenced.  In many cases, our business intelligence software may function better if the attribute in question is located in exactly one place -- a shared dimension table. 

As an aside, note that in these other schemas, the cardinality may be different.  For example, the same invoice line may be referenced in multiple shipments or invoices.

Other degenerate dimensions

Not all degenerate dimensions are transaction identifiers.  They may simply be dimension attributes with no other suitable home.  The status code mentioned by the reader may be an example.  Another example appeared in a previous Q&A post: Reason Overpaid.

Once again, the use of the degenerate technique is entirely optional. Many designers are comfortable placing the attribute in the fact tables, while others opt for dimension tables with one attribute.

Notice that a dimension table with one attribute does not require a surrogate key for slow change purposes. However, without the surrogate key, the attribute value would be stored in the fact table anyway, as a foreign key reference to a single column table!

Placing the attribute in a separate table may help with our BI tools, for the same reason given above. Database administrators may also prefer the option because it makes it easier to configure a star-join optimizer.  Adding a surrogate key may reduce the size of the foreign key that will be stored in fact table rows.

Learn More

For more information on this topic, see the following posts:
  • Accumulating Snapshots (October 1, 2010)  Discusses another situation where dimension table may have the same number of rows as the fact table, or perhaps even less.
  • Rule 1: State Your Grain (December 9, 2009) Defines the concept of grain, and discusses the best practice of setting it at the lowest level of detail possible.
Of course, all these topics are also covered in Star Schema: The Complete Reference

You can help support this blog by using the links here to order a copy from Amazon.com.

Image licensed via Creative Commons 2.0
from Patrick Hoesley

Monday, October 17, 2011

Creating transaction identifiers for fact tables

Sometimes, the source data for a fact table does not include a transaction identifier. When this happens, we often crate our own. A reader asks if this goes against best practices:

Q: You said that we should not create surrogate keys in fact tables. But we have a source system that does not provide us with identifiers for phone calls. When we get the log files, we create surrogate keys for each call before loading them into the star.

A: This is a common practice, and it is just fine.

You are not creating what dimensional modelers call surrogate keys. You are creating transaction identifiers.  These will be helpful as part of the ETL and QA processes.

Surrogate keys vs. transaction identifiers

Surrogate keys should not be confused with transaction identifiers.

As I wrote in the post referenced in the question, a surrogate key is an attribute that is created to uniquely identify rows in dimension tables. It does not come from a source system; it is created expressly for the dimensional schema.

A transaction identifier is a dimension attribute in a fact table that defines its grain, or at least helps to do so. (If you need a refresher on grain, see this post.)

For example, PAYROLL_FACTS may have the grain "one row per pay stub."  Each pay stub has an identifying number in the source system, which is carried into the star as a degenerate dimension. This attribute is a transaction identifier.1

When a transaction identifier defines grain, the DBA may define it as "the primary key of the fact table." Likewise, an ETL developer or QA analyst may use it as a sort of key to compare the star to operational data or staging data. 

From the dimensional perspective, however, these transaction identifiers are dimensions.  They can be used to filter queries, group or sort the results, and so forth. They will not appear in other tables as foreign keys that reference the fact table.

Creating transaction identifiers

When operational systems do not have good transaction identifiers, we often "cook one up" during the ETL process. They are not surrogate keys in the dimensional modeling sense, though the term might sound appropriate. They are transaction identifiers.

For example, suppose you are designing a star that records phone calls made from your call center. You want the grain of the fact table to be  "one row per phone call."  The data will come from your call switch.  As you've mentioned, it creates a log file, but does not provide an identifier for each call.

In your design, you may define your grain: "one row per date, time, originating number and number dialed." This works just fine, but may be difficult for ETL developers to manage.  Instead, they assign a unique identifier to each call in the log. You can use this identifier to define the grain of the fact table.

Carried into the fact table, this attribute aids in the data integration and quality assurance process. It can be used to quickly crosscheck the star against the log file.

However, for analytic purposes, it is not used as a key.  It is a dimension attribute -- a "cooked up" transaction identifier.

Other concerns

When you are forced to create your own transaction identifiers, you will have other things to worry about.

First and foremost, you must be sure there is a mechanism to prevent loading of duplicate facts. You will need to understand how the source system generates its log files, to be sure that you extract data that is complete, accurate, and not subject to revision.

Also, note that the artificial identifiers may convey a false sense that you can reach all the way back to the source to find the original transaction. While it may be possible, your identifier will not be the mechanism.

Thanks for the question, and keep them coming.

-Chris

More Info

For more information, see these posts:

Also check out my book, Star Schema: The Complete Reference.  I discuss surrogate keys in Chapter 1, Analytic Databases and Dimensional Design.  Degenerate dimensions and transaction identifiers are covered in Chapter 3, Stars and Cubes.

1Not all degenerate dimensions are transaction identifiers.  For an example that is not a transaction identifier, see my post of October 15, 2010.

Image licensed via Creative Commons 2.0
 from Patrick Hoesley

Friday, October 15, 2010

Q&A: Degenerate Dimensions, ETL and BI

A question from a reader about including dimensions in the fact table:

Q:  Question concerning an argument I am having with a colleague. We have a  transaction fact table that will have an attribute called "Reason Overpaid". This attribute can only contain one of 10 values. Is it better to create an "Reason Overpaid" dimension and put a FK in the fact table referencing to the dimension, or just have the "Reason Overpaid" description in the fact table.
A:  This is one argument I will not be able to settle.  Either approach is fine.

Stored in the fact table, this attribute would be known as a degenerate dimension.  It is perfectly acceptable there, but you may decide to move it to a separate table for other reasons. 

Degenerate Dimension

A degenerate dimension is nothing more than a dimension attribute stored in the fact table. This technique is commonly employed when there is an attribute left over that doesn't really fit into any of the other dimension tables.

Your "Reason Overpaid" attribute can be stored in the fact table as a degenerate dimensionYou can still use it in the exact same way as any other dimension attribute -- as a way to filter queries, group results, break subtotals, and so forth.

Keeping it in the fact table avoids unnecessary complexity -- a new table and key attribute to manage and load, a new foreign key lookup when processing facts, and most importantly an extra join to include in queries.

That said, a dimension table for the attribute may make sense in some situations.

Junk Dimension

If there are more than one degenerate dimensions, consider moving them all to a separate dimension table. This is called a junk dimension.  The attributes are not directly related to one another and there is no natural key.   It is populated with the table that contains the Cartesian product of all possible values. 

ETL Consistency Concerns

If your "Reason Overpaid" will also appear in other fact tables, worries about ETL consistency may arise.

Degenerate dimensions are still OK in this situation, but now two or more fact tables will contain the attribute, and it will be necessary to be sure it is loaded consistently.  Creating a separate dimension table allows the values to be created exactly once, avoiding any problems that might be created by inconsistent ETL processing.

While I would not go to a separate table for this reason, I do understand why many designers opt to do so.  The next situation is a different story.

BI Tool Capabilities

If your "Reason Overpaid" will also appear in other fact tables, the capabilities of your BI software may come into play.

The scenario is this:  you are configuring your BI tool to auto-generate SQL queries for users.  You'd like to have an item they can request called "Reason Overpaid", but the tool does not understand that it can appear in two places in the database schema.

Creating a dimension table for the attribute solves this problem. Both fact tables can link to the same dimension table. The tool can now have a definitive place to go for "Reason Overpaid", and may even be able to use it as the basis for comparing data in two fact tables. This is a strong reason to go with a separate table.

Luckily, many BI tools can be configured to acknowledge that a dimension may appear in more than one place, in which case this is not an issue.  And if you are building cubes for the purposes of BI reporting, you can trust your developers to choose the right attribute.

If you're interested in reading more about how BI tools may influence your dimensional design, be sure to check Chapter 16, "Design and Business Intelligence" in my latest book, Star Schema The Complete Reference.

Send in your questions

Have a question of your own about dimensional modeling?  Send it to the blog address in the sidebar.

Chris