Showing posts with label Dates. Show all posts
Showing posts with label Dates. Show all posts

Thursday, March 28, 2013

Where To Put Dates

A reader is trying to decide if certain dates should be modeled as dimensions of a fact table or as attributes of a dimension table.

I have two attributes that I'm really not sure where is the best place to place:'Account Open Date' and 'Account Close Date'. In my model, I have [Dim Accounts] as a dimension and [F transact] as a fact table containing accounts transactions. An account can have many transactions, so the dates have different cardinality than the transactions.
  • I thought to put the dates in the Accounts dimension, but this led to problems: difficulties in calculations related to those dates--like if I want to get the transactions of the accounts that opened in the 4th quarter of 2012, or to get the difference between the date of last transaction and the account opening date, and so on.  In other words I can't benefit from the Date dimension and the hierarchies it contains.
  • So I though about placing those dates in the fact table, but what made me hesitate is that the granularity of those dates is higher than the fact table, so there will be a lot of redundancy.
- Ahmad
Bethlehem, Palestine

This is a common dilemma. Many of our most important dimensions come with a number of possible dates that describe them.

Ahmad is thinking about this problem in the right way: how will my choice affect my ability to study the facts?

It turns out that (1) this is not an either/or question, and (2) granularity is not an issue.

Dates that Describe Important Dimensions

Image licensed via Creative Commons 2.0

The dates are clearly useful dimension attributes.  I suggest that you keep them in the dimension in one of two ways, which I will discuss in a moment.

First, though, lets look at what happens if the dates are only represented as foreign keys in the fact table:

If the dates are not stored in the dimension, the open and close date are only associated with the Account dimension through the fact table.  The fact table only has records when transactions occur. So it becomes harder to find a list of open accounts, or to find the set of accounts that were active as of a particular date.

An additional factless fact table may help here, but it is far more simple to store the dates in the dimension.

Date as Attribute vs Outrigger

If plan to represent the dates in your dimension table, you have two choices. You can model the dates themselves as attributes, or you can model a pair of day keys in your account dimension. Either approach is acceptable.

The first option does not expose the richness of your full day dimension for analytic usage, but it may be simpler to use for many business questions. Other questions (like your quarterly example) will require a bit more technical knowledge, but most BI tools help with this.

The second option transforms your star into a (partial) snowflake. The day dimension becomes known as an "outrigger" when it connects to your account dimension. This allows you to explicitly leverage all the attributes of your Day dimension. The cost is some extra joins, which may be confusing and may also disrupt star-join optimization.

Making the correct choice here involves balancing several perspectives:

  • The business view and usability
  • The capabilities of your BI software front end
  • The capabilities of your DBMS back end software

Day Keys in the Fact Table

Having said all that, it is also useful to represent at least one of these dates in the fact table. The account open date may be a good dimensional perspective for the analysis of facts.

As you observed, this date has different cardinality than the transactions. The account open date for an account remains constant, even if it has dozens of transactions in your fact table. But the fact that it has low cardinality should not stop you from choosing it as a major dimension as your star!

Your account transaction fact table may have a pair of day keys -- one for the date the account was opened, and one for the date of the transaction.

If you choose to do this, the account dimension itself should include the open date. The outrigger solution is not necessary since your fact table has full access to the Day dimension.

Note that I do not recommend this for your account closed date, because that date changes. Storing it as a key for every transaction against an account would require a lot of updates to fact table rows once the account becomes closed.

More Information

I've touched on this topic in the past.  In particular, see this post:


Although I edited it out of Ahmad's question, he also cited an issue surrounding the use of NULL for accounts that do not have a closed date. On that topic, see this recent post:



Support this Blog

I maintain this blog in my spare time.  If you find it helpful, you can help support it by picking up a copy of my book:  Star Schema: The Complete Reference.

Use the links on this blog to get a copy of this or any of the other recommended books, and you will be helping to keep this effort going.

Thursday, May 12, 2011

Dates In Dimension Tables

A reader asks about tracking dates in the fact table vs. dimension tables.
Q: It seems to me that it is sometimes more effective to store date keys in the dimension table -- particularly if that date could be used to describe the rows in many different fact tables.  The alternative  would be to duplicate the date key in multiple fact tables.

For example, the effective date of an insurance policy could be used against many fact tables.  Should we store the effective date key in each of the fact tables?  Or store it in a single row in the Policy?

The differences in design would be a star schema (fact table) versus a snowflake (stored in the dimension).

- Robert
Montreal
A:  Many models feature at least one major dimension with a date that  might be used to qualify facts in more than one star. This happens in other industries, too.  Examples include dimension tables that represent orders, contracts, tests and so forth.

Avoid placing date_key columns in dimension tables.  It increases SQL complexity, reduces schema understandability, and may impact performance.

Some designers may allow limited use of snowflaking, but not in this situation.

Date in Dimension Table, Day Key in Fact Table

When you have an date that (1) clearly describes something represented by a dimension, and (2) will also be used to group facts, do two things:

  1. Place good old-fashioned dates in the dimension table 
  2. In any star where you may want filter/group/aggregate facts by some aspect of the date in question, add date keys to the fact table
For example, in Robert's case there might be a Policy dimension table with an attribute called policy_effective_date.  This is a date, not a key.

There is also a star that tracks claims. To support analysis of claim information using the policy effective date, the fact table will contain a day_key_policy_effective.  The fact table may contain other day_keys as well (such as day_key_claim, which supports analysis of claim data by the date of the claim.)

Judicious Use of Snowflaking?

There are times when some designers might choose a snowflake path -- placing dimension keys into dimension tables. This happens when there is a set of attributes that appear in multiple dimension tables.

This occurs with geographical data.  The characteristics of a location might appear in a customer dimension table, a warehouse dimension table and a department dimension table.

In this case, the concern is that the ETL routines may not consistently process locations across multiple dimension tables. If their attributes or slow change characteristics are not managed identically, inconsistencies may develop across these tables.

Designers reason that creating one location table solves this problem.  Each dimension that contains location data will contain a location_key. This ensures a single consistent representation of locations.

I discuss this technique in Chapter 7 of Star Schema The Complete Reference, and point out that it opens the door for a different kind of ETL challenge.  If location data is moved to an outrigger, each time there is Type 2 slow change to a location, all related dimension table rows must also undergo Type 2 changes.

My preference is to allow the attributes to repeat in multiple tables, and to increase the QA focus on related ETL programs.

Thanks to Robert for the question.  If you have one of your own, send it to the address in the sidebar.

-Chris
Image credit: Gravityx9 licensed under Creative Commons 2.0