Monday, February 16, 2009

Q&A: Chosing the right time dimension for a Snapshot

In this Q&A, a reader asks how to designate a time dimension for a fact table that is a snapshot.
Q: We are designing a new star. The fact table is a monthly snapshot. It will capture several metrics each month for each open account. My question is how to handle the time dimension. We cannot agree whether it makes more sense to have a month dimension, or to use a month end date from a day dimension. What makes the most sense?

- Name withheld by request

First, I want to commend you for stating the grain of your fact table, and stating it clearly. Grain confusion can cause many problems, and makes a question like yours even more difficult to answer. Snapshots always have a time component to their grain; in your case it is monthly. Your fact table will have one row per account per month.

Now to your question: I strongly prefer the use of a month dimension for a monthly snapshot. This approach is non-ambiguous. It is self-evident what each row represents; the chance for any confusion is minimal.

When a monthly snapshot is designed with a day dimension, the designers are generally clear on what each row represents. However, report developers or other users may misinterpret the data now or in the future.

Designers who use a day dimension with a monthly snapshot generally choose the last day of the period represented, as you suggested. The problem with this is that there are many elements of a typical day dimension that simply do not apply to a monthly summary. Tor example, a snapshot representing March 2009 would have the day 3/31/09 associated with it. The day dimension has a lot of information about March 31, 2009 that does not apply to the snapshot, such as what day of the week it was, whether it was a holiday, and so forth. It would be all too easy to misinterpret this information as being applicable to the facts.

Even if everyone is clear on this at the start, a year or two down the road there are likely to be new members of your development team who may misinterpret the data. The presence of a date can also lead to confusion about what a row represents. If your monthly summary
contains an average daily balance, for example, a user may confuse this for a balance as of the day in question.

Confusion will also arise over what date to use for the current period. During March 2009, you may be updating the snapshot rows for each account on a daily basis. Using a day dimension, ETL developers may assign the particular date in March, which now serves double duty -- designating the snapshot period and the date loaded.

Sometimes, there is a strong desire among technical team members to use a day dimension with a monthly snapshot. If you cannot convince them otherwise, then you can avoid some confusion by creating a view for use with the monthly snapshot that hides the day-level attributes. You may be able to do the same kind of “hiding” of non-applicable dimension attributes within your BI tool. Use the month-end date for each fact table row. If the ETL team wants to track the day each row was loaded, they can do this with a second relationship to the day dimension. (This trick works nicely even if you do go with a month dimension.)

Regardless of your approach, I also advise you to name the facts carefully. If you want to include a balance as of the end of the period, name it period_end_balance. An average daily balance for the period in question should be called average_daily_balance. A column name like balance is simply too vague, and bound to be misused.

Thanks for your question!

- Chris

If you have a question, click my profile picture at the top of the page for my address. I try to respond to all messages, and may even post an answer here.