Saturday, April 18, 2009

Dealing with Period-to-Date Measurements (MTD, YTD and so forth)

A few weeks ago, I touched on some problems you might encounter when trying to store period-to-date measurements in a fact table. Today, I want to address the issue more directly.

Period-to-date measurements can summarize just about any fact over a period of time. A measurement of month-to-date sales , for example, aggregates sales from the beginning of the month to the date in question. Month-to-date, quarter-to-date and year-to-date facts are often used to do things like compute commissions, volume discounts, etc.

Period-to-date Facts and Additivity

Most facts in a star schema can be summed up across any and all dimensions. Sales dollars, for example, can be summed across days, customers, stores, or whatever dimension makes sense to the person doing the analysis. We call this kind of fact fully additive.

The basic issue with storing period-to-date facts is this: they are never fully additive. It does not make any sense to sum up period-to-date measurements that were taken at different times. Month-to-date sales for a customer on January 14, 15 and 16, for example, cannot be meaningfully added together.

Period-to-date measurements are semi-additive; the can be summed up across some dimensions, but not time.

It might make sense to aggregate together several period-to-date measurements if they were taken at the same time. The year-to-date sales of all customers who made a purchase on a particular day, for example, might factor into some useful analysis.

Storing a period-to-date fact with transactions

Since it makes no sense to sum up period-to-date measurements across different points in time, they have very limited use in a transaction-grained fact table.

For example, a transaction-grained fact table that stores a row for each order-line of each order might contain dimensions that capture the order_line number, date and time of the order, product sold, and so forth. You might store a period-to-date measurement in such a fact table, but there there will be little or no opportunity to sum this fact across different order lines. Only orders with the same date/time can be meaningfully summed. And if there are multiple order-lines for a given order, even this may not make sense, since the period-to-date measurement would be stored redundantly.

Stored in this kind of fact table, the period-to-date measurement has very limited use. It can really only be used to study the most granular data in the fact table. For example, it might be used to compute an applicable discount, or a commission percentage that escalates with sales volume.

But is this needed in the fact table? Such metrics are usually highly operational. They may influence other metrics, which have more analytic value and happen to be additive. The commission paid or discount given, for example, may be computed using period-to-date information, but itself is fully additive. This will be useful to store in a fact table.

This does not mean that period-to-date measurements have no place in a transaction-grained fact table. If they are central to the business, or to analysis of the process, then place them there. They may, however, cause excessive growth in the row size, since it is possible to enumerate multiple period-to-date versions of any given fact.

Storing a period-to-date fact with period aggregates or snapshots

The natural home for a period-to-date measurement is a fact table with a periodic-snapshot grain, or a fact table that aggregates across the time dimension.

An aggregate that summarizes the time dimension sums up data from a transaction-based fact table over a particular period. For each period summarized, the resulting rows all summarize the same period. This means that each can contain period-to-date measurements that might be meaningfully summed, at least within the period. (Note I said "might." There is a caveat, which I will get to in a second.)

An aggregate that records monthly sales totals, for example, is potentially a good place to include a quarter-to-date or year-to-date measurement. Each row in the table will summarize the same period, so the period-to-date measurements may be meaningfully summarized.

The same goes for a periodic snapshot fact table. Like a periodic aggregate, each row in a snapshot summarizes a particular period. The main difference is that a snapshot will contain some form of unique status measurement that cannot be found in the transactions, such as a balance or level. Once again, a semi-additive period-to-date fact can be stored here.
Snapshots record the status of things at fixed time intervals. A monthly snapshot of bank accounts, for example, records the day-end balance of each account. This is a logical place to store month-to-date or other such

Caution: Exactly what is being measured period-to-date?

Even within an aggregate or snapshot period, period-to-date facts may not be additive. It is important to ask yourself: exactly what is being measured period-to-date? If it does not coincide with the grain of the table, it may be non-additive.

For example, a monthly aggregate stores sales by product and customer. If we want to store year-to-date customer sales in this table, the same fact will be repeated multiple times if a single customer purchased multiple products. It is not additive across products; it is repeated if the customer bought more than one product.

On the other hand, if the period-to-date measurement captures sales by customer and product, it coincides with the grain of the table. It can be meaningfully aggregated across any dimension but time.

Reporting

The old stand-by is to compute a period-to-date measurement in a query or report. It is calculated from transactions at run-time. This allows access to any conceivable period-to-date metric, and may also be used as a behavioral qualification rather than a fact.

Such queries may be particularly nasty or time consuming, but they can be run during batch windows and cached to compensate.

My next post will be another Q&A. Please send in your questions!

Chris