Monday, April 27, 2009

More on Distinct Counts

Shortly before my recent post on unique counts, a reader sent in a related question that takes a slightly different perspective:
As fact tables could easily have a hierarchy of keys, the counts using the lowest grain can be counted via a count(*), and anything above that needs a count(distinct column_a).

Do you see many applications accommodating count(*)'s…by that I mean, making separate higher grain fact tables so ad-hoc users do not have to use a count distinct?

P. Petrini
As the reader points out, unique counts (or distinct counts) can be problematic for end users to put together. This may be particularly if you are using a SQL-generating Business Intelligence (BI) tool.

For example, suppose an orders fact table has one row per order line. Its major dimensions are customer, product, salesperson and day. To count the number of distinct products ordered on a day, you would need to place count(distinct product_name) in your SQL.

That is something a developer can do in her sleep, but may be a bit much for a power-user. And it may be difficult to get a BI tool to generate this kind of SQL.

The reader correctly points out that this would not be necessary if the grain of the fact table precisely matches whatever it is that needs to be counted. All that would be necessary would be count(*) or something along those lines. So a summary table might help....

The Issue

The problem with this is that the summary table will simplify one and only one kind of count.

There will probably be a variety of different, levels or time periods across which you want distinct counts, and each would require its own aggregate.

This could quickly become a lot of work -- simply to make it easier to formulate certain kinds of queries. And while it may make a particular query easier to express, it introduces new complexity for the user, who must now choose precisely the right star for each query.

For example, a summary fact table that has one row for each product for each day makes it easier to count the number of products that sold on a day. But this aggregate must omit other dimensions like salesperson and customer to be useful in this regard. If you also want to count other distinct things by day, each will require its own aggregate. Also want distinct counts by month? More aggregates.

Not very efficient, unfortunately. The SQL for counting distinct things is simplified, to be sure. But now the ETL process is doing a lot more work, and users are faced with choosing the right aggregate for each query they build.

Alternatives

My suggestion is that, rather than build fact tables that eliminate the need for count(distinct), build and cache reports that do the work. People interested in these counts can access the reports, instead of writing SQL. If these counts are common, you might also find it easier to use an OLAP tool.

Not perfect, I know, but nothing ever is.

This is not to say that there is anything wrong with creating a series of summary tables for this purpose, or perhaps storing some pre-computed counts as discussed earlier.

As people who attend my classes know, my motto is "be pragmatic, not dogmatic." As long as everyone is aware of the pros and cons, and understands how it affects each aspect of the data warehouse implementation, then if there is a consensus that adding summary tables is the best way to make life easier, go ahead and do it.

Chris

Many thanks to P. Petrini, for consenting to have his question appear here.

If you have a question, you can email it to me. Click my profile picture at the top of this page for the address. I try to answer all questions, and might even print yours here

Saturday, April 18, 2009

Dealing with Period-to-Date Measurements (MTD, YTD and so forth)

A few weeks ago, I touched on some problems you might encounter when trying to store period-to-date measurements in a fact table. Today, I want to address the issue more directly.

Period-to-date measurements can summarize just about any fact over a period of time. A measurement of month-to-date sales , for example, aggregates sales from the beginning of the month to the date in question. Month-to-date, quarter-to-date and year-to-date facts are often used to do things like compute commissions, volume discounts, etc.

Period-to-date Facts and Additivity

Most facts in a star schema can be summed up across any and all dimensions. Sales dollars, for example, can be summed across days, customers, stores, or whatever dimension makes sense to the person doing the analysis. We call this kind of fact fully additive.

The basic issue with storing period-to-date facts is this: they are never fully additive. It does not make any sense to sum up period-to-date measurements that were taken at different times. Month-to-date sales for a customer on January 14, 15 and 16, for example, cannot be meaningfully added together.

Period-to-date measurements are semi-additive; the can be summed up across some dimensions, but not time.

It might make sense to aggregate together several period-to-date measurements if they were taken at the same time. The year-to-date sales of all customers who made a purchase on a particular day, for example, might factor into some useful analysis.

Storing a period-to-date fact with transactions

Since it makes no sense to sum up period-to-date measurements across different points in time, they have very limited use in a transaction-grained fact table.

For example, a transaction-grained fact table that stores a row for each order-line of each order might contain dimensions that capture the order_line number, date and time of the order, product sold, and so forth. You might store a period-to-date measurement in such a fact table, but there there will be little or no opportunity to sum this fact across different order lines. Only orders with the same date/time can be meaningfully summed. And if there are multiple order-lines for a given order, even this may not make sense, since the period-to-date measurement would be stored redundantly.

Stored in this kind of fact table, the period-to-date measurement has very limited use. It can really only be used to study the most granular data in the fact table. For example, it might be used to compute an applicable discount, or a commission percentage that escalates with sales volume.

But is this needed in the fact table? Such metrics are usually highly operational. They may influence other metrics, which have more analytic value and happen to be additive. The commission paid or discount given, for example, may be computed using period-to-date information, but itself is fully additive. This will be useful to store in a fact table.

This does not mean that period-to-date measurements have no place in a transaction-grained fact table. If they are central to the business, or to analysis of the process, then place them there. They may, however, cause excessive growth in the row size, since it is possible to enumerate multiple period-to-date versions of any given fact.

Storing a period-to-date fact with period aggregates or snapshots

The natural home for a period-to-date measurement is a fact table with a periodic-snapshot grain, or a fact table that aggregates across the time dimension.

An aggregate that summarizes the time dimension sums up data from a transaction-based fact table over a particular period. For each period summarized, the resulting rows all summarize the same period. This means that each can contain period-to-date measurements that might be meaningfully summed, at least within the period. (Note I said "might." There is a caveat, which I will get to in a second.)

An aggregate that records monthly sales totals, for example, is potentially a good place to include a quarter-to-date or year-to-date measurement. Each row in the table will summarize the same period, so the period-to-date measurements may be meaningfully summarized.

The same goes for a periodic snapshot fact table. Like a periodic aggregate, each row in a snapshot summarizes a particular period. The main difference is that a snapshot will contain some form of unique status measurement that cannot be found in the transactions, such as a balance or level. Once again, a semi-additive period-to-date fact can be stored here.
Snapshots record the status of things at fixed time intervals. A monthly snapshot of bank accounts, for example, records the day-end balance of each account. This is a logical place to store month-to-date or other such

Caution: Exactly what is being measured period-to-date?

Even within an aggregate or snapshot period, period-to-date facts may not be additive. It is important to ask yourself: exactly what is being measured period-to-date? If it does not coincide with the grain of the table, it may be non-additive.

For example, a monthly aggregate stores sales by product and customer. If we want to store year-to-date customer sales in this table, the same fact will be repeated multiple times if a single customer purchased multiple products. It is not additive across products; it is repeated if the customer bought more than one product.

On the other hand, if the period-to-date measurement captures sales by customer and product, it coincides with the grain of the table. It can be meaningfully aggregated across any dimension but time.

Reporting

The old stand-by is to compute a period-to-date measurement in a query or report. It is calculated from transactions at run-time. This allows access to any conceivable period-to-date metric, and may also be used as a behavioral qualification rather than a fact.

Such queries may be particularly nasty or time consuming, but they can be run during batch windows and cached to compensate.

My next post will be another Q&A. Please send in your questions!

Chris