Monday, April 27, 2009

More on Distinct Counts

Shortly before my recent post on unique counts, a reader sent in a related question that takes a slightly different perspective:
As fact tables could easily have a hierarchy of keys, the counts using the lowest grain can be counted via a count(*), and anything above that needs a count(distinct column_a).

Do you see many applications accommodating count(*)'s…by that I mean, making separate higher grain fact tables so ad-hoc users do not have to use a count distinct?

P. Petrini
As the reader points out, unique counts (or distinct counts) can be problematic for end users to put together. This may be particularly if you are using a SQL-generating Business Intelligence (BI) tool.

For example, suppose an orders fact table has one row per order line. Its major dimensions are customer, product, salesperson and day. To count the number of distinct products ordered on a day, you would need to place count(distinct product_name) in your SQL.

That is something a developer can do in her sleep, but may be a bit much for a power-user. And it may be difficult to get a BI tool to generate this kind of SQL.

The reader correctly points out that this would not be necessary if the grain of the fact table precisely matches whatever it is that needs to be counted. All that would be necessary would be count(*) or something along those lines. So a summary table might help....

The Issue

The problem with this is that the summary table will simplify one and only one kind of count.

There will probably be a variety of different, levels or time periods across which you want distinct counts, and each would require its own aggregate.

This could quickly become a lot of work -- simply to make it easier to formulate certain kinds of queries. And while it may make a particular query easier to express, it introduces new complexity for the user, who must now choose precisely the right star for each query.

For example, a summary fact table that has one row for each product for each day makes it easier to count the number of products that sold on a day. But this aggregate must omit other dimensions like salesperson and customer to be useful in this regard. If you also want to count other distinct things by day, each will require its own aggregate. Also want distinct counts by month? More aggregates.

Not very efficient, unfortunately. The SQL for counting distinct things is simplified, to be sure. But now the ETL process is doing a lot more work, and users are faced with choosing the right aggregate for each query they build.


My suggestion is that, rather than build fact tables that eliminate the need for count(distinct), build and cache reports that do the work. People interested in these counts can access the reports, instead of writing SQL. If these counts are common, you might also find it easier to use an OLAP tool.

Not perfect, I know, but nothing ever is.

This is not to say that there is anything wrong with creating a series of summary tables for this purpose, or perhaps storing some pre-computed counts as discussed earlier.

As people who attend my classes know, my motto is "be pragmatic, not dogmatic." As long as everyone is aware of the pros and cons, and understands how it affects each aspect of the data warehouse implementation, then if there is a consensus that adding summary tables is the best way to make life easier, go ahead and do it.


Many thanks to P. Petrini, for consenting to have his question appear here.

If you have a question, you can email it to me. Click my profile picture at the top of this page for the address. I try to answer all questions, and might even print yours here