Thursday, March 19, 2009

Q&A: Is a Unique Count a Fact or Dimension?

A reader recently sent in a question about unique counts. Are they facts, or dimensions?

Here is a paraphrased version of their questions. The business example has been changed.

Q: We report on the number of unique customers that visit a location. We do this at daily, monthly and annual levels, and also look at the same thing across all locations. We originally modeled the customer counts as facts. The problem is that if we compute unique customer counts by day, they do not correctly roll up to month. Does it make more sense to treat this as a dimension?

A: First, let me say that customer counts are certainly facts. They are important metrics that are colored by the dimensions we use to define them.

The difficulty with counts of distinct things, as you have discovered, is that they are non-additive facts. As soon as you use a set of dimensions to create a distinct count -- such as day, location and customer -- you have an aggregation that cannot be further summed. If you need to go to the monthly level, or across all locations, you need to go back to the granular data.

You have a few options.

Option 1: Compute in reports

The first is to compute unique counts in reports. This can be time consuming, and a major hassle if there are several counts that must be computed frequently. You might be able to schedule the reports to run in a batch window. This deals with the processing issues, but not the report or query complexity.

In my view, however, it is the best option. There are other possibilities, but each has significant drawbacks.

Option 2: Create snapshots or aggregates

A second option is to create some snapshot fact tables that compute the various distinct counts at various periodic intervals. These will contain non-additive facts, and your developers will have to understand that they should not be aggregated. Depending on their design, these tables may more closely resemble aggregate tables.

This approach may require quite a few fact tables, since each of your unique counts will have a different grain -- daily by location, monthly by location, annually by location, daily across all locations, monthly across all locations, etc.

To make life a little easier for report developers, designers sometimes replicate the monthly and annual totals in the daily level snapshot. Developers find this appealing, because all the information to produce subtotals over time can be fetched from a single row, rather than three queries. The flipside is that the monthly and annual data is stored redundantly in a lot of places, and must be maintained consistently and used properly.

Option 3: Behavioral Dimensions

If the current period is the primary focus, a third option is to create behavioral dimensions. For example, in the store dimension, you can add attributes that represent counts of unique customers for the current month and year. I mentioned behavioral dimensions in a previous Q&A post, and the concept is the same here.

Option 4: Creating Special Facts that can be Summed

Lastly, some designers like to create special facts in the fact table that can be used to sum unique counts. If you have a fact table that gets a row each time a customer visits a location, you may be able to tweak it a bit. Add an additional fact called first_visit_of_month that only contains a 1 the first time the customer visits any location during the month. First_visit_of_year will only contain a 1 the first time the customer visits during the year. These facts can be summed to produce unique counts for various time frames, but you need to be careful not to aggregate them across the periods they represent.

Another drawback is that, once again, numerous special facts will be required. To compute unique counts by location, you will need facts like first_visit_this_location_this_month. This fact should not be aggregated across months or locations. That's a lot of rules, and they are likely to be broken.

As you can see, no option looks perfect. Don’t let that lead you to push back on requirements, however. Distinct counts are often very important metrics, particularly when studying customer service or processing efficiency.

Thanks for the question!

- Chris

If you have a question, you can email it to me. Click my profile picture at the top of this page for the address. I try to answer all questions, and might even print yours here.