Friday, March 27, 2009

Is a Snapshot an Aggregate?

Students in my classes often ask me if a snapshot fact table is an aggregate. My short answer is that it doesn't matter. But here's the long answer.

This is really a matter of semantics more than anything else. What it boils down to is how you define an aggregate table. Is it simply a table that summarizes data from other tables? Then yes, most of the time a snapshot will fit this definition -- but not always. However, I like to reserve the word "aggregate" for a very specific kind of summary. If you do as well, then the answer is no. Never.

Before getting into that, though, here is some background.

Snapshot Fact Tables

A snapshot, for the uninitiated, is a fact table that periodically samples a measurement. A snapshot always includes at least one measurement of status, such as a balance or level. A transaction fact table, in contrast, captures measurements as they occur.

The snapshot is useful for describing the effect of transactions, rather than transactions themselves. The best example is your bank account. Its all well and good if you have a record of each deposit, check, interest payment, fee, direct debit, and so forth. But what is your balance? Aggregating all your transactions is not the most expedient way to figure that out.

Sometimes Snapshots Summarize Transactions, but not Always

People sometimes think of snapshots as aggregates because they can be constructed from a table that contains transactions. That is certainly true in the case of account balances. If you have the transactions, you can compute the snapshots.

In this sense, you might be justified in thinking of a snapshot as an aggregate. But I don't. I'll explain why in just a second.

First, though, it is important to understand that not all snapshots are constructed from transactions. Sometimes, the number of transactions is too numerous to keep around. But in other cases, there simply aren't any. For example, we can't make use of transactions to describe the ambient temperature on the factory floor, or the level of pressure inside an industrial device. In these cases, the snapshot is clearly not a summary.

Snapshots Contain "new" Facts

So how about those other situations -- the ones where there are transactions that can be used to create a snapshot? Isn't a snapshot an aggregate in this case?

It really depends on how you define an aggregate.

The word aggregate is normally reserved for a summary table that does not transform the structure of its corresponding base table. For example, suppose account transactions are recorded in a fact table, with a fact called transaction_amount. A monthly aggregate would have exactly the same fact. Still called transaction_amount, is the same as the fact in the base table. The only difference is in level of detail: it represents a month's worth of transactions, rather than individual transactions. It is the same measurement.

Account balance (or other status measurements like levels) is a not the same as what the base fact table records. It is not an aggregation of transaction that occurred during the snapshot period; it is something more. It describes the net effect of the account transactions. It is a different measurement.

Conclusion

I use the term derived table to describe any table in a dimensional design uses another as its source. I save the word aggregate for non-transformative summaries.

To sum things up, then:

Yes, a snapshot sometimes summarizes a transaction fact table.

But if you accept these definitions for aggregate tables and derived tables, the answer is different:

A snapshot is is sometimes a derived table, but never an aggregate.

This is just a matter of semantics, and not very important to me. Call it whatever you want, and I will be fine with it. The important thing is that you have a dimensional design that enables useful analytics.

- Chris

Tuesday, March 24, 2009

Feed Updated, Now Contains Full Posts

If you follow this blog in a newsreader, you will be happy to know that the RSS feed now contains full posts, rather than the first couple of paragraphs. That should make your reading experience a lot easier.

The change seems to have been glitch-free, but if there are any issues please let me know.

Chris

Thursday, March 19, 2009

Q&A: Is a Unique Count a Fact or Dimension?

A reader recently sent in a question about unique counts. Are they facts, or dimensions?

Here is a paraphrased version of their questions. The business example has been changed.

Q: We report on the number of unique customers that visit a location. We do this at daily, monthly and annual levels, and also look at the same thing across all locations. We originally modeled the customer counts as facts. The problem is that if we compute unique customer counts by day, they do not correctly roll up to month. Does it make more sense to treat this as a dimension?

A: First, let me say that customer counts are certainly facts. They are important metrics that are colored by the dimensions we use to define them.

The difficulty with counts of distinct things, as you have discovered, is that they are non-additive facts. As soon as you use a set of dimensions to create a distinct count -- such as day, location and customer -- you have an aggregation that cannot be further summed. If you need to go to the monthly level, or across all locations, you need to go back to the granular data.

You have a few options.

Option 1: Compute in reports

The first is to compute unique counts in reports. This can be time consuming, and a major hassle if there are several counts that must be computed frequently. You might be able to schedule the reports to run in a batch window. This deals with the processing issues, but not the report or query complexity.

In my view, however, it is the best option. There are other possibilities, but each has significant drawbacks.

Option 2: Create snapshots or aggregates

A second option is to create some snapshot fact tables that compute the various distinct counts at various periodic intervals. These will contain non-additive facts, and your developers will have to understand that they should not be aggregated. Depending on their design, these tables may more closely resemble aggregate tables.

This approach may require quite a few fact tables, since each of your unique counts will have a different grain -- daily by location, monthly by location, annually by location, daily across all locations, monthly across all locations, etc.

To make life a little easier for report developers, designers sometimes replicate the monthly and annual totals in the daily level snapshot. Developers find this appealing, because all the information to produce subtotals over time can be fetched from a single row, rather than three queries. The flipside is that the monthly and annual data is stored redundantly in a lot of places, and must be maintained consistently and used properly.

Option 3: Behavioral Dimensions

If the current period is the primary focus, a third option is to create behavioral dimensions. For example, in the store dimension, you can add attributes that represent counts of unique customers for the current month and year. I mentioned behavioral dimensions in a previous Q&A post, and the concept is the same here.

Option 4: Creating Special Facts that can be Summed

Lastly, some designers like to create special facts in the fact table that can be used to sum unique counts. If you have a fact table that gets a row each time a customer visits a location, you may be able to tweak it a bit. Add an additional fact called first_visit_of_month that only contains a 1 the first time the customer visits any location during the month. First_visit_of_year will only contain a 1 the first time the customer visits during the year. These facts can be summed to produce unique counts for various time frames, but you need to be careful not to aggregate them across the periods they represent.

Another drawback is that, once again, numerous special facts will be required. To compute unique counts by location, you will need facts like first_visit_this_location_this_month. This fact should not be aggregated across months or locations. That's a lot of rules, and they are likely to be broken.

As you can see, no option looks perfect. Don’t let that lead you to push back on requirements, however. Distinct counts are often very important metrics, particularly when studying customer service or processing efficiency.

Thanks for the question!

- Chris

If you have a question, you can email it to me. Click my profile picture at the top of this page for the address. I try to answer all questions, and might even print yours here.

Privacy Policy

Recent changes to the Google AdSense program prompted me to define a privacy policy for this blog. Here it is:


Oakton Software does not gather any information about your visits to this blog, or the pages you view.

I wish the policy could be that simple, but you should also know the following:

Blogger:

The blog is hosted by Google's Blogger.com.

If you have a Google, Gmail or Blogger account, Google may log information about your visits to this blog. The Blogger privacy policy can be found by visiting the Google Privacy Center and clicking on the link for Blogger.

Newsletter:

If you subscribe to the newsletter, the information you provide is used to send occasional email updates. These always include an opt-out link.

The mail list is maintained and managed by an external service provider. They follow a confirmed opt-in policy to ensure compliance with anti-spam legislation.

Ads By Google:

UPDATE 12/16/2010: I have removed ads from this blog. While they helped support it, I felt they were detracting from its value. If you'd like to help support StarSchemaCentral, please use the links in the sidebars to buy books from Amazon.com.