Showing posts with label Behavioral Dimensions. Show all posts
Showing posts with label Behavioral Dimensions. Show all posts

Tuesday, March 16, 2010

Storing Nonadditive Facts [Q&A]

It can be challenging to decide what to do with a nonadditive fact. A reader asks if (and when) it might make sense to store a percentage in a fact table:
Chris,

What are your thoughts on storing percentages in fact tables?  Many of the calculations seem too complex to do in the BI tool. For instance, we calculate life-to-date percentages of KPIs on products.

Thanks,
Adrienne
Burlington, NC
Percentages are (usually) non-additive facts.  What this means is that you cannot add them together.  For example, if a computer is sold at 5% margin, and then another is sold at 5% margin, it is a mistake to say that total margin for the two sales is 10%. 

In introductory texts or courses, you are taught to decompose them into fully additive components.  The idea is that you can sum all the components, then compute the ratio at the last moment. The margin example above is a good application. Storing cost and sales price allows margin to be computed at any level of aggregation.

However, this only really works when all the components for a particular ratio are stored in the same row, and used within the same scope of aggregation.   As you have discovered, there are some non-additive facts that do not work this way.  A ratio that compares something to a larger pool or category is a good example.  Two or more aggregations are needed to compute this kind of metric.

It also sounds as if your problem is compounded by the need to use some kind of running-totals in the computation, which represent the lifetime-to-date sums.  These are also problematic because they are also non-additive, or possibly semi-additive.

Non-additive measurements that are computed from different aggregations are often stored in fact tables for exactly the reason you give: they are difficult to automate in a BI tool. They are also precomputed in situations where they would otherwise harm performance.

If you decide to pre-compute and store a non-additive fact, there are usually two questions that follow:  Where to store it, and how to use it.  The issues here are very similar to those surrounding period-to-date measurements, as I discussed in a post last year. 

Where to Store a Nonadditive Fact

Figuring out the answer to this question is closely linked to how it will be used.  If it is needed at a transaction-level, it must go in a transaction-grained fact table.

However, if it also need to be known at a daily or monthly level, that solution is not enough.  Since it is non-additive, it must be computed separately for these aggregations.  A periodic snapshot makes sense here.  Consider forgoing the transaction-level version, unless it is needed as well.

Also note that "time" is not the only dimension across which we might need a summarized non-additive fact.  For example, you might also want to see it across product categories, customer categories, and so forth.  Each would require its own snapshot-style fact table for discrete storage of the pre-computed fact.

In fact, this is the biggest problem with storing non-additive facts:  each potential aggregation must be stored separately.

Nonadditive Facts and BI Software

Because they are nonadditive,  (and may involve constituents that are semi-additive), these kind of facts must be used carefully.  For example, they should never be summed, subtotaled, etc.  Similarly, if we need to show them at different levels of detail, we need to fetch each aggregation separately.

This is relatively straight forward if the schema is well thought out, well documented, and fully understood by report developers.

However, it is very hard to make this kind of fact available in a "Self-Service" mode for end users, since most BI tools will easily break both rules.  (BI tools can usually handle simple non-additive facts, like the margin example at the beginning, but not facts that are computed from multiple aggregations.)

Alternatives may include:
  • Hiding the non-additive fact from users while exposing it to developers
  • Classifying it as a dimension so that it is not aggregated,  or 
  • Blocking certain features of the tool.
The fact that a non-additive fact may be stored in multiple locations (corresponding to different levels of aggregation) may also befuddle your BI software.
  • It may be possible to make use of  "aggregate navigation" features to teach it to fetch a fact from more than one possible location, but you will need to make sure that the tool always gets it right.  In addition, using such a feature may prevent you from "turning off" the tool's ability to aggregate a fact.  
  • Separate semantic layers for developers vs. users may be the solution to these problems.
Again, these considerations are less important in cases where trained developers do all the report building.  It is only when leveraging SQL-generating BI tools that you need to worry.

Another Option

In some cases, non-additive facts can be linked to a specific member of a dimension, such as an individual customer or  product.  In this case, consider storing it in the dimension table, as a behavioral dimension.  This is most effective when the measurement in question can be refreshed on a pre-defined schedule.

Thanks to Adrienne for her question, and for consenting to have it reprinted here!

- Chris

Do you have a question about dimensional design or data warehousing?  Send it in.  I answer all my email, though sometimes it takes a while.

Image is by Patrick Hosely licensed under Creative Commons 2.0

Thursday, March 19, 2009

Q&A: Is a Unique Count a Fact or Dimension?

A reader recently sent in a question about unique counts. Are they facts, or dimensions?

Here is a paraphrased version of their questions. The business example has been changed.

Q: We report on the number of unique customers that visit a location. We do this at daily, monthly and annual levels, and also look at the same thing across all locations. We originally modeled the customer counts as facts. The problem is that if we compute unique customer counts by day, they do not correctly roll up to month. Does it make more sense to treat this as a dimension?

A: First, let me say that customer counts are certainly facts. They are important metrics that are colored by the dimensions we use to define them.

The difficulty with counts of distinct things, as you have discovered, is that they are non-additive facts. As soon as you use a set of dimensions to create a distinct count -- such as day, location and customer -- you have an aggregation that cannot be further summed. If you need to go to the monthly level, or across all locations, you need to go back to the granular data.

You have a few options.

Option 1: Compute in reports

The first is to compute unique counts in reports. This can be time consuming, and a major hassle if there are several counts that must be computed frequently. You might be able to schedule the reports to run in a batch window. This deals with the processing issues, but not the report or query complexity.

In my view, however, it is the best option. There are other possibilities, but each has significant drawbacks.

Option 2: Create snapshots or aggregates

A second option is to create some snapshot fact tables that compute the various distinct counts at various periodic intervals. These will contain non-additive facts, and your developers will have to understand that they should not be aggregated. Depending on their design, these tables may more closely resemble aggregate tables.

This approach may require quite a few fact tables, since each of your unique counts will have a different grain -- daily by location, monthly by location, annually by location, daily across all locations, monthly across all locations, etc.

To make life a little easier for report developers, designers sometimes replicate the monthly and annual totals in the daily level snapshot. Developers find this appealing, because all the information to produce subtotals over time can be fetched from a single row, rather than three queries. The flipside is that the monthly and annual data is stored redundantly in a lot of places, and must be maintained consistently and used properly.

Option 3: Behavioral Dimensions

If the current period is the primary focus, a third option is to create behavioral dimensions. For example, in the store dimension, you can add attributes that represent counts of unique customers for the current month and year. I mentioned behavioral dimensions in a previous Q&A post, and the concept is the same here.

Option 4: Creating Special Facts that can be Summed

Lastly, some designers like to create special facts in the fact table that can be used to sum unique counts. If you have a fact table that gets a row each time a customer visits a location, you may be able to tweak it a bit. Add an additional fact called first_visit_of_month that only contains a 1 the first time the customer visits any location during the month. First_visit_of_year will only contain a 1 the first time the customer visits during the year. These facts can be summed to produce unique counts for various time frames, but you need to be careful not to aggregate them across the periods they represent.

Another drawback is that, once again, numerous special facts will be required. To compute unique counts by location, you will need facts like first_visit_this_location_this_month. This fact should not be aggregated across months or locations. That's a lot of rules, and they are likely to be broken.

As you can see, no option looks perfect. Don’t let that lead you to push back on requirements, however. Distinct counts are often very important metrics, particularly when studying customer service or processing efficiency.

Thanks for the question!

- Chris

If you have a question, you can email it to me. Click my profile picture at the top of this page for the address. I try to answer all questions, and might even print yours here.

Friday, January 23, 2009

Q&A: Customers with No Sales in Fact Table

In this post, Chris answer a reader's question about reporting on customers with no sales in the fact table.

New Q&A Feature

I get a lot of questions about dimensional modeling from readers of the blog, my books, and people who attend my classes. I try to answer all of them. I thought it would be interesting to post some of them here, for all to read.

This week, I'm answering a question about reporting on customers with no sales.

In general what is the recommended approach for reporting customers that have no sales in the fact table for MTD or YTD? Or, for that matter, any dimension member where no fact exists for some time period?

- J. Dolan

Usually, something like “customers with no sales for a period” is computed as part of the reporting process. Developers often bristle at this kind of report, because it requires using a correlated subquery. This makes it difficult to do with a SQL-generating BI tool, and can be slow. And if you want to use the list of customers with no sales as a filter for another query, things can get really hairy.

An alternative is to pre-compute this information, storing the results in a table that summarizes reporting periods. This is likely to be a factless fact table, with foreign keys to dimension tables representing the period summarized and the dimension in question--customers in your case. If you require monthly and yearly slices, you will want to multiple summaries. This approach may simplify reporting, but adds complexity to the ETL process.

If current period is the only period of concern, as in your question, then it is also possible to pre-compute the information and store the result directly to the customer dimension table, as a flag. The flag would specify whether the customer has placed an order during the current period. In your case, it would be two flags -- one for month to date and one for year to date.

This kind of attribute is called a behavioral dimension because it takes behavioral activity recorded in the fact table and transforms it into a dimension. Behavioral dimensions are very powerful, because they can be used as simple filters for other queries.

Again, this concept trades ETL complexity for reporting flexibility and performance. But isn't that what data warehousing is all about?

Thanks to J. Dolan for agreeing to have his question posted. If you have a question, send it in. I try to answer all my email, though it sometimes takes a while!

Chris