Chris Adamson’s Blog: October 2012

Wednesday, October 10, 2012

Handling Rankings

Dimensional modelers often struggle with issues surrounding rankings. Learn when to store them, where to store them, and where not to store them, by reading this post.

We need to calculate Top N rankings across several dimensions (account, channel, hour, day, week, month.) There are massive data sets involved. This must be a common problem. Is there an approach you can point us towards?
- Jeremy

Rankings are time and context dependent. When you need to track them, they pose challenges that are similar to non-additive facts such as period-to-date measurements or distinct counts.

The solutions, it turns out, are also similar.

Rankings and their Context

When thinking about how to handle rankings, it is useful to break them down. A ranking is actually made up of several parts. There is the dimension being ranked, the dimensions for which the ranking will be computed, and the fact that is the source of the ranking.

For example, look at this simple statement:

"Show me the top 10 customers by week and product."

The dimension being ranked here is Customer.

The dimensional context of each ranking is a product and week. For example, if we have 20 products, then each week we will need 20 top ten lists -- one for each product.

The fact driving the ranking is not explicit in this question. It is probably spending, though it may be based on margin or transaction counts.

That may sound simple, but things are usually a bit more complicated.

The reader, for example, does not just need a single ranking. He actually needs several -- daily, weekly, monthly. Presumably, these may be mixed and matched with the other dimensions he mentioned -- account and channel. In addition, there may be different rankings based on different facts.

Compute vs. Store

It may have occurred to you at this point that there are a multitude of possible rankings, even for a relatively simple schema.

In addition, once we compute a ranking, there is very little we can do with it. For example, we cannot sum it up over time periods, or other dimensions. We may be able to compare it to another ranking (a different period or dimension, for example), or treat it as reference data describing the thing being ranked. But that's about it.

For these reasons, many organizations choose to compute rankings within reports, or at query time. This provides maximum flexibility, and also sidesteps the ugly problem of having to store information that cannot be aggregated.

However, the reader indicates that data volumes are large. Let's assume they are so large that it is not feasible to keep the granular data around to compute rankings. Let's also assume that the rankings he needs are consistent over time. In his case, it may be necessary to compute all needed rankings and then store them.

Rankings fit with Snapshots or Aggregates

A ranking is only useful within the original context of its calculation. We won't really be able to aggregate it. In this respect, its is similar to a distinct count or a period-to-date measurement -- topics I have addressed in the past. (Some period-to-date metrics can be aggregated within a single period, but otherwise the comparison holds.)

The recommended solution is similar as well. Establish the granularity of the various rankings, then keep track of them fact tables of appropriate grain. These will either be periodic snapshots, or aggregate tables that summarize appropriate time periods. The rank itself will be stored in the fact table, even though it is not additive.

Many developers balk at this solution, observing that it increases the number of fact tables that must be maintained. But note that these fact tables do not introduce any new dimension tables, and they can usually be sourced from transaction-grained stars. And the alternatives have even less appeal.

Rankings don't fit with Transactions

Most of the time, it is not useful to store rankings with transactions. Unless all the things being ranked happen to have transactions that occur at the same time, it will be impossible to compare the ranking with anything -- including other things being ranked!

Resist abstracted models

It is tempting to come up with designs that abstract the concept of a ranking. For example, one can imagine setting up a dimension that describes a ranking, the dimension being ranked, and the period. A fact table can be set up that carries the key for this metadata, plus 10 or 20 keys that will be used to link to top 1, top 2, etc.

Abstracted designs are appealing because they may appear to limit the number of stars to be maintained and don't require schema changes when new rankings are added. But note that the same amount of work must be done to load each ranking.

More importantly, these abstracted designs severely hamper usability. With such a design, it is much harder to merge information about a ranking with other data. A report that contains product level detail, for example, cannot easily pull in info indicating the product appeared in a Top 10 ranking -- this would require decoding metadata and then checking each of the various keys.

Further Reading

As mentioned, some of the challenges related to rankings are similar to those involving distinct counts or period-to-date metrics. I've covered these topics before:

Dealing with Period-to-date Measurements (4/18/2009)

Is a Unique Count a Fact or Dimension? (3/19/2009)

Storing Non-additive Facts (3/16/2010)

I also touch on the topic in Chapter 8 of Star Schema: The Complete Reference, where I address the proper home of a period-to-date measurement. Use the link to purchase a copy, and you will be helping to support this blog.

Image licensed via Creative Commons 2.0

from Patrick Hoesley

Subscribe to: Posts (Atom)

Classes

Chris is scheduled to present at the following events. Course enrollment is open to the general public.

All these courses are also available on site (see below).

August 18, 2019
San Diego, CA
Data Modeling in the Age of Big Data
Registration: TDWI San Diego
August 20, 2019
San Diego, CA
Data Architecture: Managing Information in the Age of Big Data
Registration: TDWI San Diego
August 20, 2019
San Diego, CA
Workshop: Building the Business Case for Advanced Analytics
Registration: TDWI San Diego Strategy Summit
Monday October 21, 2019
San Francisco, CA
TDWI Dimensional Data Modeling Primer: From Requirements to Business Analysis
Registration: TDWI Seminars
Tuesday October 22, 2019
San Francisco, CA
Advanced Dimensional Modeling: Techniques for Practitioners
Registration: TDWI Seminars
Wednesday October 23, 2019
San Francisco, CA
Dimensional Models: What’s New in the Big Data Era
Registration: TDWI Seminars
November 12, 2019
Orlando, FL
Data Architecture: Managing Information in the Age of Big Data
Registration: TDWI Orlando
November 12, 2019
Orlando, FL
The Dimensional Model Refactored: New Techniques for the 21st Century
Registration: TDWI Orlando
November 15, 2019
Orlando, FL
Advanced Dimensional Modeling: Complete Tour of Modern Best Practices
Registration: TDWI Orlando

Onsite Education

You can bring Chris to your team for interactive education.

Dimensional Modeling

Chris provides full-day and expanded two-day courses covering the dimensional design concepts from Star Schema: The Complete Reference.
TDWI Courses

Chris teaches select TDWI courses that cover topics like data BI fundamentals, performance management, business analytics, dashboards and scorecards, and more.

All of Chris's education offerings are provided through TDWI.

For information on onsite offerings, contact TDWI Onsite Education. or Oakton Software