Wednesday, October 10, 2012

Handling Rankings

Dimensional modelers often struggle with issues surrounding rankings. Learn when to store them, where to store them, and where not to store them, by reading this post.
We need to calculate Top N rankings across several dimensions (account, channel, hour, day, week, month.) There are massive data sets involved. This must be a common problem. Is there an approach you can point us towards?
- Jeremy
Rankings are time and context dependent. When you need to track them, they pose challenges that are similar to non-additive facts such as period-to-date measurements or distinct counts.

The solutions, it turns out, are also similar.

Rankings and their Context

When thinking about how to handle rankings, it is useful to break them down. A ranking is actually made up of several parts. There is the dimension being ranked, the dimensions for which the ranking will be computed,  and the fact that is the source of the ranking.

For example, look at this simple statement:

"Show me the top 10 customers by week and product."
  • The dimension being ranked here is Customer.  
  • The dimensional context of each ranking is a product and week. For example, if we have 20 products, then each week we will need 20 top ten lists -- one for each product.
  • The fact driving the ranking is not explicit in this question. It is probably spending, though it may be based on margin or transaction counts.  
That may sound simple, but things are usually a bit more complicated.

The reader, for example, does not just need a single ranking. He actually needs several -- daily, weekly, monthly. Presumably, these may be mixed and matched with the other dimensions he mentioned -- account and channel. In addition, there may be different rankings based on different facts.

Compute vs. Store

It may have occurred to you at this point that there are a multitude of possible rankings, even for a relatively simple schema.

In addition, once we compute a ranking, there is very little we can do with it. For example, we cannot sum it up over time periods, or other dimensions.  We may be able to compare it to another ranking (a different period or dimension, for example), or treat it as reference data describing the thing being ranked. But that's about it.

For these reasons, many organizations choose to compute rankings within reports, or at query time.  This provides maximum flexibility, and also sidesteps the ugly problem of having to store information that cannot be aggregated.

However, the reader indicates that data volumes are large. Let's assume they are so large that it is not feasible to keep the granular data around to compute rankings. Let's also assume that the rankings he needs are consistent over time. In his case, it may be necessary to compute all needed rankings and then store them.

Rankings fit with Snapshots or Aggregates

A ranking is only useful within the original context of its calculation. We won't really be able to aggregate it.  In this respect, its is similar to a distinct count or a period-to-date measurement -- topics I have addressed in the past.  (Some period-to-date metrics can be aggregated within a single period, but otherwise the comparison holds.)

The recommended solution is similar as well. Establish the granularity of the various rankings, then keep track of them fact tables of appropriate grain. These will either be periodic snapshots, or aggregate tables that summarize appropriate time periods. The rank itself will be stored in the fact table, even though it is not additive.

Many developers balk at this solution, observing that it increases the number of fact tables that must be maintained. But note that these fact tables do not introduce any new dimension tables, and they can usually be sourced from transaction-grained stars. And the alternatives have even less appeal.

Rankings don't fit with Transactions

Most of the time, it is not useful to store rankings with transactions. Unless all the things being ranked happen to have transactions that occur at the same time, it will be impossible to compare the ranking with anything -- including other things being ranked!

Resist abstracted models

It is tempting to come up with designs that abstract the concept of a ranking. For example, one can imagine setting up a dimension that describes a ranking, the dimension being ranked, and the period. A fact table can be set up that carries the key for this metadata, plus 10 or 20 keys that will be used to link to top 1, top 2, etc.

Abstracted designs are appealing because they may appear to limit the number of stars to be maintained and don't require schema changes when new rankings are added. But note that the same amount of work must be done to load each ranking.

More importantly, these abstracted designs severely hamper usability. With such a design, it is much harder to merge information about a ranking with other data. A report that contains product level detail, for example, cannot easily pull in info indicating the product appeared in a Top 10 ranking -- this would require decoding metadata and then checking each of the various keys.

Further Reading

As mentioned, some of the challenges related to rankings are similar to those involving distinct counts or period-to-date metrics.  I've covered these topics before:


I also touch on the topic in Chapter 8 of  Star Schema: The Complete Reference, where I address the proper home of a period-to-date measurement.  Use the link to purchase a copy, and you will be helping to support this blog.



Image licensed via Creative Commons 2.0



Friday, September 21, 2012

Business Analytics at Work, Almost

Back in February, business analytics briefly broke into the mainstream consciousness thanks to an article in the New York Times Magazine called How Companies Learn Your Secrets by Charles Duhigg.

When I teach, I frequently use an example from this article. I get lots of email asking me for more information, so I figured I might as well post it here.

Breaking Habits

Duhigg's purpose is to show that human behavior is driven by habit. The key to changing behavior is understanding the underlying habits. 

He tells several stories where businesses use this information to drive positive results.

One of his stories happens to touch on predictive analytics.  

The setup is simple: back in 2002, marketers at Target had realized that shopping habits are disrupted with the arrival of a first-born. Bring expecting parents into stores, and you have an opportunity to cultivate new patterns. The result: new lifetime customers for all manners of product--not just baby items.

The key, then, was to identify people who might be pregnant. With the help of a team of statisticians, Target figured out that they could do just that.

Enter Analytics

Duhigg describes how Target tapped into their vast data reserves, and eventually came up with a market basket profile that was a good predictor of pregnancy. This profile could then be applied to their customer base to identify thousands of people who were likely to be expecting.

They were surprisingly effective. Duhigg tells the story of a gentleman berating a store manager for marketing baby products to his teenage daughter:

“She’s still in high school, and you’re sending her coupons for baby clothes and cribs? Are you trying to encourage her to get pregnant?”

The manager didn’t have any idea what the man was talking about. He looked at the mailer. Sure enough, it was addressed to the man’s daughter and contained advertisements for maternity clothing, nursery furniture and pictures of smiling infants. The manager apologized and then called a few days later to apologize again.

On the phone, though, the father was somewhat abashed. “I had a talk with my daughter,” he said. “It turns out there’s been some activities in my house I haven’t been completely aware of. She’s due in August. I owe you an apology.”
The analytics, it seems, did their job. But did they really?

Analytics and Business Strategy

This anecdote went viral when the Times published Duhigg's article. Ethical issues notwithstanding, it offered an updated version of the apocryphal  "beer and diapers" tale from the 1990's.

But there's more here than market basket analysis. This story illustrates a key aspect of predictive analytics: its important to engage with the owner of the business problem. In this case, it seems, that did not happen. Instead, someone made a tactical decision about how to act on analytics.

In this case,

  • A business analyst had already identified something important: pregnancy. 
  • An analytic modeler took things from there, and developed a way to identify pregnant shoppers. 
  • The analyst chose to mail to people identified by the analytic model.

This last step is where the fumble took place. Every analytic initiative must be connected to the owner of the business problem. This person may not be actively engaged in the analytics, but it is her job to decide what to do with the insights. Deciding how to act on analytics is a key part of the process. It involves communication, and requires the attention of executives who are responsible for business objectives.

The Book

Duhigg has now published a book on the same topic. You might want to check it out.  Just remember, this is not a book about analytics -- it's about habits.


The Power of Habit: Why We Do What We Do in Life and Business by Charles Duhigg.  (Random House, 2012)




Monday, July 16, 2012

Q&A: Does Master-Detail Require Multiple Stars in a Travel Data Mart

A reader asks whether master-detail relationships call for multiple fact tables.

Q: Is there some general principle to use as a guide when deciding to use 1 star or 2 for master-detail situations?

We have a master-detail situation involving trips, and multiple trip expenses per trip. Many of our queries are at the trip level - eg. how many trips are for clients in their 30's going to Arizona in July. Many of the queries are at the detail level -- how much is spent on means by the client's age range?

Both levels of query can be answered with one star, with the fact at the detail level (trip expense line). For trip-level questions - you can count distinct occurrences of the natural key in the trip dimension. But it seems that grinding through all those trip expenses when I just wanted to count trips is inefficient.

So do we have 2 fact tables, one at the trip level, one at the trip expense level? Do the dimensions surround both of them?  Is there a rule?


- Ellen, Ottawa
A: As the reader observes, this is a common situation: when there is a master-detail relationship, is there a need for multiple fact tables?

This case calls for at least two fact tables.
Read on for the details.

Photo by o5com, licensed by Creative Commons

Multiple Fact Tables

After learning the basics of dimensional modeling, the first real world challenge we face is understanding when and how to design multiple fact tables.  Until we learn to think in dimensional terms, the choice can be difficult.

I suggest starting with some basic guidelines.  You probably need different fact tables if:
  1. You have measurements with different periodicity
  2. You have measurements with different levels of detail
The first guideline suggests that if facts do not describe the same event, they probably belong in different fact tables. For example, orders and shipments do not always happen at the same time.  Order dollars and shipment dollars belong in separate fact tables.

The second guideline pertains to facts that do describe the same events.  Information about an order and information about an order line are ostensibly available at the same time, but they have different levels of detail. If there are facts at both of these levels, there will need to be multiple fact tables.

So how would you apply these guidelines?

Think Dimensionally

It is easiest to work these questions out by thinking dimensionally. Forget about master-detail or parent-child for a few minutes, and consider what is being measured. 
  • What are the business questions being asked, and what business metrics do they contain?  
  • Are these metrics available simultaneously?
  • At what level of detail are these metrics available?
The reader cited business questions in her example.  These questions reveal at least two metrics:  number of trips, and expense dollars.

These metrics may be available on different schedules -- for example, a trip may commence before we have the details of all the expenses incurred.  This would argue for multiple fact tables.

But let's suppose that the single source of data is expense reports. We do not have any information about the trip until it is over, at which point we have all the expense details.

In this case, lets think about the level of detail of these metrics. Number of trips and expense dollars seem to share a lot of common detail -- the traveller, the destination, the trip start and end date, and so forth.  

But expense dollars have some additional detail -- the date of the expense item, and an expense category. Since these facts have different levels of detail, it makes sense to include them in separate fact tables.

Factless Fact Table

The trip-level star may contain a factless fact table.  It contains one row per trip, with no explicit facts.  We can determine the number of trips that answer a business question simply by counting rows.

Many teams would prefer to add a fact called number_of_trips and always populate it with the value 1.  This is useful, because it makes the business metric explicit in your schema design.  (I've written about this technique before.)

The trip level star may also contain some summary level facts that describe expenses -- say the total trip cost, the total transportation cost, total meals cost, etc.  More detail on these metrics (such as the vendor, payment method, etc.) can found in the expense item star.

Deeper into Trip Analysis

Digging deeper, the reader may discover she needs more than two fact tables.  

Trips often include more than one destination.  When it is necessary to study the various segments (or "legs") of a trip, we can define a third fact table that contains one row per segment.

In this case, the reader would have three fact tables:
  1. trip_facts  Grain: one row per trip.  Metrics: Number of trips (always one)
  2. trip_segment_facts  Grain: one row per destination city.  Metrics: number of segments (always one)
  3. trip_expense_facts  Grain: one row per expense item.  Metrics: expense_dollars
Each of these stars will share some common dimensions -- the trip origination date, the trip end date, the traveller, the trip identifier, etc.

Trip_segment_facts will also include destination-specific dimensions: the destination city of the segment, the arrival and departure dates for the segment, etc.

Trip_expense_facts will include dimensions that describe the individual expense items: the date paid, the payment method, the payee, the expense category and subcategory, and so forth.

A conformance matrix will help you keep track of the dimensionality of each star.

Learn More

Thanks to Ellen for the thoughtful question.  If you have a question of your own, please send it in.

Pick up a copy of Star Schema: The Complete Reference and you will be helping support this blog. 

  • Chapter 4 is dedicated to designing multiple fact tables.  
  • Chapter 5 looks closely at conformed dimensions. 
  • Chapter 12 covers faceless fact tables.
You can learn more about these topics by referring to these posts:
  • Factless Fact Tables (September 15, 2011) describes faceless fact tables that describe events, and the use of a fact with the constant value of 1.
  • The Conformance Matrix (June 5, 2012) describes how to represent the dimensionality of your stars in a matrix format.

The photo in this post is by o5com, licensed by Creative Commons