Thursday, November 14, 2013

Facebook's Ken Rudin on Analytics

If you are interested in how business analytics impact your BI program, carve out forty-five minutes of time to watch Ken Rudin's recent TDWI keynote: "Big Data, Bigger Impact." The video is embedded below.

Rudin is the director of analytics at Facebook. In his presentation, he discusses several topics that are of interest to readers of this blog. Among them:
  • Big data technology should be used to extend your traditional BI solution, not replace it. Facebook has realized this, and is working to bring in relational technology to answer traditional business questions.
  • Successful analytics programs bring together centrally managed core data metrics with a variety of data that is not centrally managed. Rudin shares different ways he has been able to make this happen.
  • A similar balance can be attained with your organizational structure. Use of "embedded analysts" provides the business benefits of decentralization, while maintaining the efficiencies and scale advantages of a centralized program.
These are just a few of the points made during his talk. If you don't have the time to watch it now, bookmark this page for later.

You'll also want to check out Wayne Eckerson's latest book, Secrets of Analytical Leaders. (Details below.)

Big Data, Bigger Impact
Ken Rudin
TDWI World Conference, Chicago 5/6/2013




Recommended Reading

Wayne Eckerson's excellent book, Secrets of Analytical Leaders,features more insights from Ken Rudin and others.

I highly recommend this book if you are interested in analytics.

Get it from Amazon.com in paperback or Kindle editions.


Wednesday, September 25, 2013

Optimizing warehouse data for business analytics

Business analytics often integrate information from your data warehouse with other sources of data. This post looks at the best practices of warehouse design that make this possible.

I receive a lot of questions regarding the best way to structure warehouse data to support an analytics program. The answer is simple: follow the same best practices you've already learned.

I'll cover these practices from a dimensional modeling perspective. Keep in mind that they apply in any data warehouse, including those modeled in third normal form.


1. Store Granular Facts

Analytic modelers often choose sources external to the data warehouse, even when the warehouse seems to contain relevant data. The number one reason for this is insufficient detail. The warehouse contains summarized data; the analytic model requires detail.

In this situation, the analytic modeler has no choice but to look elsewhere.  Worse, she may be forced to build redundant processes to transform source data and compile history. Luckily, this is not a failure of warehouse design principles; its a failure to follow standard best practices.

Best practices of  dimensional design dictate that we set the grain of base fact tables at the lowest level of detail possible. Need a daily summary of sales? Store the individual order lines. Asked to track the cost of tips? Store detail about each leg.

Dimensional solutions can contain summarized data. This takes the form of cubes, aggregates, or derived schemas. But these summaries should be derived exclusively from detailed data that also lives in the warehouse.

Like all rules, this rule has exceptions. There are times when the cost/benefit calculus is such that it doesn't make sense to house highly granular indefinitely. But more often than not, summary data is stored simply because basic best practices were not followed.

2. Build “Wide” Dimensions

The more attributes there are in your reference data (aka dimensions), the more useful source material there is for analytic discovery. So build dimensions that are full of attributes,  as many as you can find.

If the grain of your fact table gives the analytics team “observations” to work on, the dimensions give them “variables.”  And the more variables there are, the better the odds of finding useful associations, correlations, or influences.

Luckily, this too is already a best practice. Unfortunately, it is one that is often misunderstood and violated. Misguided modelers frequently break things down into the essential pieces only, or model just to specific requirements.  

3. Track Changes to Reference Data (and Use Effective Dating)

When reference data changes, too many dimensional models default to updating corresponding dimensions, because it is easier.

For example, suppose your company re-brands a product. It's still the same product, but with a new name. You may be tempted to simply update the reference data in your data warehouse. This is easier than tracking changes.  It may even seem to make business sense, because 90% of your reports require this-year-versus-last comparison by product name.

Unfortunately, some very important analysis may require understanding how consumer behavior correlates with the product name. You've lost this in your data set. Best practices help avoid these problems.

Dimensional models should track the change history of reference data. In dimensional speak, this means application of  type 2 slow changes as a rule. This preserves the historic context of every fact recorded in the fact table.

In addition, every row in a dimension table should track "effective" and "expiration" dates, as well as a flag rows that are current. This enables the delivery of type 1 behavior (the current value) even as we store type 2 behavior. From an analytic perspective, it also enables useful "what if" analysis.

As with all rules, again there are exceptions. In some cases, there may be good reason not to respond to changes in reference data by tracking history. But more often than not, type 1 responses are chosen for the wrong reason: because they are easier to implement.

4. Record Identifying Information, Including Alternate Identifiers

Good dimensional models allow us to trace back to the original source data. To do this, include transaction identifiers (real or manufactured) in fact tables, and maintain identifiers from source systems in dimension tables (these are called "natural keys").

Some of this is just plain necessary in order to get a dimensional schema loaded. For example, if we are tracking changes to a product name in a dimension, we may have multiple rows for a given product. The product's identifier is not a unique identifier, but we must have access to it. If we don't, it would become impossible to load a fact into the fact table.

Identifying information is also essential for business analytics. Data from the warehouse is likely to be combined with data that comes from other places. These identifiers are the connectors that allow analytic modelers to do this.  Without them, it may become necessary to bypass the warehouse.

Your analytic efforts, however, may require blending new data with your enterprise data. And that new data may not come with handy identifiers. You have a better chance blending it with enterprise data if your warehouse also includes alternate identifiers, which can be used to do matching. Include things like phone numbers, email addresses, geographic coordinates—anything that will give the analytics effort a fighting chance of linking up data sources.

Summary

If you've been following the best practices of dimensional modeling, you've produced an asset that maximized value for analytic modelers:

  • You have granular, detailed event data.
  • You have rich, detailed reference data.
  • You are tracking and time-stamping changes to reference data. 
  • You've got transaction identifiers, business keys, and alternate identifiers.  

It also goes without saying that conformed dimensions are crucial if you hope to sustain a program of business analytics.

Of course, there are other considerations that may cause an analytic modeler to turn her back on the data warehouse. Latency issues, for example, may steer them to operational solutions. Accessibility and procedural issues, too, may get in the way of the analytic process.

But from a database design perspective, the message is simple: follow those best practices!

Further Reading

You can also read more in prior posts.  For example:
You can also read more in my book, Star Schema: The Complete Reference.  If you use the links on this page to pick up a copy on Amazon, you will be helping support this blog.  

 It covers the best practices of dimensional design in depth. For example:

  • Grain, identifiers, keys and basic slow change techniques are covered in Chapter 3, "Stars and Cubes"
  • The place of summary data is covered in Chapter 14, "Derived Schemas" and Chapter 15, "Aggregates"
  • Conformance is covered in Chapter 5, "Conformed Dimensions"
  • Advanced slow change techniques are explored in Chapter 8, "More Slow Change Techniques"

Wednesday, July 17, 2013

Business Analytics and Dimensional Data

Readers of this blog frequently ask about the relationship of business analytics to the dimensional data that is recorded in data marts and the data warehouse.

Business analytics operate on data that often does not come from the data warehouse. The value of business analytics, however, is measured by its impact on business metrics that are tracked in the data warehouse. 

Business analytics may also help adjust our notion of which metrics matter the most.

The Data Warehouse and Dimensional Data

The dimensional model is the focal point of business information in the data warehouse. It describes how we track business activities and measure business performance. It may also be the foundation for a performance management program that links metrics to business goals.

Dimensional data is the definitive record of what matters to the business about activities and status. Clearly defined performance indicators (facts) are recorded consistently and cross referenced with standardized and conformed reference data (dimensions).

In this post, when I talk about "the data warehouse," I will have this dimensional data in mind.

Business Analytics

Business analytics seek to provide new insight into business activities. Analytics do not always operate on business metrics, and they don't rely exclusively on information form the data warehouse. Dimensional information may be an input, but other sources of data are also drawn upon.

The outputs of business analytics, however, aim directly at the metrics tracked by our dimensional models. Insights from analytics are used by people to move key metrics in the desired directions. These results are called impacts.

Business analytics may also help in another way. Sometimes, they help us determine which metrics are actually the most important.

A great illustration of these dynamics can be found in the business of Major League Baseball. (If you don't follow baseball, don't worry. You don't have to understand baseball to follow this example.)

Metrics in Baseball

Major league baseball has long been in the business of measurement. Followers of the game are familiar with the "box score" that summarizes each game, "standings" that illustrate the relative performance of teams, and numerous statistics that describe the performance of each player.

These metrics have precise definitions and have been recorded consistently for almost 150 years.1 Like the metrics in your data warehouse, they are tracked systematically. Professional baseball teams can also set goals for these metrics and compare them to results, much like a scorecard in your performance management program.

How does one improve these results? If you run a baseball team, part of the answer lies in how you choose players. In the book Moneyball2 Michael Lewis describes how the Oakland Athletics used a set of techniques known as sabermetrics3 to make smarter choices about which players to add to their roster.

These analytics allowed the A's to make smarter choices with measurable impact--improving performance and reducing costs. Analytics also motivated the A's to change the emphasis given to various metrics.

Business Analytics and the Oakland Athletics

The traditional approach to selecting players was focused on long-held conventional wisdom about what makes a valuable player. For example, offensive value was generally held to derive from the ability to contact the baseball, and with a player's speed. These skills are at least partially evident in some of the standard baseball metrics -- things like the batting average, stolen bases, runs batted in and sacrifices.

The Oakland A's looked to data to refine their notion of what a valuable player looks like. How do the things players do actually contribute to a win or loss? To do this, the A's went beyond the box scores and statistics -- beyond the data warehouse, so to speak.

By studying every action that is a part of the game -- what players are on base, what kind of pitches are thrown, where the ball lands when it is hit, etc -- the A's realized they could be smarter about assessing how a player adds value. These business analytics led to several useful conclusions:
  • Batting averages don't tell the whole story about a player's ability to get on base; for example, they exclude walks.
  • Stolen bases don't always contribute to scoring; much depends on who comes to bat next.
  • Runs batted in tell as much about who hits before a player as they do about the player himself
  • Sacrifices, where an out is recorded but a runner advances, were found to contribute less to the outcome of a game than conventional wisdom held.
You may or may not understand these conclusions, but here is the important thing: the analytics suggested that the A's could better assess a player's impact on winning games by turning away from conventional wisdom. Contact and speed are not the best predictors for winning game.  "Patience at the plate" leads to better outcomes.

Impact for the A's

By using these insights to make choices, the A's were able to select less expensive players who could make a more significant contribution to team results. These choices resulted in measurable improvement in many of the standard metrics of baseball--the win/loss ratio in particular. These insights also enabled them to deliver improved financial results.

Analytics also helped the A's in another way: they refined exactly which metrics they should be tracking. For example, in assessing of offensive value, on base percentage should be emphasized over batting average. They also created some of their own metrics to track their performance over time.

The Impact of Analytics

Business analytics tell us what to look for, what works, or what might happen. Examples are signs of impending churn, what makes a web site "sticky", patterns that might indicate fraud, and so forth.

These insights, in turn, are applied in making business decisions. These choices provide valuable impact that can by tracking traditional business metrics. Examples include increased retention rates, reduced costs associated with fraud, and so forth.

These impacts are the desired outcome of the analytic program. If the analytics don't have a demonstrable impact on metrics, they are not providing value.

Business analytics can also help us revise our notion of what to track in our data warehouses, or which metrics to pay closest attention to. Number of calls to the support center, for example, may be less of an indicator of customer satisfaction than the average time to resolve an issue.

Conclusion

As you expand the scope of your BI program to include analytics, remember that your desired outcome is a positive impact on results. Move the needle on business metrics, and the analytics have done their job.

Thanks to my colleague Mark Peco, for suggesting that I use Moneyball as a way to explain analytics without revealing the proprietary insights attained by my customers. 

Notes

[1] The box score and many of these statistics were established in the mid 1800's by a sports writer named Henry Chadwick.

[2] Moneyball by Michael Lewis (Norton, 2011).

[3] The Oakland A's are a high-profile example of the use of sabermetrics, but did not originate the concept. See wikipedia for more information.

Related Posts

See also these posts: