Showing posts with label Analytics. Show all posts
Showing posts with label Analytics. Show all posts

Monday, June 5, 2017

In Praise of the Whiteboard

Modeling tools are great, but early stage modeling activities are best conducted on a whiteboard. It is better for collaborative development, and handles rapid change more effectively. And there are inexpensive solutions if one is not readily available.

When I lead seminars that cover modeling techniques, this question always comes up: “What’s the best modeling tool?” 

My answer: Start modeling activities on a whiteboard. Break out the software later, one the model has stabilized.

Using a whiteboard, you will get to a better solution, faster. 

I find this to be true across the spectrum of BI project types and modeling techniques. Examples  include:
  • Dimensional models (OLAP projects)
  • Strategy maps (Performance Management projects)
  • Influence Diagrams (Business Analytics projects)
  • Causal loop diagrams (Business Analytics projects)
There are two reasons a whiteboard works best for this kind of work: it supports collaboration, and it is better suited to the rapid changes common in early stage modeling. In short, a whiteboard is inherently agile.

Collaboration

The best models are produced by small groups, not individuals. Collaboration generates useful and creative ideas which reach beyond what a seasoned modeler can do alone. 

Each of the techniques listed above requires collaboration between business and technical personnel. And within either of these realms, a diversity of perspectives always produces better results. Brainstorming is the name of the game.

Use of a modeling tool quashes the creativity and spontaneity of brainstorming sessions. You may have experienced this yourself.

Imagine five people in a room, one person’s laptop connected to a projector. Four people call out ideas, but the facilitator with the laptop can only respond to one at a time. The session becomes frustrating to all participants, no matter how good the facilitator is. The team loses ideas, and participants lose enthusiasm.

Now imagine the same five people in front of a whiteboard, each holding a pen. Everyone is able to get their ideas onto the board. While this may seem like anarchy, it helps ensure that no ideas are lost and it keeps everyone engaged. The result is always a better model, developed faster.

Rapid change

The other reason to start on a white board is practical: it is easy to erase, change, redraw. And you will be doing a lot of these things if you are collaborating.

Imagine a group is sketching out a model, and decides to make a major change. Perhaps one fact table is to be split into two. Or a single input parameter is to be decomposed into four. If a modeling tool is in use, making the change will will require deletion of elements, addition of new elements, and perhaps a few dialog boxes, check boxes, and warning messages.  The tool gets in the way of the creative process.

Now imagine a whiteboard is in use. A couple of boxes are drawn, some lines erased, some new lines added. The free flow of ideas continues, uninterrupted. Once again, this is what you want.

Unimpeded collaboration produces better results, faster.

No white board? No problem.

Don’t let the lack of a whiteboard in your cube or project room stop you.

There are many brands of inexpensive whiteboard sheets that cling to the wall. These have the additional benefit of being easy to relocate if you are forced to move to another room.  Here is one example, available on Amazon.com:



There are also several kinds of whiteboard-style notebooks. These are useful if you are working alone or in a group of two. They provide the same benefit of being able to collaborate and quickly change your minds, but in a smaller format.  The one that I carry is called Wipebook:


I learned about these and similar solutions from clients and students in my seminars.

…and then the tool

All this is not to say that modeling tools are bad. To the contrary, they are essential. Once the ideas have been firmed up, a modeling tool is the next step.

Modeling tools allow you to gets ideas into a form that can be reviewed and revised. They support division of labor for doing required “grunt work” — such as filling in business definitions, technical characteristics, and other metadata. And they produce useful documentation for developers, maintainers, and consumers of your solutions.

But when you’re getting started, use a whiteboard!

Sunday, March 19, 2017

Data Alone Does Not Change People’s Minds

On NPR’s Hidden Brain podcast, cognitive neuroscientist Tali Sharot discusses the role of data in changing people’s behavior.

From Data to Action

The goal of analytics is to have a positive impact on the performance of your organization. To have an impact, you usually need to convince people to change their behavior.

This is required whether you want to convince a CEO to adopt a new strategy, a manager to allocate resources differently, or a knowledge worker to change their processes.

That’s why data visualization and data storytelling have become key skill sets for modern analytics professionals.


Data is Not Enough

How do you convince people to change their behaviors? Many analysts fall into the trap of letting the data speak for itself.

On a recent episode of NPR’s Hidden Brain podcast, cognitive neuroscientists Tali Sharot explains that data alone won’t do the job.  (The podcast is embedded above.)

Most people are familiar with the concept of confirmation bias, where we tend to accept data that supports our existing opinions. Sharot suggests there are ways to override this kind of bias.

Some key takeaways:
  • People evaluate new information based on what they already believe
  • Strongly held false beliefs are difficult to change with data
  • Fear tends to lead to inaction, rather than action
  • Positive feedback or hope is a powerful motivator if you want to change peoples actions
This is a fascinating listen for anyone interested in telling stories with data. Not only does it offer suggestions on how to change people’s behavior, it also illustrates the power of tracking results and making them available to people.

I’ve pre-ordered Sharot’s upcoming book, The Influential Mind. You should too!

Recommended Podcast Apps

I have received a lot of positive feedback from people who enjoy listening to the podcasts I mention on this blog. Several people have asked me how to listen to podcasts.

You can, of course, simply click on the play button in the posts. But you can also subscribe to podcasts using a smartphone app. This lets you listen on the go, and also notifies you when new episodes are available.

Here are two apps I recommend if you use an iOS device:
  • Castro Podcast Player is perfect if you are new to podcasts, or if you subscribe to a handful of podcasts.
  • Overcast: Podcast Player is good for people who subscribe to a large number of podcasts. It is more complex, but allows you to set up multiple playlists and priorities.
Further Reading




Wednesday, January 25, 2017

Avoid the Unintended Consequences of Analytic Models

Cathy O’Neil’s Weapons of Math Destruction is a must-read for analytics professionals and data scientists.

In a world where it is acceptable for people to say, “I’m not good at math,” it’s tempting to lean on analytic models as the arbiters of truth.

But like anything else, analytic models can be done poorly. And sometimes, you must look outside your organization to spot the damages.

The Nature of Analytic Insights

Traditional OLAP focuses on the objective aspects of business information. “The person who placed this $40 order is 39 years old and lives in Helena Montana.” No argument there.

But analytics go beyond simple descriptive assertions. Analytic insights are derived from mathematical models that make inferences or predictions, often using statistics and data mining.1

This brings in the messy world of probability. The result is a different kind of insight: “This person is likely to default on their payment.” How likely? What degree of certainty is needed to turn them away?

When you make a decision based on analytics, you are playing the odds at best. But what if the underlying model is flawed?

Several things can go wrong with the model itself:
  • It is a poor fit to the business situation
  • It is based on inappropriate proxy metrics
  • It uses training data that reinforces past errors or injustices
  • It is so complex that it is not understood by those who use it to make decisions.
And here is the worst news: whether or not you manage to avoid these pitfalls, a model can seem to be “working” for one area of your business, while causing damage elsewhere.

The first step in learning to avoid these problems is knowing what to look for.

Shining a Light on Hidden Damages

In Weapons of Math Destruction, Cathy O’Neil teaches you to identify a class of models that do serious harm. This harm might otherwise go un-noticed, since the negative impacts are often felt outside the organization.2 She calls these models “weapons of math destruction.”

O’Neil defines a WMD as a model with three characteristics:
  • Opacity – the workings of the model are not accessible to those it impacts
  • Scale – the model has the potential to impact large numbers of people
  • Damage – the model is used to make decisions that may negatively impact individuals
The book explores models that have all three of these characteristics. It exposes their hidden effects on familiar areas of everyday life – choosing a college, getting a job, or securing a loan. It also explores their effects on parts of our culture that might not be familiar to the reader, such as sentencing in the criminal justice system.

Misaligned Incentives

O’Neil’s book is not a blanket indictment of analytics. She points out that analytic models can have wide ranging benefits. This occurs when everyone’s best interests line up.

For example, as Amazon’s recommendation engine improves, both Amazon and their customers benefit. In this case, the internal incentive to improve lines up with the external benefits.

WMD’s occur when these interests conflict. O’Neil finds this to be the case for models that screen job applications. If these models reduce the number of résumés that HR staff must consider, they are deemed “good enough” to use. They may also exclude valid candidates from consideration, but there is not an internal incentive to improve them. The fact that they harm outside parties may even go unnoticed.

Untangling Impact from Intent

WMD’s can seem insidious, but they are often born of good intentions. O’Neil shows that it is important to distinguish between the business objective and the model itself.  It’s possible to have the best of intentions, but produce a model that generates untold damage.

The hand-screening of job applications, for example, has been shown to be inherently biased. Who would argue against “solving” this problem by replacing the manual screening with an objective model?

This may be a noble intention, but O’Neil shows that it fails miserably when the model internalizes the very same biases. Couple that with misaligned incentives for improvement, and the WMD fuels a vicious cycle that can have the precisely the opposite of the intended effect.

Learning to Spot Analytic Pitfalls

The first step to avoiding analytics gone awry is to learn what to look for.

“Data scientists all too often lose sight of the folks at the receiving end of the transaction,” O’Neill writes in the introduction. This book is the vaccine that helps prevent that mistake.

If you work in the field of analytics, Weapons of Math Destruction is an essential read.


Notes:

1. OLAP and Analytics are two of the key service areas of a modern BI program. To learn more about what distinguishes them, see The Three Pillars of Modern BI (Feb 9, 2005).

2. But not always. For example, some of the models explored in the book have negative impacts on employees.


Thursday, November 17, 2016

Probability and Analytics: Reactions to 2016 Election Forecasts

Reactions to the 2016 election forecasts suggest we don’t do a good job communicating probability and risk.

In a September 2016 post, I suggested readers check out the discussions of analytic models at FiveThirtyEight. One of the links led to their forecast model for the 2016 presidential election.1

In the past week, I have received quite a bit of email suggesting I should take down the post, given that the model “failed.” For example, one emailer wrote:
How can you continue to promote Nate Silver? The election result proved the analytics wrong.
These reactions expose a real issue with analytics: most people do not understand how to interpret probability.

An analytic failure?

On November 7th, the  final prediction of the FiveThirtyEight “Polls Only Model” gave Hillary Clinton a 71% chance of winning. As things turned out, she lost.

Those emailing me were not alone in believing the model failed. The day after the election, there were many stories suggesting FiveThirtyEight and the other aggregators were wrong.2

But were they?


Nate Silver discusses the FiveThirtyEight Model
(If the video above does not play, you can access it here.)

Understanding probability

The FiveThirtyEight model gave Clinton a 71% chance of winning the election. That’s about a 7 in 10 chance. To understand how to interpret this probability, try the following thought experiment:

Suppose you are at Dulles airport, and are about to board a plane. While you are waiting, you are notified that there is a 7 in 10 chance your flight will land safely.  Would you get on the plane? 

I know I wouldn’t.

When the probability of something happening is 70%, the probability of it not happening is 30%. In the case of the airline flight, that’s not an acceptable risk!

Now suppose the flight lands safely.  Was the prediction right?

Maybe, but maybe not.  The plane landed safely, but were the odds with the passengers?  Was there actually a greater danger that was narrowly avoided? Was there no danger at all?

When a single event is assigned a probability, its hard to assess whether the assigned probability was “correct.”

Suppose every flight departing Dulles was given a 7 in 10 chance of landing safely, rather than just one. The next day, we check the results and find that all flights landed safely. Was the prediction correct?

In this case, we are able to say that the model was clearly wrong. About 1,800 flights depart Dulles airport each day. The model predicted that thirty percent, or about 540 flights, would not land safely. It clearly missed the mark, and by a wide margin.

Probabilistic predictions are easier to evaluate when they apply to a large number of events.

Explaining probability

In the days and weeks leading up to the election, the FiveThirtyEight staff spent a good deal of time trying to put the uncertainty of their forecast in context.  As the election drew closer, these became daily warnings:
  • November 6:  A post outlined just how close the race was, and how a standard polling miss of 3% could swing the election.
  • November 7:  An update called a Clinton win “probable but far from certain.”
  • November 8: The final model discussion outlined all the reasons a Clinton win was not a certainty, and explored scenarios that would lead to a loss.
Despite all this, many people were unable to interpret the probabilistic model, and the associated uncertainty.

Avoiding unrealistic expectations

If  a research scientist at Yale and the MIT Technology Review misunderstood a probabilistic forecast, how well are people in your business doing?
  • Are people in your business making decisions based on probabilistic models? 
  • Are they factoring an appropriate risk level into their actions?
  • Are you doing enough to help them understand the strength of model predictions?
It's important that decision makers comprehend the predictive strength of the models they use. And it’s everyone’s responsibility to make sure they understand.

We have a long, long way to go.



Notes:

1. See the post Read (or Listen to) Discussions of Analytic Models.  The model discussion I linked to is: A User’s Guide To FiveThirtyEight’s 2016 General Election Forecast

2. “Aggregators" is a term used by the mainstream press to describe data scientists who build models based on polling data.  Here are a few stories that suggested these models were wrong: The WrapVanity FairThe New YorkerQuanta MagazineMIT Technology Review.


Wednesday, September 28, 2016

Read (or Listen to) Discussions of Analytic Models

Organizations often feel their analytics are proprietary, and therefore decline to discuss how their models work. One shining exception is Nate Silver’s FiveThirtyEight.com. The site makes a point of exposing how their models are built. They also discuss their models as part of their elections podcast.

Data Storytelling

Recommended Reading
As students in my courses know, FiveThirtyEight.com is a data driven journalism blog founded by Nate Silver. FiveThirtyEight covers sports, politics, science, and popular culture.

If you are interested in visualization, analytics, or telling stories with data, you will enjoy the site.

Stories on FiveThirtyEight are always shaped by data. And if they develop a model of any kind, that model is openly explained. You may have to cull through footnotes, but its always there.

One of the most detailed discussions on the site right now describes their 2016 election forecast model. (With apologies to readers outside the US, this is a very US-centric topic.)

Podcasts

FiveThirtyEight also offers several podcasts, where you can listen to analyst discussions which are driven by data.

Until recently, these conversations rarely delved into the technical realm. On the elections podcast, if Nate Silver or Harry Enton mentioned “long tails,” “blended averages,” or “p-values,” the other hosts jokingly steered the conversation back to analysis.

That practice was put to an end a few weeks ago with the establishment of “Model Talk” episodes. Every second Friday the model itself is discussed in greater detail. For example, in the 8/26 episode, Silver describes the predictive value of state polls over national polls, and why it is important to build a model where state by state probabilities interact.

Here are links to the “model talk" discussions to date:

Recommended Reading

I also highly recommend Silver’s book, The Signal and the Noise: Why So Many Predictions Fail—but Some Don’t. If you are interested in analytics, it is a fascinating read.










Wednesday, December 23, 2015

What Hollywood Can Teach Analytics Professionals: How to Tell Stories

You might not realize it, but you probably have something in common with the creators of the TV show South Park. 


Analytics yield insights that can have powerful business impact. These insights come from statistics and data mining—processes that are inaccessible to most people. If you want your business to learn and remember, you have to tell a story.

All too often, the communication of an analytic finding reads like a police report: procedural, laden with jargon, and stripped of meaningful business context.

That’s not interesting. People won’t learn from it, and they certainly won’t change their behavior.

How then to get your point across? You need to learn how to tell stories. Data stories.



Trey Parker and Matt Stone know a thing or two about telling a story. They are the creators of South Park, a wildly successful television show which has been on the air for 19 years. Like you, their success depends on telling interesting stories.

In the video clip above, Parker and Stone are speaking to a group of students at NYU on storytelling strategies. Trey tells the students:

We can take these beats, which are basically the beats of your outline, and if the words “and then” belong between those beats, you’re f***ed. Basically. You’ve got something pretty boring.

What should happen between every beat that you’ve written down is either the word “therefore” or “but.”

Data storytellers make this mistake all the time. "We did this…then we tried that…the algorithm showed this…the correlation coefficient is that…our conclusion is...”

This kind of forensic storytelling is boring. It won’t be remembered, and the value of the insight will be lost. Save the procedural detail for an appendix somewhere. People learn from good stories, not lab reports.

As Matt says later in the clip, you need causality to have an interesting story:

But. Because. Therefore.  That gives you the causation between each beat.  And that…that’s a story.

Be sure to watch the entire clip and, if you are so inclined, take some time off for an episode or two of South Park. It just might make you a better data scientist!



The embedded video is from the NY Times ArtsBeat blog post, Hello! Matt Stone and Tray Parker Crash a Class at NYU (September 8, 2011).  Hat tip to Tony Zhou and his Video Essay on F for Fake at the marvelous blog Every Frame a Paining.


Friday, March 20, 2015

BI and the Path to Business Value

Managing BI services requires a consistent information architecture, even if different teams are responsible for data marts, performance management, and analytics.

Business Value From BI


Business Intelligence is the use of information to improve business performance.[1] To improve business performance, we must do three things:

  • Track business performance
  • Analyze business performance
  • Impact business performance

Each step on the path to business value is supported by two kinds of BI services, as shown in the illustration.

  • Tracking performance requires understanding what is currently happening (Performance Management) and what has happened in the past (OLAP).
  • Analyzing performance requires the ability to get to detail (OLAP), develop insight into cause and effect (Business Analytics).
  • Impacting performance requires targeting a business metric (Performance Management) and taking a prescribed course of action (Business Analytics.)





Each of these steps leverages a pair of BI services, and each service shares a common interest in business information.[2] Managing BI services therefore requires a consistent information architecture. This is true even when separate teams manage each area.

Tracking Performance

Understanding performance often starts with summarized data on dashboards and scorecards (Performance Management). The need investigate potential problems requires detailed data and history (OLAP and Reporting.)

As Wayne Eckerson demonstrated in Performance Dashboards, both these areas provide stronger business value when they are integrated. For example, a dashboard is more useful when someone can click a metric and bring up supporting detail in an OLAP cube.

To successfully link Performance Management and OLAP, the two domains must share common definitions for business metrics (facts) and associated reference data (dimensions). Metrics must be calculate in the same way, linked to reference data and different levels of detail, and synchronized (if managed separately).

Analyzing Performance

Analyzing performance is the process of breaking down what has occurred in an attempt to understand it better. Slicing and dicing an OLAP cube is a form of analysis, providing insight through detail. Analytic models provide a deeper level of analysis, providing insight into cause and effect, and extending this to the future through prediction.

OLAP is largely focused on exploring various aggregations of business metrics, while analytics is largely focused on the underlying detail that surrounds them. Our OLAP solutions provide historic detail to Business Analytics in the form of data from the data warehouse.[3]

The exchange flows the opposite direction as well. Business analytics develop insights that suggest other things that should be tracked by OLAP services. For example, a particular set of behaviors may define a high value customer. This assessment is developed using Business Analytics, and applied to the customers in the OLAP data mart. For a fun example from the world of sports, check out the book Moneyballby Michael Lewis.[4]

Improving Performance

All of this is somewhat academic if people in the business do not use all this information to make decisions. Business impact occurs at the intersection of Performance Management (which tells us what is important and how we are doing) and Analytics (which suggests the best course of action.)

Every analytic model targets a business metric or key performance indicator (KPI) from the world of performance management. That same KPI, in turn, can be used to measure return on investment of the analytic model.

For example, a direct sales manager of enterprise software wants to reduce cost of sales. An analytic model is developed that assesses the likelihood of a prospect to buy enterprise software.

The manager begins using the prospect assessment model to prioritize the work of the sales team. Less likely prospects are reassigned to a telesales force. Over the next two quarters, cost of sales starts falling. The same KPI that the analytic model targeted is used to measure its return on investment.

Information as an Asset

It is common to manage each of the pillars of Modern BI as a separate program. The path to business value, however, requires that these programs share a consistent view of business information. BI programs that are not centralized must carefully coordinate around a common information architecture.

Further Reading


  1. For more on this definition of business intelligence, see Business Intelligence in the Modern Era (9/8/2014)
  2. The three service areas are explained in The Three Pillars of Modern BI (2/9/2015).
  3. Sometimes analytic modelers bypass the data warehouse, but there are steps you can take to make this important repository more useful. For tips on how to make your data warehouse more useful to analytic modelers, see Optimizing Warehouse Data for Business Analytics (9/25/13). Note that even with a well designed data warehouse, analytic models often augment this enterprise data with additional data sources.
  4. The Oakland A's used analytics to re-evaluate the basic metrics used to assess the value of a baseball player. See Business Analytics and Dimensional Data (7/17/13).
** Interested in learning more about modern BI programs?  **

Check out my new course, Business Information and Modern BI: Evolving Beyond the Dimensional Data Mart.  Offered at TDWI conferences and onsite.  See the sidebar for upcoming dates.

Thursday, September 18, 2014

Business Intelligence in the Modern Era

This post offers an updated definition for BI, and suggests that you don't have to think about it as a box on an org chart.

BI has changed a lot in the last two decades. Technologies and best practices have evolved, and we've found more ways in which a BI program can deliver value. Some of these innovations have occurred outside of IT or the BI Competency Centers that many businesses have established. At the same time, many organizations are moving to make business units autonomous.

These changes lead many people to ask what exactly is BI? Is it a box on the org chart? Does it include analytics that were never done by IT? How do data governance and master data management fit in?

Business Intelligence Defined

I define BI as follows:
Business Intelligence:
The use of information to improve business performance
- Chris Adamson

The first thing to note about this definition is that it does not address any specific technologies or methods. These aspects change over time, and they certainly influence what we may be able to achieve.  But the objective is always to provide business value.

Secondly, note that this definition is not beholden to the boundaries of a departmental structure. Regardless of who develops, supports or uses solutions, it's all considered BI.

Let's take a quick look at both these aspects.

BI Services and Activities

The reason we commit resources to BI programs is simple: we intend to use information to deliver some kind of business value.  The definition has been crafted to cover any activities that support this objective.  It can be used to describe a variety of activities that provide business value, both old and new.

Among the older activities it covers:
  • Traditional reporting, OLAP and ad hoc functions
  • Dashboards and scorecards
  • Traditional data warehouses and/or data marts
  • Data integration services
At the same time, some newer uses of information are covered:
  • Business analytics and predictive analytic
  • Master data management
  • Data governance
  • Virtualization and federation services
The definition also covers activities that some people think of as on "the other side of the fence" from BI:
  • Transaction processing
That's intentional; transaction processing manufactures much of the "raw material" that BI programs attempt to leverage.  When we plan an operational solution, we should be thinking about these downstream uses.

BI and the Org Chart

While you may have a group responsible for BI program management, it is important to understand that the scope of BI reaches well beyond this group. The delivery of business benefit from information impacts the entire organization.

Some of the functional areas that participate in BI are:

  • Business units  All of the value from BI happens within business areas that use information. This is where decisions are made and impacts are realized.  For many businesses, responsibility for development of BI solutions also lies in business areas.  This is particularly the case for analytics, but also increasingly for the traditional forms of BI.
  • BI Competency Centers  Whether part of IT or external to it, many organizations have established a centralized resource for planning and overseeing the development of traditional forms of BI, such as data marts, dashboards or scorecards.  In some cases, these centers have become focused on providing advisory services to business units that create and manage their own solutions.
  • Analytic Competency Centers  Business analytics often begins within business areas such as marketing or risk management.  Analytic competency centers are developed to help other areas of the business leverage information in a similar manner. Whether part of the BI competency center or distinct from it, this is also a core BI function.
  • IT  At a minimum, IT has some responsibility for the technical infrastructure on top of which information systems are built -- networks, computers and the services that keep them up and running. IT may also have responsibility for some of the business applications and data management solutions.
Regardless of how your organization structure divvies up these responsibilities, BI is the sum total of these activities, and not the domain of a particular group or department. A business strategy to create value through information cuts across many departments.  It cannot be planned or executed in isolation.
The Future of BI

We're not far from an age where BI is not a separate part of our information architecture.  We're not there yet, but several trends have us on this path:

  • Focus on the future value and re-use of data managed by operational applications
  • Commitment to data governance
  • Maturation of master data management solutions
  • Technological advances in data management and information access

When we finally arrive at a unified information architecture, the definition of BI will still hold. We will be closer to delivering on its promise than ever before.

And, without a doubt, we will have come up with ways of using information to deliver value that have not even be thought of today.

Thursday, November 14, 2013

Facebook's Ken Rudin on Analytics

If you are interested in how business analytics impact your BI program, carve out forty-five minutes of time to watch Ken Rudin's recent TDWI keynote: "Big Data, Bigger Impact." The video is embedded below.

Rudin is the director of analytics at Facebook. In his presentation, he discusses several topics that are of interest to readers of this blog. Among them:
  • Big data technology should be used to extend your traditional BI solution, not replace it. Facebook has realized this, and is working to bring in relational technology to answer traditional business questions.
  • Successful analytics programs bring together centrally managed core data metrics with a variety of data that is not centrally managed. Rudin shares different ways he has been able to make this happen.
  • A similar balance can be attained with your organizational structure. Use of "embedded analysts" provides the business benefits of decentralization, while maintaining the efficiencies and scale advantages of a centralized program.
These are just a few of the points made during his talk. If you don't have the time to watch it now, bookmark this page for later.

You'll also want to check out Wayne Eckerson's latest book, Secrets of Analytical Leaders. (Details below.)

Big Data, Bigger Impact
Ken Rudin
TDWI World Conference, Chicago 5/6/2013




Recommended Reading

Wayne Eckerson's excellent book, Secrets of Analytical Leaders,features more insights from Ken Rudin and others.

I highly recommend this book if you are interested in analytics.

Get it from Amazon.com in paperback or Kindle editions.


Wednesday, September 25, 2013

Optimizing warehouse data for business analytics

Business analytics often integrate information from your data warehouse with other sources of data. This post looks at the best practices of warehouse design that make this possible.

I receive a lot of questions regarding the best way to structure warehouse data to support an analytics program. The answer is simple: follow the same best practices you've already learned.

I'll cover these practices from a dimensional modeling perspective. Keep in mind that they apply in any data warehouse, including those modeled in third normal form.


1. Store Granular Facts

Analytic modelers often choose sources external to the data warehouse, even when the warehouse seems to contain relevant data. The number one reason for this is insufficient detail. The warehouse contains summarized data; the analytic model requires detail.

In this situation, the analytic modeler has no choice but to look elsewhere.  Worse, she may be forced to build redundant processes to transform source data and compile history. Luckily, this is not a failure of warehouse design principles; its a failure to follow standard best practices.

Best practices of  dimensional design dictate that we set the grain of base fact tables at the lowest level of detail possible. Need a daily summary of sales? Store the individual order lines. Asked to track the cost of tips? Store detail about each leg.

Dimensional solutions can contain summarized data. This takes the form of cubes, aggregates, or derived schemas. But these summaries should be derived exclusively from detailed data that also lives in the warehouse.

Like all rules, this rule has exceptions. There are times when the cost/benefit calculus is such that it doesn't make sense to house highly granular indefinitely. But more often than not, summary data is stored simply because basic best practices were not followed.

2. Build “Wide” Dimensions

The more attributes there are in your reference data (aka dimensions), the more useful source material there is for analytic discovery. So build dimensions that are full of attributes,  as many as you can find.

If the grain of your fact table gives the analytics team “observations” to work on, the dimensions give them “variables.”  And the more variables there are, the better the odds of finding useful associations, correlations, or influences.

Luckily, this too is already a best practice. Unfortunately, it is one that is often misunderstood and violated. Misguided modelers frequently break things down into the essential pieces only, or model just to specific requirements.  

3. Track Changes to Reference Data (and Use Effective Dating)

When reference data changes, too many dimensional models default to updating corresponding dimensions, because it is easier.

For example, suppose your company re-brands a product. It's still the same product, but with a new name. You may be tempted to simply update the reference data in your data warehouse. This is easier than tracking changes.  It may even seem to make business sense, because 90% of your reports require this-year-versus-last comparison by product name.

Unfortunately, some very important analysis may require understanding how consumer behavior correlates with the product name. You've lost this in your data set. Best practices help avoid these problems.

Dimensional models should track the change history of reference data. In dimensional speak, this means application of  type 2 slow changes as a rule. This preserves the historic context of every fact recorded in the fact table.

In addition, every row in a dimension table should track "effective" and "expiration" dates, as well as a flag rows that are current. This enables the delivery of type 1 behavior (the current value) even as we store type 2 behavior. From an analytic perspective, it also enables useful "what if" analysis.

As with all rules, again there are exceptions. In some cases, there may be good reason not to respond to changes in reference data by tracking history. But more often than not, type 1 responses are chosen for the wrong reason: because they are easier to implement.

4. Record Identifying Information, Including Alternate Identifiers

Good dimensional models allow us to trace back to the original source data. To do this, include transaction identifiers (real or manufactured) in fact tables, and maintain identifiers from source systems in dimension tables (these are called "natural keys").

Some of this is just plain necessary in order to get a dimensional schema loaded. For example, if we are tracking changes to a product name in a dimension, we may have multiple rows for a given product. The product's identifier is not a unique identifier, but we must have access to it. If we don't, it would become impossible to load a fact into the fact table.

Identifying information is also essential for business analytics. Data from the warehouse is likely to be combined with data that comes from other places. These identifiers are the connectors that allow analytic modelers to do this.  Without them, it may become necessary to bypass the warehouse.

Your analytic efforts, however, may require blending new data with your enterprise data. And that new data may not come with handy identifiers. You have a better chance blending it with enterprise data if your warehouse also includes alternate identifiers, which can be used to do matching. Include things like phone numbers, email addresses, geographic coordinates—anything that will give the analytics effort a fighting chance of linking up data sources.

Summary

If you've been following the best practices of dimensional modeling, you've produced an asset that maximized value for analytic modelers:

  • You have granular, detailed event data.
  • You have rich, detailed reference data.
  • You are tracking and time-stamping changes to reference data. 
  • You've got transaction identifiers, business keys, and alternate identifiers.  

It also goes without saying that conformed dimensions are crucial if you hope to sustain a program of business analytics.

Of course, there are other considerations that may cause an analytic modeler to turn her back on the data warehouse. Latency issues, for example, may steer them to operational solutions. Accessibility and procedural issues, too, may get in the way of the analytic process.

But from a database design perspective, the message is simple: follow those best practices!

Further Reading

You can also read more in prior posts.  For example:
You can also read more in my book, Star Schema: The Complete Reference.  If you use the links on this page to pick up a copy on Amazon, you will be helping support this blog.  

 It covers the best practices of dimensional design in depth. For example:

  • Grain, identifiers, keys and basic slow change techniques are covered in Chapter 3, "Stars and Cubes"
  • The place of summary data is covered in Chapter 14, "Derived Schemas" and Chapter 15, "Aggregates"
  • Conformance is covered in Chapter 5, "Conformed Dimensions"
  • Advanced slow change techniques are explored in Chapter 8, "More Slow Change Techniques"

Wednesday, July 17, 2013

Business Analytics and Dimensional Data

Readers of this blog frequently ask about the relationship of business analytics to the dimensional data that is recorded in data marts and the data warehouse.

Business analytics operate on data that often does not come from the data warehouse. The value of business analytics, however, is measured by its impact on business metrics that are tracked in the data warehouse. 

Business analytics may also help adjust our notion of which metrics matter the most.

The Data Warehouse and Dimensional Data

The dimensional model is the focal point of business information in the data warehouse. It describes how we track business activities and measure business performance. It may also be the foundation for a performance management program that links metrics to business goals.

Dimensional data is the definitive record of what matters to the business about activities and status. Clearly defined performance indicators (facts) are recorded consistently and cross referenced with standardized and conformed reference data (dimensions).

In this post, when I talk about "the data warehouse," I will have this dimensional data in mind.

Business Analytics

Business analytics seek to provide new insight into business activities. Analytics do not always operate on business metrics, and they don't rely exclusively on information form the data warehouse. Dimensional information may be an input, but other sources of data are also drawn upon.

The outputs of business analytics, however, aim directly at the metrics tracked by our dimensional models. Insights from analytics are used by people to move key metrics in the desired directions. These results are called impacts.

Business analytics may also help in another way. Sometimes, they help us determine which metrics are actually the most important.

A great illustration of these dynamics can be found in the business of Major League Baseball. (If you don't follow baseball, don't worry. You don't have to understand baseball to follow this example.)

Metrics in Baseball

Major league baseball has long been in the business of measurement. Followers of the game are familiar with the "box score" that summarizes each game, "standings" that illustrate the relative performance of teams, and numerous statistics that describe the performance of each player.

These metrics have precise definitions and have been recorded consistently for almost 150 years.1 Like the metrics in your data warehouse, they are tracked systematically. Professional baseball teams can also set goals for these metrics and compare them to results, much like a scorecard in your performance management program.

How does one improve these results? If you run a baseball team, part of the answer lies in how you choose players. In the book Moneyball2 Michael Lewis describes how the Oakland Athletics used a set of techniques known as sabermetrics3 to make smarter choices about which players to add to their roster.

These analytics allowed the A's to make smarter choices with measurable impact--improving performance and reducing costs. Analytics also motivated the A's to change the emphasis given to various metrics.

Business Analytics and the Oakland Athletics

The traditional approach to selecting players was focused on long-held conventional wisdom about what makes a valuable player. For example, offensive value was generally held to derive from the ability to contact the baseball, and with a player's speed. These skills are at least partially evident in some of the standard baseball metrics -- things like the batting average, stolen bases, runs batted in and sacrifices.

The Oakland A's looked to data to refine their notion of what a valuable player looks like. How do the things players do actually contribute to a win or loss? To do this, the A's went beyond the box scores and statistics -- beyond the data warehouse, so to speak.

By studying every action that is a part of the game -- what players are on base, what kind of pitches are thrown, where the ball lands when it is hit, etc -- the A's realized they could be smarter about assessing how a player adds value. These business analytics led to several useful conclusions:
  • Batting averages don't tell the whole story about a player's ability to get on base; for example, they exclude walks.
  • Stolen bases don't always contribute to scoring; much depends on who comes to bat next.
  • Runs batted in tell as much about who hits before a player as they do about the player himself
  • Sacrifices, where an out is recorded but a runner advances, were found to contribute less to the outcome of a game than conventional wisdom held.
You may or may not understand these conclusions, but here is the important thing: the analytics suggested that the A's could better assess a player's impact on winning games by turning away from conventional wisdom. Contact and speed are not the best predictors for winning game.  "Patience at the plate" leads to better outcomes.

Impact for the A's

By using these insights to make choices, the A's were able to select less expensive players who could make a more significant contribution to team results. These choices resulted in measurable improvement in many of the standard metrics of baseball--the win/loss ratio in particular. These insights also enabled them to deliver improved financial results.

Analytics also helped the A's in another way: they refined exactly which metrics they should be tracking. For example, in assessing of offensive value, on base percentage should be emphasized over batting average. They also created some of their own metrics to track their performance over time.

The Impact of Analytics

Business analytics tell us what to look for, what works, or what might happen. Examples are signs of impending churn, what makes a web site "sticky", patterns that might indicate fraud, and so forth.

These insights, in turn, are applied in making business decisions. These choices provide valuable impact that can by tracking traditional business metrics. Examples include increased retention rates, reduced costs associated with fraud, and so forth.

These impacts are the desired outcome of the analytic program. If the analytics don't have a demonstrable impact on metrics, they are not providing value.

Business analytics can also help us revise our notion of what to track in our data warehouses, or which metrics to pay closest attention to. Number of calls to the support center, for example, may be less of an indicator of customer satisfaction than the average time to resolve an issue.

Conclusion

As you expand the scope of your BI program to include analytics, remember that your desired outcome is a positive impact on results. Move the needle on business metrics, and the analytics have done their job.

Thanks to my colleague Mark Peco, for suggesting that I use Moneyball as a way to explain analytics without revealing the proprietary insights attained by my customers. 

Notes

[1] The box score and many of these statistics were established in the mid 1800's by a sports writer named Henry Chadwick.

[2] Moneyball by Michael Lewis (Norton, 2011).

[3] The Oakland A's are a high-profile example of the use of sabermetrics, but did not originate the concept. See wikipedia for more information.

Related Posts

See also these posts:





Wednesday, June 5, 2013

In the Era of Big Data, The Dimensional Model is Essential

Don't let the hype around big data lead you to believe your BI program is obsolete. 

I receive a lot of questions about "big data."  Here is one:
We have been doing data warehousing using Kimball method and dimensional modeling for several years and are very successful (thanks for your 3 books, btw). However, these days we hear a lot about Big Data Analytics, and people say that Big Data is the future trend of BI, and that it will replace data warehousing, etc.

Personally I don't believe that Big Data is going to replace Data Warehousing but I guess that it may still bring certain value to BI.  I'm wondering if you could share some thoughts.

"Big data" is the never-ending quest to expand the ways in which our BI programs deliver business value.

As we expand the scope of what we deliver to the business, we must be able to tie our discoveries back to business metrics and measure the impact of our decisions. The dimensional model is the glue that allows us to achieve this.

Unless you plan to stop measuring your business, the dimensional model will remain essential to your BI program. The data warehouse remains relevant as a means to instantiate the information that supports this model. Reports of its death have been greatly exaggerated.

Big Data

"Big Data" is usually defined as a set of data management challenges known as "the three V's" -- volume, velocity and variety. These challenges are not new. Doug Laney first wrote about the three V's in 2001 -- twelve years ago.1And even before that, we were dealing with these problems.

Photo from NASA in public domain.
Consider the first edition of The Data Warehouse Toolkit, published by Ralph Kimball in 1996.2 For many readers, his "grocery store" example provided their first exposure to the star schema. This schema captured aggregated data! The 21 GB fact table was a daily summary of sales, not a detailed record of point-of-sale transactions. Such a data set was presumably too large at the time.

That's volume, the first V, circa 1996.

In the same era, we were also dealing with velocity and variety. Many organizations were moving from monthly, weekly or daily batch loads to real-time or near-real time loads. Some were also working to establish linkages between dimensional data and information stored in document repositories.

New business questions

As technology evolves, we are able to address an ever expanding set of business questions.

Today, it is not unreasonable to expect the grocery store's data warehouse to have a record for every product that moves across the checkout scanner, measured in terabytes rather than gigabytes. With this level of detail, market basket analysis is possible, along with longitudinal study of customer behavior.

But of course, the grocery store is now looking beyond sales to new analytic possibilities. These include tracking the movement of product through the supply and distribution process, capturing interaction behavior of on-line shoppers, and studying consumer sentiment.

We still measure our businesses

What does this mean for the dimensional model? As I've posted before, a dimensional model represents how we measure the business. That's not something we're going to stop doing. Traditional business questions remain relevant, and the information that supports them is the core of our BI solution.

At the same time, we need to be able to link this information to other types of data. For a variety of reasons (V-V-V), some of this information may not be stored in a relational format, and some may not be a part of the data warehouse.

Making sense of all this data requires placing it in the context of our business objectives and activities.

To do this, we must continue to understand and capture business metrics, record transaction identifiers, integrate around conformed dimensions, and maintain associated business keys. These are long established best practices of dimensional modeling.

By applying these dimensional techniques, we can (1) link insights from our analytics to business objectives and (2) measure the impact of resultant business decisions. If we don't do this, our big data analytics become a modern-day equivalent of the stove-pipe data mart.

The data warehouse

The function of the data warehouse is to instantiate the data that supports measurement of the business. The dimensional model can be used toward this aim (think: star schema, cube.)

The dimensional model also has other functions. It is used to express information requirements, to guide program scope, and to communicate with the business. Technology may eventually get us to a point where we can jettison the data warehouse on an enterprise scale,but these other functions will remain essential. In fact, their importance becomes elevated.

In any architecture that moves away from physically integrated data, we need a framework that allows us to bring that data together with semantic consistency. This is one of the key functions of the dimensional model.

The dimensional model is the glue that is used to assemble business information from distributed data.

Organizations that leverage a bus architecture already understand this. They routinely bring together information from separate physical data marts, a process supported by the dimensional principle of conformance. Wholesale elimination of the data warehouse takes things one step further.

Notes
  1. Doug Laney's first published treatment of "The Three V's" can be found on his blog.
  2. Now out of print, this discussion appeared in Chapter 2, "The Grocery Store."  Insight into the big data challenges of 1996 can be found in Chapter 17, "The Future."
  3. I think we are a long time away from being able to do this on an enterprise scale. When we do get there, it will be as much due to master data management as it is due to big data or virtualization technologies. I'll discuss virtualization in some future posts.
More reading

Previous posts have dealt with this topic.
  • In Big Data and Dimensional Modeling (4/20/2012) you can see me discuss the impact of new technologies on the data warehouse and the importance of the dimensional model.

Tuesday, April 30, 2013

The Role of the Dimensional Model in Your BI Program

The dimensional model delivers value long before a database is designed or built, and even when no data is ever stored dimensionally. While it is best known as a basis for database design, its other roles may have more important impacts on your BI program.

The dimensional model plays four key roles in Business Intelligence:
  1. The dimensional model is the ideal way define requirements, because it describes how the business is measured
  2. The dimensional model is ideal for managing scope because it communicates to business people (functionality) and technical people (complexity) 
  3. The dimensional model is ideal as a basis for data mart design because it provides ease of use and high performance
  4. The dimensional model is ideal as a semantic layer because it communicates in business terms
Information Requirements

The dimensional model is best understood as an information model, rather than data model. It describes business activities the same way people do: as a system of measurement. This makes it the ideal form to express information needs, regardless of how information will be stored.

Image by Gravityx9
licensed under 
Creative Commons 2.0
A dimensional model defines business metrics or performance indicators in detail, and captures the attendant dimensional context. (For a refresher, see the post What is a Dimensional Model from 4/27/2010.) Metrics are grouped based on shared granularity, cross referenced to shared reference data, and traced to data sources.

This representation is valuable because business questions are constantly changing. If you simply state them, you produce a model with limited shelf life. If you model answers to the question of today, you've provided perishable goods.

A dimensional model establishes information requirements that endure, even as questions change. It provides a strong foundation for multiple facets of BI:
  • Performance management, including dashboards and scorecards
  • Analytic processing, including OLAP and ad hoc analysis
  • Reporting, including both enterprise and operational reports
  • Advanced analytics, including business analytics, data mining and predictive analytics
All these disciplines center on business metrics. It should be no surprise that when Howard Dresner coined the term Business Intelligence, his definition referenced "facts and fact based systems." It's all about measurement.

Program Roadmap and Project Scope

A dimensional model can be used to describe scope because it communicates to two important audiences.
  • Business people: functionality   The dimensional model describes the measurement of a business process, reflecting how the process is evaluated by participants and observers. It communicates business capability.
  • Technical personnel: level of effort A dimensional model has technical implications: it determines the data sources that must be integrated, how information must be integrated and cleansed, and how queries or reports can be built. In this respect, it communicates level of effort. 
These dual perspectives make the dimensional design an ideal centerpiece for managing the roadmap for your BI program. Fully documented and mapped to data sources, a dimensional model can be divided into projects and prioritized. It is a blueprint that can be understood by all interested parties. A simple conformance matrix communicates both intended functionality and technical level of effort for each project.

At the project level, a dimensional design can be used as the basis for progress reporting. It can also serve as a nonambiguous arbiter of change requests. Changes that add data sources or impact grain, for example, are considered out of scope. This is particularly useful for organizations that employ iterative methodologies, but its simplicity makes it easy to reconcile with any development methodology.

Database Design

The dimensional model is best known as the basis for database design. The term "star schema" is far more widely recognized than "dimensional model" (a fact that influenced the name of my most recent book).

In fact, the dimensional model is the de facto standard for data mart design, and many organizations use it to shape the entire data warehouse. It has an important place in the W.H. Inmon's Corporate Information Factory, Ralph Kimball Dimensional Bus architecture, and even in one-off data marts that lack an enterprise focus.

Implemented in a relational database, the dimensional model becomes known as a star schema or snowflake. Implemented in a multidimensional database, it is known as a cube. These implementations offer numerous benefits. They are:
  • Easily understandable by business people
  • Extraordinarily flexible from a reporting and analysis perspective
  • Adaptable to change
  • Capable of very high performance
Presentation and the Semantic Layer

A dimensional representation is the ideal way to present information to business people, regardless of how it is actually stored. It reflects how people think about the business, so it is used to organize the catalog of items they can call on for analysis.

Many business intelligence tools are architected around this concept, allowing a semantic layer to sit between the user and database tables. The elements with which people can frame questions are categorized as facts and dimensions. One need not know what physical data structures lay beneath.

Even the earliest incarnations of the semantic layer leveraged this notion. Many organizations used these tools to impose a dimensional view directly on top of operational data. Today, semantic layers are commonly linked to dimensional data marts.

A dimensional representation of business activity is the starting point for a variety of BI activities:
  • Building enterprise reports
  • Defining performance dashboards
  • Performing ad hoc analysis
  • Preparing data for an analytic model
The concept of dimensional presentation is receiving renewed attention as federated solutions promise the construction of virtual solutions rather than physical ones.

Further information

I briefly covered these four roles in an interview last year:
Many of these themes have been discussed previously:
Although I've touched on these topics before, I wanted to bring them together in a single article. In the coming months, I will refer back to these concepts as I address common questions about big data, agile BI and federation.

In the mean time, please help support this blog by picking up a copy of my latest book.

Thursday, December 20, 2012

Dimensional Models: No E-R Fallback Required

People often suggest to me that a dimensional model cannot handle certain situations. These situations, the assertion holds, require falling back to entity-relationship modeling techniques.

Image by Patrick Hosley,
Licensed by Creative Commons 2.0
When we look together at their examples, though,  I've always found a dimensional solution.

Here's a list of things people sometimes do not realize can be handled by standard dimensional techniques.

At the end of the post is a large collection of links to previous posts treat these topcis in more detail.

Note: This article should not be interpreted as a criticism of entity-relationship modeling. Rather, the point is to cover commonly overlooked capabilities of dimensional modeling.


Predictive Analytics: Storage of Granular Detail

I often hear people say: if you want to do predictive analytics, its important to capture all the relationships among data attributes. This is incorrect.

The job of predictive analytics is to tell us what is important--not the reverse!  Neither data mining nor predictive analytics requires information to be stored in an entity-relationship format.

As people who do this work will tell you, they don't care how the data is supplied. Nor do they care what modelers believe are the important relationships. Many tools will completely de-normalize the data before doing anything else.

In order to support predictive analytics, standard best practices of dimensional modeling apply:
  • Fact tables record business activities and conditions at the lowest level of detail possible in granular, first line stars.
  • Transaction identifiers are carried in the model if available.
  • Surrounding dimensions are enriched with detail to the fullest extent possible
  • In the dimension tables, changes are tracked and time-stamped (“type 2”).
If we do these things with our dimensional data, we have not lost anything that is that is required to do predictive analytics.

Many-to-many Relationships: Multi-Valued Dimension Bridge

Usually, each fact in a star schema can be linked to a single member of a dimension. This is often mistaken for a rule, which leads to the incorrect conclusion that dimensional models cannot accommodate a fact that may link to multiple members of a dimension.

(This is often generalized as “dimensional cannot handle many-to-many relationships.”)

In a dimensional model, this situation is referred to as a multi-valued dimension. We use a bridge table to link the facts to the dimension in question. This special table sets up a many-to-many relationship between the facts and dimensions, allowing any number of dimension members to be associated with any number of facts.

Here are some examples:
  • In sales, a bridge table may be used to associate a sale, as recorded in a fact table, with multiple sales people. 
  • In insurance, a single claim (again, tracked in a fact table) can be associated with multiple parties. 
  • In healthcare, a single encounter may be associated with multiple providers or tests. 
  • In government, a single audit or inspection may be associated with multiple findings. 
  • In law enforcement, a single arrest may be associated with multiple charges.
Repeating Attributes: Multi-Values Attribute Bridge

It is sometimes suggested that dimensional models cannot gracefully accommodate repeating attributes. The dimensional solution is again a bridge table. This time, it is placed between the dimension table and an outrigger that contains the attribute in question.

Examples include:
  • Companies that have multiple Standard Industry Classification codes 
  • People with multiple phone numbers or addresses 
  • Accounts with multiple account holders 
  • Patients with multiple diagnoses
  • Documents with multiple keywords.
Recursive Hierarchies: Hierarchy Bridge

Ragged hierarchies, unbalanced hierarchies, or recursive hierarchies are often cited as the downfall of the dimensional model. In fact, a solution exists, and it is extraordinarily powerful. The hierarchy bridge table allows facts to be aggregated either by rolling up or rolling down through the hierarchy, regardless of number of levels.

Examples include:
  • Parts that are made up of other parts
  • Departments that fall within other departments
  • Geographies that fall within other geographies
  • Companies that own other companies
Relationships Between Dimensions: Factless Fact Tables

A star schema does not include relationships between dimension tables. This has led to the misconception that you can't track these relationships.

In fact, any important relationship between dimension tables can and should be captured. It is done using factless fact tables. (Dimensions are never directly linked because of the implications this would have on slow change processing.)

Examples include:
  • Employees filling a job in a department
  • Marketing promotions in effect in a geographical market
  • Students registered for courses
  • The primary care physician assigned to an insured party
  • Brokers assigned to clients
Subtyping: Core and Custom

Another situation where modelers often believe they must “fall back” on ER techniques is when the attributes of a dimension table vary by type. This variation is often misconstrued as calling for the ER construct known as subtyping.  Similar variation might also be found with associated facts

In the dimensional model, heterogeneous attributes are handled via the core and custom technique.

A core dimension captures common attributes, and type specific replicas capture the core attributes along with those specific to the subtype. Dimensions can then be joined to the fact table according to the analytic requirement. If there is variation in the facts, the same can be done with fact tables.

Examples include:
  • Products with attributes that vary by type 
  • Customers that have different characteristics depending on whether they are businesses or individuals
  • In insurance, policies that have different characteristics depending on whether they are group or individual polices
  • In healthcare, medical procedures or tests that have different characteristics and result metrics depending on the test
  • In retail, stores that have varying characteristics depending on the type (e.g. owned, franchise, pocket)
Non Additive Metrics: Snapshots or Summary Tables

In a classic star schema design, facts are recorded at a granular level and “rolled up” across various dimensions at query time. Detractors often assume this means that non-additive facts have no place in a dimensional model.

In fact, dimensional modelers have several tools for handling non-additive facts.

Those that can be broken down into additive components are captured at the component level. This is common for many key business metrics such as:
  • Margin rate: stored as margin amount and cost amount in a single fact table. The ratio is computed after queries aggregate the detail.
  • Yield or conversion percentage: stored as quote count and order count in two separate fact tables, with ratio converted after aggregation at query time.
Other non-additive facts cannot be broken down into fully additive components. These are usually captured at the appropriate level of detail, and stored in snapshot tables or summary tables. Common examples include:
  • Period-to-date amounts, stored in a snapshot table or summary table
  • Distinct counts, stored in a summary table
  • Non-numeric grades, stored in a transaction-grained fact table
While these non-additive facts are flexible than additive facts in terms of how they can be used, this is not a result of dimensional representation.

Conclusion

Every technique mentioned here is part of dimensional modeling cannon. None are stopgaps or workarounds. While some may prove problematic for some of our BI software, these problems are not unique to the dimensional world.

In the end, the dimensional model can represent the same real-world complexities that entity-relationship models can. No ER fallback required.

- Chris


Learn More

All of these topics have been covered previously on this blog.  Here are some links to get you started.

Establishing granular, detailed star schemas:
Multi-valued Dimensions:
Multi-Valued Attributes:
Recursive Hierarchies:
Factless Fact Tables:
Core and Custom:
  • There's not much on this topic on this blog right now, but see my book for more info (details below.)
Non-additive Facts:

Also consider checking out my book, Star Schema: The Complete Reference.  It covers all these topics in much more detail:
  • Granular and detailed stars are covered in Chapter 3, "Stars and Cubes"
  • Multi-valued Dimensions and Multi-Valued attributes are covered in Chapter 9, "Multi-valued Dimensions and Bridges"
  • Hierarchy Bridges are covered in Chapter 10, "Recursive Hierarchies and Bridge Tables"
  • Factless Fact Tables are covered in Chapter 12, "Factless Fact Tables"
  • Core and Custom schemas are covered in Chapter 13, "Type-specific Stars"
  • Non Additive Facts are covered in Chapter 3, "Stars and Cubes," Chapter 11, "Transactions, Snapshots and Accumulating Snapshots" and Chapter 14, "Derived Schemas"

Use the links on this blog to order a copy from Amazon.  There is no additional cost to you, and you will be helping support this  blog.