Wednesday, June 5, 2013

In the Era of Big Data, The Dimensional Model is Essential

Don't let the hype around big data lead you to believe your BI program is obsolete. 

I receive a lot of questions about "big data."  Here is one:
We have been doing data warehousing using Kimball method and dimensional modeling for several years and are very successful (thanks for your 3 books, btw). However, these days we hear a lot about Big Data Analytics, and people say that Big Data is the future trend of BI, and that it will replace data warehousing, etc.

Personally I don't believe that Big Data is going to replace Data Warehousing but I guess that it may still bring certain value to BI.  I'm wondering if you could share some thoughts.

"Big data" is the never-ending quest to expand the ways in which our BI programs deliver business value.

As we expand the scope of what we deliver to the business, we must be able to tie our discoveries back to business metrics and measure the impact of our decisions. The dimensional model is the glue that allows us to achieve this.

Unless you plan to stop measuring your business, the dimensional model will remain essential to your BI program. The data warehouse remains relevant as a means to instantiate the information that supports this model. Reports of its death have been greatly exaggerated.

Big Data

"Big Data" is usually defined as a set of data management challenges known as "the three V's" -- volume, velocity and variety. These challenges are not new. Doug Laney first wrote about the three V's in 2001 -- twelve years ago.1And even before that, we were dealing with these problems.

Photo from NASA in public domain.
Consider the first edition of The Data Warehouse Toolkit, published by Ralph Kimball in 1996.2 For many readers, his "grocery store" example provided their first exposure to the star schema. This schema captured aggregated data! The 21 GB fact table was a daily summary of sales, not a detailed record of point-of-sale transactions. Such a data set was presumably too large at the time.

That's volume, the first V, circa 1996.

In the same era, we were also dealing with velocity and variety. Many organizations were moving from monthly, weekly or daily batch loads to real-time or near-real time loads. Some were also working to establish linkages between dimensional data and information stored in document repositories.

New business questions

As technology evolves, we are able to address an ever expanding set of business questions.

Today, it is not unreasonable to expect the grocery store's data warehouse to have a record for every product that moves across the checkout scanner, measured in terabytes rather than gigabytes. With this level of detail, market basket analysis is possible, along with longitudinal study of customer behavior.

But of course, the grocery store is now looking beyond sales to new analytic possibilities. These include tracking the movement of product through the supply and distribution process, capturing interaction behavior of on-line shoppers, and studying consumer sentiment.

We still measure our businesses

What does this mean for the dimensional model? As I've posted before, a dimensional model represents how we measure the business. That's not something we're going to stop doing. Traditional business questions remain relevant, and the information that supports them is the core of our BI solution.

At the same time, we need to be able to link this information to other types of data. For a variety of reasons (V-V-V), some of this information may not be stored in a relational format, and some may not be a part of the data warehouse.

Making sense of all this data requires placing it in the context of our business objectives and activities.

To do this, we must continue to understand and capture business metrics, record transaction identifiers, integrate around conformed dimensions, and maintain associated business keys. These are long established best practices of dimensional modeling.

By applying these dimensional techniques, we can (1) link insights from our analytics to business objectives and (2) measure the impact of resultant business decisions. If we don't do this, our big data analytics become a modern-day equivalent of the stove-pipe data mart.

The data warehouse

The function of the data warehouse is to instantiate the data that supports measurement of the business. The dimensional model can be used toward this aim (think: star schema, cube.)

The dimensional model also has other functions. It is used to express information requirements, to guide program scope, and to communicate with the business. Technology may eventually get us to a point where we can jettison the data warehouse on an enterprise scale,but these other functions will remain essential. In fact, their importance becomes elevated.

In any architecture that moves away from physically integrated data, we need a framework that allows us to bring that data together with semantic consistency. This is one of the key functions of the dimensional model.

The dimensional model is the glue that is used to assemble business information from distributed data.

Organizations that leverage a bus architecture already understand this. They routinely bring together information from separate physical data marts, a process supported by the dimensional principle of conformance. Wholesale elimination of the data warehouse takes things one step further.

Notes
  1. Doug Laney's first published treatment of "The Three V's" can be found on his blog.
  2. Now out of print, this discussion appeared in Chapter 2, "The Grocery Store."  Insight into the big data challenges of 1996 can be found in Chapter 17, "The Future."
  3. I think we are a long time away from being able to do this on an enterprise scale. When we do get there, it will be as much due to master data management as it is due to big data or virtualization technologies. I'll discuss virtualization in some future posts.
More reading

Previous posts have dealt with this topic.
  • In Big Data and Dimensional Modeling (4/20/2012) you can see me discuss the impact of new technologies on the data warehouse and the importance of the dimensional model.