Thursday, November 14, 2013

Facebook's Ken Rudin on Analytics

If you are interested in how business analytics impact your BI program, carve out forty-five minutes of time to watch Ken Rudin's recent TDWI keynote: "Big Data, Bigger Impact." The video is embedded below.

Rudin is the director of analytics at Facebook. In his presentation, he discusses several topics that are of interest to readers of this blog. Among them:
  • Big data technology should be used to extend your traditional BI solution, not replace it. Facebook has realized this, and is working to bring in relational technology to answer traditional business questions.
  • Successful analytics programs bring together centrally managed core data metrics with a variety of data that is not centrally managed. Rudin shares different ways he has been able to make this happen.
  • A similar balance can be attained with your organizational structure. Use of "embedded analysts" provides the business benefits of decentralization, while maintaining the efficiencies and scale advantages of a centralized program.
These are just a few of the points made during his talk. If you don't have the time to watch it now, bookmark this page for later.

You'll also want to check out Wayne Eckerson's latest book, Secrets of Analytical Leaders. (Details below.)

Big Data, Bigger Impact
Ken Rudin
TDWI World Conference, Chicago 5/6/2013




Recommended Reading

Wayne Eckerson's excellent book, Secrets of Analytical Leaders,features more insights from Ken Rudin and others.

I highly recommend this book if you are interested in analytics.

Get it from Amazon.com in paperback or Kindle editions.


Wednesday, September 25, 2013

Optimizing warehouse data for business analytics

Business analytics often integrate information from your data warehouse with other sources of data. This post looks at the best practices of warehouse design that make this possible.

I receive a lot of questions regarding the best way to structure warehouse data to support an analytics program. The answer is simple: follow the same best practices you've already learned.

I'll cover these practices from a dimensional modeling perspective. Keep in mind that they apply in any data warehouse, including those modeled in third normal form.


1. Store Granular Facts

Analytic modelers often choose sources external to the data warehouse, even when the warehouse seems to contain relevant data. The number one reason for this is insufficient detail. The warehouse contains summarized data; the analytic model requires detail.

In this situation, the analytic modeler has no choice but to look elsewhere.  Worse, she may be forced to build redundant processes to transform source data and compile history. Luckily, this is not a failure of warehouse design principles; its a failure to follow standard best practices.

Best practices of  dimensional design dictate that we set the grain of base fact tables at the lowest level of detail possible. Need a daily summary of sales? Store the individual order lines. Asked to track the cost of tips? Store detail about each leg.

Dimensional solutions can contain summarized data. This takes the form of cubes, aggregates, or derived schemas. But these summaries should be derived exclusively from detailed data that also lives in the warehouse.

Like all rules, this rule has exceptions. There are times when the cost/benefit calculus is such that it doesn't make sense to house highly granular indefinitely. But more often than not, summary data is stored simply because basic best practices were not followed.

2. Build “Wide” Dimensions

The more attributes there are in your reference data (aka dimensions), the more useful source material there is for analytic discovery. So build dimensions that are full of attributes,  as many as you can find.

If the grain of your fact table gives the analytics team “observations” to work on, the dimensions give them “variables.”  And the more variables there are, the better the odds of finding useful associations, correlations, or influences.

Luckily, this too is already a best practice. Unfortunately, it is one that is often misunderstood and violated. Misguided modelers frequently break things down into the essential pieces only, or model just to specific requirements.  

3. Track Changes to Reference Data (and Use Effective Dating)

When reference data changes, too many dimensional models default to updating corresponding dimensions, because it is easier.

For example, suppose your company re-brands a product. It's still the same product, but with a new name. You may be tempted to simply update the reference data in your data warehouse. This is easier than tracking changes.  It may even seem to make business sense, because 90% of your reports require this-year-versus-last comparison by product name.

Unfortunately, some very important analysis may require understanding how consumer behavior correlates with the product name. You've lost this in your data set. Best practices help avoid these problems.

Dimensional models should track the change history of reference data. In dimensional speak, this means application of  type 2 slow changes as a rule. This preserves the historic context of every fact recorded in the fact table.

In addition, every row in a dimension table should track "effective" and "expiration" dates, as well as a flag rows that are current. This enables the delivery of type 1 behavior (the current value) even as we store type 2 behavior. From an analytic perspective, it also enables useful "what if" analysis.

As with all rules, again there are exceptions. In some cases, there may be good reason not to respond to changes in reference data by tracking history. But more often than not, type 1 responses are chosen for the wrong reason: because they are easier to implement.

4. Record Identifying Information, Including Alternate Identifiers

Good dimensional models allow us to trace back to the original source data. To do this, include transaction identifiers (real or manufactured) in fact tables, and maintain identifiers from source systems in dimension tables (these are called "natural keys").

Some of this is just plain necessary in order to get a dimensional schema loaded. For example, if we are tracking changes to a product name in a dimension, we may have multiple rows for a given product. The product's identifier is not a unique identifier, but we must have access to it. If we don't, it would become impossible to load a fact into the fact table.

Identifying information is also essential for business analytics. Data from the warehouse is likely to be combined with data that comes from other places. These identifiers are the connectors that allow analytic modelers to do this.  Without them, it may become necessary to bypass the warehouse.

Your analytic efforts, however, may require blending new data with your enterprise data. And that new data may not come with handy identifiers. You have a better chance blending it with enterprise data if your warehouse also includes alternate identifiers, which can be used to do matching. Include things like phone numbers, email addresses, geographic coordinates—anything that will give the analytics effort a fighting chance of linking up data sources.

Summary

If you've been following the best practices of dimensional modeling, you've produced an asset that maximized value for analytic modelers:

  • You have granular, detailed event data.
  • You have rich, detailed reference data.
  • You are tracking and time-stamping changes to reference data. 
  • You've got transaction identifiers, business keys, and alternate identifiers.  

It also goes without saying that conformed dimensions are crucial if you hope to sustain a program of business analytics.

Of course, there are other considerations that may cause an analytic modeler to turn her back on the data warehouse. Latency issues, for example, may steer them to operational solutions. Accessibility and procedural issues, too, may get in the way of the analytic process.

But from a database design perspective, the message is simple: follow those best practices!

Further Reading

You can also read more in prior posts.  For example:
You can also read more in my book, Star Schema: The Complete Reference.  If you use the links on this page to pick up a copy on Amazon, you will be helping support this blog.  

 It covers the best practices of dimensional design in depth. For example:

  • Grain, identifiers, keys and basic slow change techniques are covered in Chapter 3, "Stars and Cubes"
  • The place of summary data is covered in Chapter 14, "Derived Schemas" and Chapter 15, "Aggregates"
  • Conformance is covered in Chapter 5, "Conformed Dimensions"
  • Advanced slow change techniques are explored in Chapter 8, "More Slow Change Techniques"

Wednesday, July 17, 2013

Business Analytics and Dimensional Data

Readers of this blog frequently ask about the relationship of business analytics to the dimensional data that is recorded in data marts and the data warehouse.

Business analytics operate on data that often does not come from the data warehouse. The value of business analytics, however, is measured by its impact on business metrics that are tracked in the data warehouse. 

Business analytics may also help adjust our notion of which metrics matter the most.

The Data Warehouse and Dimensional Data

The dimensional model is the focal point of business information in the data warehouse. It describes how we track business activities and measure business performance. It may also be the foundation for a performance management program that links metrics to business goals.

Dimensional data is the definitive record of what matters to the business about activities and status. Clearly defined performance indicators (facts) are recorded consistently and cross referenced with standardized and conformed reference data (dimensions).

In this post, when I talk about "the data warehouse," I will have this dimensional data in mind.

Business Analytics

Business analytics seek to provide new insight into business activities. Analytics do not always operate on business metrics, and they don't rely exclusively on information form the data warehouse. Dimensional information may be an input, but other sources of data are also drawn upon.

The outputs of business analytics, however, aim directly at the metrics tracked by our dimensional models. Insights from analytics are used by people to move key metrics in the desired directions. These results are called impacts.

Business analytics may also help in another way. Sometimes, they help us determine which metrics are actually the most important.

A great illustration of these dynamics can be found in the business of Major League Baseball. (If you don't follow baseball, don't worry. You don't have to understand baseball to follow this example.)

Metrics in Baseball

Major league baseball has long been in the business of measurement. Followers of the game are familiar with the "box score" that summarizes each game, "standings" that illustrate the relative performance of teams, and numerous statistics that describe the performance of each player.

These metrics have precise definitions and have been recorded consistently for almost 150 years.1 Like the metrics in your data warehouse, they are tracked systematically. Professional baseball teams can also set goals for these metrics and compare them to results, much like a scorecard in your performance management program.

How does one improve these results? If you run a baseball team, part of the answer lies in how you choose players. In the book Moneyball2 Michael Lewis describes how the Oakland Athletics used a set of techniques known as sabermetrics3 to make smarter choices about which players to add to their roster.

These analytics allowed the A's to make smarter choices with measurable impact--improving performance and reducing costs. Analytics also motivated the A's to change the emphasis given to various metrics.

Business Analytics and the Oakland Athletics

The traditional approach to selecting players was focused on long-held conventional wisdom about what makes a valuable player. For example, offensive value was generally held to derive from the ability to contact the baseball, and with a player's speed. These skills are at least partially evident in some of the standard baseball metrics -- things like the batting average, stolen bases, runs batted in and sacrifices.

The Oakland A's looked to data to refine their notion of what a valuable player looks like. How do the things players do actually contribute to a win or loss? To do this, the A's went beyond the box scores and statistics -- beyond the data warehouse, so to speak.

By studying every action that is a part of the game -- what players are on base, what kind of pitches are thrown, where the ball lands when it is hit, etc -- the A's realized they could be smarter about assessing how a player adds value. These business analytics led to several useful conclusions:
  • Batting averages don't tell the whole story about a player's ability to get on base; for example, they exclude walks.
  • Stolen bases don't always contribute to scoring; much depends on who comes to bat next.
  • Runs batted in tell as much about who hits before a player as they do about the player himself
  • Sacrifices, where an out is recorded but a runner advances, were found to contribute less to the outcome of a game than conventional wisdom held.
You may or may not understand these conclusions, but here is the important thing: the analytics suggested that the A's could better assess a player's impact on winning games by turning away from conventional wisdom. Contact and speed are not the best predictors for winning game.  "Patience at the plate" leads to better outcomes.

Impact for the A's

By using these insights to make choices, the A's were able to select less expensive players who could make a more significant contribution to team results. These choices resulted in measurable improvement in many of the standard metrics of baseball--the win/loss ratio in particular. These insights also enabled them to deliver improved financial results.

Analytics also helped the A's in another way: they refined exactly which metrics they should be tracking. For example, in assessing of offensive value, on base percentage should be emphasized over batting average. They also created some of their own metrics to track their performance over time.

The Impact of Analytics

Business analytics tell us what to look for, what works, or what might happen. Examples are signs of impending churn, what makes a web site "sticky", patterns that might indicate fraud, and so forth.

These insights, in turn, are applied in making business decisions. These choices provide valuable impact that can by tracking traditional business metrics. Examples include increased retention rates, reduced costs associated with fraud, and so forth.

These impacts are the desired outcome of the analytic program. If the analytics don't have a demonstrable impact on metrics, they are not providing value.

Business analytics can also help us revise our notion of what to track in our data warehouses, or which metrics to pay closest attention to. Number of calls to the support center, for example, may be less of an indicator of customer satisfaction than the average time to resolve an issue.

Conclusion

As you expand the scope of your BI program to include analytics, remember that your desired outcome is a positive impact on results. Move the needle on business metrics, and the analytics have done their job.

Thanks to my colleague Mark Peco, for suggesting that I use Moneyball as a way to explain analytics without revealing the proprietary insights attained by my customers. 

Notes

[1] The box score and many of these statistics were established in the mid 1800's by a sports writer named Henry Chadwick.

[2] Moneyball by Michael Lewis (Norton, 2011).

[3] The Oakland A's are a high-profile example of the use of sabermetrics, but did not originate the concept. See wikipedia for more information.

Related Posts

See also these posts:





Wednesday, June 5, 2013

In the Era of Big Data, The Dimensional Model is Essential

Don't let the hype around big data lead you to believe your BI program is obsolete. 

I receive a lot of questions about "big data."  Here is one:
We have been doing data warehousing using Kimball method and dimensional modeling for several years and are very successful (thanks for your 3 books, btw). However, these days we hear a lot about Big Data Analytics, and people say that Big Data is the future trend of BI, and that it will replace data warehousing, etc.

Personally I don't believe that Big Data is going to replace Data Warehousing but I guess that it may still bring certain value to BI.  I'm wondering if you could share some thoughts.

"Big data" is the never-ending quest to expand the ways in which our BI programs deliver business value.

As we expand the scope of what we deliver to the business, we must be able to tie our discoveries back to business metrics and measure the impact of our decisions. The dimensional model is the glue that allows us to achieve this.

Unless you plan to stop measuring your business, the dimensional model will remain essential to your BI program. The data warehouse remains relevant as a means to instantiate the information that supports this model. Reports of its death have been greatly exaggerated.

Big Data

"Big Data" is usually defined as a set of data management challenges known as "the three V's" -- volume, velocity and variety. These challenges are not new. Doug Laney first wrote about the three V's in 2001 -- twelve years ago.1And even before that, we were dealing with these problems.

Photo from NASA in public domain.
Consider the first edition of The Data Warehouse Toolkit, published by Ralph Kimball in 1996.2 For many readers, his "grocery store" example provided their first exposure to the star schema. This schema captured aggregated data! The 21 GB fact table was a daily summary of sales, not a detailed record of point-of-sale transactions. Such a data set was presumably too large at the time.

That's volume, the first V, circa 1996.

In the same era, we were also dealing with velocity and variety. Many organizations were moving from monthly, weekly or daily batch loads to real-time or near-real time loads. Some were also working to establish linkages between dimensional data and information stored in document repositories.

New business questions

As technology evolves, we are able to address an ever expanding set of business questions.

Today, it is not unreasonable to expect the grocery store's data warehouse to have a record for every product that moves across the checkout scanner, measured in terabytes rather than gigabytes. With this level of detail, market basket analysis is possible, along with longitudinal study of customer behavior.

But of course, the grocery store is now looking beyond sales to new analytic possibilities. These include tracking the movement of product through the supply and distribution process, capturing interaction behavior of on-line shoppers, and studying consumer sentiment.

We still measure our businesses

What does this mean for the dimensional model? As I've posted before, a dimensional model represents how we measure the business. That's not something we're going to stop doing. Traditional business questions remain relevant, and the information that supports them is the core of our BI solution.

At the same time, we need to be able to link this information to other types of data. For a variety of reasons (V-V-V), some of this information may not be stored in a relational format, and some may not be a part of the data warehouse.

Making sense of all this data requires placing it in the context of our business objectives and activities.

To do this, we must continue to understand and capture business metrics, record transaction identifiers, integrate around conformed dimensions, and maintain associated business keys. These are long established best practices of dimensional modeling.

By applying these dimensional techniques, we can (1) link insights from our analytics to business objectives and (2) measure the impact of resultant business decisions. If we don't do this, our big data analytics become a modern-day equivalent of the stove-pipe data mart.

The data warehouse

The function of the data warehouse is to instantiate the data that supports measurement of the business. The dimensional model can be used toward this aim (think: star schema, cube.)

The dimensional model also has other functions. It is used to express information requirements, to guide program scope, and to communicate with the business. Technology may eventually get us to a point where we can jettison the data warehouse on an enterprise scale,but these other functions will remain essential. In fact, their importance becomes elevated.

In any architecture that moves away from physically integrated data, we need a framework that allows us to bring that data together with semantic consistency. This is one of the key functions of the dimensional model.

The dimensional model is the glue that is used to assemble business information from distributed data.

Organizations that leverage a bus architecture already understand this. They routinely bring together information from separate physical data marts, a process supported by the dimensional principle of conformance. Wholesale elimination of the data warehouse takes things one step further.

Notes
  1. Doug Laney's first published treatment of "The Three V's" can be found on his blog.
  2. Now out of print, this discussion appeared in Chapter 2, "The Grocery Store."  Insight into the big data challenges of 1996 can be found in Chapter 17, "The Future."
  3. I think we are a long time away from being able to do this on an enterprise scale. When we do get there, it will be as much due to master data management as it is due to big data or virtualization technologies. I'll discuss virtualization in some future posts.
More reading

Previous posts have dealt with this topic.
  • In Big Data and Dimensional Modeling (4/20/2012) you can see me discuss the impact of new technologies on the data warehouse and the importance of the dimensional model.

Tuesday, April 30, 2013

The Role of the Dimensional Model in Your BI Program

The dimensional model delivers value long before a database is designed or built, and even when no data is ever stored dimensionally. While it is best known as a basis for database design, its other roles may have more important impacts on your BI program.

The dimensional model plays four key roles in Business Intelligence:
  1. The dimensional model is the ideal way define requirements, because it describes how the business is measured
  2. The dimensional model is ideal for managing scope because it communicates to business people (functionality) and technical people (complexity) 
  3. The dimensional model is ideal as a basis for data mart design because it provides ease of use and high performance
  4. The dimensional model is ideal as a semantic layer because it communicates in business terms
Information Requirements

The dimensional model is best understood as an information model, rather than data model. It describes business activities the same way people do: as a system of measurement. This makes it the ideal form to express information needs, regardless of how information will be stored.

Image by Gravityx9
licensed under 
Creative Commons 2.0
A dimensional model defines business metrics or performance indicators in detail, and captures the attendant dimensional context. (For a refresher, see the post What is a Dimensional Model from 4/27/2010.) Metrics are grouped based on shared granularity, cross referenced to shared reference data, and traced to data sources.

This representation is valuable because business questions are constantly changing. If you simply state them, you produce a model with limited shelf life. If you model answers to the question of today, you've provided perishable goods.

A dimensional model establishes information requirements that endure, even as questions change. It provides a strong foundation for multiple facets of BI:
  • Performance management, including dashboards and scorecards
  • Analytic processing, including OLAP and ad hoc analysis
  • Reporting, including both enterprise and operational reports
  • Advanced analytics, including business analytics, data mining and predictive analytics
All these disciplines center on business metrics. It should be no surprise that when Howard Dresner coined the term Business Intelligence, his definition referenced "facts and fact based systems." It's all about measurement.

Program Roadmap and Project Scope

A dimensional model can be used to describe scope because it communicates to two important audiences.
  • Business people: functionality   The dimensional model describes the measurement of a business process, reflecting how the process is evaluated by participants and observers. It communicates business capability.
  • Technical personnel: level of effort A dimensional model has technical implications: it determines the data sources that must be integrated, how information must be integrated and cleansed, and how queries or reports can be built. In this respect, it communicates level of effort. 
These dual perspectives make the dimensional design an ideal centerpiece for managing the roadmap for your BI program. Fully documented and mapped to data sources, a dimensional model can be divided into projects and prioritized. It is a blueprint that can be understood by all interested parties. A simple conformance matrix communicates both intended functionality and technical level of effort for each project.

At the project level, a dimensional design can be used as the basis for progress reporting. It can also serve as a nonambiguous arbiter of change requests. Changes that add data sources or impact grain, for example, are considered out of scope. This is particularly useful for organizations that employ iterative methodologies, but its simplicity makes it easy to reconcile with any development methodology.

Database Design

The dimensional model is best known as the basis for database design. The term "star schema" is far more widely recognized than "dimensional model" (a fact that influenced the name of my most recent book).

In fact, the dimensional model is the de facto standard for data mart design, and many organizations use it to shape the entire data warehouse. It has an important place in the W.H. Inmon's Corporate Information Factory, Ralph Kimball Dimensional Bus architecture, and even in one-off data marts that lack an enterprise focus.

Implemented in a relational database, the dimensional model becomes known as a star schema or snowflake. Implemented in a multidimensional database, it is known as a cube. These implementations offer numerous benefits. They are:
  • Easily understandable by business people
  • Extraordinarily flexible from a reporting and analysis perspective
  • Adaptable to change
  • Capable of very high performance
Presentation and the Semantic Layer

A dimensional representation is the ideal way to present information to business people, regardless of how it is actually stored. It reflects how people think about the business, so it is used to organize the catalog of items they can call on for analysis.

Many business intelligence tools are architected around this concept, allowing a semantic layer to sit between the user and database tables. The elements with which people can frame questions are categorized as facts and dimensions. One need not know what physical data structures lay beneath.

Even the earliest incarnations of the semantic layer leveraged this notion. Many organizations used these tools to impose a dimensional view directly on top of operational data. Today, semantic layers are commonly linked to dimensional data marts.

A dimensional representation of business activity is the starting point for a variety of BI activities:
  • Building enterprise reports
  • Defining performance dashboards
  • Performing ad hoc analysis
  • Preparing data for an analytic model
The concept of dimensional presentation is receiving renewed attention as federated solutions promise the construction of virtual solutions rather than physical ones.

Further information

I briefly covered these four roles in an interview last year:
Many of these themes have been discussed previously:
Although I've touched on these topics before, I wanted to bring them together in a single article. In the coming months, I will refer back to these concepts as I address common questions about big data, agile BI and federation.

In the mean time, please help support this blog by picking up a copy of my latest book.

Thursday, March 28, 2013

Where To Put Dates

A reader is trying to decide if certain dates should be modeled as dimensions of a fact table or as attributes of a dimension table.

I have two attributes that I'm really not sure where is the best place to place:'Account Open Date' and 'Account Close Date'. In my model, I have [Dim Accounts] as a dimension and [F transact] as a fact table containing accounts transactions. An account can have many transactions, so the dates have different cardinality than the transactions.
  • I thought to put the dates in the Accounts dimension, but this led to problems: difficulties in calculations related to those dates--like if I want to get the transactions of the accounts that opened in the 4th quarter of 2012, or to get the difference between the date of last transaction and the account opening date, and so on.  In other words I can't benefit from the Date dimension and the hierarchies it contains.
  • So I though about placing those dates in the fact table, but what made me hesitate is that the granularity of those dates is higher than the fact table, so there will be a lot of redundancy.
- Ahmad
Bethlehem, Palestine

This is a common dilemma. Many of our most important dimensions come with a number of possible dates that describe them.

Ahmad is thinking about this problem in the right way: how will my choice affect my ability to study the facts?

It turns out that (1) this is not an either/or question, and (2) granularity is not an issue.

Dates that Describe Important Dimensions

Image licensed via Creative Commons 2.0

The dates are clearly useful dimension attributes.  I suggest that you keep them in the dimension in one of two ways, which I will discuss in a moment.

First, though, lets look at what happens if the dates are only represented as foreign keys in the fact table:

If the dates are not stored in the dimension, the open and close date are only associated with the Account dimension through the fact table.  The fact table only has records when transactions occur. So it becomes harder to find a list of open accounts, or to find the set of accounts that were active as of a particular date.

An additional factless fact table may help here, but it is far more simple to store the dates in the dimension.

Date as Attribute vs Outrigger

If plan to represent the dates in your dimension table, you have two choices. You can model the dates themselves as attributes, or you can model a pair of day keys in your account dimension. Either approach is acceptable.

The first option does not expose the richness of your full day dimension for analytic usage, but it may be simpler to use for many business questions. Other questions (like your quarterly example) will require a bit more technical knowledge, but most BI tools help with this.

The second option transforms your star into a (partial) snowflake. The day dimension becomes known as an "outrigger" when it connects to your account dimension. This allows you to explicitly leverage all the attributes of your Day dimension. The cost is some extra joins, which may be confusing and may also disrupt star-join optimization.

Making the correct choice here involves balancing several perspectives:

  • The business view and usability
  • The capabilities of your BI software front end
  • The capabilities of your DBMS back end software

Day Keys in the Fact Table

Having said all that, it is also useful to represent at least one of these dates in the fact table. The account open date may be a good dimensional perspective for the analysis of facts.

As you observed, this date has different cardinality than the transactions. The account open date for an account remains constant, even if it has dozens of transactions in your fact table. But the fact that it has low cardinality should not stop you from choosing it as a major dimension as your star!

Your account transaction fact table may have a pair of day keys -- one for the date the account was opened, and one for the date of the transaction.

If you choose to do this, the account dimension itself should include the open date. The outrigger solution is not necessary since your fact table has full access to the Day dimension.

Note that I do not recommend this for your account closed date, because that date changes. Storing it as a key for every transaction against an account would require a lot of updates to fact table rows once the account becomes closed.

More Information

I've touched on this topic in the past.  In particular, see this post:


Although I edited it out of Ahmad's question, he also cited an issue surrounding the use of NULL for accounts that do not have a closed date. On that topic, see this recent post:



Support this Blog

I maintain this blog in my spare time.  If you find it helpful, you can help support it by picking up a copy of my book:  Star Schema: The Complete Reference.

Use the links on this blog to get a copy of this or any of the other recommended books, and you will be helping to keep this effort going.

Wednesday, March 6, 2013

Learn Dimensional Modeling With Chris in 2013

Several courses in Dimensional Modeling are already scheduled for 2013:
Chris is also scheduled to teach several TDWI courses at these and other events.  For full details, check the sidebar of this blog.  (If using a reader, you will need to click through)

These don't work for you?  Check back from time to time.  As new cities and dates are added, they will appear on the sidebar.

Wednesday, February 13, 2013

Optional Relationships Without NULL

Optional relationships are important in dimensional models. This post shows you how to support them without resorting to NULL keys in the fact table.


Last month, we looked at the impact of allowing dimension attributes to contain NULL. In this post, we'll look at the impact of allowing foreign keys in fact tables to contain NULL.

Once again, NULL will prove problematic. What should be simple queries will require an alternate join syntax, multiple comparisons and nested parentheses.

The preferred solution is to establish special-case rows in dimensions. These rows can be referenced by fact table rows that do not have corresponding dimension detail.

Recap on NULL

NULL is a special SQL keyword used to denote the absence of data.

Last month, I explained why we avoid allowing dimension attributes to contain NULL. NULL fails standard comparisons, necessitating query predicates containing numerous tests which are carefully balanced within sets of parentheses.

For the full story, and the preferred solution, see last month's post: Avoid NULL in Dimensions (1/7/2013).

But that was NULL dimension attributes. What about NULL foreign keys?

Optional Relationships and NULL

Sometimes, the relationship between a fact table and a dimension is optional. This means some rows in the fact table cannot be associated with the dimension.

In an ER model, the traditional solution is to store NULL foreign keys for such rows. Let's take a look at what would happen if we did that in a dimensional model.

You may have noticed that in some stores, the cashier asks you if a salesperson helped you. If so, they record that info. So some sales have a salesperson, some do not.

With an optional relationship to Salesrep, you star schema might look like this:



The dotted line represents an optional relationship. (In other notations, optionality is represented by including circles at the ends of relationship lines.)  For fact table rows with no salesperson, salesrep_key contains NULL.

Usability Harmed by NULL Foreign Keys

When a foreign key can contain NULL, we once again face difficulties when answering some simple business questions. As before, NULL complicates queries because it requires a comparison syntax that is different from the syntax for standard values. This time, we'll also be facing different join syntax.

For example, using the sales star, you might like to see all sales where a manager was not involved.  Assuming the Salesrep table has a column called salesrep_type, you would be forgiven for adding this to your query:

     WHERE salesrep.salesrep_type != "Manager"

This predicate is not sufficient to find all sales without managerial involvement.

Assuming a standard join is linking sales_facts to salerep, rows with no salesrep will not appear in the query results. This happens because, for any fact without a salesrep_key, the join to salesrep will fail. An outer join must be used to help facts with no salesreps survive the join.

Even when an outer join is employed, the above constraint remains insufficient. That's because a side effect of the outer join is to create NULL salesreps in the data set.

In addition to an outer join, we must supplement the constraint above:

   WHERE
     ( salesrep.salesrep_type != "Manager" OR
       salesrep.salesrep_type IS NULL
     ) AND...

NULL keys force us to choose the correct join type, perform multiple comparisons against the same dimension attribute, and carefully balance parentheses.

A dimensional model is meant to be understandable and usable from a business perspective. NULL keys do not fit the bill.

Use a Special Case Row


When there is an optional relationship between fact and dimension, best practices call for a special row in the dimension. This row is referenced by facts that would otherwise require a NULL foreign key.

For example, we add a "not applicable" row to our salesrep table as so:


salerep_keyrow_typesalerep_typesalesrep_name
0No Salesrepn/an/a
100SalesrepAssociatePaul Cook
101SalesrepAssociateSteve Jones
201SalesrepManagerGlen Matlock


Now we don't need outer joins, and we don't need to bend over backwards to perform simple comparisons.

Further Reading

The technique described in this post can be extended to handle other situations (invalid data, future events, or reference data that becomes available after facts).

Read more about these possibilities in Chapter 6 of my book, my book, Star Schema: The Complete Reference.

Also check out the previous post,  Avoid NULL in Dimensions (1/7/2013)

Edited 2/13/13 5:30pm to correct mismatched table headings. Thanks for the emails.

Monday, January 7, 2013

Avoid NULL in Dimensions

Best practices in dimensional modeling forbid the use of NULL.  In this post, you'll learn why.

In any relational database, allowing NULL severely hampers our ability to define query predicates or filters, to use the common term.

For a dimensional modeler, this is anathema.  Data marts are meant to be understood and used by business people. Instead, we take a different approach to handling absent data.

About NULL

Any popular RDBMS product will allow you to store a row of data in which one or more columns do not contain a value. At first glance, this may seem perfectly reasonable. After all, there are times when  something is missing, unknown, or simply does not apply.

Image from Wikimedia
Licensed under CC 3.0
In order to handle the potential absence of data, vendors support the concept of NULL. This special SQL keyword is used to denote the absence of data.

The theorists tell us that NULL introduces a variety of problems:

  • It requires N-valued logic. 
  • It mixes data with metadata. 
  • It introduces a raft of complications when writing queries.
All true, but it is this last point that is the most important for us.

The Problem with NULL

Suppose we have a dimension table that represents an insurance policy. It includes columns that capture the policy's effective_date and the cancellation_date.  Active policies do not have a cancellation date, so it might come in handy that we don't have to store one for each policy.

Some of the data in the table might look like this:


policy_keypolicy_idpolicy_typeeffective_datecancellation_date
12882A111000Home1/1/1012/31/12
12910B222000Automotive3/12/111/31/13
13011C333000Automotive6/7/12[NULL]
13255D444000Home7/1/1212/5/12


(The string "[NULL]" is not stored in the table -- that's just how we indicate it in written examples.)

Now lets say you want to qualify a query so that it only includes auto policies that were active on December 1, 2012.

In your query, you might try to include something like this:


WHERE
  effective_date <= 12/1/2012
  AND cancellation_date >= 12/1/2012
  AND policy_type = “Automotive”
  AND ...


Unfortunately, this is not going pick up Policy C333000, although it was active on the day in question.

Why? The cancellation date for policy C333000 is NULL.  NULL is not greater than the date in question, it is not less than the date in question, nor is it equal to the date in question. All these comparisons will fail.

To correctly answer the business question, it will be necessary to check for NULL cancellation dates:

WHERE
  ( effective_date     <= 12/1/2012
    AND cancellation_date >= 12/1/2012 )
  OR
  ( effective_date <= 12/1/2012
    AND cancellation_date IS NULL ) 


And remember, we only wanted automotive policies.  So we'll need to wrap parentheses around the above fragment, then add the additional constraint. Now we have:

WHERE
  (
    ( effective_date     <= 12/1/2012
      AND cancellation_date >= 12/1/2012 )
    OR
    ( effective_date <= 12/1/2012
      AND cancellation_date IS NULL )
  )
  AND policy_type = "Automotive"
  AND...



Not many business people will be able to define this query, even if they are using a SQL-generating BI tool. Many developers may make a mistake as well.

Alternative to the Null

In lieu of allowing NULL, dimensional designers choose (and document) special values that will be used when there is no data. For strings, this is often "Not Applicable" or "N/A". For a date, it is typically 12/31/9999.

This solution makes our queries easier to write. A 12/31/9999 cancellation date for policy C333000, for example, will allow us to use the original WHERE clause from the example.

Note that the solution is imperfect. Using simple date arithmetic, for example, Policy C333000 now appears to be almost 8,000 years old!

Note, too, that we have not addressed the theoretical objections to NULL. We are still mixing data with metadata; we've just chosen a more workable placeholder for the absence of data.

Learn More

For more on NULL and the dimensional model, see Chapter 6 of my book Star Schema: The Complete Reference.

Use the links on this page to pick up a copy, and you will also be helping to support this blog!

If you want to learn more about relational theory in general, including the NULL, I recommend Chris Date's An Introduction to Database Systems (8th Edition).