Monday, January 7, 2013

Avoid NULL in Dimensions

Best practices in dimensional modeling forbid the use of NULL.  In this post, you'll learn why.

In any relational database, allowing NULL severely hampers our ability to define query predicates or filters, to use the common term.

For a dimensional modeler, this is anathema.  Data marts are meant to be understood and used by business people. Instead, we take a different approach to handling absent data.

About NULL

Any popular RDBMS product will allow you to store a row of data in which one or more columns do not contain a value. At first glance, this may seem perfectly reasonable. After all, there are times when  something is missing, unknown, or simply does not apply.

Image from Wikimedia
Licensed under CC 3.0
In order to handle the potential absence of data, vendors support the concept of NULL. This special SQL keyword is used to denote the absence of data.

The theorists tell us that NULL introduces a variety of problems:

  • It requires N-valued logic. 
  • It mixes data with metadata. 
  • It introduces a raft of complications when writing queries.
All true, but it is this last point that is the most important for us.

The Problem with NULL

Suppose we have a dimension table that represents an insurance policy. It includes columns that capture the policy's effective_date and the cancellation_date.  Active policies do not have a cancellation date, so it might come in handy that we don't have to store one for each policy.

Some of the data in the table might look like this:


(The string "[NULL]" is not stored in the table -- that's just how we indicate it in written examples.)

Now lets say you want to qualify a query so that it only includes auto policies that were active on December 1, 2012.

In your query, you might try to include something like this:

  effective_date <= 12/1/2012
  AND cancellation_date >= 12/1/2012
  AND policy_type = “Automotive”
  AND ...

Unfortunately, this is not going pick up Policy C333000, although it was active on the day in question.

Why? The cancellation date for policy C333000 is NULL.  NULL is not greater than the date in question, it is not less than the date in question, nor is it equal to the date in question. All these comparisons will fail.

To correctly answer the business question, it will be necessary to check for NULL cancellation dates:

  ( effective_date     <= 12/1/2012
    AND cancellation_date >= 12/1/2012 )
  ( effective_date <= 12/1/2012
    AND cancellation_date IS NULL ) 

And remember, we only wanted automotive policies.  So we'll need to wrap parentheses around the above fragment, then add the additional constraint. Now we have:

    ( effective_date     <= 12/1/2012
      AND cancellation_date >= 12/1/2012 )
    ( effective_date <= 12/1/2012
      AND cancellation_date IS NULL )
  AND policy_type = "Automotive"

Not many business people will be able to define this query, even if they are using a SQL-generating BI tool. Many developers may make a mistake as well.

Alternative to the Null

In lieu of allowing NULL, dimensional designers choose (and document) special values that will be used when there is no data. For strings, this is often "Not Applicable" or "N/A". For a date, it is typically 12/31/9999.

This solution makes our queries easier to write. A 12/31/9999 cancellation date for policy C333000, for example, will allow us to use the original WHERE clause from the example.

Note that the solution is imperfect. Using simple date arithmetic, for example, Policy C333000 now appears to be almost 8,000 years old!

Note, too, that we have not addressed the theoretical objections to NULL. We are still mixing data with metadata; we've just chosen a more workable placeholder for the absence of data.

Learn More

For more on NULL and the dimensional model, see Chapter 6 of my book Star Schema: The Complete Reference.

Use the links on this page to pick up a copy, and you will also be helping to support this blog!

If you want to learn more about relational theory in general, including the NULL, I recommend Chris Date's An Introduction to Database Systems (8th Edition).