Showing posts with label NULL. Show all posts
Showing posts with label NULL. Show all posts

Thursday, March 28, 2013

Where To Put Dates

A reader is trying to decide if certain dates should be modeled as dimensions of a fact table or as attributes of a dimension table.

I have two attributes that I'm really not sure where is the best place to place:'Account Open Date' and 'Account Close Date'. In my model, I have [Dim Accounts] as a dimension and [F transact] as a fact table containing accounts transactions. An account can have many transactions, so the dates have different cardinality than the transactions.
  • I thought to put the dates in the Accounts dimension, but this led to problems: difficulties in calculations related to those dates--like if I want to get the transactions of the accounts that opened in the 4th quarter of 2012, or to get the difference between the date of last transaction and the account opening date, and so on.  In other words I can't benefit from the Date dimension and the hierarchies it contains.
  • So I though about placing those dates in the fact table, but what made me hesitate is that the granularity of those dates is higher than the fact table, so there will be a lot of redundancy.
- Ahmad
Bethlehem, Palestine

This is a common dilemma. Many of our most important dimensions come with a number of possible dates that describe them.

Ahmad is thinking about this problem in the right way: how will my choice affect my ability to study the facts?

It turns out that (1) this is not an either/or question, and (2) granularity is not an issue.

Dates that Describe Important Dimensions

Image licensed via Creative Commons 2.0

The dates are clearly useful dimension attributes.  I suggest that you keep them in the dimension in one of two ways, which I will discuss in a moment.

First, though, lets look at what happens if the dates are only represented as foreign keys in the fact table:

If the dates are not stored in the dimension, the open and close date are only associated with the Account dimension through the fact table.  The fact table only has records when transactions occur. So it becomes harder to find a list of open accounts, or to find the set of accounts that were active as of a particular date.

An additional factless fact table may help here, but it is far more simple to store the dates in the dimension.

Date as Attribute vs Outrigger

If plan to represent the dates in your dimension table, you have two choices. You can model the dates themselves as attributes, or you can model a pair of day keys in your account dimension. Either approach is acceptable.

The first option does not expose the richness of your full day dimension for analytic usage, but it may be simpler to use for many business questions. Other questions (like your quarterly example) will require a bit more technical knowledge, but most BI tools help with this.

The second option transforms your star into a (partial) snowflake. The day dimension becomes known as an "outrigger" when it connects to your account dimension. This allows you to explicitly leverage all the attributes of your Day dimension. The cost is some extra joins, which may be confusing and may also disrupt star-join optimization.

Making the correct choice here involves balancing several perspectives:

  • The business view and usability
  • The capabilities of your BI software front end
  • The capabilities of your DBMS back end software

Day Keys in the Fact Table

Having said all that, it is also useful to represent at least one of these dates in the fact table. The account open date may be a good dimensional perspective for the analysis of facts.

As you observed, this date has different cardinality than the transactions. The account open date for an account remains constant, even if it has dozens of transactions in your fact table. But the fact that it has low cardinality should not stop you from choosing it as a major dimension as your star!

Your account transaction fact table may have a pair of day keys -- one for the date the account was opened, and one for the date of the transaction.

If you choose to do this, the account dimension itself should include the open date. The outrigger solution is not necessary since your fact table has full access to the Day dimension.

Note that I do not recommend this for your account closed date, because that date changes. Storing it as a key for every transaction against an account would require a lot of updates to fact table rows once the account becomes closed.

More Information

I've touched on this topic in the past.  In particular, see this post:


Although I edited it out of Ahmad's question, he also cited an issue surrounding the use of NULL for accounts that do not have a closed date. On that topic, see this recent post:



Support this Blog

I maintain this blog in my spare time.  If you find it helpful, you can help support it by picking up a copy of my book:  Star Schema: The Complete Reference.

Use the links on this blog to get a copy of this or any of the other recommended books, and you will be helping to keep this effort going.

Wednesday, February 13, 2013

Optional Relationships Without NULL

Optional relationships are important in dimensional models. This post shows you how to support them without resorting to NULL keys in the fact table.


Last month, we looked at the impact of allowing dimension attributes to contain NULL. In this post, we'll look at the impact of allowing foreign keys in fact tables to contain NULL.

Once again, NULL will prove problematic. What should be simple queries will require an alternate join syntax, multiple comparisons and nested parentheses.

The preferred solution is to establish special-case rows in dimensions. These rows can be referenced by fact table rows that do not have corresponding dimension detail.

Recap on NULL

NULL is a special SQL keyword used to denote the absence of data.

Last month, I explained why we avoid allowing dimension attributes to contain NULL. NULL fails standard comparisons, necessitating query predicates containing numerous tests which are carefully balanced within sets of parentheses.

For the full story, and the preferred solution, see last month's post: Avoid NULL in Dimensions (1/7/2013).

But that was NULL dimension attributes. What about NULL foreign keys?

Optional Relationships and NULL

Sometimes, the relationship between a fact table and a dimension is optional. This means some rows in the fact table cannot be associated with the dimension.

In an ER model, the traditional solution is to store NULL foreign keys for such rows. Let's take a look at what would happen if we did that in a dimensional model.

You may have noticed that in some stores, the cashier asks you if a salesperson helped you. If so, they record that info. So some sales have a salesperson, some do not.

With an optional relationship to Salesrep, you star schema might look like this:



The dotted line represents an optional relationship. (In other notations, optionality is represented by including circles at the ends of relationship lines.)  For fact table rows with no salesperson, salesrep_key contains NULL.

Usability Harmed by NULL Foreign Keys

When a foreign key can contain NULL, we once again face difficulties when answering some simple business questions. As before, NULL complicates queries because it requires a comparison syntax that is different from the syntax for standard values. This time, we'll also be facing different join syntax.

For example, using the sales star, you might like to see all sales where a manager was not involved.  Assuming the Salesrep table has a column called salesrep_type, you would be forgiven for adding this to your query:

     WHERE salesrep.salesrep_type != "Manager"

This predicate is not sufficient to find all sales without managerial involvement.

Assuming a standard join is linking sales_facts to salerep, rows with no salesrep will not appear in the query results. This happens because, for any fact without a salesrep_key, the join to salesrep will fail. An outer join must be used to help facts with no salesreps survive the join.

Even when an outer join is employed, the above constraint remains insufficient. That's because a side effect of the outer join is to create NULL salesreps in the data set.

In addition to an outer join, we must supplement the constraint above:

   WHERE
     ( salesrep.salesrep_type != "Manager" OR
       salesrep.salesrep_type IS NULL
     ) AND...

NULL keys force us to choose the correct join type, perform multiple comparisons against the same dimension attribute, and carefully balance parentheses.

A dimensional model is meant to be understandable and usable from a business perspective. NULL keys do not fit the bill.

Use a Special Case Row


When there is an optional relationship between fact and dimension, best practices call for a special row in the dimension. This row is referenced by facts that would otherwise require a NULL foreign key.

For example, we add a "not applicable" row to our salesrep table as so:


salerep_keyrow_typesalerep_typesalesrep_name
0No Salesrepn/an/a
100SalesrepAssociatePaul Cook
101SalesrepAssociateSteve Jones
201SalesrepManagerGlen Matlock


Now we don't need outer joins, and we don't need to bend over backwards to perform simple comparisons.

Further Reading

The technique described in this post can be extended to handle other situations (invalid data, future events, or reference data that becomes available after facts).

Read more about these possibilities in Chapter 6 of my book, my book, Star Schema: The Complete Reference.

Also check out the previous post,  Avoid NULL in Dimensions (1/7/2013)

Edited 2/13/13 5:30pm to correct mismatched table headings. Thanks for the emails.

Monday, January 7, 2013

Avoid NULL in Dimensions

Best practices in dimensional modeling forbid the use of NULL.  In this post, you'll learn why.

In any relational database, allowing NULL severely hampers our ability to define query predicates or filters, to use the common term.

For a dimensional modeler, this is anathema.  Data marts are meant to be understood and used by business people. Instead, we take a different approach to handling absent data.

About NULL

Any popular RDBMS product will allow you to store a row of data in which one or more columns do not contain a value. At first glance, this may seem perfectly reasonable. After all, there are times when  something is missing, unknown, or simply does not apply.

Image from Wikimedia
Licensed under CC 3.0
In order to handle the potential absence of data, vendors support the concept of NULL. This special SQL keyword is used to denote the absence of data.

The theorists tell us that NULL introduces a variety of problems:

  • It requires N-valued logic. 
  • It mixes data with metadata. 
  • It introduces a raft of complications when writing queries.
All true, but it is this last point that is the most important for us.

The Problem with NULL

Suppose we have a dimension table that represents an insurance policy. It includes columns that capture the policy's effective_date and the cancellation_date.  Active policies do not have a cancellation date, so it might come in handy that we don't have to store one for each policy.

Some of the data in the table might look like this:


policy_keypolicy_idpolicy_typeeffective_datecancellation_date
12882A111000Home1/1/1012/31/12
12910B222000Automotive3/12/111/31/13
13011C333000Automotive6/7/12[NULL]
13255D444000Home7/1/1212/5/12


(The string "[NULL]" is not stored in the table -- that's just how we indicate it in written examples.)

Now lets say you want to qualify a query so that it only includes auto policies that were active on December 1, 2012.

In your query, you might try to include something like this:


WHERE
  effective_date <= 12/1/2012
  AND cancellation_date >= 12/1/2012
  AND policy_type = “Automotive”
  AND ...


Unfortunately, this is not going pick up Policy C333000, although it was active on the day in question.

Why? The cancellation date for policy C333000 is NULL.  NULL is not greater than the date in question, it is not less than the date in question, nor is it equal to the date in question. All these comparisons will fail.

To correctly answer the business question, it will be necessary to check for NULL cancellation dates:

WHERE
  ( effective_date     <= 12/1/2012
    AND cancellation_date >= 12/1/2012 )
  OR
  ( effective_date <= 12/1/2012
    AND cancellation_date IS NULL ) 


And remember, we only wanted automotive policies.  So we'll need to wrap parentheses around the above fragment, then add the additional constraint. Now we have:

WHERE
  (
    ( effective_date     <= 12/1/2012
      AND cancellation_date >= 12/1/2012 )
    OR
    ( effective_date <= 12/1/2012
      AND cancellation_date IS NULL )
  )
  AND policy_type = "Automotive"
  AND...



Not many business people will be able to define this query, even if they are using a SQL-generating BI tool. Many developers may make a mistake as well.

Alternative to the Null

In lieu of allowing NULL, dimensional designers choose (and document) special values that will be used when there is no data. For strings, this is often "Not Applicable" or "N/A". For a date, it is typically 12/31/9999.

This solution makes our queries easier to write. A 12/31/9999 cancellation date for policy C333000, for example, will allow us to use the original WHERE clause from the example.

Note that the solution is imperfect. Using simple date arithmetic, for example, Policy C333000 now appears to be almost 8,000 years old!

Note, too, that we have not addressed the theoretical objections to NULL. We are still mixing data with metadata; we've just chosen a more workable placeholder for the absence of data.

Learn More

For more on NULL and the dimensional model, see Chapter 6 of my book Star Schema: The Complete Reference.

Use the links on this page to pick up a copy, and you will also be helping to support this blog!

If you want to learn more about relational theory in general, including the NULL, I recommend Chris Date's An Introduction to Database Systems (8th Edition).