Showing posts with label Advanced. Show all posts
Showing posts with label Advanced. Show all posts

Monday, January 7, 2013

Avoid NULL in Dimensions

Best practices in dimensional modeling forbid the use of NULL.  In this post, you'll learn why.

In any relational database, allowing NULL severely hampers our ability to define query predicates or filters, to use the common term.

For a dimensional modeler, this is anathema.  Data marts are meant to be understood and used by business people. Instead, we take a different approach to handling absent data.

About NULL

Any popular RDBMS product will allow you to store a row of data in which one or more columns do not contain a value. At first glance, this may seem perfectly reasonable. After all, there are times when  something is missing, unknown, or simply does not apply.

Image from Wikimedia
Licensed under CC 3.0
In order to handle the potential absence of data, vendors support the concept of NULL. This special SQL keyword is used to denote the absence of data.

The theorists tell us that NULL introduces a variety of problems:

  • It requires N-valued logic. 
  • It mixes data with metadata. 
  • It introduces a raft of complications when writing queries.
All true, but it is this last point that is the most important for us.

The Problem with NULL

Suppose we have a dimension table that represents an insurance policy. It includes columns that capture the policy's effective_date and the cancellation_date.  Active policies do not have a cancellation date, so it might come in handy that we don't have to store one for each policy.

Some of the data in the table might look like this:


policy_keypolicy_idpolicy_typeeffective_datecancellation_date
12882A111000Home1/1/1012/31/12
12910B222000Automotive3/12/111/31/13
13011C333000Automotive6/7/12[NULL]
13255D444000Home7/1/1212/5/12


(The string "[NULL]" is not stored in the table -- that's just how we indicate it in written examples.)

Now lets say you want to qualify a query so that it only includes auto policies that were active on December 1, 2012.

In your query, you might try to include something like this:


WHERE
  effective_date <= 12/1/2012
  AND cancellation_date >= 12/1/2012
  AND policy_type = “Automotive”
  AND ...


Unfortunately, this is not going pick up Policy C333000, although it was active on the day in question.

Why? The cancellation date for policy C333000 is NULL.  NULL is not greater than the date in question, it is not less than the date in question, nor is it equal to the date in question. All these comparisons will fail.

To correctly answer the business question, it will be necessary to check for NULL cancellation dates:

WHERE
  ( effective_date     <= 12/1/2012
    AND cancellation_date >= 12/1/2012 )
  OR
  ( effective_date <= 12/1/2012
    AND cancellation_date IS NULL ) 


And remember, we only wanted automotive policies.  So we'll need to wrap parentheses around the above fragment, then add the additional constraint. Now we have:

WHERE
  (
    ( effective_date     <= 12/1/2012
      AND cancellation_date >= 12/1/2012 )
    OR
    ( effective_date <= 12/1/2012
      AND cancellation_date IS NULL )
  )
  AND policy_type = "Automotive"
  AND...



Not many business people will be able to define this query, even if they are using a SQL-generating BI tool. Many developers may make a mistake as well.

Alternative to the Null

In lieu of allowing NULL, dimensional designers choose (and document) special values that will be used when there is no data. For strings, this is often "Not Applicable" or "N/A". For a date, it is typically 12/31/9999.

This solution makes our queries easier to write. A 12/31/9999 cancellation date for policy C333000, for example, will allow us to use the original WHERE clause from the example.

Note that the solution is imperfect. Using simple date arithmetic, for example, Policy C333000 now appears to be almost 8,000 years old!

Note, too, that we have not addressed the theoretical objections to NULL. We are still mixing data with metadata; we've just chosen a more workable placeholder for the absence of data.

Learn More

For more on NULL and the dimensional model, see Chapter 6 of my book Star Schema: The Complete Reference.

Use the links on this page to pick up a copy, and you will also be helping to support this blog!

If you want to learn more about relational theory in general, including the NULL, I recommend Chris Date's An Introduction to Database Systems (8th Edition).



Wednesday, May 2, 2012

Recursive Hierarchies And Bridge Tables

This post introduces the hierarchy bridge table, which can be used to navigate recursive hierarchies.

High Trestle Bridge at Night
Photo (c) Carol Bodensteiner
Bridge tables are a popular topic among readers of this blog. Perhaps that's because they can be a bit confusing, yet also enable powerful analysis.

Previously, I've focused on the two simplest uses for bridge tables:  the attribute bridge and the dimension bridge.  See links at the end of this post for more on these.

In this post, I'll tackle a third and final use: how a bridge can aid in the navigation of a recursive hierarchy.

Recursive relationships

In the real world, recursive relationships are common. Employees report to other employees, parts are made of other parts, and so forth.

In an E-R model, a recursive relationship might look like this:

Companies may own other companies.  That dotted relationship line is sometimes called a "pigs ear."  It links each company to the company that owns it.

You can read this relationship in two directions:  each company may have a parent company; each company may be the parent of one or more companies.

Looking up and looking down

From a business perspective, this information may be very useful.  For example, suppose you work for a business whose sells things to companies. You would like to be able to answer questions like these:
  1. What are January order dollars from to Company XYZ and its subsidiaries?
  2. Show me January orders from Company ABC and any companies above it.
The first question "looks down" from company XYZ.  The second question "looks up" from company ABC.

In both cases, we want to use the hierarchy to aggregate details about orders.

The hierarchy bridge

In a dimensional model, the idea is to support the study of facts (such as "order dollars") in the context of this hierarchy. This can be done using a bridge table, as shown here:

The table in the center is a hierarchy bridge.  This kind of table can be a bit confusing.  First, consider its content.

Each row contains a pair of company keys.  Here are the rules for populating the bridge table:
  1. For each company, the bridge contains rows associating it with each of its subordinates.
  2. Each company also has 1 row associating it with itself.
With respect to item 1, note each company is linked to all subordinates, not just the direct ones. (That's why the keys have the prefixes "superior" and "subordinate" rather than "parent" and "child".)

Using the bridge

In the picture above, notice that the fact table contains a company_key.  This represents the company from which each order is taken.  

Normally, we would join it to the company table to study "orders by company."

In this case, however, the bridge is inserted between the fact table and the company table. The bridge acts like a multiplexer--linking each company (on the right) to transactions with any and all of its subordinates (on the left).

This allows us to select a single company on the right (say Company XYZ) and aggregate all orders at or beneath that company.  That's called "looking down." It lets us answer question 1.

We can also use this hierarchy to "look up," as in question 2.  We simply reverse the joins--link company to the subordinate key, and the superior key to the fact table.  Then we can select a single company (say Company ABC) and look upward.

Powerful but dangerous

Of course, the danger of the bridge table should be readily apparent.  As you can see from the diagram above, if we forget to select a single parent company, we can easily double count the orders, triple count them, or worse.

For this reason, writing queries that involve the hierarchy must be governed carefully.  Developers or users with little experience can study the facts without using the bridge.  This will be safe, but they will not be able to leverage the hierarchy.

Other kinds of bridge tables

The hierarchy bridge described in this post is one of three kinds of bridge tables.  I've written previously about the other two kinds:
  • Bridge To Multi-Valued Dimensions (2/9/2011) shows how a bridge can link a single fact to multiple rows in a dimension.  Examples include orders with multiple salespeople, claims with multiple parties, etc.
More on the hierarchy bridge

When it comes to recursive hierarchies, this post only scratches the surface.  If you have questions, send them in.

In the mean time, there's a whole chapter on this topic in my book, Star Schema: The Complete Reference.  That would be Chapter 10:  "Recursive Hierarchies and Bridges." It's a long chapter--the longest one in the book.

And don't forget, when you use the links on this page to buy it from Amazon, you are helping to support this blog.


Copyright (c) Carol Bodensteiner
Used by permission, with many thanks!

Tuesday, October 4, 2011

Are bridge tables really fact tables?

A reader observes that bridge tables seem problematic, and wonders if they should just replaced by factless fact tables.

Q:  I am wondering if all bridge tables are in fact replacements for factless fact tables. The problem with the bridge table as you mention it is that...[you] need to do an expensive join and issues with Cartesian joins/ double counting etc. So the question is whether a bridge table is practical option as compared to a separate fact.
Ashish
Bangalore, India

A:  The bridge table looks a lot like a fact table, but it is used very differently.

While we avoid joining a fact table to another fact table, we seek to join bridge tables to fact tables. This can have implications for BI tools that generate SQL.

Similarities

A bridge table appears similar to a fact table because it contains multiple foreign keys. This is most evident when you look at an attribute bridge table, which links a dimension to an outrigger. It consists solely of foreign key references to the dimension and the outrigger.

In this respect, the bridge table is very similar to a factless fact table.  Indeed, one might make the argument that a bridge relates a set of dimensions in much the same way that a factless fact table describes conditions.

But there is a very important difference: we never join a fact table to another fact table. Bridges, on the other hand, are intended to be joined with fact tables.

We do not join fact tables to fact tables

You should never join two or more fact tables--either directly or indirectly via shared dimensions. Fact values will repeat if multiple rows in either fact table share the same dimensionality. We receive a Cartesian product of all related facts.  The result is double-counting, or worse. 

Instead of joining fact tables, we use a technique called drilling across. Facts are collected from each table and aggregated to common level of detail, then merged into a single result set. I wrote about this process earlier this year.1

Many BI tools that generate SQL are able to identify fact tables and automatically invoke drill across logic when required.

We do join bridge tables to fact tables

Bridge tables represent groups. We create them so that a single fact can be associated with a group of values (such as multiple salespeople) rather than a single value.

When we use a bridge table, we link it to other fact tables.2  By doing this, a single fact in the fact table associates with multiple rows in the bridge table.   

With a bridge, we are exploiting the very Cartesian product that we normally seek to avoid.

We are intentionally repeating a single fact for multiple group members.  To avoid incorrect results, it behooves us to group results by member, or to constrain for a single group member.

Even if you were to replace a bridge with a factless fact table, this is the behavior you would desire. Rather than drill across, you would link it to other fact tables, in order to associate the bridged values with various facts. Cartesian products and the danger of double counting would remain.

Bridge tables and BI software

A bridge table is not really a fact table. It is not the locus of process measurement.  It describes neither activities nor conditions. It is merely a construct that allows us to deal with repeating values. It is meant to be linked with fact tables, and used with care.

Because a bridge is composed of foreign keys, however, some BI tools may identify it as a fact table. If your tool does this, you will need to prevent it from invoking drill-across logic for queries that involve a bridge.

Your tool may have a facility for this. If it does not, you can hide the bridge by joining it to dimensions within a view.

More info

Thanks to Ashish for the comments. If you have a question about bridge tables, send it to the address on the sidebar.

You can learn more in my book, Star Schema: The Complete Reference.  Two full chapters are dedicated to bridge tables, including 30 diagrams.

See also:
1Factless fact tables which describe conditions are not joined to other fact tables either. When they are compared to other fact tables, we typically use set operators or subqueries.

2This may happen directly, in the case of a dimension bridge, or indirectly, in the case of an attribute bridge.


Image by ahisgett licensed under Creative Commons 2.0

Wednesday, September 21, 2011

Avoid Surrogate Keys for Fact Tables


I am often asked for feedback on designs that include a "surrogate key for the fact table."  There are two common reasons for which they are proposed; both have better alternatives. 

Surrogate keys are for dimension tables

A surrogate key is an attribute that is created to uniquely identify rows in a dimension table. It does not come from a source system; it is created expressly for the dimensional schema.

Surrogate keys for dimension tables serve two important purposes:
  1. They make it easier to track history.  They allow the dimension to capture changes to something, even if the source does not.  Absent a surrogate key, this would be difficult; the primary key of the dimension would be concatenation of natural keys and type 2 attributes.
  2. They make it easy to join to the dimension.  The dimensions' surrogate keys appear in fact tables as  foreign keys.  They allow the fact to be joined to the appropriate dimension values, without having to use a multi-part key.
A fact table does not require a manufactured data element for either of these purposes.

Tracking change history of facts?  Use a log.

When the value of a fact can change, a surrogate key for the fact table might be proposed. This would theoretically allow the fact table to record the change history of facts, in the same way that a dimension table does.

Unfortunately, storing the history of facts in this manner destroys the usability of the star. The additive properties of the facts are lost, and it becomes much harder to use.

If the facts can change, the fact table should be updated.

To track the history of facts, use an audit table as described in a previous post.  This table can log historic values, or can store the changes as "deltas" in a dimensional format.

Joining to other fact tables?  Drill across.

The other reason surrogate keys are proposed for fact tables is that it will make them "easier to join." 

Joining fact tables to dimensions is easy; the fact table already contains foreign keys that reference the surrogate keys in dimension tables.  But what about joining to other fact tables?

Because they contain facts meant to be aggregated, fact tables should never be joined to one another. Otherwise, double-counting will ensue.

Facts from multiple stars should be combined by drilling across, as described in a previous post.  Don't try to merge them by joining fact tables.

If you're not after facts with this proposed fact-to-fact join, you must be after dimension values.  If this is the case, carry forward any/all useful dimensions to other stars in the value chain.

More info

To learn more about surrogate keys, check out these posts:
For more on tracking the change history of facts, check out this post:
To learn about drilling across, read this post:
If you find this blog helpful, please consider picking up a copy of my book, Star Schema: The Complete Reference.


Wednesday, February 9, 2011

Bridge to Multi-Valued Dimensions

Here's what you can do when a dimension table and fact are not in a one-to-many relationship.

A recent post described how a bridge can be used to resolve repeating attributes in a dimension. Today's post looks at a second use:  facts that must link to multiple rows in a dimension.

Facts with Multi-Valued Dimensions

In most star schemas, each fact table row references exactly one row in each associated dimension. This relationship between dimension table and fact table is variously called "one-to-many", "master-detail" or "parent-child". For example, in a star schema that measures the order-taking process, each row in the fact table will link to exactly one row in the product table, one row in the salesperson table, etc.

But what happens when more than one salesperson can be involved in an order? Suppose there could be two, three, or any number of salespeople? The one-to-many relationship breaks down.

Avoiding the Bridge

As with the repeating attribute from last time, we might try to accommodate this multi-valued dimension by simplifying the relationship. Perhaps we can isolate some "roles" that can be used to recover a clean one-to-many relationship. If there is always a lead salesperson and a secondary salesperson, we can simply place 2 salesperson keys in the fact table.

This has some disadvantages, though, and they are the same as those from last time.

First, it becomes difficult to produce a report showing all sales that a particular person was involved in. The query needs to select sales where the primary salesperson is the person in question, OR where the secondary salesperson is the person in question. Doing this for all salespeople in a department becomes tricky, especially if we are using a SQL-generating BI tool.

The second problem with simplifying the relationships is that there may be a need for any number of dimension values. Suppose 3 sales people collaborate on an order? Do we provide 3 salesperson keys in the fact table? This compounds the first problem, and it will only be a matter of time before we come across a case where we need 4.

A Dimension Bridge

The solution is to build a table that bridges the fact table and dimensions.

When a fact references a group of dimension values, a group will be created. The fact table will contain the group key. This key will join to the bridge table.  The bridge will have a row for each group member, linking to the dimension. An example appears here:


This simple solution allows individual rows in the fact table to link to any number of salespeople.  Notice that it requires a group to be established even for orders that have only one salesperson.

Using the Bridge

With a dimension bridge in place, it is easy filter facts for single dimension member, or group them by dimension members.  The sum of sales, for example, can be grouped by salesperson: the fact table joins to the bridge, the bridge joins to the dimension, and a simple group by query is issued.

Dimension bridges are used in many other situations. Multiple citations during an inspection, multiple parties in a dispute or claim, and so forth.

Dangers

As with the attribute bridge, the dimension bridge brings with it the possibility of misuse.

Without careful attention, facts may be double-counted, triple-counted or worse.  For example, if two people collaborate on a sale, a report of sale by salesperson will show the sale twice.  This kind of report is useful, but creating a grand total would be a mistake. Similar issues crop up if both people are in the same department, and someone tries to produce a report showing sales by department.

The solution is to make sure the bridge is only accessible to trained developers and analysts.  A simplified solution can be made available to ad hoc users. In this case, that might mean adding a primary salesperson key to the fact table. This allows ad hoc users to safely join the fact table and dimension table, but their analytic capability will be limited.

More On Dimension Bridges

This introduction has only scratched the surface of dimension bridges. As with the previous post, there are many more refinements and complications that can be introduced. These include the impact on ETL, the addition of allocation factors, coping with software that does not permit a many-to-many relationship and more.

These topics will be explored in future posts. You can also read all about bridge tables in Chatpers 9 and 10 of Star Schema: The Complete Reference.

Have a question about bridge tables? Send it to the address in the sidebar, and it may be addressed in a Q&A post.

Image by Branwebs via Creative Commons

Friday, January 28, 2011

Resolve Repeating Attributes with a Bridge Table

One of the most underutilized and misunderstood techniques in dimensional design is the use of bridge tables. These special tables can be used in several situations. This post focuses on one use case: the repeating attribute.

When a dimension attribute appears multiple times in the same dimension table, we call it a "repeating attribute." In the world of entity-relationship modeling, these are avoided, but in the world of dimensional modeling they are acceptable. For example, a customer dimension table may contain three phone number attributes -- work, mobile and home.

In some cases, though, the repeating attribute is not effective. For example, you are designing a dimension table that represents companies. One of the attributes you want to include is the company's industry -- manufacturing, services, banking, etc. The problem is that some companies participate in more than one industry. What do we do for Company A, it it participates in Manufacturing and Services?

Repeating the Attribute Doesn't Cut It

You might try a design that repeats the industry, just like the multiple phone numbers. But this doesn't work very well in the case of our company. For one thing, each of the three phone numbers had a clear and distinct role: work, mobile, home. We can't do the same thing with a company's industries.

We might arbitrarily name some distinct industry columns -- industry_1, and industry_2 for example. But this introduces two more problems. The first is obvious: what if it turns out there is a company that participates in 3 industries? Exactly how many company columns are we going to add to this table?

The other problem has to do with usability. If you want to create a report that shows sales for companies in a particular industry, you have to qualify your query multiple times, and remember to use "or". Plus you have to take care balancing parentheses if there are other predicates:


WHERE
( industry_1 = "Services"
OR
industry_2 = "Services"
) AND

year = 2011
The usability challenge is increased if you want to group results by industry; you may need to resort to procedural logic.

Solution: Bridge

A real-world bridge connects to places together; it is not a destination itself. Similarly, the bridge table's job is to connect two tables together. It contains no attributes that will appear in query results; its function is to establish a link.

In the case of a multi-valued dimension attribute, a bridge is used to link the main dimension with an outrigger that houses the repeating attribute. For the company/industry dilemma, the link looks like this:


This construction makes it possible to associate any number of industries with a company. It is now simple to qualify queries for a specific industry. Since there is one column called "industry" is also simple to group results by industry.

Take Care

The flexibility and power of the bridge comes at a cost: it is dangerous. When not used properly, facts may be double-counted, triple counted or worse.

Suppose you want to report on sales by industry. You connect industry to company using the bridge, and you then connect company to a fact table containing sales dollars. You aggregate sales and group by industry. Remember that Company A appears in the manufacturing and services industries. Its sales will be repeated for each industry.

This kind of report is acceptable in many businesses. Kimball calls it an "impact report". But we must take special care with these reports -- labeling them as such, and taking special care not to create "grand totals".

Other Repeating Attributes

Repeating attributes occur all over the place. Multiple keywords for a document. Multiple diagnoses for a patient. Multiple responsible parties for an account. Multiple skills for an employee. I am sure you can imagine others.

In all cases, a bridge to an outrigger gives you increased analytic flexibility, though you will have to carefully monitor for misuse.

More Uses, More Issues

There is a lot more to say about bridge tables. That's probably why I've never written about them in the 4-year history of this blog. There are to other ways to employ a bridge.

A bridge can be used when there are potentially many dimension rows associated with a single fact-- such as multiple salespeople associated with a sales fact table.

Bridges are also very powerful in helping traverse recursive hierarchies -- companies that own other companies, parts that break down into other parts, etc.

Though powerful and flexible, bridge tables introduce new complications for the data warehouse team. For example, the many-to-many relationships they introduce (see the picture above) are problematic for some modeling tools, and even some RDBMS's. They also impact how we configure our BI and reporting software. And we must give careful thought to the impact of bridges on slowly changing dimensions.

Future posts will look at some of these topics. I also devote two full chapters to bridge tables in Star Schema: The Complete Reference.

Image: Water Lillies and Japanese Bridge from the Public Domain.

Monday, December 13, 2010

Deeper into the Accumulating Snapshot

The most popular posts on this blog discuss accumulating snapshot designs. In this post, I'll share some new insights on the technique, which come from my book Star Schema The Complete Reference.

In particular, I want to debunk some common misconceptions about accumulating snapshots.

The Accumulating Snapshot

One of the three major kinds of fact table, the accumulating snapshot is the least understood and the most underutilized. 

I provided a basic description of the technique several years ago in the first post to this blog.  The basic concept is simple: in this kind of design, each row in the fact table tracks the changes to something over time.  The idea is to facilitate reports that look at how long things spend in various processing stages.

For example, the post mentioned above describes an accumulating snapshot that tracks the processing of mortgage applications.  Each row in the fact table tracks a single application, and is updated each time it completes a processing stage.  It contains foreign key references to multiple dates:
  • Submitted Date
  • Approval Date
  • Processed Date
  • Underwritten Date
  • Settlement Date
Corresponding to each of these milestones is a fact that indicates how long it spent in the stage.  In the previous post, I've called these facts "lags," but I'll suggest a different way of looking at them in a moment.

These facts make it it easy to study the average time items spend in each stage. Without an accumulating snapshot, this might require correlated sub-queries,  or the comparison of detail rows from multiple fact tables.

Unique Characteristics

Unlike other kinds of fact table, the intent is to update rows in an accumulating snapshot.  When an application reaches a new status, for example, the appropriate date is added.  I've touched on this topic in a prior Q and A post.

Another characteristic that differs from other kinds of fact tables is the size of the accumulating snapshot.  It usually has less rows than the dimension table representing the item being processed.  That's because the item itself is likely to undergo type 2 slow changes.  I've noted this in a previous post as well.

Lastly, the accumulating snapshot is essentially current valued.  It is used to study the net effects of a process up to and including the current point in time. 

Non Linear Processes

In the past, I've written that the accumulating snapshot is only effective for processes that follow a consistent, predictable series of steps.  (See my previous books in the sidebar to this blog.)

But the technique can easily be adapted to accommodate processes that are not linear.

For example, suppose our mortgage application may bounce back to the processing stage if the underwriter discovers some missing documentation. Conventional wisdom holds that this will "break" the accumulating snapshot, since it violates the standard sequence of status milestones.

This non-linear behavior is easily accounted for. First, we will rename our "lag" facts to represent the number of days spend at each stage:
  • Days awaiting approval
  • Days processing
  • Days underwriting
  • Days awaiting settlement
The ETL process will be developed to check each open application on a daily basis, and increment the appropriate fact for its processing stage.  

This means it does not matter if an application reaches the underwriting stage and then gets sent back to the processing stage.  If this occurs, the ETL process will simple resume incrementing the "days processing" fact.

One thing that will need to be worked out is what dates to use for each milestone. For the application in question, once the processor finishes with it for the second time, what is the processed date?  The first time it achieved this status, or the second?  Business rules must drive these decisions.

Long Duration Processes

We are also accustomed to thinking of accumulating snapshots as appropriate only for short-duration processes.

However, my customers and students have found success in modeling activities of significant duration. Examples include the tenure of faculty members, the life cycle of a service contract, the duration of a mortgage or insurance policy, and the incarceration of a prisoner.

Some of these processes do not have fixed durations, and may prove to span years or even decades. As long as it is feasible for the ETL process to continually update records for active entities in the process, the approach is effective.

Going Even Deeper

Several other topics are worth considering:
  • You may have an operational system that logs very detailed status changes. It is possible to map a set of hundreds of statuses into a few key milestones for an accumulating snapshot. 
  • If the item or entity being processed undergoes slow changes, update its surrogate key in its corresponding fact table row.  This will be sure the fact table always points to the most recent version of the item.
  • Some milestones make useful dimension attributes as well. Incorporate the dates into the dimension table as appropriate.  For example, "date of approval" may be a useful characteristic in the mortgage application dimension.
  • For milestones that have not been achieved, don't include a NULL-valued foreign key in the fact table.  Instead, establish a special row your day dimension for events that have yet to occur.  Set its attributes to "N/A" and give it the highest possible date supported by your DBMS.  This will simplify joins and query predicates.
  • If at all possible, source the accumulating snapshot from other fact tables that log status changes as transactions.  This will make the ETL process much simpler. 
These topics are all discussed in the book.

You will find this same kind of deep dive in every topic it covers.  You can help support this blog by using the links in the sidebar to order a copy.

Thanks for your support, and feel free to send in your questions!

Image Credit:  Creativity103 via Creative Commons