Showing posts with label Snowflakes. Show all posts
Showing posts with label Snowflakes. Show all posts

Thursday, May 12, 2011

Dates In Dimension Tables

A reader asks about tracking dates in the fact table vs. dimension tables.
Q: It seems to me that it is sometimes more effective to store date keys in the dimension table -- particularly if that date could be used to describe the rows in many different fact tables.  The alternative  would be to duplicate the date key in multiple fact tables.

For example, the effective date of an insurance policy could be used against many fact tables.  Should we store the effective date key in each of the fact tables?  Or store it in a single row in the Policy?

The differences in design would be a star schema (fact table) versus a snowflake (stored in the dimension).

- Robert
Montreal
A:  Many models feature at least one major dimension with a date that  might be used to qualify facts in more than one star. This happens in other industries, too.  Examples include dimension tables that represent orders, contracts, tests and so forth.

Avoid placing date_key columns in dimension tables.  It increases SQL complexity, reduces schema understandability, and may impact performance.

Some designers may allow limited use of snowflaking, but not in this situation.

Date in Dimension Table, Day Key in Fact Table

When you have an date that (1) clearly describes something represented by a dimension, and (2) will also be used to group facts, do two things:

  1. Place good old-fashioned dates in the dimension table 
  2. In any star where you may want filter/group/aggregate facts by some aspect of the date in question, add date keys to the fact table
For example, in Robert's case there might be a Policy dimension table with an attribute called policy_effective_date.  This is a date, not a key.

There is also a star that tracks claims. To support analysis of claim information using the policy effective date, the fact table will contain a day_key_policy_effective.  The fact table may contain other day_keys as well (such as day_key_claim, which supports analysis of claim data by the date of the claim.)

Judicious Use of Snowflaking?

There are times when some designers might choose a snowflake path -- placing dimension keys into dimension tables. This happens when there is a set of attributes that appear in multiple dimension tables.

This occurs with geographical data.  The characteristics of a location might appear in a customer dimension table, a warehouse dimension table and a department dimension table.

In this case, the concern is that the ETL routines may not consistently process locations across multiple dimension tables. If their attributes or slow change characteristics are not managed identically, inconsistencies may develop across these tables.

Designers reason that creating one location table solves this problem.  Each dimension that contains location data will contain a location_key. This ensures a single consistent representation of locations.

I discuss this technique in Chapter 7 of Star Schema The Complete Reference, and point out that it opens the door for a different kind of ETL challenge.  If location data is moved to an outrigger, each time there is Type 2 slow change to a location, all related dimension table rows must also undergo Type 2 changes.

My preference is to allow the attributes to repeat in multiple tables, and to increase the QA focus on related ETL programs.

Thanks to Robert for the question.  If you have one of your own, send it to the address in the sidebar.

-Chris
Image credit: Gravityx9 licensed under Creative Commons 2.0

Monday, March 7, 2011

Q&A: Bridges are Part of Dimensional Modeling

Two recent posts on bridge tables generated a lot of questions.  Here is the most common:
When you use a bridge table, isn't that a snowflake? I thought this was not allowed in dimensional modeling.
The use of a bridge is a limited form of snowflaking.  It is also an important feature of the dimensional model.

Bridges are widely accepted, even among those who would otherwise avoid snowflake designs.

A Limited form of Snowflaking

In a Q&A from last year, I suggested that a snowflake occurs when dimensions are joined to something other than a fact table. By this definition, the examples from the previous two posts are certainly snowflakes.

In the first example, a bridge was used to associate a dimension with a repeating attribute (a company with multiple industries.)

In the second example, a bridge was used to associate a fact with multiple dimension rows (a sale with multiple salespeople.)

In both cases, a standard one-to-many relationship between fact and dimension does not exist.1 The bridge table solves this problem.

This usage is widely accepted. In fact, the bridge table is a central feature of dimensional modeling. I devote two full chapters to bridged designs in Star Schema: The Complete Reference.

Other Relationships are Not Modeled

When a bridge table is employed, most dimensional modelers will still refer to the schema as "a star." That's because the bridge is only used in the situations described above -- when the standard master-detail relationship between dimension table and fact table breaks down.

Other kinds of relationships between data elements are not resolved via multiple tables.  In our examples:
  • Repeating attributes are permitted (work phone, home phone, mobile phone) 
  • Dimensions contain partially dependent attributes (brand information present in a product table)
  • Master-detail relationships are not instantiated into separate tables
  • Values associated with codes are not placed in look-up tables
  • Dimensions are not linked to one another
An all-out snowflake design would employ principles of normalization to eliminate these features.

Other Times to Snowflake?

There may be some other situations where limited snowflaking is acceptable.  One is where a set of attributes (e.g. geographic) would otherwise appear in multiple dimension tables. Another is where software tools (the RDBMS or BI software) work better in a snowflake environment. 

These uses are a bit more controversial, and I wrote about them in a previous post.

When it comes to the bridge table, however, there is no controversy.  The bridge is an accepted technique in dimensional modeling, and is often necessary for specific dimensional designs.

- Chris

1 Bridges may also be used to navigate recursive relationships, as I describe in the book.

Image credit: Gravityx9 licensed under Creative Commons 2.0

Wednesday, February 9, 2011

Bridge to Multi-Valued Dimensions

Here's what you can do when a dimension table and fact are not in a one-to-many relationship.

A recent post described how a bridge can be used to resolve repeating attributes in a dimension. Today's post looks at a second use:  facts that must link to multiple rows in a dimension.

Facts with Multi-Valued Dimensions

In most star schemas, each fact table row references exactly one row in each associated dimension. This relationship between dimension table and fact table is variously called "one-to-many", "master-detail" or "parent-child". For example, in a star schema that measures the order-taking process, each row in the fact table will link to exactly one row in the product table, one row in the salesperson table, etc.

But what happens when more than one salesperson can be involved in an order? Suppose there could be two, three, or any number of salespeople? The one-to-many relationship breaks down.

Avoiding the Bridge

As with the repeating attribute from last time, we might try to accommodate this multi-valued dimension by simplifying the relationship. Perhaps we can isolate some "roles" that can be used to recover a clean one-to-many relationship. If there is always a lead salesperson and a secondary salesperson, we can simply place 2 salesperson keys in the fact table.

This has some disadvantages, though, and they are the same as those from last time.

First, it becomes difficult to produce a report showing all sales that a particular person was involved in. The query needs to select sales where the primary salesperson is the person in question, OR where the secondary salesperson is the person in question. Doing this for all salespeople in a department becomes tricky, especially if we are using a SQL-generating BI tool.

The second problem with simplifying the relationships is that there may be a need for any number of dimension values. Suppose 3 sales people collaborate on an order? Do we provide 3 salesperson keys in the fact table? This compounds the first problem, and it will only be a matter of time before we come across a case where we need 4.

A Dimension Bridge

The solution is to build a table that bridges the fact table and dimensions.

When a fact references a group of dimension values, a group will be created. The fact table will contain the group key. This key will join to the bridge table.  The bridge will have a row for each group member, linking to the dimension. An example appears here:


This simple solution allows individual rows in the fact table to link to any number of salespeople.  Notice that it requires a group to be established even for orders that have only one salesperson.

Using the Bridge

With a dimension bridge in place, it is easy filter facts for single dimension member, or group them by dimension members.  The sum of sales, for example, can be grouped by salesperson: the fact table joins to the bridge, the bridge joins to the dimension, and a simple group by query is issued.

Dimension bridges are used in many other situations. Multiple citations during an inspection, multiple parties in a dispute or claim, and so forth.

Dangers

As with the attribute bridge, the dimension bridge brings with it the possibility of misuse.

Without careful attention, facts may be double-counted, triple-counted or worse.  For example, if two people collaborate on a sale, a report of sale by salesperson will show the sale twice.  This kind of report is useful, but creating a grand total would be a mistake. Similar issues crop up if both people are in the same department, and someone tries to produce a report showing sales by department.

The solution is to make sure the bridge is only accessible to trained developers and analysts.  A simplified solution can be made available to ad hoc users. In this case, that might mean adding a primary salesperson key to the fact table. This allows ad hoc users to safely join the fact table and dimension table, but their analytic capability will be limited.

More On Dimension Bridges

This introduction has only scratched the surface of dimension bridges. As with the previous post, there are many more refinements and complications that can be introduced. These include the impact on ETL, the addition of allocation factors, coping with software that does not permit a many-to-many relationship and more.

These topics will be explored in future posts. You can also read all about bridge tables in Chatpers 9 and 10 of Star Schema: The Complete Reference.

Have a question about bridge tables? Send it to the address in the sidebar, and it may be addressed in a Q&A post.

Image by Branwebs via Creative Commons

Friday, January 28, 2011

Resolve Repeating Attributes with a Bridge Table

One of the most underutilized and misunderstood techniques in dimensional design is the use of bridge tables. These special tables can be used in several situations. This post focuses on one use case: the repeating attribute.

When a dimension attribute appears multiple times in the same dimension table, we call it a "repeating attribute." In the world of entity-relationship modeling, these are avoided, but in the world of dimensional modeling they are acceptable. For example, a customer dimension table may contain three phone number attributes -- work, mobile and home.

In some cases, though, the repeating attribute is not effective. For example, you are designing a dimension table that represents companies. One of the attributes you want to include is the company's industry -- manufacturing, services, banking, etc. The problem is that some companies participate in more than one industry. What do we do for Company A, it it participates in Manufacturing and Services?

Repeating the Attribute Doesn't Cut It

You might try a design that repeats the industry, just like the multiple phone numbers. But this doesn't work very well in the case of our company. For one thing, each of the three phone numbers had a clear and distinct role: work, mobile, home. We can't do the same thing with a company's industries.

We might arbitrarily name some distinct industry columns -- industry_1, and industry_2 for example. But this introduces two more problems. The first is obvious: what if it turns out there is a company that participates in 3 industries? Exactly how many company columns are we going to add to this table?

The other problem has to do with usability. If you want to create a report that shows sales for companies in a particular industry, you have to qualify your query multiple times, and remember to use "or". Plus you have to take care balancing parentheses if there are other predicates:


WHERE
( industry_1 = "Services"
OR
industry_2 = "Services"
) AND

year = 2011
The usability challenge is increased if you want to group results by industry; you may need to resort to procedural logic.

Solution: Bridge

A real-world bridge connects to places together; it is not a destination itself. Similarly, the bridge table's job is to connect two tables together. It contains no attributes that will appear in query results; its function is to establish a link.

In the case of a multi-valued dimension attribute, a bridge is used to link the main dimension with an outrigger that houses the repeating attribute. For the company/industry dilemma, the link looks like this:


This construction makes it possible to associate any number of industries with a company. It is now simple to qualify queries for a specific industry. Since there is one column called "industry" is also simple to group results by industry.

Take Care

The flexibility and power of the bridge comes at a cost: it is dangerous. When not used properly, facts may be double-counted, triple counted or worse.

Suppose you want to report on sales by industry. You connect industry to company using the bridge, and you then connect company to a fact table containing sales dollars. You aggregate sales and group by industry. Remember that Company A appears in the manufacturing and services industries. Its sales will be repeated for each industry.

This kind of report is acceptable in many businesses. Kimball calls it an "impact report". But we must take special care with these reports -- labeling them as such, and taking special care not to create "grand totals".

Other Repeating Attributes

Repeating attributes occur all over the place. Multiple keywords for a document. Multiple diagnoses for a patient. Multiple responsible parties for an account. Multiple skills for an employee. I am sure you can imagine others.

In all cases, a bridge to an outrigger gives you increased analytic flexibility, though you will have to carefully monitor for misuse.

More Uses, More Issues

There is a lot more to say about bridge tables. That's probably why I've never written about them in the 4-year history of this blog. There are to other ways to employ a bridge.

A bridge can be used when there are potentially many dimension rows associated with a single fact-- such as multiple salespeople associated with a sales fact table.

Bridges are also very powerful in helping traverse recursive hierarchies -- companies that own other companies, parts that break down into other parts, etc.

Though powerful and flexible, bridge tables introduce new complications for the data warehouse team. For example, the many-to-many relationships they introduce (see the picture above) are problematic for some modeling tools, and even some RDBMS's. They also impact how we configure our BI and reporting software. And we must give careful thought to the impact of bridges on slowly changing dimensions.

Future posts will look at some of these topics. I also devote two full chapters to bridge tables in Star Schema: The Complete Reference.

Image: Water Lillies and Japanese Bridge from the Public Domain.

Friday, November 5, 2010

Q&A: Star vs. Snowflake

A question from Portugal gives me an excuse to talk about snowflakes, why you might want to avoid them, and why you might want to use them.

Q: In your perspective, when does a star schema start to be a snowflake? If you have a snowflaked dimension, do you consider the model a star-schema? Or if you have for example an outrigger, is the model not a star-schema anymore?


I think the question when does a star become a snowflake is really one of semantics, but I'll give an answer.

Readers who just want to know when to use snowflake designs can skip ahead a couple of paragraphs.

Is it a Star or Snowflake?

As I said, this is really a semantic issue, but here is my answer. When dimension tables are linked to other dimension tables, or to anything that is not a fact table, the design is a snowflake.

The presence of an outrigger indicates a snowflake design, and so does a 3NF dimension containing 45 tables (and yes, I have seen that!).

Is a snowflake a star? I don't have a good answer for this.  When I discuss design options with colleagues, we think of it as an either/or choice. But when a design is complete, we refer to each fact table and its dimensions as a "star," even if there are outriggers.

One thing I am sure of is that both types of design are dimensional designs.  (So is a cube, by the way.)

Thanks to Pedro for the question.  Now I'm going to talk a bit more about snowflakes, for those who are interested.

When in doubt, don't snowflake

Best practices dictate that snowflakes should be avoided.  The reasons for this are wholly pragmatic.
  • They increase ETL complexity
  • The increase query complexity, harming "usability"
  • They reduce the understandability of the model
We could get into a discussion on this, but instead I'd like to look at some situations where these guidelines might not apply.

When snowflaking is acceptable

There are some cases the guideline against snowflaking may be relaxed.
  • Judicious use of outriggers may be acceptable in limited cases where there are relationships between dimensions that must be browsable.  But consider a factless fact table.
  • Outriggers are necessary when there are repeating attributes such as a product with multiple features or a patient with multiple diagnoses.
  • Outriggers are helpful in situations where there is a recursive hierarchy such as departments that contain other departments, or regions that contain other regions.
  • Snowflaking may be justified when your software products require it.  (e.g. your DBMS or reporting tool.)
In the latter case, you are making a strategic decision rather than a design decision, and I recommend involving several points of view in the decision making process--not just designers.

Read more, and help support this blog:

I devote many pages to snowflake designs in Star Schema The Complete Reference.
  • Chapter 7 discusses snowflakes and hierarchies, and has an in-depth discussion of some of the issues touched on in this post.  
  • Chapter 9 discusses the use of outriggers and  bridge tables to capture repeating dimensions or attributes. 
  • Chapter 10 shows how you can use a bridged design to support a recursive relationship to very powerful effect.
  • Chapter 16 looks at some of the implications of the star vs. snowflake decision on your BI software 
If you use the link above (or the links in the sidebar) to order my book, a portion of the proceeds helps to support this blog.   You can also find the table of contents and other info right here.


Image credit: Gravityx9 licensed under Creative Commons 2.0