Friday, January 28, 2011

Resolve Repeating Attributes with a Bridge Table

One of the most underutilized and misunderstood techniques in dimensional design is the use of bridge tables. These special tables can be used in several situations. This post focuses on one use case: the repeating attribute.

When a dimension attribute appears multiple times in the same dimension table, we call it a "repeating attribute." In the world of entity-relationship modeling, these are avoided, but in the world of dimensional modeling they are acceptable. For example, a customer dimension table may contain three phone number attributes -- work, mobile and home.

In some cases, though, the repeating attribute is not effective. For example, you are designing a dimension table that represents companies. One of the attributes you want to include is the company's industry -- manufacturing, services, banking, etc. The problem is that some companies participate in more than one industry. What do we do for Company A, it it participates in Manufacturing and Services?

Repeating the Attribute Doesn't Cut It

You might try a design that repeats the industry, just like the multiple phone numbers. But this doesn't work very well in the case of our company. For one thing, each of the three phone numbers had a clear and distinct role: work, mobile, home. We can't do the same thing with a company's industries.

We might arbitrarily name some distinct industry columns -- industry_1, and industry_2 for example. But this introduces two more problems. The first is obvious: what if it turns out there is a company that participates in 3 industries? Exactly how many company columns are we going to add to this table?

The other problem has to do with usability. If you want to create a report that shows sales for companies in a particular industry, you have to qualify your query multiple times, and remember to use "or". Plus you have to take care balancing parentheses if there are other predicates:

( industry_1 = "Services"
industry_2 = "Services"

year = 2011
The usability challenge is increased if you want to group results by industry; you may need to resort to procedural logic.

Solution: Bridge

A real-world bridge connects to places together; it is not a destination itself. Similarly, the bridge table's job is to connect two tables together. It contains no attributes that will appear in query results; its function is to establish a link.

In the case of a multi-valued dimension attribute, a bridge is used to link the main dimension with an outrigger that houses the repeating attribute. For the company/industry dilemma, the link looks like this:

This construction makes it possible to associate any number of industries with a company. It is now simple to qualify queries for a specific industry. Since there is one column called "industry" is also simple to group results by industry.

Take Care

The flexibility and power of the bridge comes at a cost: it is dangerous. When not used properly, facts may be double-counted, triple counted or worse.

Suppose you want to report on sales by industry. You connect industry to company using the bridge, and you then connect company to a fact table containing sales dollars. You aggregate sales and group by industry. Remember that Company A appears in the manufacturing and services industries. Its sales will be repeated for each industry.

This kind of report is acceptable in many businesses. Kimball calls it an "impact report". But we must take special care with these reports -- labeling them as such, and taking special care not to create "grand totals".

Other Repeating Attributes

Repeating attributes occur all over the place. Multiple keywords for a document. Multiple diagnoses for a patient. Multiple responsible parties for an account. Multiple skills for an employee. I am sure you can imagine others.

In all cases, a bridge to an outrigger gives you increased analytic flexibility, though you will have to carefully monitor for misuse.

More Uses, More Issues

There is a lot more to say about bridge tables. That's probably why I've never written about them in the 4-year history of this blog. There are to other ways to employ a bridge.

A bridge can be used when there are potentially many dimension rows associated with a single fact-- such as multiple salespeople associated with a sales fact table.

Bridges are also very powerful in helping traverse recursive hierarchies -- companies that own other companies, parts that break down into other parts, etc.

Though powerful and flexible, bridge tables introduce new complications for the data warehouse team. For example, the many-to-many relationships they introduce (see the picture above) are problematic for some modeling tools, and even some RDBMS's. They also impact how we configure our BI and reporting software. And we must give careful thought to the impact of bridges on slowly changing dimensions.

Future posts will look at some of these topics. I also devote two full chapters to bridge tables in Star Schema: The Complete Reference.

Image: Water Lillies and Japanese Bridge from the Public Domain.