Friday, January 28, 2011

Resolve Repeating Attributes with a Bridge Table

One of the most underutilized and misunderstood techniques in dimensional design is the use of bridge tables. These special tables can be used in several situations. This post focuses on one use case: the repeating attribute.

When a dimension attribute appears multiple times in the same dimension table, we call it a "repeating attribute." In the world of entity-relationship modeling, these are avoided, but in the world of dimensional modeling they are acceptable. For example, a customer dimension table may contain three phone number attributes -- work, mobile and home.

In some cases, though, the repeating attribute is not effective. For example, you are designing a dimension table that represents companies. One of the attributes you want to include is the company's industry -- manufacturing, services, banking, etc. The problem is that some companies participate in more than one industry. What do we do for Company A, it it participates in Manufacturing and Services?

Repeating the Attribute Doesn't Cut It

You might try a design that repeats the industry, just like the multiple phone numbers. But this doesn't work very well in the case of our company. For one thing, each of the three phone numbers had a clear and distinct role: work, mobile, home. We can't do the same thing with a company's industries.

We might arbitrarily name some distinct industry columns -- industry_1, and industry_2 for example. But this introduces two more problems. The first is obvious: what if it turns out there is a company that participates in 3 industries? Exactly how many company columns are we going to add to this table?

The other problem has to do with usability. If you want to create a report that shows sales for companies in a particular industry, you have to qualify your query multiple times, and remember to use "or". Plus you have to take care balancing parentheses if there are other predicates:

( industry_1 = "Services"
industry_2 = "Services"

year = 2011
The usability challenge is increased if you want to group results by industry; you may need to resort to procedural logic.

Solution: Bridge

A real-world bridge connects to places together; it is not a destination itself. Similarly, the bridge table's job is to connect two tables together. It contains no attributes that will appear in query results; its function is to establish a link.

In the case of a multi-valued dimension attribute, a bridge is used to link the main dimension with an outrigger that houses the repeating attribute. For the company/industry dilemma, the link looks like this:

This construction makes it possible to associate any number of industries with a company. It is now simple to qualify queries for a specific industry. Since there is one column called "industry" is also simple to group results by industry.

Take Care

The flexibility and power of the bridge comes at a cost: it is dangerous. When not used properly, facts may be double-counted, triple counted or worse.

Suppose you want to report on sales by industry. You connect industry to company using the bridge, and you then connect company to a fact table containing sales dollars. You aggregate sales and group by industry. Remember that Company A appears in the manufacturing and services industries. Its sales will be repeated for each industry.

This kind of report is acceptable in many businesses. Kimball calls it an "impact report". But we must take special care with these reports -- labeling them as such, and taking special care not to create "grand totals".

Other Repeating Attributes

Repeating attributes occur all over the place. Multiple keywords for a document. Multiple diagnoses for a patient. Multiple responsible parties for an account. Multiple skills for an employee. I am sure you can imagine others.

In all cases, a bridge to an outrigger gives you increased analytic flexibility, though you will have to carefully monitor for misuse.

More Uses, More Issues

There is a lot more to say about bridge tables. That's probably why I've never written about them in the 4-year history of this blog. There are to other ways to employ a bridge.

A bridge can be used when there are potentially many dimension rows associated with a single fact-- such as multiple salespeople associated with a sales fact table.

Bridges are also very powerful in helping traverse recursive hierarchies -- companies that own other companies, parts that break down into other parts, etc.

Though powerful and flexible, bridge tables introduce new complications for the data warehouse team. For example, the many-to-many relationships they introduce (see the picture above) are problematic for some modeling tools, and even some RDBMS's. They also impact how we configure our BI and reporting software. And we must give careful thought to the impact of bridges on slowly changing dimensions.

Future posts will look at some of these topics. I also devote two full chapters to bridge tables in Star Schema: The Complete Reference.

Image: Water Lillies and Japanese Bridge from the Public Domain.

Search Inside Star Schema TCR now allows you to search inside my latest book. (This wasn't the case for the first couple of months the book was available.)

I've been told by students in my classes that they have used this feature to read about and solve a design problem, before they even purchased the book!

Give it a try. Plug in a design topic and see what comes up. Page around, and you will also see some of the detailed illustrations.

If you like to look at a book before you buy it, this feature is great.

(You may have to be logged into an Amazon account to use the search feature.)

For more info on the book, you can also visit this post, as well as the FAQ.

Mobile Theme Now Available

Star Schema Central is now easier to read on the go. Simply visit it from your mobile device, and content will automatically formatted for maximum readability.
If you want to see the standard display, scroll down and choose "view web version."

Friday, January 7, 2011

Build High Resultion Stars

On TV, a low-resolution photo is never a problem for crime scene investigators. Load it into a computer, zoom in on a reflection, click "enhance", and presto! the criminal is revealed.

In the world of data warehousing, we cannot rely on the CSI effect. If we start with low-resolution data, the detail is lost forever. Eventually, someone will ask a question that requires more detail. And it simply wont be there to answer the question.

A few simple guidelines will help you avoid this unfortunate situation.

A Common Problem

Omission of detail is one of the biggest frustrations caused by legacy designs. It is the number three problem I encounter in design reviews (after failure to use surrogate keys and failure to plan slowly changing dimensions.)

The reason for this is simple: people often design their schema to support current requirements. Unfortunately, this does not work in data warehousing. Analytic requirements constantly change. As a designer, your job is to produce solutions that will answer questions that are not yet known.

This may sound like an impossible task, but it's not. Its one of the strong suits of dimensional design. Three guidelines for high-resolution design will help future-proof your solution.

1. Match Grain to Source Data, Not Requirements

When you're designing a fact table, you need to establish its grain. A statement of grain defines what is represented by a row in a fact table. This is the "resolution" of your measurements. (For a refresher on grain, see the post Rule 1: State Your Grain.)

Don't set grain at the level of detail that meets requirements. Set it at the level of detail at which data is available.

For example, you are designing a fact table that measures sales. Someone asks for daily totals of sales in dollars. Don't assume daily totals of sales is sufficient. Eventually someone will want to go deeper, perhaps looking for products that are bought together. This requires setting your grain at the order line level of detail.

Once you have identified the process a fact table represents, look at the resolution of the source data. That should guide your decisions on grain. Of course, this needs to be kept within reason. If the available level of detail would result in fact tables that are too large or expensive to manage, you might need to do some summarization. But be careful; you'll be working with low-resolution data in the future.

2. Capture All the Facts that Fit

Remember that each fact table describes a process. When you fill in the facts, don't just include ones that people asked for. Include all available facts that describe that process.

Returning to our sales data, this means you should not stop at sales dollars. What else is known about sales? Quantities, tax, and other information is likely to be readily available, and many operational systmes may also offer a rudimentary concept of cost or margin. Include them!

3. Fill Out Dimensions

The last guideline for a high-resolution design involves the dimensions. You can probably guess what it is:

When you design a dimension table, don't just include attributes someone asked for. Find what is available in the source system, and include it all.

A simple "Product Name" and "Product ID" might be all that's needed to support sales by product. But what else do source tables hold that describes your products? Manufactures? Suppliers? Colors? Sizes? Weights? Include all this in your designs.

Remember, dimensions are the source of all context in our reports. The more we fill them out, the more kinds of questions people will be able to ask.

Initial Design is the Right Time

Match your grain, facts and dimensions to the available data, even if that goes beyond the requirements.

Of course, you can recover from errors in any of these areas. This will require modifying your design, ETL process, and so forth. When its decided that additional detail is needed going forward, you can make these changes.

But it is often difficult to add detail to historic data already in the data warehouse. For example, it may require accessing backup data. Worse, it may put ETL developers through the difficult process of performing slow change processing in reverse.

Think high-resolution today, and you can avoid these problems tomorrow.

Image Credit: Alan Cleaver via Creative Commons 2.0