Showing posts with label Bridges. Show all posts
Showing posts with label Bridges. Show all posts

Thursday, December 20, 2012

Dimensional Models: No E-R Fallback Required

People often suggest to me that a dimensional model cannot handle certain situations. These situations, the assertion holds, require falling back to entity-relationship modeling techniques.

Image by Patrick Hosley,
Licensed by Creative Commons 2.0
When we look together at their examples, though,  I've always found a dimensional solution.

Here's a list of things people sometimes do not realize can be handled by standard dimensional techniques.

At the end of the post is a large collection of links to previous posts treat these topcis in more detail.

Note: This article should not be interpreted as a criticism of entity-relationship modeling. Rather, the point is to cover commonly overlooked capabilities of dimensional modeling.


Predictive Analytics: Storage of Granular Detail

I often hear people say: if you want to do predictive analytics, its important to capture all the relationships among data attributes. This is incorrect.

The job of predictive analytics is to tell us what is important--not the reverse!  Neither data mining nor predictive analytics requires information to be stored in an entity-relationship format.

As people who do this work will tell you, they don't care how the data is supplied. Nor do they care what modelers believe are the important relationships. Many tools will completely de-normalize the data before doing anything else.

In order to support predictive analytics, standard best practices of dimensional modeling apply:
  • Fact tables record business activities and conditions at the lowest level of detail possible in granular, first line stars.
  • Transaction identifiers are carried in the model if available.
  • Surrounding dimensions are enriched with detail to the fullest extent possible
  • In the dimension tables, changes are tracked and time-stamped (“type 2”).
If we do these things with our dimensional data, we have not lost anything that is that is required to do predictive analytics.

Many-to-many Relationships: Multi-Valued Dimension Bridge

Usually, each fact in a star schema can be linked to a single member of a dimension. This is often mistaken for a rule, which leads to the incorrect conclusion that dimensional models cannot accommodate a fact that may link to multiple members of a dimension.

(This is often generalized as “dimensional cannot handle many-to-many relationships.”)

In a dimensional model, this situation is referred to as a multi-valued dimension. We use a bridge table to link the facts to the dimension in question. This special table sets up a many-to-many relationship between the facts and dimensions, allowing any number of dimension members to be associated with any number of facts.

Here are some examples:
  • In sales, a bridge table may be used to associate a sale, as recorded in a fact table, with multiple sales people. 
  • In insurance, a single claim (again, tracked in a fact table) can be associated with multiple parties. 
  • In healthcare, a single encounter may be associated with multiple providers or tests. 
  • In government, a single audit or inspection may be associated with multiple findings. 
  • In law enforcement, a single arrest may be associated with multiple charges.
Repeating Attributes: Multi-Values Attribute Bridge

It is sometimes suggested that dimensional models cannot gracefully accommodate repeating attributes. The dimensional solution is again a bridge table. This time, it is placed between the dimension table and an outrigger that contains the attribute in question.

Examples include:
  • Companies that have multiple Standard Industry Classification codes 
  • People with multiple phone numbers or addresses 
  • Accounts with multiple account holders 
  • Patients with multiple diagnoses
  • Documents with multiple keywords.
Recursive Hierarchies: Hierarchy Bridge

Ragged hierarchies, unbalanced hierarchies, or recursive hierarchies are often cited as the downfall of the dimensional model. In fact, a solution exists, and it is extraordinarily powerful. The hierarchy bridge table allows facts to be aggregated either by rolling up or rolling down through the hierarchy, regardless of number of levels.

Examples include:
  • Parts that are made up of other parts
  • Departments that fall within other departments
  • Geographies that fall within other geographies
  • Companies that own other companies
Relationships Between Dimensions: Factless Fact Tables

A star schema does not include relationships between dimension tables. This has led to the misconception that you can't track these relationships.

In fact, any important relationship between dimension tables can and should be captured. It is done using factless fact tables. (Dimensions are never directly linked because of the implications this would have on slow change processing.)

Examples include:
  • Employees filling a job in a department
  • Marketing promotions in effect in a geographical market
  • Students registered for courses
  • The primary care physician assigned to an insured party
  • Brokers assigned to clients
Subtyping: Core and Custom

Another situation where modelers often believe they must “fall back” on ER techniques is when the attributes of a dimension table vary by type. This variation is often misconstrued as calling for the ER construct known as subtyping.  Similar variation might also be found with associated facts

In the dimensional model, heterogeneous attributes are handled via the core and custom technique.

A core dimension captures common attributes, and type specific replicas capture the core attributes along with those specific to the subtype. Dimensions can then be joined to the fact table according to the analytic requirement. If there is variation in the facts, the same can be done with fact tables.

Examples include:
  • Products with attributes that vary by type 
  • Customers that have different characteristics depending on whether they are businesses or individuals
  • In insurance, policies that have different characteristics depending on whether they are group or individual polices
  • In healthcare, medical procedures or tests that have different characteristics and result metrics depending on the test
  • In retail, stores that have varying characteristics depending on the type (e.g. owned, franchise, pocket)
Non Additive Metrics: Snapshots or Summary Tables

In a classic star schema design, facts are recorded at a granular level and “rolled up” across various dimensions at query time. Detractors often assume this means that non-additive facts have no place in a dimensional model.

In fact, dimensional modelers have several tools for handling non-additive facts.

Those that can be broken down into additive components are captured at the component level. This is common for many key business metrics such as:
  • Margin rate: stored as margin amount and cost amount in a single fact table. The ratio is computed after queries aggregate the detail.
  • Yield or conversion percentage: stored as quote count and order count in two separate fact tables, with ratio converted after aggregation at query time.
Other non-additive facts cannot be broken down into fully additive components. These are usually captured at the appropriate level of detail, and stored in snapshot tables or summary tables. Common examples include:
  • Period-to-date amounts, stored in a snapshot table or summary table
  • Distinct counts, stored in a summary table
  • Non-numeric grades, stored in a transaction-grained fact table
While these non-additive facts are flexible than additive facts in terms of how they can be used, this is not a result of dimensional representation.

Conclusion

Every technique mentioned here is part of dimensional modeling cannon. None are stopgaps or workarounds. While some may prove problematic for some of our BI software, these problems are not unique to the dimensional world.

In the end, the dimensional model can represent the same real-world complexities that entity-relationship models can. No ER fallback required.

- Chris


Learn More

All of these topics have been covered previously on this blog.  Here are some links to get you started.

Establishing granular, detailed star schemas:
Multi-valued Dimensions:
Multi-Valued Attributes:
Recursive Hierarchies:
Factless Fact Tables:
Core and Custom:
  • There's not much on this topic on this blog right now, but see my book for more info (details below.)
Non-additive Facts:

Also consider checking out my book, Star Schema: The Complete Reference.  It covers all these topics in much more detail:
  • Granular and detailed stars are covered in Chapter 3, "Stars and Cubes"
  • Multi-valued Dimensions and Multi-Valued attributes are covered in Chapter 9, "Multi-valued Dimensions and Bridges"
  • Hierarchy Bridges are covered in Chapter 10, "Recursive Hierarchies and Bridge Tables"
  • Factless Fact Tables are covered in Chapter 12, "Factless Fact Tables"
  • Core and Custom schemas are covered in Chapter 13, "Type-specific Stars"
  • Non Additive Facts are covered in Chapter 3, "Stars and Cubes," Chapter 11, "Transactions, Snapshots and Accumulating Snapshots" and Chapter 14, "Derived Schemas"

Use the links on this blog to order a copy from Amazon.  There is no additional cost to you, and you will be helping support this  blog.






Monday, November 5, 2012

Name Value Pairs and Dimensional Models

A reader asks how name/value pairs are best translated into a dimension design.
I was wondering if you have ever had to setup a data warehouse with a source system that has the ability to have dynamic "virtual" columns using a set of name/value pair tables...If you have done this or have any cool ideas on how to tackle this setup in a dimension that would be greatly appreciated.
-Joshua
Belmont, CA

This is a common scenario. The preferred solution is to identify the most commonly used "virtual columns" and model them as standard attributes. Bridged solutions are also possible, but are much less flexible from a usability standpoint.

Name Value Pairs

Name/value pairs are often used in OLTP solutions to offer a flexible way to describe an entity.

Rather than store each characteristic of an entity as an attribute, a special table lets you identify unlimited attributes for something. Its main columns capture the name of the attribute and its value. (Hence the moniker "name/value pair".)

For example, suppose you are Apple Inc. Your OLTP system has a product table, but each product has a wide variety of attributes. Rather than model each attribute, you might capture them in a name/value pair table.

The rows in the table for the newly released iPad Mini might look like this:

As you can see, the 18 rows above describe one iPad Mini model. 

Elsewhere in the data model, and associative table (or "intersect table") cross references these 18 pairs with a single row in the product table for this item.

Apple has 23 other iPad Mini models, each with different of values for the capacity attribute, color attribute, carrier attribute, etc. 

In addition, this same table can be used to describe the characteristics of products with different named attributes - a phone, a computer, a notebook, and so forth. New attributes do not require changing the database design. Instead, all that is necessary is to insert the appropriate rows.

In a dimensional model, there are several ways to handle name/value sources. Each involves tradeoffs between flexibility and usability.

Flattened Solution is Most Usable

The preferred method is to model the "names" as explicit dimension attributes.1

In the case of our product, for example, the resultant dimensional model might nook like this:

This approach produces the most understandable and usable dimensional model. Each important characteristic is present in the dimension, named according to business terms, and ready to provide context for facts.

Do not be discouraged if every instance of the source entity (product in this case) does not share the same set of attributes in the name/value pair table. Look for 100 or so attributes that are most commonly used. In many cases, this may satisfy most of your analytic requirements.

Core and Custom Dimensions

If there are simply too many names to flatten into a single dimension table, the next step is to consider "core and custom" models.

If you separate the entity by "types", can you find 100 named attributes for each type?  

If so:
  • Build a single "core" dimension that contains an identifier and any attributes common across all types, along with a surrogate key.
  • For each type, build an additional dimension that replicates the common attributes, plus contains the 100 or so attribute names specific to the type
  • The core and custom versions should use the same surrogate key domain for each member of the dimension, so that they can be swapped in and out of queries as required.
For example, Apple might do the following:
  • Build a single product table that has the part number, category, product name and a key. This table will have a row for all products, including computers, tablets, phones and music players. This is the "core" product table. 
  • Build additional "custom" dimensions would be built for each category of product. The table pictured in the previous section might be the custom dimension for tablets.
This solution retains the understandability of the previous solution - each attribute is clearly and explicitly named. However, we must now take care to join the appropriate version of the dimension to fact tables - a technical consideration based on the business question being asked.

Bridged Solution Mirrors OLTP Design

The last option is to employ a bridged solution. This technique actually mimics the original OLTP design. An outrigger contains the name/value pairs, and a bridge associates it with the dimension.

For example, Apple's product dimension might be bridged as follows:

This solution has the benefit of providing access to the full range of named attributes. It is also flexible; new named attributes do not require any change to the schema design.2

The disadvantage here is that this solution is the most difficult to use. Facts can be easily double counted, triple counted, etc. It will be necessary to take great care in constructing and qualifying queries, and it may also be necessary to lock down "grand total" functionality in end-user reporting tools.

Both Ways

Remember that this is not an either/or solution. If a bridged outrigger is necessary, consider also capturing the 100 most commonly used attributes in the dimension itself.

This allows you to create a safe "sandbox" for less technically sophisticated users. They are able to access much of the analytic value of your solution, without having to expose them to the complexity and possible dangers of the bridge table.

More Reading
  • Bridges: Much has been written in this blog about bridge tables. The most relevant entry if you want to learn more is Resolve Repeating Attributes With A Bridge Table (1/28/11).  There are also two full chapters in my latest book (see below.)
  • Core and custom dimensions have not been previously discussed here, but you can learn more about them in Star Schema: The Complete Reference.  It dedicates an entire chapter to "Type-Specific Stars."
Help support this blog

This blog is ad-free, and no one pays me to write it.  You can help out:  
  • Use any of the links on these pages to pick up my latest book, Star Schema: The Complete Reference.  
  • Or, if you already have it, use the links to check out any of the other recommended books.

When you do, a small portion of the purchase goes to this blog (you do not pay any extra).


Notes:

1I'll call these "names" rather than "pairs", since its the name that will translate into a column. For example, "color" is a name, it may participate in several name/value pairs -- one for each possible color.

2The bridge table in this diagram contains product and attribute keys. If there will be many combinations, dimensional modelers often replace the product key with a group key, and store the group key in the dimension table. See Resolve Repeating Attributes With A Bridge Table (1/28/11). 


Friday, May 18, 2012

Using A Hierarchy Bridge WIthout A Many-to-many Relationship

In this post, I describe what you can do to eliminate the many-to-many relationship when using a hierarchy bridge.

The Hierarchy Bridge

A recent post on Recursive Hierarchies and Bridge Tables described how to support a recursive hierarchy in a dimensional model.

The business benefits of this kind of solution can be extremely valuable.

In the example, a special bridge table was used to navigate the relationships between companies and their subsidiaries. When connected to a fact table, this allowed us to roll up transactions via the recursive hierarchy.

Here is what the solution looked like.  Click the picture for a larger version.  For full details, refer to the original post.



Many-to-many is legal, but you may need to resolve it anyway.

You be wondering about the many-to-many relationship -- the one between the fact table and the bridge table.

This kind of relationship is perfectly valid in a relational database.  Unfortunately, some of our software products may not accept this kind of relationship, making it necessary to eliminate it.

I've written about this before in the context of a multi-valued dimension.  The solution in that case was to resolve the many-to-many relationship by creating a single-column table that consisted solely of a surrogate key.

With a hierarchy bridge, we can resolve the many-to-many without adding one of those silly single-column tables.

Additional table not required

In the recursive hierarchy above, we already have what we need: the company table itself.

Notice that in the picture above, the company table represents the "superior" to which we are aggregating transactions. The bridge links it to facts that describe the activities of subordinates.

To resolve the many-to-many, we will insert a second alias for the company table between the fact table and the bridge.  This alias represents the subordinate.

Now the picture looks like this:




An added benefit of this approach is that we can include information about both the superior and the subordinates in our query results.

Learn more

There are several posts describing bridge tables in this blog.  This post has mentioned two of them:
  • Use the category keyword bridges to find more posts.
And of course, you can help support this blog by picking up a copy of Star Schema: The Complete Reference.  It devotes two chapters and 30 illustrations to the topic of bridge tables.


Bridge photo by Branwebs via 
Creative Commons

Wednesday, May 2, 2012

Recursive Hierarchies And Bridge Tables

This post introduces the hierarchy bridge table, which can be used to navigate recursive hierarchies.

High Trestle Bridge at Night
Photo (c) Carol Bodensteiner
Bridge tables are a popular topic among readers of this blog. Perhaps that's because they can be a bit confusing, yet also enable powerful analysis.

Previously, I've focused on the two simplest uses for bridge tables:  the attribute bridge and the dimension bridge.  See links at the end of this post for more on these.

In this post, I'll tackle a third and final use: how a bridge can aid in the navigation of a recursive hierarchy.

Recursive relationships

In the real world, recursive relationships are common. Employees report to other employees, parts are made of other parts, and so forth.

In an E-R model, a recursive relationship might look like this:

Companies may own other companies.  That dotted relationship line is sometimes called a "pigs ear."  It links each company to the company that owns it.

You can read this relationship in two directions:  each company may have a parent company; each company may be the parent of one or more companies.

Looking up and looking down

From a business perspective, this information may be very useful.  For example, suppose you work for a business whose sells things to companies. You would like to be able to answer questions like these:
  1. What are January order dollars from to Company XYZ and its subsidiaries?
  2. Show me January orders from Company ABC and any companies above it.
The first question "looks down" from company XYZ.  The second question "looks up" from company ABC.

In both cases, we want to use the hierarchy to aggregate details about orders.

The hierarchy bridge

In a dimensional model, the idea is to support the study of facts (such as "order dollars") in the context of this hierarchy. This can be done using a bridge table, as shown here:

The table in the center is a hierarchy bridge.  This kind of table can be a bit confusing.  First, consider its content.

Each row contains a pair of company keys.  Here are the rules for populating the bridge table:
  1. For each company, the bridge contains rows associating it with each of its subordinates.
  2. Each company also has 1 row associating it with itself.
With respect to item 1, note each company is linked to all subordinates, not just the direct ones. (That's why the keys have the prefixes "superior" and "subordinate" rather than "parent" and "child".)

Using the bridge

In the picture above, notice that the fact table contains a company_key.  This represents the company from which each order is taken.  

Normally, we would join it to the company table to study "orders by company."

In this case, however, the bridge is inserted between the fact table and the company table. The bridge acts like a multiplexer--linking each company (on the right) to transactions with any and all of its subordinates (on the left).

This allows us to select a single company on the right (say Company XYZ) and aggregate all orders at or beneath that company.  That's called "looking down." It lets us answer question 1.

We can also use this hierarchy to "look up," as in question 2.  We simply reverse the joins--link company to the subordinate key, and the superior key to the fact table.  Then we can select a single company (say Company ABC) and look upward.

Powerful but dangerous

Of course, the danger of the bridge table should be readily apparent.  As you can see from the diagram above, if we forget to select a single parent company, we can easily double count the orders, triple count them, or worse.

For this reason, writing queries that involve the hierarchy must be governed carefully.  Developers or users with little experience can study the facts without using the bridge.  This will be safe, but they will not be able to leverage the hierarchy.

Other kinds of bridge tables

The hierarchy bridge described in this post is one of three kinds of bridge tables.  I've written previously about the other two kinds:
  • Bridge To Multi-Valued Dimensions (2/9/2011) shows how a bridge can link a single fact to multiple rows in a dimension.  Examples include orders with multiple salespeople, claims with multiple parties, etc.
More on the hierarchy bridge

When it comes to recursive hierarchies, this post only scratches the surface.  If you have questions, send them in.

In the mean time, there's a whole chapter on this topic in my book, Star Schema: The Complete Reference.  That would be Chapter 10:  "Recursive Hierarchies and Bridges." It's a long chapter--the longest one in the book.

And don't forget, when you use the links on this page to buy it from Amazon, you are helping to support this blog.


Copyright (c) Carol Bodensteiner
Used by permission, with many thanks!

Thursday, April 26, 2012

Q&A: Human resources data marts


A reader asks if Human Resources data marts are inherently complex. I run down a list of dimensional techniques he should expect to find:
Q: I will be working on a data mart design project to design star schemas for human resources data. I heard that HR data is more complex than sales or marketing and special techniques need to be applied.

I looked at the star schemas of pre-built analytical applications developed by some vendors for our packaged HR solution. I felt that they are quite complex and just wondering star design for HR data should be so complex.

If possible, can you please discuss this topic in a detailed manner by considering any one of popular HRMS system data and the most common data/reporting requirements along with the design discussion to achieve the star for those reports using the given HRMS data?

- Venkat, UK

A: Human Resources applications do indeed tend to use advanced techniques in dimensional design.

Below, I run down a list of topics you will probably need to brush up on. In reality, every subject area requires complete mastery of dimensional modeling, not just the basics.

Note that the complexity you are seeing in packaged solutions may stem from the subject area. Vendors often produce abstracted models to facilitate customization.

Techniques used in HR data marts

No doubt you are accustomed to the transaction-grained stars you encountered in sales.  You will find them in HR as well, but you will also encounter these:
  • Snapshot stars sample one or more metrics at pre-defined intervals.

    In an HR data mart, these may be used to track various kinds of accruals, balances in benefit programs, etc.

  • Accumulating snapshot stars track dimension members through a business process and allow analysis of the elapsed time between milestones.

    These may be used to track the filling of a position, "on-boarding" processes, disciplinary procedures, or applications to benefit programs.

  • Factless fact tables track business processes where the primary metric is the occurrence of an event. They contain no facts, but are used to count rows.

    These are likely to be used for tracking attendance or absence, participation in training courses, etc.

  • Coverage stars are factless fact tables that model conditions.  These are usually in place to support comparison to activities represented in other stars, but may also be leveraged to capture key relationships among dimensions.

    These are likely to be used for linking employees to positions, departments and managers. 
Your dimensions will also require reaching beyond the basics:
  • Transaction dimensions capture the effective and expiration date/time for each row in a dimension table.  These are advisable in almost any situation.

    In HR they may be used to  track changes in an employee dimension.

  • Bridge tables for Multi-valued attributes allow you to associate a repeating attribute with a dimension.

    In HR, these are likely to be used to associate an employee with skills, languages, and other important characteristics.

  • Hierarchy bridge tables allow you to aggregate facts through a recursive hierarchy.

    In HR, these are used to navigate reporting structures (employees report to employees, who in turn report to other employees, and so forth) as well as organizational structures.
I would also expect to encounter some complexity in slow-change processing rules.  Human Resource systems carefully audit certain kinds of data changes, tracking the reason for each change. As a result, you may have attributes in your dimension schema that may exhibit either type 1 or type 2 behavior, depending on the reason for the change.

Every schema goes beyond the basics

This list could go on, but I think you get the idea.

The only way to design a data mart that meets business needs is to have a well rounded understanding of the techniques of dimensional modeling.

You cannot get very far with nothing more than a grasp of the basics. This holds true in any subjet area -- even sales and marketing.  You need the complete toolbox to build a powerful business solution.

Packaged data marts

The complexity that concerns the reader may actually stem from another cause:  he is looking at packaged data mart solutions.

Packaged applications often introduce complexity for an entirely different reason: to support extensibility or customization. For example, facts may be stored row-wise rather than column-wise, and dimensions may contain generic attribute names.

Learn more

This blog contains posts on most of the topics listed above.  Click each header for a link to a related article.  Some have been discussed in multiple posts, but I have included only one link for each.  So also do some exploration.

In addition, please check out my book Star Schema: The Complete Reference.  When you purchase it from Amazon using the links on this page, you help support this blog.
  • Snapshots and accumulating snapshots are covered in Chapter 11, "Transactions, Snapshots and Accumulating Snapshots

  • Factless fact tables and coverage stars are covered in Chapter 12, "Factless Fact Tables"

  • Transaction dimensions are covered in Chapter 8, "More Slow Change Techniques"

  • Attribute bridges are covered in Chapter 9, "Multi-valued Dimensions and Bridges"

  • Hierarchy bridges are covered in Chapter 10, "Recursive Hierarchies and Bridges"
Thanks for the question!

- Chris

Send in your own questions to the address in the sidebar. 


Do you have another technique that was useful in an HR data mart? Use the comments.

Image credit: Gravityx9 licensed under Creative Commons 2.0

Tuesday, October 4, 2011

Are bridge tables really fact tables?

A reader observes that bridge tables seem problematic, and wonders if they should just replaced by factless fact tables.

Q:  I am wondering if all bridge tables are in fact replacements for factless fact tables. The problem with the bridge table as you mention it is that...[you] need to do an expensive join and issues with Cartesian joins/ double counting etc. So the question is whether a bridge table is practical option as compared to a separate fact.
Ashish
Bangalore, India

A:  The bridge table looks a lot like a fact table, but it is used very differently.

While we avoid joining a fact table to another fact table, we seek to join bridge tables to fact tables. This can have implications for BI tools that generate SQL.

Similarities

A bridge table appears similar to a fact table because it contains multiple foreign keys. This is most evident when you look at an attribute bridge table, which links a dimension to an outrigger. It consists solely of foreign key references to the dimension and the outrigger.

In this respect, the bridge table is very similar to a factless fact table.  Indeed, one might make the argument that a bridge relates a set of dimensions in much the same way that a factless fact table describes conditions.

But there is a very important difference: we never join a fact table to another fact table. Bridges, on the other hand, are intended to be joined with fact tables.

We do not join fact tables to fact tables

You should never join two or more fact tables--either directly or indirectly via shared dimensions. Fact values will repeat if multiple rows in either fact table share the same dimensionality. We receive a Cartesian product of all related facts.  The result is double-counting, or worse. 

Instead of joining fact tables, we use a technique called drilling across. Facts are collected from each table and aggregated to common level of detail, then merged into a single result set. I wrote about this process earlier this year.1

Many BI tools that generate SQL are able to identify fact tables and automatically invoke drill across logic when required.

We do join bridge tables to fact tables

Bridge tables represent groups. We create them so that a single fact can be associated with a group of values (such as multiple salespeople) rather than a single value.

When we use a bridge table, we link it to other fact tables.2  By doing this, a single fact in the fact table associates with multiple rows in the bridge table.   

With a bridge, we are exploiting the very Cartesian product that we normally seek to avoid.

We are intentionally repeating a single fact for multiple group members.  To avoid incorrect results, it behooves us to group results by member, or to constrain for a single group member.

Even if you were to replace a bridge with a factless fact table, this is the behavior you would desire. Rather than drill across, you would link it to other fact tables, in order to associate the bridged values with various facts. Cartesian products and the danger of double counting would remain.

Bridge tables and BI software

A bridge table is not really a fact table. It is not the locus of process measurement.  It describes neither activities nor conditions. It is merely a construct that allows us to deal with repeating values. It is meant to be linked with fact tables, and used with care.

Because a bridge is composed of foreign keys, however, some BI tools may identify it as a fact table. If your tool does this, you will need to prevent it from invoking drill-across logic for queries that involve a bridge.

Your tool may have a facility for this. If it does not, you can hide the bridge by joining it to dimensions within a view.

More info

Thanks to Ashish for the comments. If you have a question about bridge tables, send it to the address on the sidebar.

You can learn more in my book, Star Schema: The Complete Reference.  Two full chapters are dedicated to bridge tables, including 30 diagrams.

See also:
1Factless fact tables which describe conditions are not joined to other fact tables either. When they are compared to other fact tables, we typically use set operators or subqueries.

2This may happen directly, in the case of a dimension bridge, or indirectly, in the case of an attribute bridge.


Image by ahisgett licensed under Creative Commons 2.0

Tuesday, June 21, 2011

Allocation Factors Are Not Magic Bullets

Reader interest in bridge tables continues.  Today, I'll address a topic that was raised in the comments: the use of allocation factors.

Allocation factors are often seen as the cure-all remedy to avoid double counting.  Unfortunately, they are not always realistic. And when they are possible, there is often a better solution.

The back-story: double counting

A dimension bridge allows a single fact row to be linked to two or more rows in a dimension.  In a previous post, I showed how this technique allowed two or more salespeople to be linked to a row in an orders fact table.

As I pointed out, the danger is that queries may accidentally count the order dollars repeatedly -- once for each associated salesperson.  This is the danger of double-counting.

The potential solution: allocation

An allocation factor can sometimes be used to avoid this danger. To the bridge, we add a column that indicates the "share" of the facts associated with each group member.  In queries, before we aggregate facts, we apply this allocation factor.

For example, we can add an allocation factor to the salesperson group bridge: 

If two members collaborate on a sale, and split the credit 75/20, there will be two rows in the bridge for their group.  One would contain the group key, the first salesperson key, and a .75 allocation factor.  The other would contain the group key, the second salesperson key, and a .4 allocation factor.  (For sample data, see Figure 9-5 my book.)

Now, when we connect the bridge to the fact table, we can apply the allocation factor:

select sum(order_facts.order_dollars * salesperson_bridge.allocation)

Seems to be the perfect solution, right?  Well, not always....

There must be a business rule

An allocation factor only works where one exists. Do not come up with your own rules -- e.g. "split transactions evenly among group members." This may not reflect how the business studies their key metrics.

Allocation factors are often not part of the business. If an insurance claim is tracked in a fact table, it may not make sense to allocate it against multiple parties. If a fact table represents a insepction event, it may not make sense to allocate it against multiple infractions.

There must be a definitive and consistent business rule that can be used. If there is no business rule, then obviously allocation will not work. Note, too, that they are even less likely to be found in situations that call for an attribute bridge.

There should be a source of record

So you have a business rule?  You are not out of the woods yet.

Next, you need to make sure you have a source of record for the allocation factor.

For example, many systems that support the sale of securities calculate, and split, broker commissions at the time of sale. These systems provide a nice, clean allocation factor.

If you cannot find a source of record, you should think long and hard about using the "business rule." In all likelyhood, it will be subject to exceptions, and may even change over time. 

For example, your business may tell you transactions can be allocated evenly among group participants. You may choose to "hard-code" this rule into your ETL process. But one day, you will encounter a situation where the business wants to "override" this rule for a particular transaction. Or, a new rule is instanitated that one particular salesperson always gets at least 40% of their sales, and so on...

Each time the business changes its mind, ETL rework will be necessary. If you don't have a system of record, think long and hard before hard-coding a business rule.

Consider altering fact table grain

Got a business rule AND a source of record?  Fantastic!  You are very lucky, and you may not even need an allocation factor at all!

When you have a clear source for an allocation factor, consider skipping the bridge altogether. Instead, restate the grain of your fact table as allocated transactions.

In the orders example, we can change the grain of the fact table from "one row per order line" to "order lines allocated to salespeople."

Now, if two salespeople collaborate on a sale, there will be two rows in the fact table.  They will be fully addtitive, with no risk of double counting the facts. No bridge needed!

Note that this works best of the orders are allocated in the source system.  Otherwise, you will wind up with splits that are not even (33/33/34) or fractional quantities that do not total to the same amount as the operational system.

More to Come

My inbox is full of additional questions and comments about bridge tables.  I promise to post some more about bridged designs in the months to come.

In the meantime, I encourage you to check out Star Schema The Complete Reference, which contains two full chapters on bridge tables.  Chapters 9-10 offer a far deeper treatment than I can provide on this blog, with 30 figures, sample instance charts, sample queries and results, etc.

Bridge photo by Branwebs via Creative Commons

Monday, April 25, 2011

Bridge Tables and Many-to-many Relationships

Recent posts about bridge tables generated a lot of interest. Today, I will look at the many-to-many relationships in these solutions. 

These relationships are intended, necessary, and valid parts of database design. Many software tools do not understand this kind of relationship, so I'll also show you how you can eliminate it, if need be.

Many-To-Many

Previously, I described two uses for bridge tables. Both examples involved many-to-many relationships between tables. Many readers may have thought this was an error. In fact, it was intended.

One case involved a situation where facts must relate to multiple dimension members. Specifically, it was possible for two or more salespeople might collaborate on orders captured in a fact table. The bridged solution looked like this:

Notice that there is a many-to-many relationship between the bridge table and the fact table. This can be understood as follows:
  • A given fact may reference multiple rows in the bridge.  This happens when two salespeople collaborate on an order. 
  • A given row in the bridge may reference multiple facts.  This happens when the same group collaborates on more than one order line.
The bridge table allows us to link the repeating dimension to the facts, but as noted in the post, care must be taken not to double count.

A Legal Relationship

In the world of entity-relationship modeling, we are taught that many-to-many relationships may be present in a logical model, but not  in a physical design. But we are not discussing entity-relationship modeling.

In a dimensional model, this kind of relationship is allowed. It is also 100% compatible with the relational model of data.  Despite what you may think, it is also compatible with your relational database management system.

The figure above can be implemented as a physical design, and it will work. Unfortunately, some of our software tools do not understand this. 

Tool Problems

Many of the software products we use in data warehousing were originally developed for transaction processing systems. Because of this lineage, the many-to-many relationship can be a problem.

For example, suppose your DBMS works best if joins between tables are supported by declaration of primary key/foreign key relationships. This is a problem for the model pictured above. Order_facts and salesperson_bridge can be joined using sales_group_key.  But it is not a primary key in either table.

Luckily, this join will work even if it is not supported by any key constraints. Unfortunately, other tools may not be as forgiving.

Some data modeling tools do not permit a many-to-many relationship in a physical model. This presumably reflects their entity-relationship lineage. And if you use a SQL-generating BI tool to create reports, you may find it identifies bridge table as fact tables.

Eliminating The Many-To-Many

If your software products cannot live with a many-to-many relationship, there is a simple solution. You can resolve in the same way an entity-relationship modeler would: create an intersect table.

In the example above, a new table is added that contains one row for each group. An association or intersect table, will link the group to the salespeople. The result looks like this:
For those of us who find humor in data modeling, this is a source of much amusement. The sales_group table consists of a single column. And since the sole column is a surrogate key, the table's entire contents are likely to be a sequence of integers!

Still: problem solved. Only take this additional step if you really must.

More on Bridges

Bridges are powerful but dangerous . To make proper use of them, there is more to understand.  Additional topics include he impact of slowly changing dimensions, the use of allocation factors, how to prevent end users from double counting....the list goes on.

I will post more about bridges, but I encourage you to also read about them in my book, Star Schema The Complete Reference.  The book goes into much more depth on bridge tables than I can do in a blog post - or even a series of posts.  Two full chapters are dedicated to bridge tables.  Check it out!

- Chris


Bridge photo by Branwebs via Creative Commons

Monday, March 7, 2011

Q&A: Bridges are Part of Dimensional Modeling

Two recent posts on bridge tables generated a lot of questions.  Here is the most common:
When you use a bridge table, isn't that a snowflake? I thought this was not allowed in dimensional modeling.
The use of a bridge is a limited form of snowflaking.  It is also an important feature of the dimensional model.

Bridges are widely accepted, even among those who would otherwise avoid snowflake designs.

A Limited form of Snowflaking

In a Q&A from last year, I suggested that a snowflake occurs when dimensions are joined to something other than a fact table. By this definition, the examples from the previous two posts are certainly snowflakes.

In the first example, a bridge was used to associate a dimension with a repeating attribute (a company with multiple industries.)

In the second example, a bridge was used to associate a fact with multiple dimension rows (a sale with multiple salespeople.)

In both cases, a standard one-to-many relationship between fact and dimension does not exist.1 The bridge table solves this problem.

This usage is widely accepted. In fact, the bridge table is a central feature of dimensional modeling. I devote two full chapters to bridged designs in Star Schema: The Complete Reference.

Other Relationships are Not Modeled

When a bridge table is employed, most dimensional modelers will still refer to the schema as "a star." That's because the bridge is only used in the situations described above -- when the standard master-detail relationship between dimension table and fact table breaks down.

Other kinds of relationships between data elements are not resolved via multiple tables.  In our examples:
  • Repeating attributes are permitted (work phone, home phone, mobile phone) 
  • Dimensions contain partially dependent attributes (brand information present in a product table)
  • Master-detail relationships are not instantiated into separate tables
  • Values associated with codes are not placed in look-up tables
  • Dimensions are not linked to one another
An all-out snowflake design would employ principles of normalization to eliminate these features.

Other Times to Snowflake?

There may be some other situations where limited snowflaking is acceptable.  One is where a set of attributes (e.g. geographic) would otherwise appear in multiple dimension tables. Another is where software tools (the RDBMS or BI software) work better in a snowflake environment. 

These uses are a bit more controversial, and I wrote about them in a previous post.

When it comes to the bridge table, however, there is no controversy.  The bridge is an accepted technique in dimensional modeling, and is often necessary for specific dimensional designs.

- Chris

1 Bridges may also be used to navigate recursive relationships, as I describe in the book.

Image credit: Gravityx9 licensed under Creative Commons 2.0

Wednesday, February 9, 2011

Bridge to Multi-Valued Dimensions

Here's what you can do when a dimension table and fact are not in a one-to-many relationship.

A recent post described how a bridge can be used to resolve repeating attributes in a dimension. Today's post looks at a second use:  facts that must link to multiple rows in a dimension.

Facts with Multi-Valued Dimensions

In most star schemas, each fact table row references exactly one row in each associated dimension. This relationship between dimension table and fact table is variously called "one-to-many", "master-detail" or "parent-child". For example, in a star schema that measures the order-taking process, each row in the fact table will link to exactly one row in the product table, one row in the salesperson table, etc.

But what happens when more than one salesperson can be involved in an order? Suppose there could be two, three, or any number of salespeople? The one-to-many relationship breaks down.

Avoiding the Bridge

As with the repeating attribute from last time, we might try to accommodate this multi-valued dimension by simplifying the relationship. Perhaps we can isolate some "roles" that can be used to recover a clean one-to-many relationship. If there is always a lead salesperson and a secondary salesperson, we can simply place 2 salesperson keys in the fact table.

This has some disadvantages, though, and they are the same as those from last time.

First, it becomes difficult to produce a report showing all sales that a particular person was involved in. The query needs to select sales where the primary salesperson is the person in question, OR where the secondary salesperson is the person in question. Doing this for all salespeople in a department becomes tricky, especially if we are using a SQL-generating BI tool.

The second problem with simplifying the relationships is that there may be a need for any number of dimension values. Suppose 3 sales people collaborate on an order? Do we provide 3 salesperson keys in the fact table? This compounds the first problem, and it will only be a matter of time before we come across a case where we need 4.

A Dimension Bridge

The solution is to build a table that bridges the fact table and dimensions.

When a fact references a group of dimension values, a group will be created. The fact table will contain the group key. This key will join to the bridge table.  The bridge will have a row for each group member, linking to the dimension. An example appears here:


This simple solution allows individual rows in the fact table to link to any number of salespeople.  Notice that it requires a group to be established even for orders that have only one salesperson.

Using the Bridge

With a dimension bridge in place, it is easy filter facts for single dimension member, or group them by dimension members.  The sum of sales, for example, can be grouped by salesperson: the fact table joins to the bridge, the bridge joins to the dimension, and a simple group by query is issued.

Dimension bridges are used in many other situations. Multiple citations during an inspection, multiple parties in a dispute or claim, and so forth.

Dangers

As with the attribute bridge, the dimension bridge brings with it the possibility of misuse.

Without careful attention, facts may be double-counted, triple-counted or worse.  For example, if two people collaborate on a sale, a report of sale by salesperson will show the sale twice.  This kind of report is useful, but creating a grand total would be a mistake. Similar issues crop up if both people are in the same department, and someone tries to produce a report showing sales by department.

The solution is to make sure the bridge is only accessible to trained developers and analysts.  A simplified solution can be made available to ad hoc users. In this case, that might mean adding a primary salesperson key to the fact table. This allows ad hoc users to safely join the fact table and dimension table, but their analytic capability will be limited.

More On Dimension Bridges

This introduction has only scratched the surface of dimension bridges. As with the previous post, there are many more refinements and complications that can be introduced. These include the impact on ETL, the addition of allocation factors, coping with software that does not permit a many-to-many relationship and more.

These topics will be explored in future posts. You can also read all about bridge tables in Chatpers 9 and 10 of Star Schema: The Complete Reference.

Have a question about bridge tables? Send it to the address in the sidebar, and it may be addressed in a Q&A post.

Image by Branwebs via Creative Commons

Friday, January 28, 2011

Resolve Repeating Attributes with a Bridge Table

One of the most underutilized and misunderstood techniques in dimensional design is the use of bridge tables. These special tables can be used in several situations. This post focuses on one use case: the repeating attribute.

When a dimension attribute appears multiple times in the same dimension table, we call it a "repeating attribute." In the world of entity-relationship modeling, these are avoided, but in the world of dimensional modeling they are acceptable. For example, a customer dimension table may contain three phone number attributes -- work, mobile and home.

In some cases, though, the repeating attribute is not effective. For example, you are designing a dimension table that represents companies. One of the attributes you want to include is the company's industry -- manufacturing, services, banking, etc. The problem is that some companies participate in more than one industry. What do we do for Company A, it it participates in Manufacturing and Services?

Repeating the Attribute Doesn't Cut It

You might try a design that repeats the industry, just like the multiple phone numbers. But this doesn't work very well in the case of our company. For one thing, each of the three phone numbers had a clear and distinct role: work, mobile, home. We can't do the same thing with a company's industries.

We might arbitrarily name some distinct industry columns -- industry_1, and industry_2 for example. But this introduces two more problems. The first is obvious: what if it turns out there is a company that participates in 3 industries? Exactly how many company columns are we going to add to this table?

The other problem has to do with usability. If you want to create a report that shows sales for companies in a particular industry, you have to qualify your query multiple times, and remember to use "or". Plus you have to take care balancing parentheses if there are other predicates:


WHERE
( industry_1 = "Services"
OR
industry_2 = "Services"
) AND

year = 2011
The usability challenge is increased if you want to group results by industry; you may need to resort to procedural logic.

Solution: Bridge

A real-world bridge connects to places together; it is not a destination itself. Similarly, the bridge table's job is to connect two tables together. It contains no attributes that will appear in query results; its function is to establish a link.

In the case of a multi-valued dimension attribute, a bridge is used to link the main dimension with an outrigger that houses the repeating attribute. For the company/industry dilemma, the link looks like this:


This construction makes it possible to associate any number of industries with a company. It is now simple to qualify queries for a specific industry. Since there is one column called "industry" is also simple to group results by industry.

Take Care

The flexibility and power of the bridge comes at a cost: it is dangerous. When not used properly, facts may be double-counted, triple counted or worse.

Suppose you want to report on sales by industry. You connect industry to company using the bridge, and you then connect company to a fact table containing sales dollars. You aggregate sales and group by industry. Remember that Company A appears in the manufacturing and services industries. Its sales will be repeated for each industry.

This kind of report is acceptable in many businesses. Kimball calls it an "impact report". But we must take special care with these reports -- labeling them as such, and taking special care not to create "grand totals".

Other Repeating Attributes

Repeating attributes occur all over the place. Multiple keywords for a document. Multiple diagnoses for a patient. Multiple responsible parties for an account. Multiple skills for an employee. I am sure you can imagine others.

In all cases, a bridge to an outrigger gives you increased analytic flexibility, though you will have to carefully monitor for misuse.

More Uses, More Issues

There is a lot more to say about bridge tables. That's probably why I've never written about them in the 4-year history of this blog. There are to other ways to employ a bridge.

A bridge can be used when there are potentially many dimension rows associated with a single fact-- such as multiple salespeople associated with a sales fact table.

Bridges are also very powerful in helping traverse recursive hierarchies -- companies that own other companies, parts that break down into other parts, etc.

Though powerful and flexible, bridge tables introduce new complications for the data warehouse team. For example, the many-to-many relationships they introduce (see the picture above) are problematic for some modeling tools, and even some RDBMS's. They also impact how we configure our BI and reporting software. And we must give careful thought to the impact of bridges on slowly changing dimensions.

Future posts will look at some of these topics. I also devote two full chapters to bridge tables in Star Schema: The Complete Reference.

Image: Water Lillies and Japanese Bridge from the Public Domain.