Tuesday, November 15, 2011

Conformed Dimensions

This second post on conformed dimensions explores different ways in which dimensions can conform. 
There are several flavors of conformed dimensions. Dimensions may be identical, or may share a subset of attributes that conform.
Conformance basics

Conformed dimensions are central to the discipline of dimensional modeling.  The basics of conformance were introduced in a post from earlier this year.  In a nutshell:


  • Measurements of discrete business processes are captured in individual star schemas (e.g. proposals and orders)
  • Some powerful business metrics combine information from multiple processes.  (e.g. fill rate: the ratio of orders to proposals)
  • We construct these metrics through a process called drilling across
  • Drilling across requires dimensions with the same structure and content (e.g. proposals and orders have a common customer dimension)

For a refresher on these concepts, see "Multiple stars and conformed dimensions" (8/15/2011). 

Physically shared dimensions not required 

When two fact tables share the same dimension table, their conformance is a given. Since the shared dimensions are the same table, we know they will support drilling across.

For example, stars for proposals and orders may share the customer dimension table.  This makes it possible to query orders by customer and products by customer, and then merge the results together.

But this process of drilling across does not require shared dimension tables. It works equally well if proposals and orders are in separate data marts in separate databases.

As long as the stars each include dimensions that share the same structure (e.g. a column called customer_name) and content (i.e. the customer values are the same), it will be possible to merge information from the stars. 

Levels of conformance 

It is easy to take this a step further.  We can also observe that there is compatibility between dimensions that are not identical.

If a subset of attributes from two dimension share the same structure and content, they form a sort of “lowest common denominator” across which we can compare data from the stars.

For example, suppose we establish budgets at the monthly level, and track spending at the daily level.  Clearly, days roll up to months.  If designed correctly, it should be possible to compare data from budget and spending stars by month.

The picture below illustrates the conformance of a MONTH and DAY table graphically.  The ring highlights the shared attributes; any of these can be used as the basis for comparing facts in associated fact tables.

In this case, the two conformed dimensions participate in a natural hierarchy.  Months summarize days. The month table is referred to as a “conformed roll-up” of day.

To successfully drill across, the content of the shared attributes must also be the same.  Instances of month names, for example, must be identical in each table -- "January" and "January" conform; "January" and "JAN." do not.

To guarantee conformance of content, the source of the rollup should be the base dimension table. This also simplifies the ETL process, since it need not reach back to the source a second time. 

Other kinds of conformance 

Identical tables and conformed roll-ups are the most common kinds of conformed dimensions.  Other kinds are less common. 

Degenerate dimensions (dimensions that appear in a fact table) may also conform. This is particularly useful with transaction identifiers. 

Overlapping dimensions may share a subset of attributes, but not participate in a hierarchy. This is most common with geographical data. 

More Info 

For more information, see the following posts: 
I also write about conformed dimensions extensively in Star Schema, The Complete Reference.  
If you enjoy this blog, you can help support it by picking up a copy!

Photo by Agnes Periapse, licensed under Creative Commons 2.0

Monday, October 17, 2011

Creating transaction identifiers for fact tables

Sometimes, the source data for a fact table does not include a transaction identifier. When this happens, we often crate our own. A reader asks if this goes against best practices:

Q: You said that we should not create surrogate keys in fact tables. But we have a source system that does not provide us with identifiers for phone calls. When we get the log files, we create surrogate keys for each call before loading them into the star.

A: This is a common practice, and it is just fine.

You are not creating what dimensional modelers call surrogate keys. You are creating transaction identifiers.  These will be helpful as part of the ETL and QA processes.

Surrogate keys vs. transaction identifiers

Surrogate keys should not be confused with transaction identifiers.

As I wrote in the post referenced in the question, a surrogate key is an attribute that is created to uniquely identify rows in dimension tables. It does not come from a source system; it is created expressly for the dimensional schema.

A transaction identifier is a dimension attribute in a fact table that defines its grain, or at least helps to do so. (If you need a refresher on grain, see this post.)

For example, PAYROLL_FACTS may have the grain "one row per pay stub."  Each pay stub has an identifying number in the source system, which is carried into the star as a degenerate dimension. This attribute is a transaction identifier.1

When a transaction identifier defines grain, the DBA may define it as "the primary key of the fact table." Likewise, an ETL developer or QA analyst may use it as a sort of key to compare the star to operational data or staging data. 

From the dimensional perspective, however, these transaction identifiers are dimensions.  They can be used to filter queries, group or sort the results, and so forth. They will not appear in other tables as foreign keys that reference the fact table.

Creating transaction identifiers

When operational systems do not have good transaction identifiers, we often "cook one up" during the ETL process. They are not surrogate keys in the dimensional modeling sense, though the term might sound appropriate. They are transaction identifiers.

For example, suppose you are designing a star that records phone calls made from your call center. You want the grain of the fact table to be  "one row per phone call."  The data will come from your call switch.  As you've mentioned, it creates a log file, but does not provide an identifier for each call.

In your design, you may define your grain: "one row per date, time, originating number and number dialed." This works just fine, but may be difficult for ETL developers to manage.  Instead, they assign a unique identifier to each call in the log. You can use this identifier to define the grain of the fact table.

Carried into the fact table, this attribute aids in the data integration and quality assurance process. It can be used to quickly crosscheck the star against the log file.

However, for analytic purposes, it is not used as a key.  It is a dimension attribute -- a "cooked up" transaction identifier.

Other concerns

When you are forced to create your own transaction identifiers, you will have other things to worry about.

First and foremost, you must be sure there is a mechanism to prevent loading of duplicate facts. You will need to understand how the source system generates its log files, to be sure that you extract data that is complete, accurate, and not subject to revision.

Also, note that the artificial identifiers may convey a false sense that you can reach all the way back to the source to find the original transaction. While it may be possible, your identifier will not be the mechanism.

Thanks for the question, and keep them coming.

-Chris

More Info

For more information, see these posts:

Also check out my book, Star Schema: The Complete Reference.  I discuss surrogate keys in Chapter 1, Analytic Databases and Dimensional Design.  Degenerate dimensions and transaction identifiers are covered in Chapter 3, Stars and Cubes.

1Not all degenerate dimensions are transaction identifiers.  For an example that is not a transaction identifier, see my post of October 15, 2010.

Image licensed via Creative Commons 2.0
 from Patrick Hoesley

Tuesday, October 4, 2011

Are bridge tables really fact tables?

A reader observes that bridge tables seem problematic, and wonders if they should just replaced by factless fact tables.

Q:  I am wondering if all bridge tables are in fact replacements for factless fact tables. The problem with the bridge table as you mention it is that...[you] need to do an expensive join and issues with Cartesian joins/ double counting etc. So the question is whether a bridge table is practical option as compared to a separate fact.
Ashish
Bangalore, India

A:  The bridge table looks a lot like a fact table, but it is used very differently.

While we avoid joining a fact table to another fact table, we seek to join bridge tables to fact tables. This can have implications for BI tools that generate SQL.

Similarities

A bridge table appears similar to a fact table because it contains multiple foreign keys. This is most evident when you look at an attribute bridge table, which links a dimension to an outrigger. It consists solely of foreign key references to the dimension and the outrigger.

In this respect, the bridge table is very similar to a factless fact table.  Indeed, one might make the argument that a bridge relates a set of dimensions in much the same way that a factless fact table describes conditions.

But there is a very important difference: we never join a fact table to another fact table. Bridges, on the other hand, are intended to be joined with fact tables.

We do not join fact tables to fact tables

You should never join two or more fact tables--either directly or indirectly via shared dimensions. Fact values will repeat if multiple rows in either fact table share the same dimensionality. We receive a Cartesian product of all related facts.  The result is double-counting, or worse. 

Instead of joining fact tables, we use a technique called drilling across. Facts are collected from each table and aggregated to common level of detail, then merged into a single result set. I wrote about this process earlier this year.1

Many BI tools that generate SQL are able to identify fact tables and automatically invoke drill across logic when required.

We do join bridge tables to fact tables

Bridge tables represent groups. We create them so that a single fact can be associated with a group of values (such as multiple salespeople) rather than a single value.

When we use a bridge table, we link it to other fact tables.2  By doing this, a single fact in the fact table associates with multiple rows in the bridge table.   

With a bridge, we are exploiting the very Cartesian product that we normally seek to avoid.

We are intentionally repeating a single fact for multiple group members.  To avoid incorrect results, it behooves us to group results by member, or to constrain for a single group member.

Even if you were to replace a bridge with a factless fact table, this is the behavior you would desire. Rather than drill across, you would link it to other fact tables, in order to associate the bridged values with various facts. Cartesian products and the danger of double counting would remain.

Bridge tables and BI software

A bridge table is not really a fact table. It is not the locus of process measurement.  It describes neither activities nor conditions. It is merely a construct that allows us to deal with repeating values. It is meant to be linked with fact tables, and used with care.

Because a bridge is composed of foreign keys, however, some BI tools may identify it as a fact table. If your tool does this, you will need to prevent it from invoking drill-across logic for queries that involve a bridge.

Your tool may have a facility for this. If it does not, you can hide the bridge by joining it to dimensions within a view.

More info

Thanks to Ashish for the comments. If you have a question about bridge tables, send it to the address on the sidebar.

You can learn more in my book, Star Schema: The Complete Reference.  Two full chapters are dedicated to bridge tables, including 30 diagrams.

See also:
1Factless fact tables which describe conditions are not joined to other fact tables either. When they are compared to other fact tables, we typically use set operators or subqueries.

2This may happen directly, in the case of a dimension bridge, or indirectly, in the case of an attribute bridge.


Image by ahisgett licensed under Creative Commons 2.0