Friday, December 23, 2011

Three ways to drill across

Previous posts have explored the importance of conformed dimensions and drilling across.  This post looks at the process of drilling across in more detail. 

There are three primary methods for drilling across.  These methods can be leveraged manually, automated with BI software, or performed during the data integration process.

Querying multiple stars

Conformed dimensions ensure compatibility of information in various stars and data marts; drilling across is the process of bringing it together.  (For a refresher, see this post.)
 
For example, suppose we wish to compute "Return Rate" by product.  Return rate is the ratio of shipments to returns.

Information about shipments and returns is captured in two granular star schemas:
  • shipment_facts captures shipment metrics by Salesperson, Customer, Product, Proposal, Contract, Shipment, Shipper, and Shipment Date
  • return_facts captures return metrics by Salesperson, Customer, Product, Contract, Reason and Return Date
Reporting on return rate will require comparing facts from each of these stars. 

Drilling across

Recall that fetching facts from more than one fact table requires careful construction of queries. It is not appropriate to join two fact tables together, nor to link them via shared dimensions. Doing so will double-count facts, triple-count them, or worse.

Instead, the process must be completed in two phases.
  • Phase 1:  Fetch facts from each fact table separately, aggregating them to a common level of detail
  • Phase 2:  Merge these intermediate result sets together based on their common dimensions
In practice, there are several ways that this task can be performed.

Method 1: Issue two queries, then merge the results

The first method for drilling across completes phase 1 on the database, and phase 2 in the application (or on the application server).
  1. Construct two separate queries: the sum of quantity shipped by product, and the sum of quantity returned by product.
  2. Take the two result sets as returned by the DBMS, and merge them based on the common products. Compute the ratio at this point.
While it may seem odd that phase 2 not be performed on the DBMS, note that if the data sets are already sorted, this step is trivial.

Method 2:  Build temp tables, then join them

The second method performs both phases on the DBMS, making use of temporary tables.
  1. Construct two SQL statements that create temporary tables: the sum of quantity shipped by product, and the sum of quantity returned by product
  2. When these are completed, issue a query that performs a full outer join of these tables on the product names and computes the ratio.
Be sure that the temporary tables are cleaned up.

Method 3: Join subqueries

Like the previous method, this method performs all the work on the DBMS.  In this case, however, a single query does all the work.

Queries for each fact table are written, then joined together in the FROM clause of a master query.  For example:


SELECT
  COALESCE (shp.product, rtn.product) as Product,
  quantity_returned / quantity_shipped as ReturnRate
FROM
  ( SELECT product, sum(quantity_shipped)as quantity_shipped
    FROM shipment_facts, product
    WHERE .....
  ) shp
FULL OUTER JOIN
  ( SELECT product, sum(quantity_returned) as quantity_returned
    FROM return_facts, product
    WHERE....
  ) rtn
 ON
    shp.product = rtn.product

   
The two subqueries in the FROM clause represent phase 1.  Phase 2 is represented by the main SELECT query that joins them and computes the ratio.

Applying these techniques

These techniques may applied in a variety of ways:
  1. Report developers may write their own queries using one of more of these methods
  2. You may have BI software that can automate drilling across using one or more of these methods
  3. Drilling across may be performed at ETL time using one of these methods (or an incremental variant)
In the latter case, the ETL process builds a new star (or cube) that contains the result of drilling across. This is called a derived schema, or second line data mart.

Learn More

For more information, see the following resources:


Many pages are devoted to this topic in my books. In the latest one, Star Schema: The Complete Reference, the following chapters may be of interest:
  • Chapter 5, "Conformed Dimensions" discusses these techniques in greater detail.  
  • Chapter 14, "Derived Schemas" looks at special considerations when creating derived stars that pre-compute drill-across comparisons.
  • Chapter 16, "Design and Business Intelligence", discusses how to work with SQL-generating BI software.
More to come on this topic in the future.  If you have questions, send them in.

-Chris



Tuesday, November 15, 2011

Conformed Dimensions

This second post on conformed dimensions explores different ways in which dimensions can conform. 
There are several flavors of conformed dimensions. Dimensions may be identical, or may share a subset of attributes that conform.
Conformance basics

Conformed dimensions are central to the discipline of dimensional modeling.  The basics of conformance were introduced in a post from earlier this year.  In a nutshell:


  • Measurements of discrete business processes are captured in individual star schemas (e.g. proposals and orders)
  • Some powerful business metrics combine information from multiple processes.  (e.g. fill rate: the ratio of orders to proposals)
  • We construct these metrics through a process called drilling across
  • Drilling across requires dimensions with the same structure and content (e.g. proposals and orders have a common customer dimension)

For a refresher on these concepts, see "Multiple stars and conformed dimensions" (8/15/2011). 

Physically shared dimensions not required 

When two fact tables share the same dimension table, their conformance is a given. Since the shared dimensions are the same table, we know they will support drilling across.

For example, stars for proposals and orders may share the customer dimension table.  This makes it possible to query orders by customer and products by customer, and then merge the results together.

But this process of drilling across does not require shared dimension tables. It works equally well if proposals and orders are in separate data marts in separate databases.

As long as the stars each include dimensions that share the same structure (e.g. a column called customer_name) and content (i.e. the customer values are the same), it will be possible to merge information from the stars. 

Levels of conformance 

It is easy to take this a step further.  We can also observe that there is compatibility between dimensions that are not identical.

If a subset of attributes from two dimension share the same structure and content, they form a sort of “lowest common denominator” across which we can compare data from the stars.

For example, suppose we establish budgets at the monthly level, and track spending at the daily level.  Clearly, days roll up to months.  If designed correctly, it should be possible to compare data from budget and spending stars by month.

The picture below illustrates the conformance of a MONTH and DAY table graphically.  The ring highlights the shared attributes; any of these can be used as the basis for comparing facts in associated fact tables.

In this case, the two conformed dimensions participate in a natural hierarchy.  Months summarize days. The month table is referred to as a “conformed roll-up” of day.

To successfully drill across, the content of the shared attributes must also be the same.  Instances of month names, for example, must be identical in each table -- "January" and "January" conform; "January" and "JAN." do not.

To guarantee conformance of content, the source of the rollup should be the base dimension table. This also simplifies the ETL process, since it need not reach back to the source a second time. 

Other kinds of conformance 

Identical tables and conformed roll-ups are the most common kinds of conformed dimensions.  Other kinds are less common. 

Degenerate dimensions (dimensions that appear in a fact table) may also conform. This is particularly useful with transaction identifiers. 

Overlapping dimensions may share a subset of attributes, but not participate in a hierarchy. This is most common with geographical data. 

More Info 

For more information, see the following posts: 
I also write about conformed dimensions extensively in Star Schema, The Complete Reference.  
If you enjoy this blog, you can help support it by picking up a copy!

Photo by Agnes Periapse, licensed under Creative Commons 2.0

Monday, October 17, 2011

Creating transaction identifiers for fact tables

Sometimes, the source data for a fact table does not include a transaction identifier. When this happens, we often crate our own. A reader asks if this goes against best practices:

Q: You said that we should not create surrogate keys in fact tables. But we have a source system that does not provide us with identifiers for phone calls. When we get the log files, we create surrogate keys for each call before loading them into the star.

A: This is a common practice, and it is just fine.

You are not creating what dimensional modelers call surrogate keys. You are creating transaction identifiers.  These will be helpful as part of the ETL and QA processes.

Surrogate keys vs. transaction identifiers

Surrogate keys should not be confused with transaction identifiers.

As I wrote in the post referenced in the question, a surrogate key is an attribute that is created to uniquely identify rows in dimension tables. It does not come from a source system; it is created expressly for the dimensional schema.

A transaction identifier is a dimension attribute in a fact table that defines its grain, or at least helps to do so. (If you need a refresher on grain, see this post.)

For example, PAYROLL_FACTS may have the grain "one row per pay stub."  Each pay stub has an identifying number in the source system, which is carried into the star as a degenerate dimension. This attribute is a transaction identifier.1

When a transaction identifier defines grain, the DBA may define it as "the primary key of the fact table." Likewise, an ETL developer or QA analyst may use it as a sort of key to compare the star to operational data or staging data. 

From the dimensional perspective, however, these transaction identifiers are dimensions.  They can be used to filter queries, group or sort the results, and so forth. They will not appear in other tables as foreign keys that reference the fact table.

Creating transaction identifiers

When operational systems do not have good transaction identifiers, we often "cook one up" during the ETL process. They are not surrogate keys in the dimensional modeling sense, though the term might sound appropriate. They are transaction identifiers.

For example, suppose you are designing a star that records phone calls made from your call center. You want the grain of the fact table to be  "one row per phone call."  The data will come from your call switch.  As you've mentioned, it creates a log file, but does not provide an identifier for each call.

In your design, you may define your grain: "one row per date, time, originating number and number dialed." This works just fine, but may be difficult for ETL developers to manage.  Instead, they assign a unique identifier to each call in the log. You can use this identifier to define the grain of the fact table.

Carried into the fact table, this attribute aids in the data integration and quality assurance process. It can be used to quickly crosscheck the star against the log file.

However, for analytic purposes, it is not used as a key.  It is a dimension attribute -- a "cooked up" transaction identifier.

Other concerns

When you are forced to create your own transaction identifiers, you will have other things to worry about.

First and foremost, you must be sure there is a mechanism to prevent loading of duplicate facts. You will need to understand how the source system generates its log files, to be sure that you extract data that is complete, accurate, and not subject to revision.

Also, note that the artificial identifiers may convey a false sense that you can reach all the way back to the source to find the original transaction. While it may be possible, your identifier will not be the mechanism.

Thanks for the question, and keep them coming.

-Chris

More Info

For more information, see these posts:

Also check out my book, Star Schema: The Complete Reference.  I discuss surrogate keys in Chapter 1, Analytic Databases and Dimensional Design.  Degenerate dimensions and transaction identifiers are covered in Chapter 3, Stars and Cubes.

1Not all degenerate dimensions are transaction identifiers.  For an example that is not a transaction identifier, see my post of October 15, 2010.

Image licensed via Creative Commons 2.0
 from Patrick Hoesley

Tuesday, October 4, 2011

Are bridge tables really fact tables?

A reader observes that bridge tables seem problematic, and wonders if they should just replaced by factless fact tables.

Q:  I am wondering if all bridge tables are in fact replacements for factless fact tables. The problem with the bridge table as you mention it is that...[you] need to do an expensive join and issues with Cartesian joins/ double counting etc. So the question is whether a bridge table is practical option as compared to a separate fact.
Ashish
Bangalore, India

A:  The bridge table looks a lot like a fact table, but it is used very differently.

While we avoid joining a fact table to another fact table, we seek to join bridge tables to fact tables. This can have implications for BI tools that generate SQL.

Similarities

A bridge table appears similar to a fact table because it contains multiple foreign keys. This is most evident when you look at an attribute bridge table, which links a dimension to an outrigger. It consists solely of foreign key references to the dimension and the outrigger.

In this respect, the bridge table is very similar to a factless fact table.  Indeed, one might make the argument that a bridge relates a set of dimensions in much the same way that a factless fact table describes conditions.

But there is a very important difference: we never join a fact table to another fact table. Bridges, on the other hand, are intended to be joined with fact tables.

We do not join fact tables to fact tables

You should never join two or more fact tables--either directly or indirectly via shared dimensions. Fact values will repeat if multiple rows in either fact table share the same dimensionality. We receive a Cartesian product of all related facts.  The result is double-counting, or worse. 

Instead of joining fact tables, we use a technique called drilling across. Facts are collected from each table and aggregated to common level of detail, then merged into a single result set. I wrote about this process earlier this year.1

Many BI tools that generate SQL are able to identify fact tables and automatically invoke drill across logic when required.

We do join bridge tables to fact tables

Bridge tables represent groups. We create them so that a single fact can be associated with a group of values (such as multiple salespeople) rather than a single value.

When we use a bridge table, we link it to other fact tables.2  By doing this, a single fact in the fact table associates with multiple rows in the bridge table.   

With a bridge, we are exploiting the very Cartesian product that we normally seek to avoid.

We are intentionally repeating a single fact for multiple group members.  To avoid incorrect results, it behooves us to group results by member, or to constrain for a single group member.

Even if you were to replace a bridge with a factless fact table, this is the behavior you would desire. Rather than drill across, you would link it to other fact tables, in order to associate the bridged values with various facts. Cartesian products and the danger of double counting would remain.

Bridge tables and BI software

A bridge table is not really a fact table. It is not the locus of process measurement.  It describes neither activities nor conditions. It is merely a construct that allows us to deal with repeating values. It is meant to be linked with fact tables, and used with care.

Because a bridge is composed of foreign keys, however, some BI tools may identify it as a fact table. If your tool does this, you will need to prevent it from invoking drill-across logic for queries that involve a bridge.

Your tool may have a facility for this. If it does not, you can hide the bridge by joining it to dimensions within a view.

More info

Thanks to Ashish for the comments. If you have a question about bridge tables, send it to the address on the sidebar.

You can learn more in my book, Star Schema: The Complete Reference.  Two full chapters are dedicated to bridge tables, including 30 diagrams.

See also:
1Factless fact tables which describe conditions are not joined to other fact tables either. When they are compared to other fact tables, we typically use set operators or subqueries.

2This may happen directly, in the case of a dimension bridge, or indirectly, in the case of an attribute bridge.


Image by ahisgett licensed under Creative Commons 2.0

Wednesday, September 21, 2011

Avoid Surrogate Keys for Fact Tables


I am often asked for feedback on designs that include a "surrogate key for the fact table."  There are two common reasons for which they are proposed; both have better alternatives. 

Surrogate keys are for dimension tables

A surrogate key is an attribute that is created to uniquely identify rows in a dimension table. It does not come from a source system; it is created expressly for the dimensional schema.

Surrogate keys for dimension tables serve two important purposes:
  1. They make it easier to track history.  They allow the dimension to capture changes to something, even if the source does not.  Absent a surrogate key, this would be difficult; the primary key of the dimension would be concatenation of natural keys and type 2 attributes.
  2. They make it easy to join to the dimension.  The dimensions' surrogate keys appear in fact tables as  foreign keys.  They allow the fact to be joined to the appropriate dimension values, without having to use a multi-part key.
A fact table does not require a manufactured data element for either of these purposes.

Tracking change history of facts?  Use a log.

When the value of a fact can change, a surrogate key for the fact table might be proposed. This would theoretically allow the fact table to record the change history of facts, in the same way that a dimension table does.

Unfortunately, storing the history of facts in this manner destroys the usability of the star. The additive properties of the facts are lost, and it becomes much harder to use.

If the facts can change, the fact table should be updated.

To track the history of facts, use an audit table as described in a previous post.  This table can log historic values, or can store the changes as "deltas" in a dimensional format.

Joining to other fact tables?  Drill across.

The other reason surrogate keys are proposed for fact tables is that it will make them "easier to join." 

Joining fact tables to dimensions is easy; the fact table already contains foreign keys that reference the surrogate keys in dimension tables.  But what about joining to other fact tables?

Because they contain facts meant to be aggregated, fact tables should never be joined to one another. Otherwise, double-counting will ensue.

Facts from multiple stars should be combined by drilling across, as described in a previous post.  Don't try to merge them by joining fact tables.

If you're not after facts with this proposed fact-to-fact join, you must be after dimension values.  If this is the case, carry forward any/all useful dimensions to other stars in the value chain.

More info

To learn more about surrogate keys, check out these posts:
For more on tracking the change history of facts, check out this post:
To learn about drilling across, read this post:
If you find this blog helpful, please consider picking up a copy of my book, Star Schema: The Complete Reference.


Thursday, September 15, 2011

Factless Fact Tables

This post introduces the concept of the factless fact table and the situations where you would model one.

When a fact table does not contain any facts, it is called a factless fact table. There are two types of factless fact tables: those that describe events, and those that describe conditions. Both may play important roles in your dimensional models. 

Factless fact tables for events

Sometimes there seem to be no facts associated with an important business process. Events or activities occur that you wish to track, but you find no measurements. In situations like this, build a standard transaction-grained fact table that contains no facts.

For example, you may be tracking contact events with customers. How often and why we have contact with each customer may be an important factor in focusing retention efforts, targeting marketing campaigns, and so forth.

The factless fact table shown here captures a row each time contact with a customer occurs.  Dimensions represent the date and time of each contact, the customer contacted, and the type of communication (e.g. inbound phone call, outbound phone call, robo-call, email campaign, etc.).

While there are no facts, this kind of star schema is indeed measuring something: the occurrence of events. In this case, the event is a contact. Since the events correspond to the grain of the fact table, a fact is not required; users can simply count rows:

SELECT
    customer_name,
    count (contact_type_key)as "CONTACT COUNT"
  FROM
    contact_facts...

  CUSTOMER_   CONTACT
  NAME        COUNT
  =========== =======
  BURNS, K          8
  HANLEY, S        11
  ROGERS, S         4
  SCANLON, C        8
  SMITH, B          8
  SMITH, M.E.      12


This fragment of SQL reveals that there really is a fact here: the number of events.  Its not necessary to store it, because it coincides with the grain of the fact table.

To acknowledge this, you can optionally add a fact called "contact_count"; it will always contain the constant value 1.

Factless fact tables for conditions

Factless fact tables are also used to model conditions or other important relationships among dimensions. In these cases, there are no clear transactions or events, but the factless fact table will enable useful analysis.

For example, an investment bank assigns a broker to each customer. Since customers may be inactive for periods of time, this relationship may not be visible in transaction-grained fact tables. A factless fact table tracks this important relationship:


Each row in this factless fact table represents a bounded time period during which a broker was assigned to a particular customer.  This kind of factless fact table is used to track conditions, coverage or eligibility.  In Kimball terminology, it is called a "coverage table."

Comparing conditions with events yields interesting business scenarios.  This factless fact table can be compared to one that tracks investment transactions to find brokers who are not interacting with their customers, brokers who conducted transactions with accounts that belong to a different broker, etc.

More on factless fact tables

If you've got questions about factless fact tables, send them to the address in the sidebar of this blog.

You can also read more about factless fact tables in Star Schema: The Complete Reference.  Chapter 12 provides detailed coverage of factless fact table design.

Monday, August 29, 2011

Slowly Changing Facts?

This post looks at the issues faced when facts can change retroactively.
Q: Our source system allows facts to be updated.  We are trying to come up with a way to preserve this history in the data warehouse.  One idea is to use the type 2 technique for facts.  We call it "slowly changing facts."  Is this a good solution?
A: When transactions are changed retroactively, you should update corresponding facts. If you need to track the history of facts, do so in separate audit tables.

To understand why, we will  look at what happens if you record multiple "versions" of a fact in your fact tables. But first, a refresher on the type 2 slow change technique.

Slowly changing dimensions

Dimensional modelers must decide what will happen when the source data for a dimension attribute changes. There are several possible responses to a change; one of them is known as the "Type 2" response.

The type 2 response calls for keeping a sort of "version history" in the dimension table. For example, if a customer's address changes, we insert a second record into the dimension for the same customer.  This way, the old "version" of the customer can be linked to facts that took place prior to the change, and the new version can be linked to new facts.

I covered the basics in more detail in a previous post; you can also refer to the slowly changing dimensions category for more posts on the topic.

Keeping the history of facts

A fact table contains a record of things that happened.  A customer ordered 200 widgets yesterday; another ordered 300 today. Measurement of these events can be aggregated across various dimensions to answer a variety of questions.

But sometimes, what we know about an event changes.  We originally thought 200 widgets were ordered yesterday, but now we know that 250 were ordered.

At first glance,  it may seem like the same "type 2" technique will solve the problem.  When details about a specific transaction are changed, simply insert a new row into the fact table.  The most current version can be flagged for easy access.

Unfortunately, this seriously harms the usability of the fact table. When people study orders, they must now be sure to single out a single version for each transaction to avoid double counting. You can try to resolve this by adding multiple date-stamps to each record, or converting changes into "deltas", but the damage is done. The model is open to serious misuse.

We'll look at the solution in a moment, but first a look at the same problem in a snapshot model.

Snapshots exhibit the same problems

Another kind of fact table, the periodic snapshot, tracks status at predefined intervals.  A financial data mart, for example, might log the balance of each account in the general ledger at the close of business each day.

Here too, our understanding of what happened may change.  At the close of business Monday, a particular account contains US$ 12,000.  On Tuesday, back-dated transactions change that balance to $13,000.

If we try to track both versions of Monday's balance, usability is damaged in the same way.  We must always qualify our analysis for (1) the date we are studying, and (2) the "as of" date of our analysis.  If we don't, grave errors ensue.

Here, usability problems are compounded by density issues.  Tracking the daily balance of each account over 5 years would normally require 1,826 rows per account (365 days x 5 years = 1,825 days, plus 1 leap-day).  If we must also re-record balances for all prior days, the number grows to an astounding 1,668,051.  (This is a triangular number, computed as ( n2+n ) / 2 where n is the number of days.)

If there are 500 accounts in the general ledger, that's the difference between 913,000 rows in a standard periodic snapshot, and 834,025,500 for one that logs history.  As each additional day is added, the difference increases.

The alternative 

There is only on version of the truth, and this is what belongs in the fact table.

When our understanding about an event has changed, the corresponding facts should be updated. This holds for both transaction models and snapshot models.

If you must be able to produce an audit trail, do so in a separate table.  The audit table need only capture the transaction identifier, the date, and the value of each fact. Record rows for the original fact and each revision.


If need be, this can also be organized dimensionally as a fact table. Keep in mind, however, that using this fact table will be subject to the same usability challenges.  And ask the business if all this trouble is really necessary.

For a snapshot model, and audit table may not be necessary; a corresponding transaction table may be able to hold the history. Each transaction that affects status is recorded with 2 dates: date of the transaction, and the date we became aware of the information (often a "posted" date).

Monday, August 15, 2011

Multiple Stars and Conformed Dimensions

The concept of conformed dimensions is central to the discipline of dimensional modeling. This post introduces conformed dimensions; future posts will explore them in more detail. 

In your data warehouse, each star corresponds to a business process. Combining facts from different processes can produce powerful compound metrics. The key to making this work is a set of conformed dimensions.

Conformed dimensions are closely associated with Kimball's "Bus Architecture," but are crucial in any scenario that involves dimensional data.

Multiple Stars

In a dimensional design, each star captures collects measurements that describe a discrete business process.

If we have two measurements that describe different processes, we place them into separate fact tables.

For example, a sales data mart may contain multiple stars:
  • Proposal information by Salesperson, Prospect, Product, Proposal and Proposal Date
  • Order information by Salesperson, Customer, Product, Proposal, Contract and Order Date
  • Shipping information by Salesperson, Customer, Product, Proposal, Contract, Shipment, Shipper, and Shipment Date
  • Return information by Salesperson, Customer, Product, Contract, Reason and Return Date
In each of these stars, the fact table will record measurements that describe the processes of issuing proposals, taking orders, shipping product and handling returns.

By recording measurements of each process in a different star, we are able to capture information at the most detailed level possible.  We can study each of these processes, complete with attendant details, by accessing the appropriate star.

Cross-process Metrics

Some of the most powerful measurements actually combine information from multiple processes. These metrics require combining facts from different stars.

In the sales data mart, the ratio of proposals to orders is the "close rate," a powerful indicator that sales managers and executives look at on a regular basis.

Similarly, the ratio of shipments to returns is the "return rate," an essential quality control metric.

Drilling Across

When we compare facts from different stars, we don't simply join the fact tables.  To do so might cause double-counting of some facts.

Instead we follow a process that Kimball calls drilling across.  The drill-across process can be broken down into two phases.  In the first phase, each star is queried, and results are aggregated to a common level of detail.  In the second phase, these result sets are merged based on their common dimensions.

For example, to compute the return rate by product for August of 2011, we do the following:
  1. a. Figure out quantity shipped by product for August 2011
    b. Figure out quantity returned by product for August 2011
  2. Merge these amounts based on the common product names and compute the ratio
We may perform this drill across operation at query time (many BI tools can do this automatically), or we may do it at ETL time, storing the results in a separate star or cube (sometimes called a second-line data mart.) 

Conformed Dimensions

The key to making all this work is the organization of the dimensions.  As you saw in the example above, we used the dimension values to link our results together: product names were used to merge together shipment and return quantities and compute their ratio.  This would not have worked if the product dimensions for Shipments and Returns had been different.

This is the basic idea behind conformed dimensions.  We say that two dimensions conform if they have the same structure and content.  Both our stars, had a product dimension table with a product name attribute, and the product names were specified the same way in each.  Sharing a single physical table is one way to ensure conformance, but it is not required.

Two dimensions can also conform if one has a subset of the other's attributes.  As long as the common attributes have the same structure and content, they are said to conform.

Planning Conformance

By planning a set of conformed dimensions, we ensure that fact tables can be used to compare processes.  This is important within a single data mart, such as the one above, and it is also important when looking across multiple data marts.

Conformed dimensions are the organizing principle in Kimball's architecture.  Conformed dimensions are planned up-front, as a part of a project that establishes a data architecture based on dimensional design. Implementation proceeds once this conformance bus has been planned.

The concept is also important in other architectures. For example, the sales data mart discussed above might be part of Corporate Information Factory architecture.  Within this data mart, conformance guarantees we can compare shipments to returns, proposals to orders, and so forth.

More to come

In the coming weeks, I will post more about conformed dimensions.  We will look at "levels" of conformance, how to document conformed dimensions, and how different tools work with conformed dimensions. If you have my book, I also encourage you to read Chapters 4 and 5, which look at these concepts in detail.

Photo by Agnes Periapse, licensed under Creative Commons 2.0

Tuesday, July 5, 2011

Dimensional Modelers Do Not Focus on Logical vs. Physical

The separation of logical and physical models is useful entity-relationship modeling. Not so in the world of dimensional models, where final table structures are strikingly similar to business models. 

Instead, a dimensional model is best described in three levels of increasing detail.

Entity-Relationship Modeling and Dimensional Modeling

It is well known that students of dimensional modeling have an easier time if they have not had significant exposure to entity-relationship modeling (or ER modeling).

Entity-relationship modelers are tempted to normalize data, design relationships between dimension tables, and so forth. These steps make perfect sense in the world of OLTP systems design, but not for analytic solutions.

Similarly, when scholars of entity-relationship modeling turn their focus to the dimensional world, another kind of mistake can occur.

Given that it is "all modeling," one may be tempted to generalize a generic framework from ER modeling, and then apply it to dimensional modeling. This mistake is often made by software vendors, and sometimes by academics.

ER modeling and dimensional modeling are both forms of data modeling, but it is not necessary to have a single framework to explain both.

Different Disciplines, Different Frameworks

The traditional distinction between logical and physical model is useful in ER modeling, but not in dimensional modeling. In the world of ER modeling, a logical model captures data requirements without assuming particular data storage technologies. It captures business entities and their attributes and relationships.

The logical model is the basis for a physical model, with reflects the requirements of relational storage in tables and columns. For example, many-to-many relationships are resolved, data may be restructured for performance reasons, etc.

This distinction has significantly reduced utility in the dimensional world. A physical dimensional model stores data in a format that is very similar to how it is understood by the business. Aside from the addition of keys, there is very little difference between logical and physical.

The distinction is not useless, but it plays a much smaller part.  So small, in fact, that you will not find it the focus of most books on dimensional modeling. My own book on dimensional modeling, for example, makes first mention of the term "logical model" in the last two paragraphs of the last chapter.  (See the end of this post for a link.)

Dimensional Models: Three Levels of Detail

Instead of having multiple kinds of model, a dimensional model is best understood at three levels of increasing detail.  Like zooming in on a Google map, each reveals additional information about the model. Each level of detail has different uses for different audiences.
  1. Business Requirements
    Requirements are grouped by subject area, correspond to business process, state measurement requirements in terms of facts and dimensions, and cross reference common dimensions.

    These business requirements clearly convey scope in business terms.  They link directly to the next level of detail in the model, which exposes the concept of table.

  2. High Level Design
    This level of the model defines the same requirements in terms of fact tables and dimension tables, natural keys an surrogate keys, and exposes major attributes of significance.

    At this level of detail, we do not record every column of every table, or even assign data types.  But we do draw table diagrams, and rigorously define several important design elements such as grain, additivity, and slow change requirements.

    This level is useful for design reviews, educating users and developers, and describing project activities.

  3. Detailed Design
    At the lowest level of detail, we expose every column of every table, define data types, provide definitions and sample data, map everything back to source data, and document transformation rules.

    This level of detail is useful for database administrators and ETL architects. It also contains metadata that will be useful for BI developers and end-users.

One Model, Three Levels of Detail

Don't make the assumption that these three levels of the dimensional model are developed in sequence. It is possible to do so, but not always the case.

Most dimensional designers do the bulk of their work at the second level - the high level dimensional design.

As the model nears completion, they will summarize it at a business level, and then develop the detailed model. In other cases, modelers may work through these levels in sequence.

Regardless of the process, in the end,  these three levels all describe the same dimensional model. Like levels of a Google map, they do so at different levels of detail. All should be accounted for.


For more information on what information to capture at each of these levels, see Star Schema: The Complete Reference.  Detailed descriptions and examples can be found in Chapter 18, "How To Design And Document A Dimensional Model."

Image licensed via Creative Commons 2.0
 from Patrick Hoesley

Tuesday, June 21, 2011

Allocation Factors Are Not Magic Bullets

Reader interest in bridge tables continues.  Today, I'll address a topic that was raised in the comments: the use of allocation factors.

Allocation factors are often seen as the cure-all remedy to avoid double counting.  Unfortunately, they are not always realistic. And when they are possible, there is often a better solution.

The back-story: double counting

A dimension bridge allows a single fact row to be linked to two or more rows in a dimension.  In a previous post, I showed how this technique allowed two or more salespeople to be linked to a row in an orders fact table.

As I pointed out, the danger is that queries may accidentally count the order dollars repeatedly -- once for each associated salesperson.  This is the danger of double-counting.

The potential solution: allocation

An allocation factor can sometimes be used to avoid this danger. To the bridge, we add a column that indicates the "share" of the facts associated with each group member.  In queries, before we aggregate facts, we apply this allocation factor.

For example, we can add an allocation factor to the salesperson group bridge: 

If two members collaborate on a sale, and split the credit 75/20, there will be two rows in the bridge for their group.  One would contain the group key, the first salesperson key, and a .75 allocation factor.  The other would contain the group key, the second salesperson key, and a .4 allocation factor.  (For sample data, see Figure 9-5 my book.)

Now, when we connect the bridge to the fact table, we can apply the allocation factor:

select sum(order_facts.order_dollars * salesperson_bridge.allocation)

Seems to be the perfect solution, right?  Well, not always....

There must be a business rule

An allocation factor only works where one exists. Do not come up with your own rules -- e.g. "split transactions evenly among group members." This may not reflect how the business studies their key metrics.

Allocation factors are often not part of the business. If an insurance claim is tracked in a fact table, it may not make sense to allocate it against multiple parties. If a fact table represents a insepction event, it may not make sense to allocate it against multiple infractions.

There must be a definitive and consistent business rule that can be used. If there is no business rule, then obviously allocation will not work. Note, too, that they are even less likely to be found in situations that call for an attribute bridge.

There should be a source of record

So you have a business rule?  You are not out of the woods yet.

Next, you need to make sure you have a source of record for the allocation factor.

For example, many systems that support the sale of securities calculate, and split, broker commissions at the time of sale. These systems provide a nice, clean allocation factor.

If you cannot find a source of record, you should think long and hard about using the "business rule." In all likelyhood, it will be subject to exceptions, and may even change over time. 

For example, your business may tell you transactions can be allocated evenly among group participants. You may choose to "hard-code" this rule into your ETL process. But one day, you will encounter a situation where the business wants to "override" this rule for a particular transaction. Or, a new rule is instanitated that one particular salesperson always gets at least 40% of their sales, and so on...

Each time the business changes its mind, ETL rework will be necessary. If you don't have a system of record, think long and hard before hard-coding a business rule.

Consider altering fact table grain

Got a business rule AND a source of record?  Fantastic!  You are very lucky, and you may not even need an allocation factor at all!

When you have a clear source for an allocation factor, consider skipping the bridge altogether. Instead, restate the grain of your fact table as allocated transactions.

In the orders example, we can change the grain of the fact table from "one row per order line" to "order lines allocated to salespeople."

Now, if two salespeople collaborate on a sale, there will be two rows in the fact table.  They will be fully addtitive, with no risk of double counting the facts. No bridge needed!

Note that this works best of the orders are allocated in the source system.  Otherwise, you will wind up with splits that are not even (33/33/34) or fractional quantities that do not total to the same amount as the operational system.

More to Come

My inbox is full of additional questions and comments about bridge tables.  I promise to post some more about bridged designs in the months to come.

In the meantime, I encourage you to check out Star Schema The Complete Reference, which contains two full chapters on bridge tables.  Chapters 9-10 offer a far deeper treatment than I can provide on this blog, with 30 figures, sample instance charts, sample queries and results, etc.

Bridge photo by Branwebs via Creative Commons

Thursday, May 12, 2011

Dates In Dimension Tables

A reader asks about tracking dates in the fact table vs. dimension tables.
Q: It seems to me that it is sometimes more effective to store date keys in the dimension table -- particularly if that date could be used to describe the rows in many different fact tables.  The alternative  would be to duplicate the date key in multiple fact tables.

For example, the effective date of an insurance policy could be used against many fact tables.  Should we store the effective date key in each of the fact tables?  Or store it in a single row in the Policy?

The differences in design would be a star schema (fact table) versus a snowflake (stored in the dimension).

- Robert
Montreal
A:  Many models feature at least one major dimension with a date that  might be used to qualify facts in more than one star. This happens in other industries, too.  Examples include dimension tables that represent orders, contracts, tests and so forth.

Avoid placing date_key columns in dimension tables.  It increases SQL complexity, reduces schema understandability, and may impact performance.

Some designers may allow limited use of snowflaking, but not in this situation.

Date in Dimension Table, Day Key in Fact Table

When you have an date that (1) clearly describes something represented by a dimension, and (2) will also be used to group facts, do two things:

  1. Place good old-fashioned dates in the dimension table 
  2. In any star where you may want filter/group/aggregate facts by some aspect of the date in question, add date keys to the fact table
For example, in Robert's case there might be a Policy dimension table with an attribute called policy_effective_date.  This is a date, not a key.

There is also a star that tracks claims. To support analysis of claim information using the policy effective date, the fact table will contain a day_key_policy_effective.  The fact table may contain other day_keys as well (such as day_key_claim, which supports analysis of claim data by the date of the claim.)

Judicious Use of Snowflaking?

There are times when some designers might choose a snowflake path -- placing dimension keys into dimension tables. This happens when there is a set of attributes that appear in multiple dimension tables.

This occurs with geographical data.  The characteristics of a location might appear in a customer dimension table, a warehouse dimension table and a department dimension table.

In this case, the concern is that the ETL routines may not consistently process locations across multiple dimension tables. If their attributes or slow change characteristics are not managed identically, inconsistencies may develop across these tables.

Designers reason that creating one location table solves this problem.  Each dimension that contains location data will contain a location_key. This ensures a single consistent representation of locations.

I discuss this technique in Chapter 7 of Star Schema The Complete Reference, and point out that it opens the door for a different kind of ETL challenge.  If location data is moved to an outrigger, each time there is Type 2 slow change to a location, all related dimension table rows must also undergo Type 2 changes.

My preference is to allow the attributes to repeat in multiple tables, and to increase the QA focus on related ETL programs.

Thanks to Robert for the question.  If you have one of your own, send it to the address in the sidebar.

-Chris
Image credit: Gravityx9 licensed under Creative Commons 2.0

Monday, April 25, 2011

Bridge Tables and Many-to-many Relationships

Recent posts about bridge tables generated a lot of interest. Today, I will look at the many-to-many relationships in these solutions. 

These relationships are intended, necessary, and valid parts of database design. Many software tools do not understand this kind of relationship, so I'll also show you how you can eliminate it, if need be.

Many-To-Many

Previously, I described two uses for bridge tables. Both examples involved many-to-many relationships between tables. Many readers may have thought this was an error. In fact, it was intended.

One case involved a situation where facts must relate to multiple dimension members. Specifically, it was possible for two or more salespeople might collaborate on orders captured in a fact table. The bridged solution looked like this:

Notice that there is a many-to-many relationship between the bridge table and the fact table. This can be understood as follows:
  • A given fact may reference multiple rows in the bridge.  This happens when two salespeople collaborate on an order. 
  • A given row in the bridge may reference multiple facts.  This happens when the same group collaborates on more than one order line.
The bridge table allows us to link the repeating dimension to the facts, but as noted in the post, care must be taken not to double count.

A Legal Relationship

In the world of entity-relationship modeling, we are taught that many-to-many relationships may be present in a logical model, but not  in a physical design. But we are not discussing entity-relationship modeling.

In a dimensional model, this kind of relationship is allowed. It is also 100% compatible with the relational model of data.  Despite what you may think, it is also compatible with your relational database management system.

The figure above can be implemented as a physical design, and it will work. Unfortunately, some of our software tools do not understand this. 

Tool Problems

Many of the software products we use in data warehousing were originally developed for transaction processing systems. Because of this lineage, the many-to-many relationship can be a problem.

For example, suppose your DBMS works best if joins between tables are supported by declaration of primary key/foreign key relationships. This is a problem for the model pictured above. Order_facts and salesperson_bridge can be joined using sales_group_key.  But it is not a primary key in either table.

Luckily, this join will work even if it is not supported by any key constraints. Unfortunately, other tools may not be as forgiving.

Some data modeling tools do not permit a many-to-many relationship in a physical model. This presumably reflects their entity-relationship lineage. And if you use a SQL-generating BI tool to create reports, you may find it identifies bridge table as fact tables.

Eliminating The Many-To-Many

If your software products cannot live with a many-to-many relationship, there is a simple solution. You can resolve in the same way an entity-relationship modeler would: create an intersect table.

In the example above, a new table is added that contains one row for each group. An association or intersect table, will link the group to the salespeople. The result looks like this:
For those of us who find humor in data modeling, this is a source of much amusement. The sales_group table consists of a single column. And since the sole column is a surrogate key, the table's entire contents are likely to be a sequence of integers!

Still: problem solved. Only take this additional step if you really must.

More on Bridges

Bridges are powerful but dangerous . To make proper use of them, there is more to understand.  Additional topics include he impact of slowly changing dimensions, the use of allocation factors, how to prevent end users from double counting....the list goes on.

I will post more about bridges, but I encourage you to also read about them in my book, Star Schema The Complete Reference.  The book goes into much more depth on bridge tables than I can do in a blog post - or even a series of posts.  Two full chapters are dedicated to bridge tables.  Check it out!

- Chris


Bridge photo by Branwebs via Creative Commons

Wednesday, April 6, 2011

Linking Expenditures to Commitments [Q&A]

A reader working on a financial data mart has a question about linking commitments (such as purchase orders) to payments:

Q: Payments can be made after the end of the fiscal year in which the original commitments were made.  So you could have some payments in February that apply to the prior fiscal  year as well as payments made in February that apply to commitments made in the current current fiscal year.

How would you handle this in the time dimension?  Do we need multiple rows for February 2010 -- one for funds committed in 2010 and another for funds committed in 2009?
J. in New York

A: You don't need to tinker with the time dimension.  The key to solving this problem is to recognize that financial transactions have multiple dates associated with them.

For example, a payment may have an effective date (the date on the check) and a date of entry into the system, which may lag behind a bit. I've covered this concept in a previous post.

This reader wants to link the payment to a commitment, such as a purchase order.  The purchase order may have been issued in the previous year, as in his example.  We can acknowledge this in the design by adding an additional key in the fact table to reference the date of the commitment.

Hence, the payment fact table may have the following foreign key references to a DAY dimension:
  • day_key_effective (the date of the check or payment)
  • day_key_entered  (the date it was entered into the system)
  • day_key_committed (the date of the corresponding commitment)
Now cash flow can be understood properly in time (via day_key_effective) and the expenditures can be rolled up according to the date the funds were committed (using day_key_committed.)

If you wish to do this, it will be important to determine if/how the source system links each expenditure with a commitment.  You will also find that for each payment, you can probably capture two transaction identifiers:  one for the payment itself, and one for the original commitment.  These may also factor into your payments model as dimensions.

- Chris

Image is from Public Domain Photos

Tuesday, March 29, 2011

Q&A: Dimensional vs ER Model for the Data Warehouse

A question about the fundamental data architecture of the data warehouse:

Q: We are embarking on designing a warehouse for our BI efforts.  Someone asked if we should create the warehouse in a dimensional structure, rather than the typical ER structures.

Everything I’ve read and seen has advocated the warehouse be ER modeled at the lowest level of detail appropriate and then for specific mart needs, the data is modeled dimensionally at a grain that is consistent with the needs for that mart, utilizing conformed dimensions.

My question then is, have you seen any instances where a Warehouse (or BI Warehouse) has been designed and built successfully utilizing a non ER structural approach?
Brenda
Iowa

A: Your question deals with the primary difference between two popular data architectures -- those associated with W.H. Inmon and Ralph Kimball.

Two Architectures

Normalized ER models are great for capturing data, but very difficult to use for analytic reporting.  That's why Inmon believes that, although the enterprise data warehouse should be normalized, the data marts should be dimensional. This approach is the one you have read about.

But it is also possible to use a dimensional model as an integrated repository of atomic data. That is the Kimball philosophy -- dimensional all the way. This repository may be centralized (physical) or distributed (logical) repository. In either case, conformed dimensions ensure compatibility across subject ares.

I explain both architectures in my latest book, but I don't argue for one over the other.  This was a conscious choice, because I wanted it to be useful to anyone who uses dimensional data. I take the same approach in the classes I teach.

Dimensional Works Just Fine

The answer to the original question: Yes, I have seen many data warehouses that are successful without an ER layer.  

Keeping data in a dimensional format does not harm the ability to integrate data or to capture detail. (In some cases it reduces the amount of work needed to load all the data structures.) It also allows dimensional constructs to guide scope from requirements to completion, without translation back and forth into an ER format. Secondary dimensional structures ("derived schemas" or "second line data marts") can always be added as required for specific reporting needs. 

Other Approaches Work Too

Keep in mind that other approaches can be successful, as I have discussed before. Most of us won't get to make the choice that Brenda is considering.  If you already have an Inmon-style architecture in place, you should not change it simply for philosophical reasons. 

The motto of this blog is: be pragmatic, not dogmatic.

- Chris


Image licensed via Creative Commons 2.0
 from Patrick Hoesley

Book Signing in DC, Date Change for Boston Seminars

Washington DC - Book Signing

I will be signing books next week at the TDWI World Conference in Washington DC.  You will find me in the main exhibit hall on Tuesday 4/5 from 11:30 am to 1:00 pm.

I will also be teaching on Monday and Tuesday, and conducting half-hour one-on-one sessions on Wednesday.

Hope to see you there! 

Boston Seminars - Date Change

The TDWI Boston Seminar will now be taking place June 13-16.  If you have already signed up, you should have received notification from TDWI. 

I will be teaching three courses that week, including an intensive two-day course on advanced dimensional design.  This is a rare public offering for my 2 day course, which is usually only offered through TDWI Onsite Education.

Monday, March 7, 2011

Q&A: Bridges are Part of Dimensional Modeling

Two recent posts on bridge tables generated a lot of questions.  Here is the most common:
When you use a bridge table, isn't that a snowflake? I thought this was not allowed in dimensional modeling.
The use of a bridge is a limited form of snowflaking.  It is also an important feature of the dimensional model.

Bridges are widely accepted, even among those who would otherwise avoid snowflake designs.

A Limited form of Snowflaking

In a Q&A from last year, I suggested that a snowflake occurs when dimensions are joined to something other than a fact table. By this definition, the examples from the previous two posts are certainly snowflakes.

In the first example, a bridge was used to associate a dimension with a repeating attribute (a company with multiple industries.)

In the second example, a bridge was used to associate a fact with multiple dimension rows (a sale with multiple salespeople.)

In both cases, a standard one-to-many relationship between fact and dimension does not exist.1 The bridge table solves this problem.

This usage is widely accepted. In fact, the bridge table is a central feature of dimensional modeling. I devote two full chapters to bridged designs in Star Schema: The Complete Reference.

Other Relationships are Not Modeled

When a bridge table is employed, most dimensional modelers will still refer to the schema as "a star." That's because the bridge is only used in the situations described above -- when the standard master-detail relationship between dimension table and fact table breaks down.

Other kinds of relationships between data elements are not resolved via multiple tables.  In our examples:
  • Repeating attributes are permitted (work phone, home phone, mobile phone) 
  • Dimensions contain partially dependent attributes (brand information present in a product table)
  • Master-detail relationships are not instantiated into separate tables
  • Values associated with codes are not placed in look-up tables
  • Dimensions are not linked to one another
An all-out snowflake design would employ principles of normalization to eliminate these features.

Other Times to Snowflake?

There may be some other situations where limited snowflaking is acceptable.  One is where a set of attributes (e.g. geographic) would otherwise appear in multiple dimension tables. Another is where software tools (the RDBMS or BI software) work better in a snowflake environment. 

These uses are a bit more controversial, and I wrote about them in a previous post.

When it comes to the bridge table, however, there is no controversy.  The bridge is an accepted technique in dimensional modeling, and is often necessary for specific dimensional designs.

- Chris

1 Bridges may also be used to navigate recursive relationships, as I describe in the book.

Image credit: Gravityx9 licensed under Creative Commons 2.0

Wednesday, February 9, 2011

Bridge to Multi-Valued Dimensions

Here's what you can do when a dimension table and fact are not in a one-to-many relationship.

A recent post described how a bridge can be used to resolve repeating attributes in a dimension. Today's post looks at a second use:  facts that must link to multiple rows in a dimension.

Facts with Multi-Valued Dimensions

In most star schemas, each fact table row references exactly one row in each associated dimension. This relationship between dimension table and fact table is variously called "one-to-many", "master-detail" or "parent-child". For example, in a star schema that measures the order-taking process, each row in the fact table will link to exactly one row in the product table, one row in the salesperson table, etc.

But what happens when more than one salesperson can be involved in an order? Suppose there could be two, three, or any number of salespeople? The one-to-many relationship breaks down.

Avoiding the Bridge

As with the repeating attribute from last time, we might try to accommodate this multi-valued dimension by simplifying the relationship. Perhaps we can isolate some "roles" that can be used to recover a clean one-to-many relationship. If there is always a lead salesperson and a secondary salesperson, we can simply place 2 salesperson keys in the fact table.

This has some disadvantages, though, and they are the same as those from last time.

First, it becomes difficult to produce a report showing all sales that a particular person was involved in. The query needs to select sales where the primary salesperson is the person in question, OR where the secondary salesperson is the person in question. Doing this for all salespeople in a department becomes tricky, especially if we are using a SQL-generating BI tool.

The second problem with simplifying the relationships is that there may be a need for any number of dimension values. Suppose 3 sales people collaborate on an order? Do we provide 3 salesperson keys in the fact table? This compounds the first problem, and it will only be a matter of time before we come across a case where we need 4.

A Dimension Bridge

The solution is to build a table that bridges the fact table and dimensions.

When a fact references a group of dimension values, a group will be created. The fact table will contain the group key. This key will join to the bridge table.  The bridge will have a row for each group member, linking to the dimension. An example appears here:


This simple solution allows individual rows in the fact table to link to any number of salespeople.  Notice that it requires a group to be established even for orders that have only one salesperson.

Using the Bridge

With a dimension bridge in place, it is easy filter facts for single dimension member, or group them by dimension members.  The sum of sales, for example, can be grouped by salesperson: the fact table joins to the bridge, the bridge joins to the dimension, and a simple group by query is issued.

Dimension bridges are used in many other situations. Multiple citations during an inspection, multiple parties in a dispute or claim, and so forth.

Dangers

As with the attribute bridge, the dimension bridge brings with it the possibility of misuse.

Without careful attention, facts may be double-counted, triple-counted or worse.  For example, if two people collaborate on a sale, a report of sale by salesperson will show the sale twice.  This kind of report is useful, but creating a grand total would be a mistake. Similar issues crop up if both people are in the same department, and someone tries to produce a report showing sales by department.

The solution is to make sure the bridge is only accessible to trained developers and analysts.  A simplified solution can be made available to ad hoc users. In this case, that might mean adding a primary salesperson key to the fact table. This allows ad hoc users to safely join the fact table and dimension table, but their analytic capability will be limited.

More On Dimension Bridges

This introduction has only scratched the surface of dimension bridges. As with the previous post, there are many more refinements and complications that can be introduced. These include the impact on ETL, the addition of allocation factors, coping with software that does not permit a many-to-many relationship and more.

These topics will be explored in future posts. You can also read all about bridge tables in Chatpers 9 and 10 of Star Schema: The Complete Reference.

Have a question about bridge tables? Send it to the address in the sidebar, and it may be addressed in a Q&A post.

Image by Branwebs via Creative Commons