Tuesday, July 5, 2011

Dimensional Modelers Do Not Focus on Logical vs. Physical

The separation of logical and physical models is useful entity-relationship modeling. Not so in the world of dimensional models, where final table structures are strikingly similar to business models. 

Instead, a dimensional model is best described in three levels of increasing detail.

Entity-Relationship Modeling and Dimensional Modeling

It is well known that students of dimensional modeling have an easier time if they have not had significant exposure to entity-relationship modeling (or ER modeling).

Entity-relationship modelers are tempted to normalize data, design relationships between dimension tables, and so forth. These steps make perfect sense in the world of OLTP systems design, but not for analytic solutions.

Similarly, when scholars of entity-relationship modeling turn their focus to the dimensional world, another kind of mistake can occur.

Given that it is "all modeling," one may be tempted to generalize a generic framework from ER modeling, and then apply it to dimensional modeling. This mistake is often made by software vendors, and sometimes by academics.

ER modeling and dimensional modeling are both forms of data modeling, but it is not necessary to have a single framework to explain both.

Different Disciplines, Different Frameworks

The traditional distinction between logical and physical model is useful in ER modeling, but not in dimensional modeling. In the world of ER modeling, a logical model captures data requirements without assuming particular data storage technologies. It captures business entities and their attributes and relationships.

The logical model is the basis for a physical model, with reflects the requirements of relational storage in tables and columns. For example, many-to-many relationships are resolved, data may be restructured for performance reasons, etc.

This distinction has significantly reduced utility in the dimensional world. A physical dimensional model stores data in a format that is very similar to how it is understood by the business. Aside from the addition of keys, there is very little difference between logical and physical.

The distinction is not useless, but it plays a much smaller part.  So small, in fact, that you will not find it the focus of most books on dimensional modeling. My own book on dimensional modeling, for example, makes first mention of the term "logical model" in the last two paragraphs of the last chapter.  (See the end of this post for a link.)

Dimensional Models: Three Levels of Detail

Instead of having multiple kinds of model, a dimensional model is best understood at three levels of increasing detail.  Like zooming in on a Google map, each reveals additional information about the model. Each level of detail has different uses for different audiences.
  1. Business Requirements
    Requirements are grouped by subject area, correspond to business process, state measurement requirements in terms of facts and dimensions, and cross reference common dimensions.

    These business requirements clearly convey scope in business terms.  They link directly to the next level of detail in the model, which exposes the concept of table.

  2. High Level Design
    This level of the model defines the same requirements in terms of fact tables and dimension tables, natural keys an surrogate keys, and exposes major attributes of significance.

    At this level of detail, we do not record every column of every table, or even assign data types.  But we do draw table diagrams, and rigorously define several important design elements such as grain, additivity, and slow change requirements.

    This level is useful for design reviews, educating users and developers, and describing project activities.

  3. Detailed Design
    At the lowest level of detail, we expose every column of every table, define data types, provide definitions and sample data, map everything back to source data, and document transformation rules.

    This level of detail is useful for database administrators and ETL architects. It also contains metadata that will be useful for BI developers and end-users.

One Model, Three Levels of Detail

Don't make the assumption that these three levels of the dimensional model are developed in sequence. It is possible to do so, but not always the case.

Most dimensional designers do the bulk of their work at the second level - the high level dimensional design.

As the model nears completion, they will summarize it at a business level, and then develop the detailed model. In other cases, modelers may work through these levels in sequence.

Regardless of the process, in the end,  these three levels all describe the same dimensional model. Like levels of a Google map, they do so at different levels of detail. All should be accounted for.


For more information on what information to capture at each of these levels, see Star Schema: The Complete Reference.  Detailed descriptions and examples can be found in Chapter 18, "How To Design And Document A Dimensional Model."

Image licensed via Creative Commons 2.0
 from Patrick Hoesley

Tuesday, June 21, 2011

Allocation Factors Are Not Magic Bullets

Reader interest in bridge tables continues.  Today, I'll address a topic that was raised in the comments: the use of allocation factors.

Allocation factors are often seen as the cure-all remedy to avoid double counting.  Unfortunately, they are not always realistic. And when they are possible, there is often a better solution.

The back-story: double counting

A dimension bridge allows a single fact row to be linked to two or more rows in a dimension.  In a previous post, I showed how this technique allowed two or more salespeople to be linked to a row in an orders fact table.

As I pointed out, the danger is that queries may accidentally count the order dollars repeatedly -- once for each associated salesperson.  This is the danger of double-counting.

The potential solution: allocation

An allocation factor can sometimes be used to avoid this danger. To the bridge, we add a column that indicates the "share" of the facts associated with each group member.  In queries, before we aggregate facts, we apply this allocation factor.

For example, we can add an allocation factor to the salesperson group bridge: 

If two members collaborate on a sale, and split the credit 75/20, there will be two rows in the bridge for their group.  One would contain the group key, the first salesperson key, and a .75 allocation factor.  The other would contain the group key, the second salesperson key, and a .4 allocation factor.  (For sample data, see Figure 9-5 my book.)

Now, when we connect the bridge to the fact table, we can apply the allocation factor:

select sum(order_facts.order_dollars * salesperson_bridge.allocation)

Seems to be the perfect solution, right?  Well, not always....

There must be a business rule

An allocation factor only works where one exists. Do not come up with your own rules -- e.g. "split transactions evenly among group members." This may not reflect how the business studies their key metrics.

Allocation factors are often not part of the business. If an insurance claim is tracked in a fact table, it may not make sense to allocate it against multiple parties. If a fact table represents a insepction event, it may not make sense to allocate it against multiple infractions.

There must be a definitive and consistent business rule that can be used. If there is no business rule, then obviously allocation will not work. Note, too, that they are even less likely to be found in situations that call for an attribute bridge.

There should be a source of record

So you have a business rule?  You are not out of the woods yet.

Next, you need to make sure you have a source of record for the allocation factor.

For example, many systems that support the sale of securities calculate, and split, broker commissions at the time of sale. These systems provide a nice, clean allocation factor.

If you cannot find a source of record, you should think long and hard about using the "business rule." In all likelyhood, it will be subject to exceptions, and may even change over time. 

For example, your business may tell you transactions can be allocated evenly among group participants. You may choose to "hard-code" this rule into your ETL process. But one day, you will encounter a situation where the business wants to "override" this rule for a particular transaction. Or, a new rule is instanitated that one particular salesperson always gets at least 40% of their sales, and so on...

Each time the business changes its mind, ETL rework will be necessary. If you don't have a system of record, think long and hard before hard-coding a business rule.

Consider altering fact table grain

Got a business rule AND a source of record?  Fantastic!  You are very lucky, and you may not even need an allocation factor at all!

When you have a clear source for an allocation factor, consider skipping the bridge altogether. Instead, restate the grain of your fact table as allocated transactions.

In the orders example, we can change the grain of the fact table from "one row per order line" to "order lines allocated to salespeople."

Now, if two salespeople collaborate on a sale, there will be two rows in the fact table.  They will be fully addtitive, with no risk of double counting the facts. No bridge needed!

Note that this works best of the orders are allocated in the source system.  Otherwise, you will wind up with splits that are not even (33/33/34) or fractional quantities that do not total to the same amount as the operational system.

More to Come

My inbox is full of additional questions and comments about bridge tables.  I promise to post some more about bridged designs in the months to come.

In the meantime, I encourage you to check out Star Schema The Complete Reference, which contains two full chapters on bridge tables.  Chapters 9-10 offer a far deeper treatment than I can provide on this blog, with 30 figures, sample instance charts, sample queries and results, etc.

Bridge photo by Branwebs via Creative Commons

Thursday, May 12, 2011

Dates In Dimension Tables

A reader asks about tracking dates in the fact table vs. dimension tables.
Q: It seems to me that it is sometimes more effective to store date keys in the dimension table -- particularly if that date could be used to describe the rows in many different fact tables.  The alternative  would be to duplicate the date key in multiple fact tables.

For example, the effective date of an insurance policy could be used against many fact tables.  Should we store the effective date key in each of the fact tables?  Or store it in a single row in the Policy?

The differences in design would be a star schema (fact table) versus a snowflake (stored in the dimension).

- Robert
Montreal
A:  Many models feature at least one major dimension with a date that  might be used to qualify facts in more than one star. This happens in other industries, too.  Examples include dimension tables that represent orders, contracts, tests and so forth.

Avoid placing date_key columns in dimension tables.  It increases SQL complexity, reduces schema understandability, and may impact performance.

Some designers may allow limited use of snowflaking, but not in this situation.

Date in Dimension Table, Day Key in Fact Table

When you have an date that (1) clearly describes something represented by a dimension, and (2) will also be used to group facts, do two things:

  1. Place good old-fashioned dates in the dimension table 
  2. In any star where you may want filter/group/aggregate facts by some aspect of the date in question, add date keys to the fact table
For example, in Robert's case there might be a Policy dimension table with an attribute called policy_effective_date.  This is a date, not a key.

There is also a star that tracks claims. To support analysis of claim information using the policy effective date, the fact table will contain a day_key_policy_effective.  The fact table may contain other day_keys as well (such as day_key_claim, which supports analysis of claim data by the date of the claim.)

Judicious Use of Snowflaking?

There are times when some designers might choose a snowflake path -- placing dimension keys into dimension tables. This happens when there is a set of attributes that appear in multiple dimension tables.

This occurs with geographical data.  The characteristics of a location might appear in a customer dimension table, a warehouse dimension table and a department dimension table.

In this case, the concern is that the ETL routines may not consistently process locations across multiple dimension tables. If their attributes or slow change characteristics are not managed identically, inconsistencies may develop across these tables.

Designers reason that creating one location table solves this problem.  Each dimension that contains location data will contain a location_key. This ensures a single consistent representation of locations.

I discuss this technique in Chapter 7 of Star Schema The Complete Reference, and point out that it opens the door for a different kind of ETL challenge.  If location data is moved to an outrigger, each time there is Type 2 slow change to a location, all related dimension table rows must also undergo Type 2 changes.

My preference is to allow the attributes to repeat in multiple tables, and to increase the QA focus on related ETL programs.

Thanks to Robert for the question.  If you have one of your own, send it to the address in the sidebar.

-Chris
Image credit: Gravityx9 licensed under Creative Commons 2.0