Thursday, December 20, 2012

Dimensional Models: No E-R Fallback Required

People often suggest to me that a dimensional model cannot handle certain situations. These situations, the assertion holds, require falling back to entity-relationship modeling techniques.

Image by Patrick Hosley,
Licensed by Creative Commons 2.0
When we look together at their examples, though,  I've always found a dimensional solution.

Here's a list of things people sometimes do not realize can be handled by standard dimensional techniques.

At the end of the post is a large collection of links to previous posts treat these topcis in more detail.

Note: This article should not be interpreted as a criticism of entity-relationship modeling. Rather, the point is to cover commonly overlooked capabilities of dimensional modeling.


Predictive Analytics: Storage of Granular Detail

I often hear people say: if you want to do predictive analytics, its important to capture all the relationships among data attributes. This is incorrect.

The job of predictive analytics is to tell us what is important--not the reverse!  Neither data mining nor predictive analytics requires information to be stored in an entity-relationship format.

As people who do this work will tell you, they don't care how the data is supplied. Nor do they care what modelers believe are the important relationships. Many tools will completely de-normalize the data before doing anything else.

In order to support predictive analytics, standard best practices of dimensional modeling apply:
  • Fact tables record business activities and conditions at the lowest level of detail possible in granular, first line stars.
  • Transaction identifiers are carried in the model if available.
  • Surrounding dimensions are enriched with detail to the fullest extent possible
  • In the dimension tables, changes are tracked and time-stamped (“type 2”).
If we do these things with our dimensional data, we have not lost anything that is that is required to do predictive analytics.

Many-to-many Relationships: Multi-Valued Dimension Bridge

Usually, each fact in a star schema can be linked to a single member of a dimension. This is often mistaken for a rule, which leads to the incorrect conclusion that dimensional models cannot accommodate a fact that may link to multiple members of a dimension.

(This is often generalized as “dimensional cannot handle many-to-many relationships.”)

In a dimensional model, this situation is referred to as a multi-valued dimension. We use a bridge table to link the facts to the dimension in question. This special table sets up a many-to-many relationship between the facts and dimensions, allowing any number of dimension members to be associated with any number of facts.

Here are some examples:
  • In sales, a bridge table may be used to associate a sale, as recorded in a fact table, with multiple sales people. 
  • In insurance, a single claim (again, tracked in a fact table) can be associated with multiple parties. 
  • In healthcare, a single encounter may be associated with multiple providers or tests. 
  • In government, a single audit or inspection may be associated with multiple findings. 
  • In law enforcement, a single arrest may be associated with multiple charges.
Repeating Attributes: Multi-Values Attribute Bridge

It is sometimes suggested that dimensional models cannot gracefully accommodate repeating attributes. The dimensional solution is again a bridge table. This time, it is placed between the dimension table and an outrigger that contains the attribute in question.

Examples include:
  • Companies that have multiple Standard Industry Classification codes 
  • People with multiple phone numbers or addresses 
  • Accounts with multiple account holders 
  • Patients with multiple diagnoses
  • Documents with multiple keywords.
Recursive Hierarchies: Hierarchy Bridge

Ragged hierarchies, unbalanced hierarchies, or recursive hierarchies are often cited as the downfall of the dimensional model. In fact, a solution exists, and it is extraordinarily powerful. The hierarchy bridge table allows facts to be aggregated either by rolling up or rolling down through the hierarchy, regardless of number of levels.

Examples include:
  • Parts that are made up of other parts
  • Departments that fall within other departments
  • Geographies that fall within other geographies
  • Companies that own other companies
Relationships Between Dimensions: Factless Fact Tables

A star schema does not include relationships between dimension tables. This has led to the misconception that you can't track these relationships.

In fact, any important relationship between dimension tables can and should be captured. It is done using factless fact tables. (Dimensions are never directly linked because of the implications this would have on slow change processing.)

Examples include:
  • Employees filling a job in a department
  • Marketing promotions in effect in a geographical market
  • Students registered for courses
  • The primary care physician assigned to an insured party
  • Brokers assigned to clients
Subtyping: Core and Custom

Another situation where modelers often believe they must “fall back” on ER techniques is when the attributes of a dimension table vary by type. This variation is often misconstrued as calling for the ER construct known as subtyping.  Similar variation might also be found with associated facts

In the dimensional model, heterogeneous attributes are handled via the core and custom technique.

A core dimension captures common attributes, and type specific replicas capture the core attributes along with those specific to the subtype. Dimensions can then be joined to the fact table according to the analytic requirement. If there is variation in the facts, the same can be done with fact tables.

Examples include:
  • Products with attributes that vary by type 
  • Customers that have different characteristics depending on whether they are businesses or individuals
  • In insurance, policies that have different characteristics depending on whether they are group or individual polices
  • In healthcare, medical procedures or tests that have different characteristics and result metrics depending on the test
  • In retail, stores that have varying characteristics depending on the type (e.g. owned, franchise, pocket)
Non Additive Metrics: Snapshots or Summary Tables

In a classic star schema design, facts are recorded at a granular level and “rolled up” across various dimensions at query time. Detractors often assume this means that non-additive facts have no place in a dimensional model.

In fact, dimensional modelers have several tools for handling non-additive facts.

Those that can be broken down into additive components are captured at the component level. This is common for many key business metrics such as:
  • Margin rate: stored as margin amount and cost amount in a single fact table. The ratio is computed after queries aggregate the detail.
  • Yield or conversion percentage: stored as quote count and order count in two separate fact tables, with ratio converted after aggregation at query time.
Other non-additive facts cannot be broken down into fully additive components. These are usually captured at the appropriate level of detail, and stored in snapshot tables or summary tables. Common examples include:
  • Period-to-date amounts, stored in a snapshot table or summary table
  • Distinct counts, stored in a summary table
  • Non-numeric grades, stored in a transaction-grained fact table
While these non-additive facts are flexible than additive facts in terms of how they can be used, this is not a result of dimensional representation.

Conclusion

Every technique mentioned here is part of dimensional modeling cannon. None are stopgaps or workarounds. While some may prove problematic for some of our BI software, these problems are not unique to the dimensional world.

In the end, the dimensional model can represent the same real-world complexities that entity-relationship models can. No ER fallback required.

- Chris


Learn More

All of these topics have been covered previously on this blog.  Here are some links to get you started.

Establishing granular, detailed star schemas:
Multi-valued Dimensions:
Multi-Valued Attributes:
Recursive Hierarchies:
Factless Fact Tables:
Core and Custom:
  • There's not much on this topic on this blog right now, but see my book for more info (details below.)
Non-additive Facts:

Also consider checking out my book, Star Schema: The Complete Reference.  It covers all these topics in much more detail:
  • Granular and detailed stars are covered in Chapter 3, "Stars and Cubes"
  • Multi-valued Dimensions and Multi-Valued attributes are covered in Chapter 9, "Multi-valued Dimensions and Bridges"
  • Hierarchy Bridges are covered in Chapter 10, "Recursive Hierarchies and Bridge Tables"
  • Factless Fact Tables are covered in Chapter 12, "Factless Fact Tables"
  • Core and Custom schemas are covered in Chapter 13, "Type-specific Stars"
  • Non Additive Facts are covered in Chapter 3, "Stars and Cubes," Chapter 11, "Transactions, Snapshots and Accumulating Snapshots" and Chapter 14, "Derived Schemas"

Use the links on this blog to order a copy from Amazon.  There is no additional cost to you, and you will be helping support this  blog.






Monday, November 5, 2012

Name Value Pairs and Dimensional Models

A reader asks how name/value pairs are best translated into a dimension design.
I was wondering if you have ever had to setup a data warehouse with a source system that has the ability to have dynamic "virtual" columns using a set of name/value pair tables...If you have done this or have any cool ideas on how to tackle this setup in a dimension that would be greatly appreciated.
-Joshua
Belmont, CA

This is a common scenario. The preferred solution is to identify the most commonly used "virtual columns" and model them as standard attributes. Bridged solutions are also possible, but are much less flexible from a usability standpoint.

Name Value Pairs

Name/value pairs are often used in OLTP solutions to offer a flexible way to describe an entity.

Rather than store each characteristic of an entity as an attribute, a special table lets you identify unlimited attributes for something. Its main columns capture the name of the attribute and its value. (Hence the moniker "name/value pair".)

For example, suppose you are Apple Inc. Your OLTP system has a product table, but each product has a wide variety of attributes. Rather than model each attribute, you might capture them in a name/value pair table.

The rows in the table for the newly released iPad Mini might look like this:

As you can see, the 18 rows above describe one iPad Mini model. 

Elsewhere in the data model, and associative table (or "intersect table") cross references these 18 pairs with a single row in the product table for this item.

Apple has 23 other iPad Mini models, each with different of values for the capacity attribute, color attribute, carrier attribute, etc. 

In addition, this same table can be used to describe the characteristics of products with different named attributes - a phone, a computer, a notebook, and so forth. New attributes do not require changing the database design. Instead, all that is necessary is to insert the appropriate rows.

In a dimensional model, there are several ways to handle name/value sources. Each involves tradeoffs between flexibility and usability.

Flattened Solution is Most Usable

The preferred method is to model the "names" as explicit dimension attributes.1

In the case of our product, for example, the resultant dimensional model might nook like this:

This approach produces the most understandable and usable dimensional model. Each important characteristic is present in the dimension, named according to business terms, and ready to provide context for facts.

Do not be discouraged if every instance of the source entity (product in this case) does not share the same set of attributes in the name/value pair table. Look for 100 or so attributes that are most commonly used. In many cases, this may satisfy most of your analytic requirements.

Core and Custom Dimensions

If there are simply too many names to flatten into a single dimension table, the next step is to consider "core and custom" models.

If you separate the entity by "types", can you find 100 named attributes for each type?  

If so:
  • Build a single "core" dimension that contains an identifier and any attributes common across all types, along with a surrogate key.
  • For each type, build an additional dimension that replicates the common attributes, plus contains the 100 or so attribute names specific to the type
  • The core and custom versions should use the same surrogate key domain for each member of the dimension, so that they can be swapped in and out of queries as required.
For example, Apple might do the following:
  • Build a single product table that has the part number, category, product name and a key. This table will have a row for all products, including computers, tablets, phones and music players. This is the "core" product table. 
  • Build additional "custom" dimensions would be built for each category of product. The table pictured in the previous section might be the custom dimension for tablets.
This solution retains the understandability of the previous solution - each attribute is clearly and explicitly named. However, we must now take care to join the appropriate version of the dimension to fact tables - a technical consideration based on the business question being asked.

Bridged Solution Mirrors OLTP Design

The last option is to employ a bridged solution. This technique actually mimics the original OLTP design. An outrigger contains the name/value pairs, and a bridge associates it with the dimension.

For example, Apple's product dimension might be bridged as follows:

This solution has the benefit of providing access to the full range of named attributes. It is also flexible; new named attributes do not require any change to the schema design.2

The disadvantage here is that this solution is the most difficult to use. Facts can be easily double counted, triple counted, etc. It will be necessary to take great care in constructing and qualifying queries, and it may also be necessary to lock down "grand total" functionality in end-user reporting tools.

Both Ways

Remember that this is not an either/or solution. If a bridged outrigger is necessary, consider also capturing the 100 most commonly used attributes in the dimension itself.

This allows you to create a safe "sandbox" for less technically sophisticated users. They are able to access much of the analytic value of your solution, without having to expose them to the complexity and possible dangers of the bridge table.

More Reading
  • Bridges: Much has been written in this blog about bridge tables. The most relevant entry if you want to learn more is Resolve Repeating Attributes With A Bridge Table (1/28/11).  There are also two full chapters in my latest book (see below.)
  • Core and custom dimensions have not been previously discussed here, but you can learn more about them in Star Schema: The Complete Reference.  It dedicates an entire chapter to "Type-Specific Stars."
Help support this blog

This blog is ad-free, and no one pays me to write it.  You can help out:  
  • Use any of the links on these pages to pick up my latest book, Star Schema: The Complete Reference.  
  • Or, if you already have it, use the links to check out any of the other recommended books.

When you do, a small portion of the purchase goes to this blog (you do not pay any extra).


Notes:

1I'll call these "names" rather than "pairs", since its the name that will translate into a column. For example, "color" is a name, it may participate in several name/value pairs -- one for each possible color.

2The bridge table in this diagram contains product and attribute keys. If there will be many combinations, dimensional modelers often replace the product key with a group key, and store the group key in the dimension table. See Resolve Repeating Attributes With A Bridge Table (1/28/11). 


Wednesday, October 10, 2012

Handling Rankings

Dimensional modelers often struggle with issues surrounding rankings. Learn when to store them, where to store them, and where not to store them, by reading this post.
We need to calculate Top N rankings across several dimensions (account, channel, hour, day, week, month.) There are massive data sets involved. This must be a common problem. Is there an approach you can point us towards?
- Jeremy
Rankings are time and context dependent. When you need to track them, they pose challenges that are similar to non-additive facts such as period-to-date measurements or distinct counts.

The solutions, it turns out, are also similar.

Rankings and their Context

When thinking about how to handle rankings, it is useful to break them down. A ranking is actually made up of several parts. There is the dimension being ranked, the dimensions for which the ranking will be computed,  and the fact that is the source of the ranking.

For example, look at this simple statement:

"Show me the top 10 customers by week and product."
  • The dimension being ranked here is Customer.  
  • The dimensional context of each ranking is a product and week. For example, if we have 20 products, then each week we will need 20 top ten lists -- one for each product.
  • The fact driving the ranking is not explicit in this question. It is probably spending, though it may be based on margin or transaction counts.  
That may sound simple, but things are usually a bit more complicated.

The reader, for example, does not just need a single ranking. He actually needs several -- daily, weekly, monthly. Presumably, these may be mixed and matched with the other dimensions he mentioned -- account and channel. In addition, there may be different rankings based on different facts.

Compute vs. Store

It may have occurred to you at this point that there are a multitude of possible rankings, even for a relatively simple schema.

In addition, once we compute a ranking, there is very little we can do with it. For example, we cannot sum it up over time periods, or other dimensions.  We may be able to compare it to another ranking (a different period or dimension, for example), or treat it as reference data describing the thing being ranked. But that's about it.

For these reasons, many organizations choose to compute rankings within reports, or at query time.  This provides maximum flexibility, and also sidesteps the ugly problem of having to store information that cannot be aggregated.

However, the reader indicates that data volumes are large. Let's assume they are so large that it is not feasible to keep the granular data around to compute rankings. Let's also assume that the rankings he needs are consistent over time. In his case, it may be necessary to compute all needed rankings and then store them.

Rankings fit with Snapshots or Aggregates

A ranking is only useful within the original context of its calculation. We won't really be able to aggregate it.  In this respect, its is similar to a distinct count or a period-to-date measurement -- topics I have addressed in the past.  (Some period-to-date metrics can be aggregated within a single period, but otherwise the comparison holds.)

The recommended solution is similar as well. Establish the granularity of the various rankings, then keep track of them fact tables of appropriate grain. These will either be periodic snapshots, or aggregate tables that summarize appropriate time periods. The rank itself will be stored in the fact table, even though it is not additive.

Many developers balk at this solution, observing that it increases the number of fact tables that must be maintained. But note that these fact tables do not introduce any new dimension tables, and they can usually be sourced from transaction-grained stars. And the alternatives have even less appeal.

Rankings don't fit with Transactions

Most of the time, it is not useful to store rankings with transactions. Unless all the things being ranked happen to have transactions that occur at the same time, it will be impossible to compare the ranking with anything -- including other things being ranked!

Resist abstracted models

It is tempting to come up with designs that abstract the concept of a ranking. For example, one can imagine setting up a dimension that describes a ranking, the dimension being ranked, and the period. A fact table can be set up that carries the key for this metadata, plus 10 or 20 keys that will be used to link to top 1, top 2, etc.

Abstracted designs are appealing because they may appear to limit the number of stars to be maintained and don't require schema changes when new rankings are added. But note that the same amount of work must be done to load each ranking.

More importantly, these abstracted designs severely hamper usability. With such a design, it is much harder to merge information about a ranking with other data. A report that contains product level detail, for example, cannot easily pull in info indicating the product appeared in a Top 10 ranking -- this would require decoding metadata and then checking each of the various keys.

Further Reading

As mentioned, some of the challenges related to rankings are similar to those involving distinct counts or period-to-date metrics.  I've covered these topics before:


I also touch on the topic in Chapter 8 of  Star Schema: The Complete Reference, where I address the proper home of a period-to-date measurement.  Use the link to purchase a copy, and you will be helping to support this blog.



Image licensed via Creative Commons 2.0



Friday, September 21, 2012

Business Analytics at Work, Almost

Back in February, business analytics briefly broke into the mainstream consciousness thanks to an article in the New York Times Magazine called How Companies Learn Your Secrets by Charles Duhigg.

When I teach, I frequently use an example from this article. I get lots of email asking me for more information, so I figured I might as well post it here.

Breaking Habits

Duhigg's purpose is to show that human behavior is driven by habit. The key to changing behavior is understanding the underlying habits. 

He tells several stories where businesses use this information to drive positive results.

One of his stories happens to touch on predictive analytics.  

The setup is simple: back in 2002, marketers at Target had realized that shopping habits are disrupted with the arrival of a first-born. Bring expecting parents into stores, and you have an opportunity to cultivate new patterns. The result: new lifetime customers for all manners of product--not just baby items.

The key, then, was to identify people who might be pregnant. With the help of a team of statisticians, Target figured out that they could do just that.

Enter Analytics

Duhigg describes how Target tapped into their vast data reserves, and eventually came up with a market basket profile that was a good predictor of pregnancy. This profile could then be applied to their customer base to identify thousands of people who were likely to be expecting.

They were surprisingly effective. Duhigg tells the story of a gentleman berating a store manager for marketing baby products to his teenage daughter:

“She’s still in high school, and you’re sending her coupons for baby clothes and cribs? Are you trying to encourage her to get pregnant?”

The manager didn’t have any idea what the man was talking about. He looked at the mailer. Sure enough, it was addressed to the man’s daughter and contained advertisements for maternity clothing, nursery furniture and pictures of smiling infants. The manager apologized and then called a few days later to apologize again.

On the phone, though, the father was somewhat abashed. “I had a talk with my daughter,” he said. “It turns out there’s been some activities in my house I haven’t been completely aware of. She’s due in August. I owe you an apology.”
The analytics, it seems, did their job. But did they really?

Analytics and Business Strategy

This anecdote went viral when the Times published Duhigg's article. Ethical issues notwithstanding, it offered an updated version of the apocryphal  "beer and diapers" tale from the 1990's.

But there's more here than market basket analysis. This story illustrates a key aspect of predictive analytics: its important to engage with the owner of the business problem. In this case, it seems, that did not happen. Instead, someone made a tactical decision about how to act on analytics.

In this case,

  • A business analyst had already identified something important: pregnancy. 
  • An analytic modeler took things from there, and developed a way to identify pregnant shoppers. 
  • The analyst chose to mail to people identified by the analytic model.

This last step is where the fumble took place. Every analytic initiative must be connected to the owner of the business problem. This person may not be actively engaged in the analytics, but it is her job to decide what to do with the insights. Deciding how to act on analytics is a key part of the process. It involves communication, and requires the attention of executives who are responsible for business objectives.

The Book

Duhigg has now published a book on the same topic. You might want to check it out.  Just remember, this is not a book about analytics -- it's about habits.


The Power of Habit: Why We Do What We Do in Life and Business by Charles Duhigg.  (Random House, 2012)




Monday, July 16, 2012

Q&A: Does Master-Detail Require Multiple Stars in a Travel Data Mart

A reader asks whether master-detail relationships call for multiple fact tables.

Q: Is there some general principle to use as a guide when deciding to use 1 star or 2 for master-detail situations?

We have a master-detail situation involving trips, and multiple trip expenses per trip. Many of our queries are at the trip level - eg. how many trips are for clients in their 30's going to Arizona in July. Many of the queries are at the detail level -- how much is spent on means by the client's age range?

Both levels of query can be answered with one star, with the fact at the detail level (trip expense line). For trip-level questions - you can count distinct occurrences of the natural key in the trip dimension. But it seems that grinding through all those trip expenses when I just wanted to count trips is inefficient.

So do we have 2 fact tables, one at the trip level, one at the trip expense level? Do the dimensions surround both of them?  Is there a rule?


- Ellen, Ottawa
A: As the reader observes, this is a common situation: when there is a master-detail relationship, is there a need for multiple fact tables?

This case calls for at least two fact tables.
Read on for the details.

Photo by o5com, licensed by Creative Commons

Multiple Fact Tables

After learning the basics of dimensional modeling, the first real world challenge we face is understanding when and how to design multiple fact tables.  Until we learn to think in dimensional terms, the choice can be difficult.

I suggest starting with some basic guidelines.  You probably need different fact tables if:
  1. You have measurements with different periodicity
  2. You have measurements with different levels of detail
The first guideline suggests that if facts do not describe the same event, they probably belong in different fact tables. For example, orders and shipments do not always happen at the same time.  Order dollars and shipment dollars belong in separate fact tables.

The second guideline pertains to facts that do describe the same events.  Information about an order and information about an order line are ostensibly available at the same time, but they have different levels of detail. If there are facts at both of these levels, there will need to be multiple fact tables.

So how would you apply these guidelines?

Think Dimensionally

It is easiest to work these questions out by thinking dimensionally. Forget about master-detail or parent-child for a few minutes, and consider what is being measured. 
  • What are the business questions being asked, and what business metrics do they contain?  
  • Are these metrics available simultaneously?
  • At what level of detail are these metrics available?
The reader cited business questions in her example.  These questions reveal at least two metrics:  number of trips, and expense dollars.

These metrics may be available on different schedules -- for example, a trip may commence before we have the details of all the expenses incurred.  This would argue for multiple fact tables.

But let's suppose that the single source of data is expense reports. We do not have any information about the trip until it is over, at which point we have all the expense details.

In this case, lets think about the level of detail of these metrics. Number of trips and expense dollars seem to share a lot of common detail -- the traveller, the destination, the trip start and end date, and so forth.  

But expense dollars have some additional detail -- the date of the expense item, and an expense category. Since these facts have different levels of detail, it makes sense to include them in separate fact tables.

Factless Fact Table

The trip-level star may contain a factless fact table.  It contains one row per trip, with no explicit facts.  We can determine the number of trips that answer a business question simply by counting rows.

Many teams would prefer to add a fact called number_of_trips and always populate it with the value 1.  This is useful, because it makes the business metric explicit in your schema design.  (I've written about this technique before.)

The trip level star may also contain some summary level facts that describe expenses -- say the total trip cost, the total transportation cost, total meals cost, etc.  More detail on these metrics (such as the vendor, payment method, etc.) can found in the expense item star.

Deeper into Trip Analysis

Digging deeper, the reader may discover she needs more than two fact tables.  

Trips often include more than one destination.  When it is necessary to study the various segments (or "legs") of a trip, we can define a third fact table that contains one row per segment.

In this case, the reader would have three fact tables:
  1. trip_facts  Grain: one row per trip.  Metrics: Number of trips (always one)
  2. trip_segment_facts  Grain: one row per destination city.  Metrics: number of segments (always one)
  3. trip_expense_facts  Grain: one row per expense item.  Metrics: expense_dollars
Each of these stars will share some common dimensions -- the trip origination date, the trip end date, the traveller, the trip identifier, etc.

Trip_segment_facts will also include destination-specific dimensions: the destination city of the segment, the arrival and departure dates for the segment, etc.

Trip_expense_facts will include dimensions that describe the individual expense items: the date paid, the payment method, the payee, the expense category and subcategory, and so forth.

A conformance matrix will help you keep track of the dimensionality of each star.

Learn More

Thanks to Ellen for the thoughtful question.  If you have a question of your own, please send it in.

Pick up a copy of Star Schema: The Complete Reference and you will be helping support this blog. 

  • Chapter 4 is dedicated to designing multiple fact tables.  
  • Chapter 5 looks closely at conformed dimensions. 
  • Chapter 12 covers faceless fact tables.
You can learn more about these topics by referring to these posts:
  • Factless Fact Tables (September 15, 2011) describes faceless fact tables that describe events, and the use of a fact with the constant value of 1.
  • The Conformance Matrix (June 5, 2012) describes how to represent the dimensionality of your stars in a matrix format.

The photo in this post is by o5com, licensed by Creative Commons


Tuesday, June 5, 2012

The Conformance Matrix

Conformed dimensions are the linchpins of dimensional models. This post summarizes their use, and describes how to document them in matrix form.

Conformed dimensions: a refresher

Metrics describing different processes can be compared if they are stored in stars which share common dimensions. As I have discussed previously, these compound metrics often turn out to be among the most valuable from a business perspective.
Image by Patrick Hosley
Licensed under CC 2.0

The common dimensions do not have to be physically shared tables.  Each star may reside in a separate database.

As long as the common dimensions, such as "customer" or "product" have the same structure and content, they are said to conform.

Using conformed dimensions, we are able to compare measurements stored in different star schemas through a process called drilling across.

Planning conformance

Conformed dimensions are therefore the linchpins of the dimensional model. They ensure that each star works on its own, and also works with other stars.

If conformed dimensions are planned in advance, you can implement one star at a time without worrying about incompatibility issues.

This is the core idea behind Ralph Kimball's bus architecture. Conformed dimensions are designed as part of an up-front architecture project. Then, they serve as a semantic "bus."  Like cards plug into the backplane of a PC, fact tables plug into this dimensional bus.

The concept is also important in other architectures. For W.H. Inmon's Corporate Information Factory, conformance allows process comparison within data marts as well as across data marts.

Without conformed dimensions, subject areas become stovepipes. The opportunity to build cross-process metrics is lost. Worse yet, users also develop distrust in the individual data marts.  Their thinking is that  if sales and inventory cannot be compared, there must be something wrong with the data.

Documenting conformance

The conformance plan is a central feature of your dimensional model, so of course it must be documented.

Conformed dimensions are best documented in a matrix format, as in the diagram below.

Image from Star Schema: The Complete Reference by Chris Adamson
 (c) 2010 McGraw-Hill.  Used by permission.
The rows of this diagram correspond to fact tables, and the columns are dimensions.  Where you see a checkmark, it indicates that the fact table makes use of the associated dimension.

The matrix makes it easy to identify compatibility across fact tables. When two fact tables have a checkmark in the same column, that dimension cab be used as the basis for comparing the processes (aka drilling across).

Notice that conformed dimensions are depicted with associated "levels."  Salesperson, for example, has three successive levels of conformity: regions, territories, and individual salespeople. (This topic has been covered previously.)

It is possible that a degenerate dimension (a dimension attribute stored within a fact table) may be a conforming dimension. These attributes should also appear on the conformance matrix. In the picture above, order_line may be a degenerate dimension.

Variations on the conformance matrix

The small conformance matrix above focuses on a subject area (sales).  Conformance across subject areas can also be illustrated using a matrix, albeit a larger one.

An enterprise level conformance matrix is a valuable tool. It is a blueprint that can guide incremental implementation. It also helps break down proprietary attitudes toward data among different groups within your business.  One look at the matrix, for example, and it becomes clear that "customer" touches several parts of the business.

Conformance matrices can be produced at different levels of summarization.  A more summarized matrix may contain one row per data mart, rather that one per fact table. This may help guide project planning, or simply make an enterprise level matrix easier to digest.

Similarly, the conformance matrix can be used to map individual fact to dimensions. Architects use this kind of matrix when they are having trouble identifying discrete fact table. Performing affinity analysis on this kind of matrix reveals facts that share dimensionality. These may be candidates for inclusion in a single star.

Support this blog

Pick up a copy of Star Schema: The Complete Reference and you will be helping support this blog. Chapter 5 is completely dedicated to conformed dimensions.

A lot of information on conformed dimensions also appears in this blog:
  • Conformed Dimensions (Nov 15, 2011) discusses different ways dimensions can conform, and introduces the concept of "levels" of conformance.
There is also a category label for posts referencing conformed dimensions.


Matrix Falling image by Patrick Hosley
Licensed under CC 2.0

Conformance matrix illustration is from
Star Schema: The Complete Reference by Chris Adamson,  
 Copyright (c) 2010 by McGraw-Hill. Used by permission.

Friday, May 18, 2012

Using A Hierarchy Bridge WIthout A Many-to-many Relationship

In this post, I describe what you can do to eliminate the many-to-many relationship when using a hierarchy bridge.

The Hierarchy Bridge

A recent post on Recursive Hierarchies and Bridge Tables described how to support a recursive hierarchy in a dimensional model.

The business benefits of this kind of solution can be extremely valuable.

In the example, a special bridge table was used to navigate the relationships between companies and their subsidiaries. When connected to a fact table, this allowed us to roll up transactions via the recursive hierarchy.

Here is what the solution looked like.  Click the picture for a larger version.  For full details, refer to the original post.



Many-to-many is legal, but you may need to resolve it anyway.

You be wondering about the many-to-many relationship -- the one between the fact table and the bridge table.

This kind of relationship is perfectly valid in a relational database.  Unfortunately, some of our software products may not accept this kind of relationship, making it necessary to eliminate it.

I've written about this before in the context of a multi-valued dimension.  The solution in that case was to resolve the many-to-many relationship by creating a single-column table that consisted solely of a surrogate key.

With a hierarchy bridge, we can resolve the many-to-many without adding one of those silly single-column tables.

Additional table not required

In the recursive hierarchy above, we already have what we need: the company table itself.

Notice that in the picture above, the company table represents the "superior" to which we are aggregating transactions. The bridge links it to facts that describe the activities of subordinates.

To resolve the many-to-many, we will insert a second alias for the company table between the fact table and the bridge.  This alias represents the subordinate.

Now the picture looks like this:




An added benefit of this approach is that we can include information about both the superior and the subordinates in our query results.

Learn more

There are several posts describing bridge tables in this blog.  This post has mentioned two of them:
  • Use the category keyword bridges to find more posts.
And of course, you can help support this blog by picking up a copy of Star Schema: The Complete Reference.  It devotes two chapters and 30 illustrations to the topic of bridge tables.


Bridge photo by Branwebs via 
Creative Commons

Wednesday, May 2, 2012

Recursive Hierarchies And Bridge Tables

This post introduces the hierarchy bridge table, which can be used to navigate recursive hierarchies.

High Trestle Bridge at Night
Photo (c) Carol Bodensteiner
Bridge tables are a popular topic among readers of this blog. Perhaps that's because they can be a bit confusing, yet also enable powerful analysis.

Previously, I've focused on the two simplest uses for bridge tables:  the attribute bridge and the dimension bridge.  See links at the end of this post for more on these.

In this post, I'll tackle a third and final use: how a bridge can aid in the navigation of a recursive hierarchy.

Recursive relationships

In the real world, recursive relationships are common. Employees report to other employees, parts are made of other parts, and so forth.

In an E-R model, a recursive relationship might look like this:

Companies may own other companies.  That dotted relationship line is sometimes called a "pigs ear."  It links each company to the company that owns it.

You can read this relationship in two directions:  each company may have a parent company; each company may be the parent of one or more companies.

Looking up and looking down

From a business perspective, this information may be very useful.  For example, suppose you work for a business whose sells things to companies. You would like to be able to answer questions like these:
  1. What are January order dollars from to Company XYZ and its subsidiaries?
  2. Show me January orders from Company ABC and any companies above it.
The first question "looks down" from company XYZ.  The second question "looks up" from company ABC.

In both cases, we want to use the hierarchy to aggregate details about orders.

The hierarchy bridge

In a dimensional model, the idea is to support the study of facts (such as "order dollars") in the context of this hierarchy. This can be done using a bridge table, as shown here:

The table in the center is a hierarchy bridge.  This kind of table can be a bit confusing.  First, consider its content.

Each row contains a pair of company keys.  Here are the rules for populating the bridge table:
  1. For each company, the bridge contains rows associating it with each of its subordinates.
  2. Each company also has 1 row associating it with itself.
With respect to item 1, note each company is linked to all subordinates, not just the direct ones. (That's why the keys have the prefixes "superior" and "subordinate" rather than "parent" and "child".)

Using the bridge

In the picture above, notice that the fact table contains a company_key.  This represents the company from which each order is taken.  

Normally, we would join it to the company table to study "orders by company."

In this case, however, the bridge is inserted between the fact table and the company table. The bridge acts like a multiplexer--linking each company (on the right) to transactions with any and all of its subordinates (on the left).

This allows us to select a single company on the right (say Company XYZ) and aggregate all orders at or beneath that company.  That's called "looking down." It lets us answer question 1.

We can also use this hierarchy to "look up," as in question 2.  We simply reverse the joins--link company to the subordinate key, and the superior key to the fact table.  Then we can select a single company (say Company ABC) and look upward.

Powerful but dangerous

Of course, the danger of the bridge table should be readily apparent.  As you can see from the diagram above, if we forget to select a single parent company, we can easily double count the orders, triple count them, or worse.

For this reason, writing queries that involve the hierarchy must be governed carefully.  Developers or users with little experience can study the facts without using the bridge.  This will be safe, but they will not be able to leverage the hierarchy.

Other kinds of bridge tables

The hierarchy bridge described in this post is one of three kinds of bridge tables.  I've written previously about the other two kinds:
  • Bridge To Multi-Valued Dimensions (2/9/2011) shows how a bridge can link a single fact to multiple rows in a dimension.  Examples include orders with multiple salespeople, claims with multiple parties, etc.
More on the hierarchy bridge

When it comes to recursive hierarchies, this post only scratches the surface.  If you have questions, send them in.

In the mean time, there's a whole chapter on this topic in my book, Star Schema: The Complete Reference.  That would be Chapter 10:  "Recursive Hierarchies and Bridges." It's a long chapter--the longest one in the book.

And don't forget, when you use the links on this page to buy it from Amazon, you are helping to support this blog.


Copyright (c) Carol Bodensteiner
Used by permission, with many thanks!

Thursday, April 26, 2012

Q&A: Human resources data marts


A reader asks if Human Resources data marts are inherently complex. I run down a list of dimensional techniques he should expect to find:
Q: I will be working on a data mart design project to design star schemas for human resources data. I heard that HR data is more complex than sales or marketing and special techniques need to be applied.

I looked at the star schemas of pre-built analytical applications developed by some vendors for our packaged HR solution. I felt that they are quite complex and just wondering star design for HR data should be so complex.

If possible, can you please discuss this topic in a detailed manner by considering any one of popular HRMS system data and the most common data/reporting requirements along with the design discussion to achieve the star for those reports using the given HRMS data?

- Venkat, UK

A: Human Resources applications do indeed tend to use advanced techniques in dimensional design.

Below, I run down a list of topics you will probably need to brush up on. In reality, every subject area requires complete mastery of dimensional modeling, not just the basics.

Note that the complexity you are seeing in packaged solutions may stem from the subject area. Vendors often produce abstracted models to facilitate customization.

Techniques used in HR data marts

No doubt you are accustomed to the transaction-grained stars you encountered in sales.  You will find them in HR as well, but you will also encounter these:
  • Snapshot stars sample one or more metrics at pre-defined intervals.

    In an HR data mart, these may be used to track various kinds of accruals, balances in benefit programs, etc.

  • Accumulating snapshot stars track dimension members through a business process and allow analysis of the elapsed time between milestones.

    These may be used to track the filling of a position, "on-boarding" processes, disciplinary procedures, or applications to benefit programs.

  • Factless fact tables track business processes where the primary metric is the occurrence of an event. They contain no facts, but are used to count rows.

    These are likely to be used for tracking attendance or absence, participation in training courses, etc.

  • Coverage stars are factless fact tables that model conditions.  These are usually in place to support comparison to activities represented in other stars, but may also be leveraged to capture key relationships among dimensions.

    These are likely to be used for linking employees to positions, departments and managers. 
Your dimensions will also require reaching beyond the basics:
  • Transaction dimensions capture the effective and expiration date/time for each row in a dimension table.  These are advisable in almost any situation.

    In HR they may be used to  track changes in an employee dimension.

  • Bridge tables for Multi-valued attributes allow you to associate a repeating attribute with a dimension.

    In HR, these are likely to be used to associate an employee with skills, languages, and other important characteristics.

  • Hierarchy bridge tables allow you to aggregate facts through a recursive hierarchy.

    In HR, these are used to navigate reporting structures (employees report to employees, who in turn report to other employees, and so forth) as well as organizational structures.
I would also expect to encounter some complexity in slow-change processing rules.  Human Resource systems carefully audit certain kinds of data changes, tracking the reason for each change. As a result, you may have attributes in your dimension schema that may exhibit either type 1 or type 2 behavior, depending on the reason for the change.

Every schema goes beyond the basics

This list could go on, but I think you get the idea.

The only way to design a data mart that meets business needs is to have a well rounded understanding of the techniques of dimensional modeling.

You cannot get very far with nothing more than a grasp of the basics. This holds true in any subjet area -- even sales and marketing.  You need the complete toolbox to build a powerful business solution.

Packaged data marts

The complexity that concerns the reader may actually stem from another cause:  he is looking at packaged data mart solutions.

Packaged applications often introduce complexity for an entirely different reason: to support extensibility or customization. For example, facts may be stored row-wise rather than column-wise, and dimensions may contain generic attribute names.

Learn more

This blog contains posts on most of the topics listed above.  Click each header for a link to a related article.  Some have been discussed in multiple posts, but I have included only one link for each.  So also do some exploration.

In addition, please check out my book Star Schema: The Complete Reference.  When you purchase it from Amazon using the links on this page, you help support this blog.
  • Snapshots and accumulating snapshots are covered in Chapter 11, "Transactions, Snapshots and Accumulating Snapshots

  • Factless fact tables and coverage stars are covered in Chapter 12, "Factless Fact Tables"

  • Transaction dimensions are covered in Chapter 8, "More Slow Change Techniques"

  • Attribute bridges are covered in Chapter 9, "Multi-valued Dimensions and Bridges"

  • Hierarchy bridges are covered in Chapter 10, "Recursive Hierarchies and Bridges"
Thanks for the question!

- Chris

Send in your own questions to the address in the sidebar. 


Do you have another technique that was useful in an HR data mart? Use the comments.

Image credit: Gravityx9 licensed under Creative Commons 2.0

Friday, April 20, 2012

Big Data and Dimensional Modeling [Video]


I am often asked, "Is the dimensional model obsolete in the world of big data?"

I always respond with a question of my own: "Do you plan to stop measuring your business?"  Of course, the answer is always "no." 

Dimensional modeling is all about measurement, and so it will continue to have a place in our BI programs.

Here, I discuss the topic with Paul Kautza, Director of Education at The Data Warehousing Institute (TDWI).



In this interview, I talk about four ways we use dimensional models:
  1. To capture business requirements (e.g. measurement definitions)
  2. To define and manage a roadmap for a BI program (scope)
  3. As the basis for a database design (star schema, snowflake schema, cube)
  4. As a presentation layer for the business (semantic layer)
Developments in the world of big data may impact the third category, but not always.

Tuesday, April 17, 2012

Q&A: Degenerate Dimensions

A reader asks about degenerate dimensions:
Q: I understand that when a particular dimension value is unique to each fact row (i.e., an invoice number perhaps) it should probably exist as a degenerate dimension in the fact table.  Otherwise, that dimension table will contain as many rows as the fact table. 

However, what about a situation in which a dimension table contains only one column aside from the surrogate key (i.e., a status of some sort) but is not unique to each fact row.  In that case, should it be moved into the fact table as a degenerate dimension?

- Brian

Both cases are good applications for a degenerate dimension: a transaction identifier or a low-cardinality attribute that would otherwise need to be a very boring single-column table.

In all cases, the degenerate dimension is entirely optional.

Degenerate dimensions

A degenerate dimension is nothing more than a dimension attribute stored in a fact table, rather than a dimension table.

In all other respects, the degenerate dimension behaves exactly like any other dimension attribute.  It can be used to provide context for facts, qualify queries, sort results, drive master-detail relationships, and so forth.

The usual reason to create a degenerate dimension is simple:  A dimension with one attribute really doesn't need to be in a separate table with a surrogate key.

Two important things to note here:
  1. Degenerate dimensions are optional
  2. There is nothing wrong with a 1:1 relationship between dimension table and fact table
Lets take a closer look at two common uses of degenerate dimensions.

Transaction identifiers as degenerate dimensions

For a transaction fact table, we usually set the grain at the lowest level of detail possible. Often that is an individual transaction such as an order line, invoice line or shipping line.  (For a refresher on grain, see this post from 2009.)

Typically, these things have some kind of unique identifier which is carried into the dimensional model.  Examples include an order line number, invoice line ID, transaction ID, etc.  When these identifiers do not exist, we often create them, as I have discussed in a previous post.

Many designers opt to place these transaction identifiers directly into the fact table as a degenerate dimension.  However, this is entirely optional.  There is nothing wrong with a dimension table being in a 1:1 relationship with the fact table, as I have discussed previously. 

Some designers forgo the degenerate dimension for transaction identifiers because that there will be other schemas where the attribute in question is also referenced.  In many cases, our business intelligence software may function better if the attribute in question is located in exactly one place -- a shared dimension table. 

As an aside, note that in these other schemas, the cardinality may be different.  For example, the same invoice line may be referenced in multiple shipments or invoices.

Other degenerate dimensions

Not all degenerate dimensions are transaction identifiers.  They may simply be dimension attributes with no other suitable home.  The status code mentioned by the reader may be an example.  Another example appeared in a previous Q&A post: Reason Overpaid.

Once again, the use of the degenerate technique is entirely optional. Many designers are comfortable placing the attribute in the fact tables, while others opt for dimension tables with one attribute.

Notice that a dimension table with one attribute does not require a surrogate key for slow change purposes. However, without the surrogate key, the attribute value would be stored in the fact table anyway, as a foreign key reference to a single column table!

Placing the attribute in a separate table may help with our BI tools, for the same reason given above. Database administrators may also prefer the option because it makes it easier to configure a star-join optimizer.  Adding a surrogate key may reduce the size of the foreign key that will be stored in fact table rows.

Learn More

For more information on this topic, see the following posts:
  • Accumulating Snapshots (October 1, 2010)  Discusses another situation where dimension table may have the same number of rows as the fact table, or perhaps even less.
  • Rule 1: State Your Grain (December 9, 2009) Defines the concept of grain, and discusses the best practice of setting it at the lowest level of detail possible.
Of course, all these topics are also covered in Star Schema: The Complete Reference

You can help support this blog by using the links here to order a copy from Amazon.com.

Image licensed via Creative Commons 2.0
from Patrick Hoesley

Sunday, February 19, 2012

Accumulating snapshots and Type 2 changes

How does one maintain an accumulating snapshot when the dimension that defines its grain can undergo type 2 changes?

A reader recently asked this question via the comments for a recent post.

The short answer:  an accumulating snapshot has only 1 row for each thing being tracked.  If that "thing" can undergo type 2 changes, then the accumulating snapshot should link to only one version -- the current version.

State transitions

Many businesses track items that go through state transitions.  A trouble ticket, for example, is opened, assigned to a support rep, and eventually closed.  A mortgage application is submitted, reviewed, processed, underwritten, and settled upon.

These activities may be represented by a series of fact tables associated with the various process milestones.  Or, they may be captured by a single fact table that records a row for each status change.

When it comes time to study the average time spent at the various stages, however, these options may not be ideal.  In the former case, it will be necessary to drill across several fact tables.  In the latter case it will be necessary to perform correlated subqueries.

The accumulating snapshot model addresses these concerns.

The grain of an accumulating snapshot

An accumulating snapshot receives one row for each item being tracked.  This single row is updated as the item reaches various milestones.  A series of facts are incremented for each day spent in a given stage.  When a milestone is achieved, the appropriate date key is populated.

(For a full refresher on the accumulating snapshot, see the links at the end of this post.)

Unlike other fact tables, then, the accumulating snapshot always reflects the current state of affairs.  In the case of a mortgage processing fact table, for example, each row shows a single application, along with how many days it has spent at each processing stage.

This makes it very easy to study the average time between various stages -- just take the average of the facts in question.  No drilling across, no correlated subqueries.

Changes in the defining dimension

Back to the original question, then. What happens if the dimension that defines the grain of the accumulating snapshot can undergo type 2 slow changes?

The goal of the accumulating snapshot is to summarize each real world item in a single row.  Even if the dimension contains multiple rows for a particular item, the fact table must contain only one. If it contained more than one, its usability would harmed.

As I have written before, this means the defining dimension may have more rows than the accumulating snapshot.  This is OK.  Its just another way in which the accumulating snapshot differs from other types of fact tables.

Our mortgage application, then, must have one and only one row in the fact table -- even if type 2 changes mean there are more than one rows for it in the dimension.

Linking to the current row

The question, then, becomes which row should the fact table link to?

Because the accumulating snapshot is current-valued, it makes the most sense to have it link to the row in the dimension table that represents the current state of the item being tracked.  In our example, this would be the most recent version of the mortgage application.

If you increment the facts in your accumulating snapshot daily (something I suggest in my book), you should also take this opportunity to adjust the key values for any items that have undergone a type 2 change.  If you adjust the facts only when a milestone is met (something you may do if the volume is very high), you may choose to update the key value at this time instead.

Learn more

This and other topics surrounding the accumulating snapshot are discussed in depth in Star Schema: The Complete Reference, in Chapter 11: "Transactions, Snapshots and Accumulating snapshots."

You can help support this blog by using the links in the sidebar to purchase a copy.

You can also read more in these posts:
  • When do you need an accumulating snapshot (January 26, 2012) a recent post that triggered the question, wherein a reader asked how to choose between building an accumulating snapshot and tracking something as a series of type 2 changes. In some cases, the answer is to do both.
  • Q&A: Accumulating snapshots (October 1, 2010) Explores the cardinality relationship between accumulating snapshot and dimension table 

Image Credit:  Creativity103 via Creative Commons