Image by Patrick Hosley, Licensed by Creative Commons 2.0 |
Here's a list of things people sometimes do not realize can be handled by standard dimensional techniques.
At the end of the post is a large collection of links to previous posts treat these topcis in more detail.
Note: This article should not be interpreted as a criticism of entity-relationship modeling. Rather, the point is to cover commonly overlooked capabilities of dimensional modeling.
Predictive Analytics: Storage of Granular Detail
I often hear people say: if you want to do predictive analytics, its important to capture all the relationships among data attributes. This is incorrect.
The job of predictive analytics is to tell us what is important--not the reverse! Neither data mining nor predictive analytics requires information to be stored in an entity-relationship format.
As people who do this work will tell you, they don't care how the data is supplied. Nor do they care what modelers believe are the important relationships. Many tools will completely de-normalize the data before doing anything else.
In order to support predictive analytics, standard best practices of dimensional modeling apply:
- Fact tables record business activities and conditions at the lowest level of detail possible in granular, first line stars.
- Transaction identifiers are carried in the model if available.
- Surrounding dimensions are enriched with detail to the fullest extent possible
- In the dimension tables, changes are tracked and time-stamped (“type 2”).
Many-to-many Relationships: Multi-Valued Dimension Bridge
Usually, each fact in a star schema can be linked to a single member of a dimension. This is often mistaken for a rule, which leads to the incorrect conclusion that dimensional models cannot accommodate a fact that may link to multiple members of a dimension.
(This is often generalized as “dimensional cannot handle many-to-many relationships.”)
In a dimensional model, this situation is referred to as a multi-valued dimension. We use a bridge table to link the facts to the dimension in question. This special table sets up a many-to-many relationship between the facts and dimensions, allowing any number of dimension members to be associated with any number of facts.
Here are some examples:
- In sales, a bridge table may be used to associate a sale, as recorded in a fact table, with multiple sales people.
- In insurance, a single claim (again, tracked in a fact table) can be associated with multiple parties.
- In healthcare, a single encounter may be associated with multiple providers or tests.
- In government, a single audit or inspection may be associated with multiple findings.
- In law enforcement, a single arrest may be associated with multiple charges.
It is sometimes suggested that dimensional models cannot gracefully accommodate repeating attributes. The dimensional solution is again a bridge table. This time, it is placed between the dimension table and an outrigger that contains the attribute in question.
Examples include:
- Companies that have multiple Standard Industry Classification codes
- People with multiple phone numbers or addresses
- Accounts with multiple account holders
- Patients with multiple diagnoses
- Documents with multiple keywords.
Ragged hierarchies, unbalanced hierarchies, or recursive hierarchies are often cited as the downfall of the dimensional model. In fact, a solution exists, and it is extraordinarily powerful. The hierarchy bridge table allows facts to be aggregated either by rolling up or rolling down through the hierarchy, regardless of number of levels.
Examples include:
- Parts that are made up of other parts
- Departments that fall within other departments
- Geographies that fall within other geographies
- Companies that own other companies
A star schema does not include relationships between dimension tables. This has led to the misconception that you can't track these relationships.
In fact, any important relationship between dimension tables can and should be captured. It is done using factless fact tables. (Dimensions are never directly linked because of the implications this would have on slow change processing.)
Examples include:
- Employees filling a job in a department
- Marketing promotions in effect in a geographical market
- Students registered for courses
- The primary care physician assigned to an insured party
- Brokers assigned to clients
Another situation where modelers often believe they must “fall back” on ER techniques is when the attributes of a dimension table vary by type. This variation is often misconstrued as calling for the ER construct known as subtyping. Similar variation might also be found with associated facts
In the dimensional model, heterogeneous attributes are handled via the core and custom technique.
A core dimension captures common attributes, and type specific replicas capture the core attributes along with those specific to the subtype. Dimensions can then be joined to the fact table according to the analytic requirement. If there is variation in the facts, the same can be done with fact tables.
Examples include:
- Products with attributes that vary by type
- Customers that have different characteristics depending on whether they are businesses or individuals
- In insurance, policies that have different characteristics depending on whether they are group or individual polices
- In healthcare, medical procedures or tests that have different characteristics and result metrics depending on the test
- In retail, stores that have varying characteristics depending on the type (e.g. owned, franchise, pocket)
In a classic star schema design, facts are recorded at a granular level and “rolled up” across various dimensions at query time. Detractors often assume this means that non-additive facts have no place in a dimensional model.
In fact, dimensional modelers have several tools for handling non-additive facts.
Those that can be broken down into additive components are captured at the component level. This is common for many key business metrics such as:
- Margin rate: stored as margin amount and cost amount in a single fact table. The ratio is computed after queries aggregate the detail.
- Yield or conversion percentage: stored as quote count and order count in two separate fact tables, with ratio converted after aggregation at query time.
- Period-to-date amounts, stored in a snapshot table or summary table
- Distinct counts, stored in a summary table
- Non-numeric grades, stored in a transaction-grained fact table
Conclusion
Every technique mentioned here is part of dimensional modeling cannon. None are stopgaps or workarounds. While some may prove problematic for some of our BI software, these problems are not unique to the dimensional world.
In the end, the dimensional model can represent the same real-world complexities that entity-relationship models can. No ER fallback required.
- Chris
Learn More
All of these topics have been covered previously on this blog. Here are some links to get you started.
Establishing granular, detailed star schemas:
- Build High Resolution Stars (January 7, 2011).
- Bridge to Multi-Valued Dimensions (February 9, 2011)
- Bridge Tables and Many-to-many Relationships (April 25, 2011)
- See the label bridges for more posts that deal with this topic.
- Resolve Repeating Attributes with a Bridge Table (January 28, 2011)
- Q and A: Bridges are Part of Dimensional Modeling (March 7, 2011)
- See the label bridges for more
- Recursive Hierarchies and Bridge Tables (May 22, 2012)
- Using a Hierarchy Bridge Without a Many-to-many Relationship (May 18, 2012)
- And again, more will appear under the label bridges
- Factless Fact Tables (September 15, 2011)
- See also the label Factless Fact Tables
- There's not much on this topic on this blog right now, but see my book for more info (details below.)
- Handling Rankings (October 10, 2012)
- Storing Non-additive Facts (March 16, 2010)
- More on Distinct Counts (April 27, 2009)
- Dealing with Period-to-date Measurements (April 18, 2009)
- See the category non-additve facts for more
Also consider checking out my book, Star Schema: The Complete Reference. It covers all these topics in much more detail:
- Granular and detailed stars are covered in Chapter 3, "Stars and Cubes"
- Multi-valued Dimensions and Multi-Valued attributes are covered in Chapter 9, "Multi-valued Dimensions and Bridges"
- Hierarchy Bridges are covered in Chapter 10, "Recursive Hierarchies and Bridge Tables"
- Factless Fact Tables are covered in Chapter 12, "Factless Fact Tables"
- Core and Custom schemas are covered in Chapter 13, "Type-specific Stars"
- Non Additive Facts are covered in Chapter 3, "Stars and Cubes," Chapter 11, "Transactions, Snapshots and Accumulating Snapshots" and Chapter 14, "Derived Schemas"
Use the links on this blog to order a copy from Amazon. There is no additional cost to you, and you will be helping support this blog.