Showing posts with label Soapbox. Show all posts
Showing posts with label Soapbox. Show all posts

Tuesday, July 5, 2011

Dimensional Modelers Do Not Focus on Logical vs. Physical

The separation of logical and physical models is useful entity-relationship modeling. Not so in the world of dimensional models, where final table structures are strikingly similar to business models. 

Instead, a dimensional model is best described in three levels of increasing detail.

Entity-Relationship Modeling and Dimensional Modeling

It is well known that students of dimensional modeling have an easier time if they have not had significant exposure to entity-relationship modeling (or ER modeling).

Entity-relationship modelers are tempted to normalize data, design relationships between dimension tables, and so forth. These steps make perfect sense in the world of OLTP systems design, but not for analytic solutions.

Similarly, when scholars of entity-relationship modeling turn their focus to the dimensional world, another kind of mistake can occur.

Given that it is "all modeling," one may be tempted to generalize a generic framework from ER modeling, and then apply it to dimensional modeling. This mistake is often made by software vendors, and sometimes by academics.

ER modeling and dimensional modeling are both forms of data modeling, but it is not necessary to have a single framework to explain both.

Different Disciplines, Different Frameworks

The traditional distinction between logical and physical model is useful in ER modeling, but not in dimensional modeling. In the world of ER modeling, a logical model captures data requirements without assuming particular data storage technologies. It captures business entities and their attributes and relationships.

The logical model is the basis for a physical model, with reflects the requirements of relational storage in tables and columns. For example, many-to-many relationships are resolved, data may be restructured for performance reasons, etc.

This distinction has significantly reduced utility in the dimensional world. A physical dimensional model stores data in a format that is very similar to how it is understood by the business. Aside from the addition of keys, there is very little difference between logical and physical.

The distinction is not useless, but it plays a much smaller part.  So small, in fact, that you will not find it the focus of most books on dimensional modeling. My own book on dimensional modeling, for example, makes first mention of the term "logical model" in the last two paragraphs of the last chapter.  (See the end of this post for a link.)

Dimensional Models: Three Levels of Detail

Instead of having multiple kinds of model, a dimensional model is best understood at three levels of increasing detail.  Like zooming in on a Google map, each reveals additional information about the model. Each level of detail has different uses for different audiences.
  1. Business Requirements
    Requirements are grouped by subject area, correspond to business process, state measurement requirements in terms of facts and dimensions, and cross reference common dimensions.

    These business requirements clearly convey scope in business terms.  They link directly to the next level of detail in the model, which exposes the concept of table.

  2. High Level Design
    This level of the model defines the same requirements in terms of fact tables and dimension tables, natural keys an surrogate keys, and exposes major attributes of significance.

    At this level of detail, we do not record every column of every table, or even assign data types.  But we do draw table diagrams, and rigorously define several important design elements such as grain, additivity, and slow change requirements.

    This level is useful for design reviews, educating users and developers, and describing project activities.

  3. Detailed Design
    At the lowest level of detail, we expose every column of every table, define data types, provide definitions and sample data, map everything back to source data, and document transformation rules.

    This level of detail is useful for database administrators and ETL architects. It also contains metadata that will be useful for BI developers and end-users.

One Model, Three Levels of Detail

Don't make the assumption that these three levels of the dimensional model are developed in sequence. It is possible to do so, but not always the case.

Most dimensional designers do the bulk of their work at the second level - the high level dimensional design.

As the model nears completion, they will summarize it at a business level, and then develop the detailed model. In other cases, modelers may work through these levels in sequence.

Regardless of the process, in the end,  these three levels all describe the same dimensional model. Like levels of a Google map, they do so at different levels of detail. All should be accounted for.


For more information on what information to capture at each of these levels, see Star Schema: The Complete Reference.  Detailed descriptions and examples can be found in Chapter 18, "How To Design And Document A Dimensional Model."

Image licensed via Creative Commons 2.0
 from Patrick Hoesley

Monday, March 7, 2011

Q&A: Bridges are Part of Dimensional Modeling

Two recent posts on bridge tables generated a lot of questions.  Here is the most common:
When you use a bridge table, isn't that a snowflake? I thought this was not allowed in dimensional modeling.
The use of a bridge is a limited form of snowflaking.  It is also an important feature of the dimensional model.

Bridges are widely accepted, even among those who would otherwise avoid snowflake designs.

A Limited form of Snowflaking

In a Q&A from last year, I suggested that a snowflake occurs when dimensions are joined to something other than a fact table. By this definition, the examples from the previous two posts are certainly snowflakes.

In the first example, a bridge was used to associate a dimension with a repeating attribute (a company with multiple industries.)

In the second example, a bridge was used to associate a fact with multiple dimension rows (a sale with multiple salespeople.)

In both cases, a standard one-to-many relationship between fact and dimension does not exist.1 The bridge table solves this problem.

This usage is widely accepted. In fact, the bridge table is a central feature of dimensional modeling. I devote two full chapters to bridged designs in Star Schema: The Complete Reference.

Other Relationships are Not Modeled

When a bridge table is employed, most dimensional modelers will still refer to the schema as "a star." That's because the bridge is only used in the situations described above -- when the standard master-detail relationship between dimension table and fact table breaks down.

Other kinds of relationships between data elements are not resolved via multiple tables.  In our examples:
  • Repeating attributes are permitted (work phone, home phone, mobile phone) 
  • Dimensions contain partially dependent attributes (brand information present in a product table)
  • Master-detail relationships are not instantiated into separate tables
  • Values associated with codes are not placed in look-up tables
  • Dimensions are not linked to one another
An all-out snowflake design would employ principles of normalization to eliminate these features.

Other Times to Snowflake?

There may be some other situations where limited snowflaking is acceptable.  One is where a set of attributes (e.g. geographic) would otherwise appear in multiple dimension tables. Another is where software tools (the RDBMS or BI software) work better in a snowflake environment. 

These uses are a bit more controversial, and I wrote about them in a previous post.

When it comes to the bridge table, however, there is no controversy.  The bridge is an accepted technique in dimensional modeling, and is often necessary for specific dimensional designs.

- Chris

1 Bridges may also be used to navigate recursive relationships, as I describe in the book.

Image credit: Gravityx9 licensed under Creative Commons 2.0

Wednesday, May 19, 2010

Kimball's Approach is Top-Down

Ralph Kimball's approach to data warehousing is frequently mis-characterized as being "bottom-up." This post aims to clear up that misconception.

Bus Architecture

Kimball's bus architecture (or dimensional data warehouse architecture) is an enterprise architecture.  At its core, a set of conformed dimensions ensure a consistent representation of standard terms and data elements across multiple subject areas.  The conformed dimensions describe important things like products, customers, locations, or anything of significance to the business.


The subject areas  are called data marts.  They represent things like manufacturing, sales, invoicing, receivables and so forth.  Data marts don't need to be implemented all at once.  They can be implemented one at a time, as part of an incremental program. Data marts also don't need to be stored in a single database (although they may.)  When they are stored in different databases, the conformance bus ensures consistency and compatibility.

Top-Down

Kimball advocates planning a set of conformed dimensions as an up-front (i.e. strategic) activity.  The conformance bus then serves as the blueprint for a set of integrated data marts, which can be built on whatever schedule makes the most sense.

Kimball and Ross put it this way:
During the limited-duration architecture phase, the team designs a master suite of standardized dimensions and facts that have uniform interpretation across the enterprise...We then tackle the implementation of separate data marts in which each iteration closely adheres to the architecture.

- From The Data Warehouse Toolkit, Second Edition
by Ralph Kimball and Margy Ross (Wiley, 2002)
Because it begins with an enterprise-level framework, then delivers departmental functionality, this is a top-down approach. 

Bottom-Up

A bottom-up approach is one that moves in the opposite direction, beginning with a departmental focus and later evolving into one that has an enterprise focus. This occurs when organizations build stand-alone data marts, then later decide to integrate them.  

Stand-alone data marts are designed and built for departmental use, without an enterprise context. They are cheaper in the short-run, offering a fast path to quick results.  Stand-alone data marts also arrive due to mergers and acquisitions, or through packaged software. 

When there is more than one stand-alone data mart, however, they are likely to exhibit incompatibilities and inconsistencies.  They are sometimes labeled "stovepipes." Faced with these inconsistent data marts, some organizations resolve to retrofit them into a conformance framework.  This can be a difficult and expensive process, requiring extensive rework.

When stand-alone data marts are successfully brought into conformance, a bottom-up path has been followed--one that starts with a departmental solution and moves to enterprise capability.   Bottom-up development is cheaper in the short term but more expensive in the long term. 

While the end result may be compatible with Kimball's vision, clearly the route is not.  If this is news to you, you might want to check out his book. (The link appears beneath the quotation above.)  You can also consult posts on data warehouse architectures and common misconceptions.

-- Chris

Image:  PCI Slot by  Ryan_Franklin_az
Licensed under Creative Commons 2.0

Wednesday, February 3, 2010

Dimensional and Relational: Not Opposites

A common misconception holds that the terms dimensional and relational are opposites. They are not.

The word "dimensional" describes a design method. The word "relational" describes a data storage technology.
  • A dimensional model is a design approach that describes  a process in terms of measurements (known as facts) and their context (dimensions)
  • A star schema is a dimensional model implemented using relational storage technology—that is, in a relational database (RDBMS)
  • A cube is a dimensional model implemented using multi-dimensional storage technology—that is, in a multidimensional database (MDB)
This simple diagram illustrates these concepts:

As you can see from the diagram, a star schema is both relational and dimensional. So is a snowflake schema.

By the way, don't let this confuse you:  most modern day DBMS products accommodate both kinds of storage.

Wednesday, December 16, 2009

Dimensional Designs are not Denormalized

I don’t usually bother to take a stand on issues surrounding terminology or theory.  For this post I am going to make an exception.

I often hear people say that a star schema is a denormalized data model.  There is an element of truth hiding in those words, but I think it describes how some people design stars, and not what a star is.  A better description is unnormalized or not normalized. But even these words do not describe all stars.

Regardless of the words you choose, using the vocabulary of normalization is generally a disservice to the novice designer.  And none of these words should ever by used to define the star schema.

Continue reading to learn why.