Monday, December 13, 2010

Deeper into the Accumulating Snapshot

The most popular posts on this blog discuss accumulating snapshot designs. In this post, I'll share some new insights on the technique, which come from my book Star Schema The Complete Reference.

In particular, I want to debunk some common misconceptions about accumulating snapshots.

The Accumulating Snapshot

One of the three major kinds of fact table, the accumulating snapshot is the least understood and the most underutilized. 

I provided a basic description of the technique several years ago in the first post to this blog.  The basic concept is simple: in this kind of design, each row in the fact table tracks the changes to something over time.  The idea is to facilitate reports that look at how long things spend in various processing stages.

For example, the post mentioned above describes an accumulating snapshot that tracks the processing of mortgage applications.  Each row in the fact table tracks a single application, and is updated each time it completes a processing stage.  It contains foreign key references to multiple dates:
  • Submitted Date
  • Approval Date
  • Processed Date
  • Underwritten Date
  • Settlement Date
Corresponding to each of these milestones is a fact that indicates how long it spent in the stage.  In the previous post, I've called these facts "lags," but I'll suggest a different way of looking at them in a moment.

These facts make it it easy to study the average time items spend in each stage. Without an accumulating snapshot, this might require correlated sub-queries,  or the comparison of detail rows from multiple fact tables.

Unique Characteristics

Unlike other kinds of fact table, the intent is to update rows in an accumulating snapshot.  When an application reaches a new status, for example, the appropriate date is added.  I've touched on this topic in a prior Q and A post.

Another characteristic that differs from other kinds of fact tables is the size of the accumulating snapshot.  It usually has less rows than the dimension table representing the item being processed.  That's because the item itself is likely to undergo type 2 slow changes.  I've noted this in a previous post as well.

Lastly, the accumulating snapshot is essentially current valued.  It is used to study the net effects of a process up to and including the current point in time. 

Non Linear Processes

In the past, I've written that the accumulating snapshot is only effective for processes that follow a consistent, predictable series of steps.  (See my previous books in the sidebar to this blog.)

But the technique can easily be adapted to accommodate processes that are not linear.

For example, suppose our mortgage application may bounce back to the processing stage if the underwriter discovers some missing documentation. Conventional wisdom holds that this will "break" the accumulating snapshot, since it violates the standard sequence of status milestones.

This non-linear behavior is easily accounted for. First, we will rename our "lag" facts to represent the number of days spend at each stage:
  • Days awaiting approval
  • Days processing
  • Days underwriting
  • Days awaiting settlement
The ETL process will be developed to check each open application on a daily basis, and increment the appropriate fact for its processing stage.  

This means it does not matter if an application reaches the underwriting stage and then gets sent back to the processing stage.  If this occurs, the ETL process will simple resume incrementing the "days processing" fact.

One thing that will need to be worked out is what dates to use for each milestone. For the application in question, once the processor finishes with it for the second time, what is the processed date?  The first time it achieved this status, or the second?  Business rules must drive these decisions.

Long Duration Processes

We are also accustomed to thinking of accumulating snapshots as appropriate only for short-duration processes.

However, my customers and students have found success in modeling activities of significant duration. Examples include the tenure of faculty members, the life cycle of a service contract, the duration of a mortgage or insurance policy, and the incarceration of a prisoner.

Some of these processes do not have fixed durations, and may prove to span years or even decades. As long as it is feasible for the ETL process to continually update records for active entities in the process, the approach is effective.

Going Even Deeper

Several other topics are worth considering:
  • You may have an operational system that logs very detailed status changes. It is possible to map a set of hundreds of statuses into a few key milestones for an accumulating snapshot. 
  • If the item or entity being processed undergoes slow changes, update its surrogate key in its corresponding fact table row.  This will be sure the fact table always points to the most recent version of the item.
  • Some milestones make useful dimension attributes as well. Incorporate the dates into the dimension table as appropriate.  For example, "date of approval" may be a useful characteristic in the mortgage application dimension.
  • For milestones that have not been achieved, don't include a NULL-valued foreign key in the fact table.  Instead, establish a special row your day dimension for events that have yet to occur.  Set its attributes to "N/A" and give it the highest possible date supported by your DBMS.  This will simplify joins and query predicates.
  • If at all possible, source the accumulating snapshot from other fact tables that log status changes as transactions.  This will make the ETL process much simpler. 
These topics are all discussed in the book.

You will find this same kind of deep dive in every topic it covers.  You can help support this blog by using the links in the sidebar to order a copy.

Thanks for your support, and feel free to send in your questions!

Image Credit:  Creativity103 via Creative Commons

Friday, November 5, 2010

Q&A: Star vs. Snowflake

A question from Portugal gives me an excuse to talk about snowflakes, why you might want to avoid them, and why you might want to use them.

Q: In your perspective, when does a star schema start to be a snowflake? If you have a snowflaked dimension, do you consider the model a star-schema? Or if you have for example an outrigger, is the model not a star-schema anymore?


I think the question when does a star become a snowflake is really one of semantics, but I'll give an answer.

Readers who just want to know when to use snowflake designs can skip ahead a couple of paragraphs.

Is it a Star or Snowflake?

As I said, this is really a semantic issue, but here is my answer. When dimension tables are linked to other dimension tables, or to anything that is not a fact table, the design is a snowflake.

The presence of an outrigger indicates a snowflake design, and so does a 3NF dimension containing 45 tables (and yes, I have seen that!).

Is a snowflake a star? I don't have a good answer for this.  When I discuss design options with colleagues, we think of it as an either/or choice. But when a design is complete, we refer to each fact table and its dimensions as a "star," even if there are outriggers.

One thing I am sure of is that both types of design are dimensional designs.  (So is a cube, by the way.)

Thanks to Pedro for the question.  Now I'm going to talk a bit more about snowflakes, for those who are interested.

When in doubt, don't snowflake

Best practices dictate that snowflakes should be avoided.  The reasons for this are wholly pragmatic.
  • They increase ETL complexity
  • The increase query complexity, harming "usability"
  • They reduce the understandability of the model
We could get into a discussion on this, but instead I'd like to look at some situations where these guidelines might not apply.

When snowflaking is acceptable

There are some cases the guideline against snowflaking may be relaxed.
  • Judicious use of outriggers may be acceptable in limited cases where there are relationships between dimensions that must be browsable.  But consider a factless fact table.
  • Outriggers are necessary when there are repeating attributes such as a product with multiple features or a patient with multiple diagnoses.
  • Outriggers are helpful in situations where there is a recursive hierarchy such as departments that contain other departments, or regions that contain other regions.
  • Snowflaking may be justified when your software products require it.  (e.g. your DBMS or reporting tool.)
In the latter case, you are making a strategic decision rather than a design decision, and I recommend involving several points of view in the decision making process--not just designers.

Read more, and help support this blog:

I devote many pages to snowflake designs in Star Schema The Complete Reference.
  • Chapter 7 discusses snowflakes and hierarchies, and has an in-depth discussion of some of the issues touched on in this post.  
  • Chapter 9 discusses the use of outriggers and  bridge tables to capture repeating dimensions or attributes. 
  • Chapter 10 shows how you can use a bridged design to support a recursive relationship to very powerful effect.
  • Chapter 16 looks at some of the implications of the star vs. snowflake decision on your BI software 
If you use the link above (or the links in the sidebar) to order my book, a portion of the proceeds helps to support this blog.   You can also find the table of contents and other info right here.


Image credit: Gravityx9 licensed under Creative Commons 2.0

Friday, October 22, 2010

Q&A: Star Schema and Referential Integrity

Two questions came in this week regarding referential integrity:
Q: Regarding foreign keys (fact to dimension) do you think it is necessary to implement foreign key constraints in a data warehouse?

Q: What are the pros and cons of having referential integrity. Why have key constraints? I mean it looks like they add maintenance burden.
A: 

If you are following the best practices of star schema design, which include the use of surrogate keys, it is not necessary to enforce referential integrity at the DBMS.

Instead, the ETL process will be responsible for maintaining the integrity of relationships between rows of the tables.

Referential integrity is still a top priority.  But responsibility has been moved to the ETL process.

ETL Must Maintain RI

In fact, for a star schema, the ETL process must maintain referential integrity (RI).  There is no way around this, because the ETL process is the source of all key values that will be inserted into the star.

The ETL process must assign surrogate key values in the dimension tables, and transform natural key values to surrogate key values when loading facts.  Any data that would violate referential integrity must be be caught by these processes.  If not, the ETL process is not dong its job.

Turning Off DBMS Enforcement

If surrogate keys are being properly managed by the ETL process, referential integrity checks at the DBMS level become redundant.  In fact, they may even slow the load down.  For this reason, many database administrators turn off referential integrity constraints while the load process is running.

This is possible for a star schema because a single known process updates the data on a predictable schedule.  Contrast that to an OLTP system, where a variety of application modules may alter data on an unpredictable/chaotic schedule. In such an environment, DBMS-level enforcement makes the most sense.

Turning it On Anyway

This does not necessarily mean you should keep referential integrity turned off.  Some ETL developers like to have it turned back on after the load is executed, as a sort of safety net, to catch any errors.  Also, a database administrator may want to turn it on because other features of the DBMS require it.

No Surrogate Keys?

If your design does not include surrogate keys, you will probably want to have the DBMS enforce referential integrity, since your ETL process is not really managing it.

But in this case, referential integrity may be the least of your problems. You are beholden to the operational system's decisions regarding change history, and you may need to use multi-part compound keys.  See last year's posts Do I really need surrogate keys?  and More On Surrogate Keys for more on why you should be using them.


Thanks for the questions, and keep them coming!

Chris

Friday, October 15, 2010

Q&A: Degenerate Dimensions, ETL and BI

A question from a reader about including dimensions in the fact table:

Q:  Question concerning an argument I am having with a colleague. We have a  transaction fact table that will have an attribute called "Reason Overpaid". This attribute can only contain one of 10 values. Is it better to create an "Reason Overpaid" dimension and put a FK in the fact table referencing to the dimension, or just have the "Reason Overpaid" description in the fact table.
A:  This is one argument I will not be able to settle.  Either approach is fine.

Stored in the fact table, this attribute would be known as a degenerate dimension.  It is perfectly acceptable there, but you may decide to move it to a separate table for other reasons. 

Degenerate Dimension

A degenerate dimension is nothing more than a dimension attribute stored in the fact table. This technique is commonly employed when there is an attribute left over that doesn't really fit into any of the other dimension tables.

Your "Reason Overpaid" attribute can be stored in the fact table as a degenerate dimensionYou can still use it in the exact same way as any other dimension attribute -- as a way to filter queries, group results, break subtotals, and so forth.

Keeping it in the fact table avoids unnecessary complexity -- a new table and key attribute to manage and load, a new foreign key lookup when processing facts, and most importantly an extra join to include in queries.

That said, a dimension table for the attribute may make sense in some situations.

Junk Dimension

If there are more than one degenerate dimensions, consider moving them all to a separate dimension table. This is called a junk dimension.  The attributes are not directly related to one another and there is no natural key.   It is populated with the table that contains the Cartesian product of all possible values. 

ETL Consistency Concerns

If your "Reason Overpaid" will also appear in other fact tables, worries about ETL consistency may arise.

Degenerate dimensions are still OK in this situation, but now two or more fact tables will contain the attribute, and it will be necessary to be sure it is loaded consistently.  Creating a separate dimension table allows the values to be created exactly once, avoiding any problems that might be created by inconsistent ETL processing.

While I would not go to a separate table for this reason, I do understand why many designers opt to do so.  The next situation is a different story.

BI Tool Capabilities

If your "Reason Overpaid" will also appear in other fact tables, the capabilities of your BI software may come into play.

The scenario is this:  you are configuring your BI tool to auto-generate SQL queries for users.  You'd like to have an item they can request called "Reason Overpaid", but the tool does not understand that it can appear in two places in the database schema.

Creating a dimension table for the attribute solves this problem. Both fact tables can link to the same dimension table. The tool can now have a definitive place to go for "Reason Overpaid", and may even be able to use it as the basis for comparing data in two fact tables. This is a strong reason to go with a separate table.

Luckily, many BI tools can be configured to acknowledge that a dimension may appear in more than one place, in which case this is not an issue.  And if you are building cubes for the purposes of BI reporting, you can trust your developers to choose the right attribute.

If you're interested in reading more about how BI tools may influence your dimensional design, be sure to check Chapter 16, "Design and Business Intelligence" in my latest book, Star Schema The Complete Reference.

Send in your questions

Have a question of your own about dimensional modeling?  Send it to the blog address in the sidebar.

Chris

Friday, October 1, 2010

Q&A: Accumulating Snapshots

A reader asks about 1:1 relationships in a star schema.
I am designing a superannuation data warehouse that will have a claims accumulating snapshot fact table. This table will contain  number of dates that will be filled in when a claim is being processed. From this  we will be able to calculate time spans between the various claims processing stages.

It is intended that the fact table will contain a claim_key that will reference a claims dimension (1:1 relationship ).  I have been told by a few people that a 1:1 relationship is not a recommended design model.
- A reader in Australia

A:  There is absolutely nothing wrong with having a 1:1 relationship.  (That's true for both ER models and dimensional models.) 

Accumulating snapshots often cause the kind of disagreement you mentioned.  They are less common.  People mistakenly generalize characteristics of more common designs, assuming they apply to all stars. 

This is natural, but it leads to three incorrect assumptions:
  1. The fact table is on the "many" end of a 1:Many relationship with each dimension
  2. The fact table will have the most rows
  3. Fact table rows are not updated once they are loaded
These assumptions are all false when it comes to an accumulating snapshot.  (In fact, even a transaction fact table may violate the first two assumptions.) 

You've asked about the first one, but let me debunk all three:

  1. By definition the accumulating snapshot will have a 1:1 relationship with one of the dimension tables.  Why? Because its purpose is to track the progression of some discrete item through a business process.  Its grain is almost always stated in terms of a dimension -- e.g. "One Row per Claim."

  2. If the dimension can undergo type 2 slow changes, the accumulating snapshot will likely have LESS ROWS than the dimension.  This is desirable, and I have written about it before.

  3. The purpose of the accumulating snapshot is to study lags between key events, while avoiding correlated sub queries.  Unlike other fact tables, this requires the row to be updated as more information becomes known about the item being tracked.

I've written about these topics a few times in the past, including the post mentioned above. I also explore these topics in much greater detail in Chapter 11 of Star Schema: The Complete Reference.

In the book, I also show that an accumulating snapshot be used even when there is not a fixed, linear series of business steps being tracked.

Anyone who is not familiar with the accumulating snapshot is encouraged to learn more.  It is an essential design technique for many common business requirements.

Thanks for the question, and keep them coming...
-Chris


Image Credit:  Creativity103 via Creative Commons

Friday, August 20, 2010

Surrogate Keys in a CIF

In this Q&A, two more questions about surrogate keys.  Here's the first:
Q:   In a Corporate Information Factory architecture, if the EDW contains a change history for every table, do data marts need surrogate keys?
- Los Angeles, CA
A:  Yes. 

This is similar to a question from last year, in which change history was tracked in an operational system. This time, we're looking at an EDW tracking changes.

First, a little background on the CIF for the uninitiated.

In a Corporate Information Factory (CIF) the Enterprise Data Warehouse (EDW) is an integrated repository of atomic data.  The tables in this repository are normalized.  Data marts take information from this repository and structure it in a dimensional format for departmental access.

Dimension tables are not normalized.  In a CIF, a given dimension may draw data from many tables in the EDW.

For example, a Company dimension might pull data in from a dozen or more normalized tables that include the information on the company, its locations, accounts, lines of business, market data and contact information. 

If the dimension relied exclusively on the unique identifiers of these tables to identify rows, it would require a multi-part key that would have numerous parts.

And if, as you say, your EDW tracks change history to each table, this multi-part key may be compounded by version numbers or date stamps from each source table. This is impractical.  (It also makes it difficult to support type 1 responses in the data marts.)

A related question sometimes comes up in my classes:

Q:  Can I maintain surrogate keys in the EDW?

A:  No.  

Most dimensions do not correspond to a single entity in the EDW, which is normalized, so there is nowhere to store and maintain surrogate keys.  Slow change processing and surrogate key management apply to the dimensional representation of data.

You can maintain a central repository of conformed dimensions, and you can even co-locate them with your EDW.  In fact, this is a good idea.  These master dimension tables can be replicated to your data marts, in whole or in part.

This ensures compatibility across subject areas.  It is a common misconception that conformed dimensions are only important in Kimball's bus architecture.  Even in a CIF, you may need to combine information from multiple data marts.

- Chris

Image:  "Keys" by Takacsi75, licensed under Creative Commons 2.0.

Star Schema: The Complete Reference

Chris Adamson's definitive guide to dimensional design is now available.

Use these links to order from Amazon in your country:


About the Book

This handbook is organized around dimensional design topics rather than vertical industry, so you can find what you need when you need it.

Architecture-neutral, the book’s practical advice can be put to use in any type of data warehouse, including those advocated by W.H. Inmon and Ralph Kimball.

For each design topic, coverage runs deep.  You will learn best practices along with the reasons behind them.  Each concept is explained through real-world examples.  There are over 175 illustrations, including example design diagrams and data instance charts.

The book explains the impact of design choices on BI and ETL processes, and explores situations where you may wish to deviate from best practices.

It provides a comprehensive set of design tasks and deliverables that can be incorporated into any project, regardless of architecture, scope or methodology.

To learn more, see the FAQ.

Contents of the Book

Coverage beings with fundamentals, then follows a logical progression through advanced topics that address real-world complexity. Beginners can read it cover-to-cover, while experts can jump straight to topics of interest.

The book's eighteen chapters are divided into six parts:

      Part I: Fundamentals

      1.  Analytic Databases and Dimensional Design
      2.  Data Warehouse Architectures
      3.  Stars and Cubes

      Part II: Multiple Stars

      4.  A Fact Table for Each Process
      5.  Conformed Dimensions

      Part III: Dimension Design

      6.  More On Dimension Tables
      7.  Hierarchies and Snowflakes
      8.  More Slow Change Techniques
      9.  Multi-Valued Dimensions and Bridges
      10.  Recursive Hierarchies and Bridges

      Part IV: Fact Table Design

      11. Transactions, Snapshots and Accumulating Snapshots
      12. Factless Fact Tables
      13. Type-Specific Stars

      Part V: Performance

      14. Derived Schemas
      15. Aggregates

      Part VI: Tools and Documentation

      16.  Design and Business Intelligence
      17.  Design and ETL
      18.  How to Design and Document a Dimensional Model

How to Order

If you’ve enjoyed Chris's other books, classes, or this blog, please consider supporting this effort.

Use these links to order from Amazon in your country:


 For more info, visit the FAQ on the book.


Star Schema: The Complete Reference
By Christopher Adamson
ISBN 0071744320 / 9780071744324
McGraw-Hill Osborne Media
486 pages
Also available as eBook

Thursday, August 12, 2010

Book Signing 8/17 at TDWI San Diego

If you are attending the TDWI summer conference in San Diego, be sure to stop by my book signing.

I'll be signing my new book in the main exhibit hall on Tuesday (8/17) from 11:30am - 12:30pm.  The book will be available at a decent discount, or you can bring your own copy.

It's also not too late to sign up for my course in intermediate and advanced dimensional design, which will be taking place the same day.

Hope to see you there!

Monday, August 2, 2010

FAQ on Star Schema: The Complete Reference

My third book, Star Schema: The Complete Reference, is now available.

I took a year off from work in order to write this book.  It has an immense amount of detail. 

If you’ve enjoyed my other books, my classes, or this blog, please consider supporting this effort.

Use this link to order a copy from Amazon.com.  Or use the links in the sidebar. 

Here are answers to some questions I have gotten about the book.

Why another book on Star Schema?

I often want to refer people to something they can read on a particular aspect of dimensional design.  Usually, this is harder than it should be.

There are some great books on star schema, but they are organized into chapters based on business cases (my own prior work included).  You can’t open up a book like that and turn to “the chapter” that covers a particular design technique.

If you want to read about snapshot designs, for example, you’ve got to flip back and forth between chapters about inventory, banking, budgeting, etc.

Also, all books target a particular data warehouse architecture—either Inmon’s "Corporate Information Factory" or Kimball’s dimensional “bus architecture.”  Since most of the principles of dimensional design are universal, this can get in the way.

My aim was to create the missing reference on star schema design, and to make it useful to anyone who works with dimensional data—stars, snowflakes or cubes.

How is this book different?

It’s structured into chapters and sections based on design topics, instead of by industries or business scenarios.  This makes it easy to find everything on a particular topic.

It’s also architecturally neutral.  It provides design techniques and best practices without advocating a specific approach to data warehousing.

It provides deep coverage.  It explains best practices and fully explores the reasoning behind them.  It looks at the impact of each technique on BI and ETL processes, and also explores situations where you may wish to deviate from best practices.

Friday, July 2, 2010

Price Waterfall Analysis and Dimensional Design

Many businesses rely on information in the data warehouse to conduct price waterfall analysis.  There is no single way to design a schema to support waterfall analysis; your solution will vary according to the waterfall model being used and the availability of data elements.

Waterfall Analysis

Waterfall analysis represents pricing as a series of "leakages" or deductions from list price, usually grouped into major categories, with various intermediate assessments of price.  Target margins are also sometimes included.  The journey from "list price" to "pocket margin" may be expressed in terms of unit prices, average unit prices, or as percentages.

In the diagram below, for example, waterfall analysis represents "list price" as 100%.  Several types of "leakages" appear as percentage reductions.  The various approximations of price are shown in dark blue; leakages are shown in light blue.  Targets appear in red. 

"Invoice Price," for example is what appears on customer orders, and factors in several discounts.  "Net Price" factors in post-order discounts.  "Product Price" factors in sales and marketing costs, and "Pocket Margin" is the end result of the analysis.

Supporting from a Dimensional Model

From a dimensional design perspective, waterfall analysis is similar to profitability analysis.  There is not a single, "best way" to support waterfall analysis. The sophistication of your solution will vary based on the complexity of the waterfall model and the availability of data.

It is best to start by thinking about the different prices and leakages as additive facts, which implies extended amounts.  While prices and leakages can be specified as unit amounts, waterfall analysis is most flexible when it can be conducted across products, time periods or geography.  Storing unit amounts in a dimensional schema does not serve this well.   Instead, think about the waterfall components as facts that represent extended amounts.  These are fully additive, and can be summarized with maximum flexibility.  (Some graphics of waterfall analysis use unit amounts, but these are computed based on volume totals, rather than the reverse.)

A Value Chain

If thought of as additive facts, the various prices and leakages can be associated with one or more processes, each potentially represented by a fact table.  Together, there fact tables form a value chain, related by a set of common dimensions.  By combining the data from these stars, the waterfall can be constructed. 

Next, look at your business's waterfall model itself.  Are the data elements available?  At what level of detail?  Do some represent allocations?  What are the allocation rules?  This will drive the design of stars that capture key components of the waterfall.

In the illustration, the first several columns may be available directly from an order-entry system.  We can envision a single order_facts star, which contains facts capturing these extended amounts for each order line:  list price, volume discounts, corporate discounts, promotion discounts, and invoice price.

The order entry system may also support rebate processing, allowing the next two facts to be stored in the same table:  rebates and net price.  Or, it may be that these are computed at different periodicity and/or in a different system, in which case they may belong in a separate star.  If stored separately, tying them to the original order will allow them to correlate to an order date rather than the time of the rebate, allowing for "lagged" and "unlagged" flavors of analysis. 

Allocations

As in traditional profitability analysis, some components of waterfall analysis may be allocated.  Where this occurs, you must be sure that there is a definitive source of data.  Alternatively, there must be an agreed upon and consistent set of business rules used to construct allocations.  (Remember that data warehouses publish data; they do not manufacture it.)

The example above requires marketing and advertising costs be available by (or allocated to) quarter, product category and region.  This may be less granular than the data that supported the columns to their left.  Conformed dimensions will ensure that all data can be expressed at a common level of detail.

If there is no source of data or business rule, the desired model cannot and should not be supported by the data warehouse.  It is up to the business to develop the processes and systems to define these data elements, not the data warehouse.

Targets

You may also have targets for the various prices, which are specified by planners as percentages.  These will also need a home somewhere.  They are really non-additive facts, and will likely be stored in fact tables that associate them with time periods and plan version.  Alternatively, they may be incorporated into reports by hand.

Pulling it Together

Waterfall analysis reports, like the one above, can be constructed by combining data from the constituent stars and combining it based on the common dimensions (such as products, time, or regions.)  This process is sometimes referred to as "drilling across."

It is also possible to construct a single "derived" star (or cube), in which this consolidation is handled in the ETL process, rather than the reporting process.   This makes the analysis much easier to perform, since much of the hard work is taken care of in advance.

Variation in the Model

Some businesses have models that vary over time or according to products or categories, and may have support for waterfall analysis incorporated directly into operational systems.  This is accomplished on the operational side by setting up a breakdown of each order line according to a set of configurable waterfall components.

Such an operational solution may translate into a dimensional design that represents each transaction line as a series of rows in a single fact table, each line containing a single dollar amount. Each row represents one of the components in the waterfall chart.  A dimension determines which component is represented by each row.

This design approach offers nice flexibility, as the model can be changed without altering the design.  However, the reports themselves may require alteration as the model is refined.  Furthermore, he star itself is limited in its utility; its single fact must be carefully qualified each time it is aggregated.

Wednesday, May 26, 2010

Avoid Using Dates as Keys

This post explores the impact of using a date as the primary key for a dimension table. 

Q: I'd like to hear your thoughts on using dates as keys.  Is it really necessary to have a surrogate key for a date dimension?

Hartford, CT

A:  It is not absolutely necessary.  But you should use a surrogate key anyway.  Here's why.

Suppose you have a DAY dimension which contains one row for each calendar date.  In addition to the full date, it contains columns that represent the year, quarter, month, day of week, holiday status, and so forth.  It also contains columns that represent frequently used combinations of these various attributes.

Users (or developers) can use the columns to do things like filter queries, aggregate facts, sort results, and drive subtotaling.  Database administrators can index these columns to optimize these activities.

You might be tempted to use the full date as the primary key of the table.  This is urge is understandable -- why bother creating an additional column called DAY_KEY.  Seems like extra work for nothing, right?

Consistency and Performance Issues

Remember that the primary key of a dimension table will also appear in fact tables.  If a date is used, users or developers will eventually try using it as a "shortcut" in queries, avoiding a join to the DAY table.

This leads to inconsistencies.  Developers may mean well, thinking they are "saving a join."  But they may apply different rules to compute things like fiscal periods.

Use of a day table guarantees consistent computation of all the characteristics of each day.  All the columns are computed exactly once, during the ETL process.  Everyone who uses them gets consistent results.

Moreover, avoiding the DAY table may generate performance issues.  If a report developer wants to filter for a particular day of the week, they may apply SQL date comparison functions to the day column in the fact table.  The database will probably have to make this comparison for each row in the fact table.

A day of the week in the DAY table is far more efficient, because it is pre-computed and indexed.

You may think you can prevent these mishaps.  But you cannot.  Sooner or later, they will occur. Dates in the fact table will beckon.  A user, developer, or consultant will try to take a shortcut.  Perhaps under your watch, or perhaps after you have gone.  

Guarantee consistency and performance by using a surrogate key.

If you are curious, read on for some other, less important reasons.

Wednesday, May 19, 2010

Kimball's Approach is Top-Down

Ralph Kimball's approach to data warehousing is frequently mis-characterized as being "bottom-up." This post aims to clear up that misconception.

Bus Architecture

Kimball's bus architecture (or dimensional data warehouse architecture) is an enterprise architecture.  At its core, a set of conformed dimensions ensure a consistent representation of standard terms and data elements across multiple subject areas.  The conformed dimensions describe important things like products, customers, locations, or anything of significance to the business.


The subject areas  are called data marts.  They represent things like manufacturing, sales, invoicing, receivables and so forth.  Data marts don't need to be implemented all at once.  They can be implemented one at a time, as part of an incremental program. Data marts also don't need to be stored in a single database (although they may.)  When they are stored in different databases, the conformance bus ensures consistency and compatibility.

Top-Down

Kimball advocates planning a set of conformed dimensions as an up-front (i.e. strategic) activity.  The conformance bus then serves as the blueprint for a set of integrated data marts, which can be built on whatever schedule makes the most sense.

Kimball and Ross put it this way:
During the limited-duration architecture phase, the team designs a master suite of standardized dimensions and facts that have uniform interpretation across the enterprise...We then tackle the implementation of separate data marts in which each iteration closely adheres to the architecture.

- From The Data Warehouse Toolkit, Second Edition
by Ralph Kimball and Margy Ross (Wiley, 2002)
Because it begins with an enterprise-level framework, then delivers departmental functionality, this is a top-down approach. 

Bottom-Up

A bottom-up approach is one that moves in the opposite direction, beginning with a departmental focus and later evolving into one that has an enterprise focus. This occurs when organizations build stand-alone data marts, then later decide to integrate them.  

Stand-alone data marts are designed and built for departmental use, without an enterprise context. They are cheaper in the short-run, offering a fast path to quick results.  Stand-alone data marts also arrive due to mergers and acquisitions, or through packaged software. 

When there is more than one stand-alone data mart, however, they are likely to exhibit incompatibilities and inconsistencies.  They are sometimes labeled "stovepipes." Faced with these inconsistent data marts, some organizations resolve to retrofit them into a conformance framework.  This can be a difficult and expensive process, requiring extensive rework.

When stand-alone data marts are successfully brought into conformance, a bottom-up path has been followed--one that starts with a departmental solution and moves to enterprise capability.   Bottom-up development is cheaper in the short term but more expensive in the long term. 

While the end result may be compatible with Kimball's vision, clearly the route is not.  If this is news to you, you might want to check out his book. (The link appears beneath the quotation above.)  You can also consult posts on data warehouse architectures and common misconceptions.

-- Chris

Image:  PCI Slot by  Ryan_Franklin_az
Licensed under Creative Commons 2.0

Tuesday, April 27, 2010

Basics: What is a Dimensional Model

In a recent post, I asserted that a star schema is the implementation of a dimensional model in a relational database.  Many readers may have missed this point, because it was embedded in a conversation about the term normalization

Here, then, is a clutter-free discussion the dimensional model.

Activities and Conditions

A dimensional model is a structured framework for measurement.   

Usually, a dimensional model describes a process or activity.  For example, a retailer might design a dimensional model of sales transactions.

A dimensional model may also describe conditions, as measured at predefined intervals. A water utility might design a dimensional model of reservoir levels, measured daily.

Facts and Dimensions

A dimensional model describes measurement of a process (or conditions) through facts and dimensions.

Facts are the measurements.  For the retailer's sales activities, facts that measure each sales transaction include the quantity sold and the price paid.  For the utility's reservoir conditions, a single fact is measured: gallons on hand.

Facts are not useful without context.  For example, "one hundred units sold," is a measurement of quantity sold, but it has no context.  One hundred of what?  When were they sold?  Where?

Dimensions are used to describe the context of facts.  For the retailer, dimension values provide context for each measurement.  These include the date and time of each purchase, the product sold, and the store in which it was purchased. 

For the utility's model of reservoir status, dimensions include the date the measurement was taken, the reservoir or facility where the measurement was taken, and the inspector who measured it.

If you can define a process, facts and dimensions, you have the core of a dimensional model.  Of course, there is a bit more to it than that (the concept of grain is also essential), but that's the basic idea.

Uses

The dimensional model is most famous for its role as the basis for a star schema.  Before we get to that, though, its important to recognize that it has other uses.

First, a dimensional model is an excellent way to describe requirements for an analytic system.  A model of process measurement is far more efficient and flexible than a list of of specific business questions.  A single dimensional model may be able to answer thousands of questions, including some that have not yet been thought of.

A dimensional model is also an excellent tool for planning your data warehouse strategy, and for managing the scope of implementation projects.  This is particularly true when the model is translated to a database design and linked to sources of data.  (I've written about this before.)
 
Stars, Snowflakes, Cubes

A dimensional model can serve as the basis for a database design.  This, of course, is what it is famous for.  (If you can even say it is famous, that is.)
  • When a dimensional model is implemented in a relational database, it is called a star schema (or sometimes a snowflake schema.)  
  • When implemented in a multidimensional database, it is called a cube.  (I touched on this previously, as well.)
If you work with any of these things, you are working with a dimensional model.

- Chris

Image: Slide together Polyhedra by fdecomite
Licensed under Creative Commons 2.0

Monday, April 5, 2010

TDWI Members: Read Chris's Column in Flashpoint

If you are a member of The Data Warehousing Institute, be sure to check out the April 1, 2010 issue of Flashpoint.

My article The Hidden Value of Dimensional Design explains how you can use dimensional design to cultivate a shared understanding of project scope between business and technical personnel.

This publication is only available to TDWI members.

If you are not a member (or if you want to read more on the topic) have a look at this blog post I wrote in 2007:  Drive Warehouse Strategy With A Dimensional Model.

- Chris

Sunday, March 28, 2010

Pre-order Chris's New Book

Star Schema: The Complete Reference is now available for pre-order at Amazon.com in the US and in several other countries.

[ UPDATE 8/20/10:  Now shipping!  Use the links below to order. For more information, see this post.]

Use these links to pre-order your copy from Amazon in your country:
* At the time of this post, Amazon Canada was not yet taking pre-orders.  However, you can sign up to be notified when pre-ordering is possible.

    Star Schema: The Complete Reference
    By Christopher Adamson
    ISBN 0071744320 / 9780071744324
    McGraw-Hill Osborne Media

    Star Schema: The Complete Reference is Chris Adamson's comprehensive guide to dimensional modeling, written for anyone who works with star schemas, snowflakes or cubes. Its breadth and depth make it the definitive reference, ideal for beginners and experts alike.

    Offering clear and easy to understand explanations of complex topics, it teaches the best practices of dimensional modeling and their underlying rationales, equipping you to make informed design decisions.

    This reference is organized around dimensional design topics rather than vertical industry, so you can find what you need when you need it.

    Architecture-neutral, the book’s practical advice can be put to use in any type of data warehouse, including those advocated by W.H. Inmon and Ralph Kimball.

    Coverage begins with fundamentals, then follows a logical progression through advanced topics that address real-world complexity. All concepts are illustrated through real-world examples.

    The book also explores the impact of design decisions on the BI and ETL processes, and discusses situations where it is useful to deviate from best practices.

    The final chapter provides a comprehensive set of tasks and deliverables that can be incorporated into any project, regardless of architecture, scope or methodology.

    Tuesday, March 16, 2010

    Storing Nonadditive Facts [Q&A]

    It can be challenging to decide what to do with a nonadditive fact. A reader asks if (and when) it might make sense to store a percentage in a fact table:
    Chris,

    What are your thoughts on storing percentages in fact tables?  Many of the calculations seem too complex to do in the BI tool. For instance, we calculate life-to-date percentages of KPIs on products.

    Thanks,
    Adrienne
    Burlington, NC
    Percentages are (usually) non-additive facts.  What this means is that you cannot add them together.  For example, if a computer is sold at 5% margin, and then another is sold at 5% margin, it is a mistake to say that total margin for the two sales is 10%. 

    In introductory texts or courses, you are taught to decompose them into fully additive components.  The idea is that you can sum all the components, then compute the ratio at the last moment. The margin example above is a good application. Storing cost and sales price allows margin to be computed at any level of aggregation.

    However, this only really works when all the components for a particular ratio are stored in the same row, and used within the same scope of aggregation.   As you have discovered, there are some non-additive facts that do not work this way.  A ratio that compares something to a larger pool or category is a good example.  Two or more aggregations are needed to compute this kind of metric.

    It also sounds as if your problem is compounded by the need to use some kind of running-totals in the computation, which represent the lifetime-to-date sums.  These are also problematic because they are also non-additive, or possibly semi-additive.

    Non-additive measurements that are computed from different aggregations are often stored in fact tables for exactly the reason you give: they are difficult to automate in a BI tool. They are also precomputed in situations where they would otherwise harm performance.

    If you decide to pre-compute and store a non-additive fact, there are usually two questions that follow:  Where to store it, and how to use it.  The issues here are very similar to those surrounding period-to-date measurements, as I discussed in a post last year. 

    Where to Store a Nonadditive Fact

    Figuring out the answer to this question is closely linked to how it will be used.  If it is needed at a transaction-level, it must go in a transaction-grained fact table.

    However, if it also need to be known at a daily or monthly level, that solution is not enough.  Since it is non-additive, it must be computed separately for these aggregations.  A periodic snapshot makes sense here.  Consider forgoing the transaction-level version, unless it is needed as well.

    Also note that "time" is not the only dimension across which we might need a summarized non-additive fact.  For example, you might also want to see it across product categories, customer categories, and so forth.  Each would require its own snapshot-style fact table for discrete storage of the pre-computed fact.

    In fact, this is the biggest problem with storing non-additive facts:  each potential aggregation must be stored separately.

    Nonadditive Facts and BI Software

    Because they are nonadditive,  (and may involve constituents that are semi-additive), these kind of facts must be used carefully.  For example, they should never be summed, subtotaled, etc.  Similarly, if we need to show them at different levels of detail, we need to fetch each aggregation separately.

    This is relatively straight forward if the schema is well thought out, well documented, and fully understood by report developers.

    However, it is very hard to make this kind of fact available in a "Self-Service" mode for end users, since most BI tools will easily break both rules.  (BI tools can usually handle simple non-additive facts, like the margin example at the beginning, but not facts that are computed from multiple aggregations.)

    Alternatives may include:
    • Hiding the non-additive fact from users while exposing it to developers
    • Classifying it as a dimension so that it is not aggregated,  or 
    • Blocking certain features of the tool.
    The fact that a non-additive fact may be stored in multiple locations (corresponding to different levels of aggregation) may also befuddle your BI software.
    • It may be possible to make use of  "aggregate navigation" features to teach it to fetch a fact from more than one possible location, but you will need to make sure that the tool always gets it right.  In addition, using such a feature may prevent you from "turning off" the tool's ability to aggregate a fact.  
    • Separate semantic layers for developers vs. users may be the solution to these problems.
    Again, these considerations are less important in cases where trained developers do all the report building.  It is only when leveraging SQL-generating BI tools that you need to worry.

    Another Option

    In some cases, non-additive facts can be linked to a specific member of a dimension, such as an individual customer or  product.  In this case, consider storing it in the dimension table, as a behavioral dimension.  This is most effective when the measurement in question can be refreshed on a pre-defined schedule.

    Thanks to Adrienne for her question, and for consenting to have it reprinted here!

    - Chris

    Do you have a question about dimensional design or data warehousing?  Send it in.  I answer all my email, though sometimes it takes a while.

    Image is by Patrick Hosely licensed under Creative Commons 2.0

    Thursday, March 11, 2010

    "The Star Schema Handbook" Gets New Name, Coming Soon

    I am pleased to announce my next book:  Star Schema: The Complete Reference.   

    It is being published by McGraw-Hill, and should be available this summer. McGraw-Hill has posted some of the publication details and availability information.

    [ UPDATE 8/20/10:  Now shipping!  Use the link below to order in the US. For other countries, or for more information, see this post.]

    Amazon.com is not taking pre-orders yet, but they have added a page for the book, where you can sign up to be notified when becomes available.

    The book was formerly titled The Star Schema Handbook: The Complete Reference To Dimensional Data Warehouse Design 

    If you pre-ordered under the old title, you'll need to place a new order.

    Many thanks to all those of you who have been sending in inquiries!
    I will provide more information about the book during the weeks to come.

    - Chris

    Tuesday, March 2, 2010

    Just the Facts?

    A reader asks:
    I’m having a problem in picking out facts. For example are all dates facts -- start date, end date and DoB?  What about personal data like post code, job title, status (active /left)  etc. Would you say that if you cant do math on it, its not a fact?
    - [Name Withheld], UK
    Good question.  It allows me to cover the basics, as well as touch on some advanced topics.

    First, the Basics..


    Facts vs. Dimensions

    Facts are measurements.  I really prefer that word, because it is a bit more descriptive.  But the rest of the world already knows "Facts".

    Examples that are clearly measurements:
    • order dollars
    • gross pay
    • quantity sold
    • account balance
    In the original question, all the attributes mentioned are dimensions.  (Dates, personal data, and so forth.)

    Dimensions are used to give facts context.  You use dimension values to filter or group facts.  Here are some in action:
    • order dollars (fact) by order date (dimension)
    • gross pay (fact) by department code (dimension)
    • quantity sold (fact) for January (dimension value)
    You can usually "do math" on facts.  The most common example is adding them together, or aggregating them, as might be done in the examples above.

    Don't use that as a criterion to to identify facts, though.  Many numbers are not facts. And, to make matters more complicated, and some non-numbers are facts.

    Many Numbers are Not Facts

    Some numbers are really dimensions.  You can tell because we use them to give context to facts.  A numeric department code is obviously a dimension.  We wouldn't normally add the values together, but we might use them to break out a measurement -- for example: budget (fact) by department (dimension).

    Less obvious are things like "unit price".  It sounds like it might be a fact. After all, we can certainly add up money, right?   Actually, when we add up sales, we don't add up unit prices, we add up "extended prices".  Extended price is a fact.  It is equivalent to unit price times quantity.

    Look again, and you will also see that unit price behaves like the other dimensions we looked at.  We can use it to group or filter facts. "How many did we sell at a unit price of 10 cents?" That question filters a fact (quantity sold) by a dimension value (10 cent unit price).

    Some Facts are Not Numbers

    There are also facts that are not numbers.  These are rare, but they do occur.

    Test results or grades are the most common example.  You can't do math on "pass" or "fail," but they are measurements.  Facts like this are sometimes called "text facts."  A test_result fact that can take on the values "pass" or "fail" might be stored in a fact table that contains a row for each student taking a particular test on a particular day.

    If they take on a relatively low number of possible values, consider converting text facts into discrete facts.  In this case, we might replace test_result with two facts:  number_passed and number_failed.  For each row in the fact table, one of these will take on the value 1, the other the value 0.  This may seem wasteful, but now it is easy to get a sum(number_passed) or sum(number_failed.)

    Some Stars Have No Facts

    If you don't have any facts, you can still have a fact table.  Sometimes, the mere occurrence of an event is all you need to know about.  For example, you might use a fact table to track the fact that a employee was present on a given day.

    Factless fact tables can also track conditions.  For example, each year, circumstances might make employees eligible to participate in particular programs.   Aside from the dimensions (Employee, Year and Program) there might not be any facts to record.  That's OK, you can still count up rows that meet various conditions.

    - Chris

    Do you have a question?  Send it in.  I answer all my email, though sometimes it takes a while.

    Image: Jack Webb as Joe "Just the Facts" Friday,
    from the Public Domain

    Wednesday, February 3, 2010

    Dimensional and Relational: Not Opposites

    A common misconception holds that the terms dimensional and relational are opposites. They are not.

    The word "dimensional" describes a design method. The word "relational" describes a data storage technology.
    • A dimensional model is a design approach that describes  a process in terms of measurements (known as facts) and their context (dimensions)
    • A star schema is a dimensional model implemented using relational storage technology—that is, in a relational database (RDBMS)
    • A cube is a dimensional model implemented using multi-dimensional storage technology—that is, in a multidimensional database (MDB)
    This simple diagram illustrates these concepts:

    As you can see from the diagram, a star schema is both relational and dimensional. So is a snowflake schema.

    By the way, don't let this confuse you:  most modern day DBMS products accommodate both kinds of storage.

    Thursday, January 14, 2010

    Accumulating Snapshots: Are They Necessary? [Q&A]

    A reader asks the following question:

    Q: We have an accumulating snapshot that tracks trouble tickets from submission to closure. The fact table has one row for each trouble ticket. Why do we need it? Can't we just place all this information in the trouble ticket dimension?
    A: You probably do need the fact table, but it might be worth putting some of the information in the dimension table as well.

    Before I explain, here is a quick refresher on the accumulating snapshot.

    Accumulating Snapshot

    An accumulating snapshot is a kind of fact table used to correlate activities surrounding a specified item. In this case, that item is a trouble ticket.


    Accumulating snapshots are useful because they simplify the queries necessary to study elapsed time between events. In the reader's case, that might be the time between submission and closure of a ticket (there are probably some interim steps as well).

    On the operational side, these events are probably recorded in different rows, and maybe even in different systems. Figuring out the days between these events can be messy, especially if averaging a results for a large number of tickets. Correlated subqueries will be reqiured, and performance may be poor.

    The accumulating snapshot does the correlation of these activities ahead of time, as part of the ETL process. In this case, it has one row for each ticket. Presumably, there are foreign keys that identify the ticket, the date submitted, and the date closed. An optional fact can record the elapsed time between these dates. This fact is very easy to aggregate and group by any dimension of interest. No need for for correlated subqueries.

    You can read more about accumulating snapshots in a prior post.

    The Fact Table is Necessary

    The reader has noted that the fact table has one row per trouble ticket. (That's a good statement of grain, by the way.) Since there is only one row for each ticket, it might seem that the content of the fact table could be simply be placed in the ticket dimension.

    This is probably not a good idea. Though your fact table has one row per ticket, your ticket dimension may have more than one, due to type 2 changes that track history.  For a particular trouble ticket, when a type 2 change occurs, a new row is added for the new version of the ticket. A given ticket may therefore have more than one row in the dimension.

    If you were to place a fact in this dimension table, like the days between submission and closure, it would be repeated for each version of the same ticket. Aggregating it would result in double counting.

    The fact is better placed in the fact table, where they can safely be summed, averaged, or whatever is necessary.

    (By the way, the same would occur if you placed the two dates in the dimension table. They would repeat for each version of the ticket. If you try to aggregate the elapsed time between the dates, the same ticket would be counted multiple times -- once for each version.)

    Place Information in the Dimension Also (Not Instead)


    Information tracked by the accumulating snapshot may also be useful in the dimension table. But here, its purpose is to serve as dimensional data.  Not as a replacement for the fact table.

    For example, the dates the ticket was opened and closed could make helpful type 1 attributes in the ticket dimension.  This is especially useful if the dimension is connected to other fact tables, such as one that tracks phone calls to the support center. The dates can be used to group call counts in interesting ways. For example, "How many support calls were generated by tickets opened on January 1 vs. January 2."

    If you want to, you can also place the elapsed days fact in the dimension table. Here, it should be used strictly as a behavioral dimension -- not as a fact.  Use it to group results. Don't try to use it as a fact; that would cause double counting. You might even want to convert it into bands to avoid confusion (1-5 days, 6-10 days, over 10 days.)

    - Chris
    Image Credit:  Creativity103 via Creative Commons

    Do you have a question of your own? Send it in. I'd be happy to answer.

    Mobile Safari Glitch Corrected [Announcements]

    I received a report that www.StarSchemaCentral.com was not redirecting properly for some readers using the mobile version of the Safari browser.

    I believe the issue has been corrected.

    If you continue to have problems, you can always use blog.oaktonsoftware.com to get here instead.

    Apologies for the inconvenience.

    -Chris

    Monday, January 11, 2010

    TDWI Members: Read Chris's "Ten Mistakes To Avoid"


    If you are a member of The Data Warehousing Institute, be sure to check out your Q4 2009 benefits.

    I wrote an installment of the "Ten Mistakes" series entitled Ten Mistakes to Avoid in Dimensional Design.  It covers common blunders in the use of dimensional design, both technical and non-technical.

    This publication is only available to TDWI members.

    - Chris