Showing posts with label Q and A. Show all posts
Showing posts with label Q and A. Show all posts

Thursday, March 28, 2013

Where To Put Dates

A reader is trying to decide if certain dates should be modeled as dimensions of a fact table or as attributes of a dimension table.

I have two attributes that I'm really not sure where is the best place to place:'Account Open Date' and 'Account Close Date'. In my model, I have [Dim Accounts] as a dimension and [F transact] as a fact table containing accounts transactions. An account can have many transactions, so the dates have different cardinality than the transactions.
  • I thought to put the dates in the Accounts dimension, but this led to problems: difficulties in calculations related to those dates--like if I want to get the transactions of the accounts that opened in the 4th quarter of 2012, or to get the difference between the date of last transaction and the account opening date, and so on.  In other words I can't benefit from the Date dimension and the hierarchies it contains.
  • So I though about placing those dates in the fact table, but what made me hesitate is that the granularity of those dates is higher than the fact table, so there will be a lot of redundancy.
- Ahmad
Bethlehem, Palestine

This is a common dilemma. Many of our most important dimensions come with a number of possible dates that describe them.

Ahmad is thinking about this problem in the right way: how will my choice affect my ability to study the facts?

It turns out that (1) this is not an either/or question, and (2) granularity is not an issue.

Dates that Describe Important Dimensions

Image licensed via Creative Commons 2.0

The dates are clearly useful dimension attributes.  I suggest that you keep them in the dimension in one of two ways, which I will discuss in a moment.

First, though, lets look at what happens if the dates are only represented as foreign keys in the fact table:

If the dates are not stored in the dimension, the open and close date are only associated with the Account dimension through the fact table.  The fact table only has records when transactions occur. So it becomes harder to find a list of open accounts, or to find the set of accounts that were active as of a particular date.

An additional factless fact table may help here, but it is far more simple to store the dates in the dimension.

Date as Attribute vs Outrigger

If plan to represent the dates in your dimension table, you have two choices. You can model the dates themselves as attributes, or you can model a pair of day keys in your account dimension. Either approach is acceptable.

The first option does not expose the richness of your full day dimension for analytic usage, but it may be simpler to use for many business questions. Other questions (like your quarterly example) will require a bit more technical knowledge, but most BI tools help with this.

The second option transforms your star into a (partial) snowflake. The day dimension becomes known as an "outrigger" when it connects to your account dimension. This allows you to explicitly leverage all the attributes of your Day dimension. The cost is some extra joins, which may be confusing and may also disrupt star-join optimization.

Making the correct choice here involves balancing several perspectives:

  • The business view and usability
  • The capabilities of your BI software front end
  • The capabilities of your DBMS back end software

Day Keys in the Fact Table

Having said all that, it is also useful to represent at least one of these dates in the fact table. The account open date may be a good dimensional perspective for the analysis of facts.

As you observed, this date has different cardinality than the transactions. The account open date for an account remains constant, even if it has dozens of transactions in your fact table. But the fact that it has low cardinality should not stop you from choosing it as a major dimension as your star!

Your account transaction fact table may have a pair of day keys -- one for the date the account was opened, and one for the date of the transaction.

If you choose to do this, the account dimension itself should include the open date. The outrigger solution is not necessary since your fact table has full access to the Day dimension.

Note that I do not recommend this for your account closed date, because that date changes. Storing it as a key for every transaction against an account would require a lot of updates to fact table rows once the account becomes closed.

More Information

I've touched on this topic in the past.  In particular, see this post:


Although I edited it out of Ahmad's question, he also cited an issue surrounding the use of NULL for accounts that do not have a closed date. On that topic, see this recent post:



Support this Blog

I maintain this blog in my spare time.  If you find it helpful, you can help support it by picking up a copy of my book:  Star Schema: The Complete Reference.

Use the links on this blog to get a copy of this or any of the other recommended books, and you will be helping to keep this effort going.

Monday, November 5, 2012

Name Value Pairs and Dimensional Models

A reader asks how name/value pairs are best translated into a dimension design.
I was wondering if you have ever had to setup a data warehouse with a source system that has the ability to have dynamic "virtual" columns using a set of name/value pair tables...If you have done this or have any cool ideas on how to tackle this setup in a dimension that would be greatly appreciated.
-Joshua
Belmont, CA

This is a common scenario. The preferred solution is to identify the most commonly used "virtual columns" and model them as standard attributes. Bridged solutions are also possible, but are much less flexible from a usability standpoint.

Name Value Pairs

Name/value pairs are often used in OLTP solutions to offer a flexible way to describe an entity.

Rather than store each characteristic of an entity as an attribute, a special table lets you identify unlimited attributes for something. Its main columns capture the name of the attribute and its value. (Hence the moniker "name/value pair".)

For example, suppose you are Apple Inc. Your OLTP system has a product table, but each product has a wide variety of attributes. Rather than model each attribute, you might capture them in a name/value pair table.

The rows in the table for the newly released iPad Mini might look like this:

As you can see, the 18 rows above describe one iPad Mini model. 

Elsewhere in the data model, and associative table (or "intersect table") cross references these 18 pairs with a single row in the product table for this item.

Apple has 23 other iPad Mini models, each with different of values for the capacity attribute, color attribute, carrier attribute, etc. 

In addition, this same table can be used to describe the characteristics of products with different named attributes - a phone, a computer, a notebook, and so forth. New attributes do not require changing the database design. Instead, all that is necessary is to insert the appropriate rows.

In a dimensional model, there are several ways to handle name/value sources. Each involves tradeoffs between flexibility and usability.

Flattened Solution is Most Usable

The preferred method is to model the "names" as explicit dimension attributes.1

In the case of our product, for example, the resultant dimensional model might nook like this:

This approach produces the most understandable and usable dimensional model. Each important characteristic is present in the dimension, named according to business terms, and ready to provide context for facts.

Do not be discouraged if every instance of the source entity (product in this case) does not share the same set of attributes in the name/value pair table. Look for 100 or so attributes that are most commonly used. In many cases, this may satisfy most of your analytic requirements.

Core and Custom Dimensions

If there are simply too many names to flatten into a single dimension table, the next step is to consider "core and custom" models.

If you separate the entity by "types", can you find 100 named attributes for each type?  

If so:
  • Build a single "core" dimension that contains an identifier and any attributes common across all types, along with a surrogate key.
  • For each type, build an additional dimension that replicates the common attributes, plus contains the 100 or so attribute names specific to the type
  • The core and custom versions should use the same surrogate key domain for each member of the dimension, so that they can be swapped in and out of queries as required.
For example, Apple might do the following:
  • Build a single product table that has the part number, category, product name and a key. This table will have a row for all products, including computers, tablets, phones and music players. This is the "core" product table. 
  • Build additional "custom" dimensions would be built for each category of product. The table pictured in the previous section might be the custom dimension for tablets.
This solution retains the understandability of the previous solution - each attribute is clearly and explicitly named. However, we must now take care to join the appropriate version of the dimension to fact tables - a technical consideration based on the business question being asked.

Bridged Solution Mirrors OLTP Design

The last option is to employ a bridged solution. This technique actually mimics the original OLTP design. An outrigger contains the name/value pairs, and a bridge associates it with the dimension.

For example, Apple's product dimension might be bridged as follows:

This solution has the benefit of providing access to the full range of named attributes. It is also flexible; new named attributes do not require any change to the schema design.2

The disadvantage here is that this solution is the most difficult to use. Facts can be easily double counted, triple counted, etc. It will be necessary to take great care in constructing and qualifying queries, and it may also be necessary to lock down "grand total" functionality in end-user reporting tools.

Both Ways

Remember that this is not an either/or solution. If a bridged outrigger is necessary, consider also capturing the 100 most commonly used attributes in the dimension itself.

This allows you to create a safe "sandbox" for less technically sophisticated users. They are able to access much of the analytic value of your solution, without having to expose them to the complexity and possible dangers of the bridge table.

More Reading
  • Bridges: Much has been written in this blog about bridge tables. The most relevant entry if you want to learn more is Resolve Repeating Attributes With A Bridge Table (1/28/11).  There are also two full chapters in my latest book (see below.)
  • Core and custom dimensions have not been previously discussed here, but you can learn more about them in Star Schema: The Complete Reference.  It dedicates an entire chapter to "Type-Specific Stars."
Help support this blog

This blog is ad-free, and no one pays me to write it.  You can help out:  
  • Use any of the links on these pages to pick up my latest book, Star Schema: The Complete Reference.  
  • Or, if you already have it, use the links to check out any of the other recommended books.

When you do, a small portion of the purchase goes to this blog (you do not pay any extra).


Notes:

1I'll call these "names" rather than "pairs", since its the name that will translate into a column. For example, "color" is a name, it may participate in several name/value pairs -- one for each possible color.

2The bridge table in this diagram contains product and attribute keys. If there will be many combinations, dimensional modelers often replace the product key with a group key, and store the group key in the dimension table. See Resolve Repeating Attributes With A Bridge Table (1/28/11). 


Wednesday, October 10, 2012

Handling Rankings

Dimensional modelers often struggle with issues surrounding rankings. Learn when to store them, where to store them, and where not to store them, by reading this post.
We need to calculate Top N rankings across several dimensions (account, channel, hour, day, week, month.) There are massive data sets involved. This must be a common problem. Is there an approach you can point us towards?
- Jeremy
Rankings are time and context dependent. When you need to track them, they pose challenges that are similar to non-additive facts such as period-to-date measurements or distinct counts.

The solutions, it turns out, are also similar.

Rankings and their Context

When thinking about how to handle rankings, it is useful to break them down. A ranking is actually made up of several parts. There is the dimension being ranked, the dimensions for which the ranking will be computed,  and the fact that is the source of the ranking.

For example, look at this simple statement:

"Show me the top 10 customers by week and product."
  • The dimension being ranked here is Customer.  
  • The dimensional context of each ranking is a product and week. For example, if we have 20 products, then each week we will need 20 top ten lists -- one for each product.
  • The fact driving the ranking is not explicit in this question. It is probably spending, though it may be based on margin or transaction counts.  
That may sound simple, but things are usually a bit more complicated.

The reader, for example, does not just need a single ranking. He actually needs several -- daily, weekly, monthly. Presumably, these may be mixed and matched with the other dimensions he mentioned -- account and channel. In addition, there may be different rankings based on different facts.

Compute vs. Store

It may have occurred to you at this point that there are a multitude of possible rankings, even for a relatively simple schema.

In addition, once we compute a ranking, there is very little we can do with it. For example, we cannot sum it up over time periods, or other dimensions.  We may be able to compare it to another ranking (a different period or dimension, for example), or treat it as reference data describing the thing being ranked. But that's about it.

For these reasons, many organizations choose to compute rankings within reports, or at query time.  This provides maximum flexibility, and also sidesteps the ugly problem of having to store information that cannot be aggregated.

However, the reader indicates that data volumes are large. Let's assume they are so large that it is not feasible to keep the granular data around to compute rankings. Let's also assume that the rankings he needs are consistent over time. In his case, it may be necessary to compute all needed rankings and then store them.

Rankings fit with Snapshots or Aggregates

A ranking is only useful within the original context of its calculation. We won't really be able to aggregate it.  In this respect, its is similar to a distinct count or a period-to-date measurement -- topics I have addressed in the past.  (Some period-to-date metrics can be aggregated within a single period, but otherwise the comparison holds.)

The recommended solution is similar as well. Establish the granularity of the various rankings, then keep track of them fact tables of appropriate grain. These will either be periodic snapshots, or aggregate tables that summarize appropriate time periods. The rank itself will be stored in the fact table, even though it is not additive.

Many developers balk at this solution, observing that it increases the number of fact tables that must be maintained. But note that these fact tables do not introduce any new dimension tables, and they can usually be sourced from transaction-grained stars. And the alternatives have even less appeal.

Rankings don't fit with Transactions

Most of the time, it is not useful to store rankings with transactions. Unless all the things being ranked happen to have transactions that occur at the same time, it will be impossible to compare the ranking with anything -- including other things being ranked!

Resist abstracted models

It is tempting to come up with designs that abstract the concept of a ranking. For example, one can imagine setting up a dimension that describes a ranking, the dimension being ranked, and the period. A fact table can be set up that carries the key for this metadata, plus 10 or 20 keys that will be used to link to top 1, top 2, etc.

Abstracted designs are appealing because they may appear to limit the number of stars to be maintained and don't require schema changes when new rankings are added. But note that the same amount of work must be done to load each ranking.

More importantly, these abstracted designs severely hamper usability. With such a design, it is much harder to merge information about a ranking with other data. A report that contains product level detail, for example, cannot easily pull in info indicating the product appeared in a Top 10 ranking -- this would require decoding metadata and then checking each of the various keys.

Further Reading

As mentioned, some of the challenges related to rankings are similar to those involving distinct counts or period-to-date metrics.  I've covered these topics before:


I also touch on the topic in Chapter 8 of  Star Schema: The Complete Reference, where I address the proper home of a period-to-date measurement.  Use the link to purchase a copy, and you will be helping to support this blog.



Image licensed via Creative Commons 2.0



Monday, July 16, 2012

Q&A: Does Master-Detail Require Multiple Stars in a Travel Data Mart

A reader asks whether master-detail relationships call for multiple fact tables.

Q: Is there some general principle to use as a guide when deciding to use 1 star or 2 for master-detail situations?

We have a master-detail situation involving trips, and multiple trip expenses per trip. Many of our queries are at the trip level - eg. how many trips are for clients in their 30's going to Arizona in July. Many of the queries are at the detail level -- how much is spent on means by the client's age range?

Both levels of query can be answered with one star, with the fact at the detail level (trip expense line). For trip-level questions - you can count distinct occurrences of the natural key in the trip dimension. But it seems that grinding through all those trip expenses when I just wanted to count trips is inefficient.

So do we have 2 fact tables, one at the trip level, one at the trip expense level? Do the dimensions surround both of them?  Is there a rule?


- Ellen, Ottawa
A: As the reader observes, this is a common situation: when there is a master-detail relationship, is there a need for multiple fact tables?

This case calls for at least two fact tables.
Read on for the details.

Photo by o5com, licensed by Creative Commons

Multiple Fact Tables

After learning the basics of dimensional modeling, the first real world challenge we face is understanding when and how to design multiple fact tables.  Until we learn to think in dimensional terms, the choice can be difficult.

I suggest starting with some basic guidelines.  You probably need different fact tables if:
  1. You have measurements with different periodicity
  2. You have measurements with different levels of detail
The first guideline suggests that if facts do not describe the same event, they probably belong in different fact tables. For example, orders and shipments do not always happen at the same time.  Order dollars and shipment dollars belong in separate fact tables.

The second guideline pertains to facts that do describe the same events.  Information about an order and information about an order line are ostensibly available at the same time, but they have different levels of detail. If there are facts at both of these levels, there will need to be multiple fact tables.

So how would you apply these guidelines?

Think Dimensionally

It is easiest to work these questions out by thinking dimensionally. Forget about master-detail or parent-child for a few minutes, and consider what is being measured. 
  • What are the business questions being asked, and what business metrics do they contain?  
  • Are these metrics available simultaneously?
  • At what level of detail are these metrics available?
The reader cited business questions in her example.  These questions reveal at least two metrics:  number of trips, and expense dollars.

These metrics may be available on different schedules -- for example, a trip may commence before we have the details of all the expenses incurred.  This would argue for multiple fact tables.

But let's suppose that the single source of data is expense reports. We do not have any information about the trip until it is over, at which point we have all the expense details.

In this case, lets think about the level of detail of these metrics. Number of trips and expense dollars seem to share a lot of common detail -- the traveller, the destination, the trip start and end date, and so forth.  

But expense dollars have some additional detail -- the date of the expense item, and an expense category. Since these facts have different levels of detail, it makes sense to include them in separate fact tables.

Factless Fact Table

The trip-level star may contain a factless fact table.  It contains one row per trip, with no explicit facts.  We can determine the number of trips that answer a business question simply by counting rows.

Many teams would prefer to add a fact called number_of_trips and always populate it with the value 1.  This is useful, because it makes the business metric explicit in your schema design.  (I've written about this technique before.)

The trip level star may also contain some summary level facts that describe expenses -- say the total trip cost, the total transportation cost, total meals cost, etc.  More detail on these metrics (such as the vendor, payment method, etc.) can found in the expense item star.

Deeper into Trip Analysis

Digging deeper, the reader may discover she needs more than two fact tables.  

Trips often include more than one destination.  When it is necessary to study the various segments (or "legs") of a trip, we can define a third fact table that contains one row per segment.

In this case, the reader would have three fact tables:
  1. trip_facts  Grain: one row per trip.  Metrics: Number of trips (always one)
  2. trip_segment_facts  Grain: one row per destination city.  Metrics: number of segments (always one)
  3. trip_expense_facts  Grain: one row per expense item.  Metrics: expense_dollars
Each of these stars will share some common dimensions -- the trip origination date, the trip end date, the traveller, the trip identifier, etc.

Trip_segment_facts will also include destination-specific dimensions: the destination city of the segment, the arrival and departure dates for the segment, etc.

Trip_expense_facts will include dimensions that describe the individual expense items: the date paid, the payment method, the payee, the expense category and subcategory, and so forth.

A conformance matrix will help you keep track of the dimensionality of each star.

Learn More

Thanks to Ellen for the thoughtful question.  If you have a question of your own, please send it in.

Pick up a copy of Star Schema: The Complete Reference and you will be helping support this blog. 

  • Chapter 4 is dedicated to designing multiple fact tables.  
  • Chapter 5 looks closely at conformed dimensions. 
  • Chapter 12 covers faceless fact tables.
You can learn more about these topics by referring to these posts:
  • Factless Fact Tables (September 15, 2011) describes faceless fact tables that describe events, and the use of a fact with the constant value of 1.
  • The Conformance Matrix (June 5, 2012) describes how to represent the dimensionality of your stars in a matrix format.

The photo in this post is by o5com, licensed by Creative Commons


Thursday, April 26, 2012

Q&A: Human resources data marts


A reader asks if Human Resources data marts are inherently complex. I run down a list of dimensional techniques he should expect to find:
Q: I will be working on a data mart design project to design star schemas for human resources data. I heard that HR data is more complex than sales or marketing and special techniques need to be applied.

I looked at the star schemas of pre-built analytical applications developed by some vendors for our packaged HR solution. I felt that they are quite complex and just wondering star design for HR data should be so complex.

If possible, can you please discuss this topic in a detailed manner by considering any one of popular HRMS system data and the most common data/reporting requirements along with the design discussion to achieve the star for those reports using the given HRMS data?

- Venkat, UK

A: Human Resources applications do indeed tend to use advanced techniques in dimensional design.

Below, I run down a list of topics you will probably need to brush up on. In reality, every subject area requires complete mastery of dimensional modeling, not just the basics.

Note that the complexity you are seeing in packaged solutions may stem from the subject area. Vendors often produce abstracted models to facilitate customization.

Techniques used in HR data marts

No doubt you are accustomed to the transaction-grained stars you encountered in sales.  You will find them in HR as well, but you will also encounter these:
  • Snapshot stars sample one or more metrics at pre-defined intervals.

    In an HR data mart, these may be used to track various kinds of accruals, balances in benefit programs, etc.

  • Accumulating snapshot stars track dimension members through a business process and allow analysis of the elapsed time between milestones.

    These may be used to track the filling of a position, "on-boarding" processes, disciplinary procedures, or applications to benefit programs.

  • Factless fact tables track business processes where the primary metric is the occurrence of an event. They contain no facts, but are used to count rows.

    These are likely to be used for tracking attendance or absence, participation in training courses, etc.

  • Coverage stars are factless fact tables that model conditions.  These are usually in place to support comparison to activities represented in other stars, but may also be leveraged to capture key relationships among dimensions.

    These are likely to be used for linking employees to positions, departments and managers. 
Your dimensions will also require reaching beyond the basics:
  • Transaction dimensions capture the effective and expiration date/time for each row in a dimension table.  These are advisable in almost any situation.

    In HR they may be used to  track changes in an employee dimension.

  • Bridge tables for Multi-valued attributes allow you to associate a repeating attribute with a dimension.

    In HR, these are likely to be used to associate an employee with skills, languages, and other important characteristics.

  • Hierarchy bridge tables allow you to aggregate facts through a recursive hierarchy.

    In HR, these are used to navigate reporting structures (employees report to employees, who in turn report to other employees, and so forth) as well as organizational structures.
I would also expect to encounter some complexity in slow-change processing rules.  Human Resource systems carefully audit certain kinds of data changes, tracking the reason for each change. As a result, you may have attributes in your dimension schema that may exhibit either type 1 or type 2 behavior, depending on the reason for the change.

Every schema goes beyond the basics

This list could go on, but I think you get the idea.

The only way to design a data mart that meets business needs is to have a well rounded understanding of the techniques of dimensional modeling.

You cannot get very far with nothing more than a grasp of the basics. This holds true in any subjet area -- even sales and marketing.  You need the complete toolbox to build a powerful business solution.

Packaged data marts

The complexity that concerns the reader may actually stem from another cause:  he is looking at packaged data mart solutions.

Packaged applications often introduce complexity for an entirely different reason: to support extensibility or customization. For example, facts may be stored row-wise rather than column-wise, and dimensions may contain generic attribute names.

Learn more

This blog contains posts on most of the topics listed above.  Click each header for a link to a related article.  Some have been discussed in multiple posts, but I have included only one link for each.  So also do some exploration.

In addition, please check out my book Star Schema: The Complete Reference.  When you purchase it from Amazon using the links on this page, you help support this blog.
  • Snapshots and accumulating snapshots are covered in Chapter 11, "Transactions, Snapshots and Accumulating Snapshots

  • Factless fact tables and coverage stars are covered in Chapter 12, "Factless Fact Tables"

  • Transaction dimensions are covered in Chapter 8, "More Slow Change Techniques"

  • Attribute bridges are covered in Chapter 9, "Multi-valued Dimensions and Bridges"

  • Hierarchy bridges are covered in Chapter 10, "Recursive Hierarchies and Bridges"
Thanks for the question!

- Chris

Send in your own questions to the address in the sidebar. 


Do you have another technique that was useful in an HR data mart? Use the comments.

Image credit: Gravityx9 licensed under Creative Commons 2.0

Tuesday, April 17, 2012

Q&A: Degenerate Dimensions

A reader asks about degenerate dimensions:
Q: I understand that when a particular dimension value is unique to each fact row (i.e., an invoice number perhaps) it should probably exist as a degenerate dimension in the fact table.  Otherwise, that dimension table will contain as many rows as the fact table. 

However, what about a situation in which a dimension table contains only one column aside from the surrogate key (i.e., a status of some sort) but is not unique to each fact row.  In that case, should it be moved into the fact table as a degenerate dimension?

- Brian

Both cases are good applications for a degenerate dimension: a transaction identifier or a low-cardinality attribute that would otherwise need to be a very boring single-column table.

In all cases, the degenerate dimension is entirely optional.

Degenerate dimensions

A degenerate dimension is nothing more than a dimension attribute stored in a fact table, rather than a dimension table.

In all other respects, the degenerate dimension behaves exactly like any other dimension attribute.  It can be used to provide context for facts, qualify queries, sort results, drive master-detail relationships, and so forth.

The usual reason to create a degenerate dimension is simple:  A dimension with one attribute really doesn't need to be in a separate table with a surrogate key.

Two important things to note here:
  1. Degenerate dimensions are optional
  2. There is nothing wrong with a 1:1 relationship between dimension table and fact table
Lets take a closer look at two common uses of degenerate dimensions.

Transaction identifiers as degenerate dimensions

For a transaction fact table, we usually set the grain at the lowest level of detail possible. Often that is an individual transaction such as an order line, invoice line or shipping line.  (For a refresher on grain, see this post from 2009.)

Typically, these things have some kind of unique identifier which is carried into the dimensional model.  Examples include an order line number, invoice line ID, transaction ID, etc.  When these identifiers do not exist, we often create them, as I have discussed in a previous post.

Many designers opt to place these transaction identifiers directly into the fact table as a degenerate dimension.  However, this is entirely optional.  There is nothing wrong with a dimension table being in a 1:1 relationship with the fact table, as I have discussed previously. 

Some designers forgo the degenerate dimension for transaction identifiers because that there will be other schemas where the attribute in question is also referenced.  In many cases, our business intelligence software may function better if the attribute in question is located in exactly one place -- a shared dimension table. 

As an aside, note that in these other schemas, the cardinality may be different.  For example, the same invoice line may be referenced in multiple shipments or invoices.

Other degenerate dimensions

Not all degenerate dimensions are transaction identifiers.  They may simply be dimension attributes with no other suitable home.  The status code mentioned by the reader may be an example.  Another example appeared in a previous Q&A post: Reason Overpaid.

Once again, the use of the degenerate technique is entirely optional. Many designers are comfortable placing the attribute in the fact tables, while others opt for dimension tables with one attribute.

Notice that a dimension table with one attribute does not require a surrogate key for slow change purposes. However, without the surrogate key, the attribute value would be stored in the fact table anyway, as a foreign key reference to a single column table!

Placing the attribute in a separate table may help with our BI tools, for the same reason given above. Database administrators may also prefer the option because it makes it easier to configure a star-join optimizer.  Adding a surrogate key may reduce the size of the foreign key that will be stored in fact table rows.

Learn More

For more information on this topic, see the following posts:
  • Accumulating Snapshots (October 1, 2010)  Discusses another situation where dimension table may have the same number of rows as the fact table, or perhaps even less.
  • Rule 1: State Your Grain (December 9, 2009) Defines the concept of grain, and discusses the best practice of setting it at the lowest level of detail possible.
Of course, all these topics are also covered in Star Schema: The Complete Reference

You can help support this blog by using the links here to order a copy from Amazon.com.

Image licensed via Creative Commons 2.0
from Patrick Hoesley

Sunday, February 19, 2012

Accumulating snapshots and Type 2 changes

How does one maintain an accumulating snapshot when the dimension that defines its grain can undergo type 2 changes?

A reader recently asked this question via the comments for a recent post.

The short answer:  an accumulating snapshot has only 1 row for each thing being tracked.  If that "thing" can undergo type 2 changes, then the accumulating snapshot should link to only one version -- the current version.

State transitions

Many businesses track items that go through state transitions.  A trouble ticket, for example, is opened, assigned to a support rep, and eventually closed.  A mortgage application is submitted, reviewed, processed, underwritten, and settled upon.

These activities may be represented by a series of fact tables associated with the various process milestones.  Or, they may be captured by a single fact table that records a row for each status change.

When it comes time to study the average time spent at the various stages, however, these options may not be ideal.  In the former case, it will be necessary to drill across several fact tables.  In the latter case it will be necessary to perform correlated subqueries.

The accumulating snapshot model addresses these concerns.

The grain of an accumulating snapshot

An accumulating snapshot receives one row for each item being tracked.  This single row is updated as the item reaches various milestones.  A series of facts are incremented for each day spent in a given stage.  When a milestone is achieved, the appropriate date key is populated.

(For a full refresher on the accumulating snapshot, see the links at the end of this post.)

Unlike other fact tables, then, the accumulating snapshot always reflects the current state of affairs.  In the case of a mortgage processing fact table, for example, each row shows a single application, along with how many days it has spent at each processing stage.

This makes it very easy to study the average time between various stages -- just take the average of the facts in question.  No drilling across, no correlated subqueries.

Changes in the defining dimension

Back to the original question, then. What happens if the dimension that defines the grain of the accumulating snapshot can undergo type 2 slow changes?

The goal of the accumulating snapshot is to summarize each real world item in a single row.  Even if the dimension contains multiple rows for a particular item, the fact table must contain only one. If it contained more than one, its usability would harmed.

As I have written before, this means the defining dimension may have more rows than the accumulating snapshot.  This is OK.  Its just another way in which the accumulating snapshot differs from other types of fact tables.

Our mortgage application, then, must have one and only one row in the fact table -- even if type 2 changes mean there are more than one rows for it in the dimension.

Linking to the current row

The question, then, becomes which row should the fact table link to?

Because the accumulating snapshot is current-valued, it makes the most sense to have it link to the row in the dimension table that represents the current state of the item being tracked.  In our example, this would be the most recent version of the mortgage application.

If you increment the facts in your accumulating snapshot daily (something I suggest in my book), you should also take this opportunity to adjust the key values for any items that have undergone a type 2 change.  If you adjust the facts only when a milestone is met (something you may do if the volume is very high), you may choose to update the key value at this time instead.

Learn more

This and other topics surrounding the accumulating snapshot are discussed in depth in Star Schema: The Complete Reference, in Chapter 11: "Transactions, Snapshots and Accumulating snapshots."

You can help support this blog by using the links in the sidebar to purchase a copy.

You can also read more in these posts:
  • When do you need an accumulating snapshot (January 26, 2012) a recent post that triggered the question, wherein a reader asked how to choose between building an accumulating snapshot and tracking something as a series of type 2 changes. In some cases, the answer is to do both.
  • Q&A: Accumulating snapshots (October 1, 2010) Explores the cardinality relationship between accumulating snapshot and dimension table 

Image Credit:  Creativity103 via Creative Commons

Thursday, January 26, 2012

When do you need an accumulating snapshot?

A reader wonders how to decide between two options:  designing an accumulating snapshot vs. tracking status changes within a dimension.

I have new project to track the status of order transition. I'm unable to reach a conclusion as to implement as an accumulating snapshot or a type 2 slowly changing dimension. ETL  integrates a number of individual systems as the order transits each stage. What is the best way to design it?
Kumar
Milton Keynes, UK

Many businesses have one or more central dimensions that undergo a state transition as they touch multiple processes. Is it enough to track the changes to the dimension?  Or is an accumulating snapshot needed?

I'll walk you through some decision criteria.  But first a little refresher on the two design options the reader is considering.

Type 2 changes and timestamps


Type 2 changes track the history of something represented by a dimension.  Each time there is a change to this item, a new row is inserted in to the dimension table.

This allows any row in a fact table to be associated with a historically accurate version of the dimension as of the relevant point in time.

In the reader's case, status of the order might be an attribute of an Order dimension.  Modeled as a type 2 change, the dimension holds a status history for each order.  Adding effective and expiration dates to each row, you know exactly when each state transition occrred.

Accumulating snapshots

An accumulating snapshot is a type of fact table that records a single row for something the enterprise tracks closely, such as a trouble ticket or mortgage application--or, in the reader's case, an order.

This fact table contains multiple references to the date dimension -- one for each of the major milestones that the item in question can reach.  In the case of the order, this might be the date of order, the date of credit approval, the date of picking, the date of shipment and the date of delivery.

Unlike other kinds of fact tables, the accumulating snapshot is intended to be updated.  These dates are adjusted each time one of the milestones is reached.

There may also be facts that track the number of days (or minutes) spent between each milestone.  These "lags" are a convenience -- they can be computed from the dates.  (Building them into the fact table makes analysis much easier, but does require that the ETL process revisit rows on a regular basis, rather than when status changes.)

Avoiding correlated subqueries

If a type 2 slowly changing dimension with timestamps tracks the history, why consider an accumulating snapshot?

The analytic value of the accumulating snapshot is that it allows us to study the time spent at various stages. In the reader's case, it can make it simple to study the average time an order spends in the "picking" stage, for example.

We can do this with a type 2 slowly changing dimension as well, but it will be more difficult to study the average time between stages. For the order in question, days spent in the picking stage requires knowing the date of credit approval and the date picked.  These will be in two different rows of the dimension.  Now imagine doing this for all orders placed in January 2012.  This will require a correlated subquery.

The accumulating snapshot pre-correlates these events and places them in a single row.  This makes the queries much easier to write, and they are likely to run faster as well.  The cost, of course, is the increased data integration burden of building the additional fact table.

Avoiding drilling across

When each of the discrete milestones is captured by a different fact table, lag may be computed without correlated subqueries.  In this case, it will involve drilling across.

For example, separate fact tables track orders, credit approvals, picking and shipping.  Each references the order dimension.  Days spent in the picking stage can be studied by drilling across credit approvals and picking, with results linked by the common order dimension. 1

Here, the pressure for an accumulating snapshot is reduced.  It may still be warranted, depending on your reporting tools, developer skills an user base.

Summary and final advice

In the end, your decision should boil down to the following:
  1. An accumulating snapshot should only be considered if you are studying the time spent between major milestones
  2. If it helps avoid correlated subqueries, it may be a strong option
  3. If it avoids drill-across queries, it may be a useful option
Making the choice will impact several groups -- ETL developers, report developers, and potentially users.  Make sure this is a shared decision.

Also keep in mind the following:
  • If you build an accumulating snapshot, you will probably also want to track status in the dimension as a type 2 change.
  • Accumulating snapshots work best where the milestones are generally linear and predicable.  If they are not, the design and maintenance will be significantly more complex.
Last but not least:
  • The accumulating snapshot should be derived from one or more base fact tables that capture the individual activities.  
When in doubt, build the base transaction-grained fact tables first.  You can always add an accumulating snapshot later. 
Learn more

This is a popular topic for this blog.  Here are some places where you can read more:
  • Q&A: Accumulating snapshots (October 1, 2010) Explores the cardinality relationship between accumulating snapshot and dimension table
And of course, these concepts are covered extensively in my books.  In the latest one, Star Schema: The Complete Reference, the following chapters may be of interest:
  • Chapter 8, "More Slow Change Techniques" discusses time stamped tracking of slow changes
  • Chapter 11, "Transactions, Snapshots and Accumulating Snapshots" explores the accumulating snapshot in detail.
  • Chapter 14, "Derived Schemas", discusses derivation of an accumulating snapshot from transaction-grained stars.
  • Chapter 4, "A Fact Table for Each Process", includes a detailed discussion of dril-across analysis.

Image Credit:  Creativity103 via Creative Commons

1 Note that this scenario applies to the reader, but does not always apply.  Trouble tickets, for example, may be tracked in a single fact table that receives a new row for each status change.  In this case, there is no drill-across option.

Monday, October 17, 2011

Creating transaction identifiers for fact tables

Sometimes, the source data for a fact table does not include a transaction identifier. When this happens, we often crate our own. A reader asks if this goes against best practices:

Q: You said that we should not create surrogate keys in fact tables. But we have a source system that does not provide us with identifiers for phone calls. When we get the log files, we create surrogate keys for each call before loading them into the star.

A: This is a common practice, and it is just fine.

You are not creating what dimensional modelers call surrogate keys. You are creating transaction identifiers.  These will be helpful as part of the ETL and QA processes.

Surrogate keys vs. transaction identifiers

Surrogate keys should not be confused with transaction identifiers.

As I wrote in the post referenced in the question, a surrogate key is an attribute that is created to uniquely identify rows in dimension tables. It does not come from a source system; it is created expressly for the dimensional schema.

A transaction identifier is a dimension attribute in a fact table that defines its grain, or at least helps to do so. (If you need a refresher on grain, see this post.)

For example, PAYROLL_FACTS may have the grain "one row per pay stub."  Each pay stub has an identifying number in the source system, which is carried into the star as a degenerate dimension. This attribute is a transaction identifier.1

When a transaction identifier defines grain, the DBA may define it as "the primary key of the fact table." Likewise, an ETL developer or QA analyst may use it as a sort of key to compare the star to operational data or staging data. 

From the dimensional perspective, however, these transaction identifiers are dimensions.  They can be used to filter queries, group or sort the results, and so forth. They will not appear in other tables as foreign keys that reference the fact table.

Creating transaction identifiers

When operational systems do not have good transaction identifiers, we often "cook one up" during the ETL process. They are not surrogate keys in the dimensional modeling sense, though the term might sound appropriate. They are transaction identifiers.

For example, suppose you are designing a star that records phone calls made from your call center. You want the grain of the fact table to be  "one row per phone call."  The data will come from your call switch.  As you've mentioned, it creates a log file, but does not provide an identifier for each call.

In your design, you may define your grain: "one row per date, time, originating number and number dialed." This works just fine, but may be difficult for ETL developers to manage.  Instead, they assign a unique identifier to each call in the log. You can use this identifier to define the grain of the fact table.

Carried into the fact table, this attribute aids in the data integration and quality assurance process. It can be used to quickly crosscheck the star against the log file.

However, for analytic purposes, it is not used as a key.  It is a dimension attribute -- a "cooked up" transaction identifier.

Other concerns

When you are forced to create your own transaction identifiers, you will have other things to worry about.

First and foremost, you must be sure there is a mechanism to prevent loading of duplicate facts. You will need to understand how the source system generates its log files, to be sure that you extract data that is complete, accurate, and not subject to revision.

Also, note that the artificial identifiers may convey a false sense that you can reach all the way back to the source to find the original transaction. While it may be possible, your identifier will not be the mechanism.

Thanks for the question, and keep them coming.

-Chris

More Info

For more information, see these posts:

Also check out my book, Star Schema: The Complete Reference.  I discuss surrogate keys in Chapter 1, Analytic Databases and Dimensional Design.  Degenerate dimensions and transaction identifiers are covered in Chapter 3, Stars and Cubes.

1Not all degenerate dimensions are transaction identifiers.  For an example that is not a transaction identifier, see my post of October 15, 2010.

Image licensed via Creative Commons 2.0
 from Patrick Hoesley

Tuesday, October 4, 2011

Are bridge tables really fact tables?

A reader observes that bridge tables seem problematic, and wonders if they should just replaced by factless fact tables.

Q:  I am wondering if all bridge tables are in fact replacements for factless fact tables. The problem with the bridge table as you mention it is that...[you] need to do an expensive join and issues with Cartesian joins/ double counting etc. So the question is whether a bridge table is practical option as compared to a separate fact.
Ashish
Bangalore, India

A:  The bridge table looks a lot like a fact table, but it is used very differently.

While we avoid joining a fact table to another fact table, we seek to join bridge tables to fact tables. This can have implications for BI tools that generate SQL.

Similarities

A bridge table appears similar to a fact table because it contains multiple foreign keys. This is most evident when you look at an attribute bridge table, which links a dimension to an outrigger. It consists solely of foreign key references to the dimension and the outrigger.

In this respect, the bridge table is very similar to a factless fact table.  Indeed, one might make the argument that a bridge relates a set of dimensions in much the same way that a factless fact table describes conditions.

But there is a very important difference: we never join a fact table to another fact table. Bridges, on the other hand, are intended to be joined with fact tables.

We do not join fact tables to fact tables

You should never join two or more fact tables--either directly or indirectly via shared dimensions. Fact values will repeat if multiple rows in either fact table share the same dimensionality. We receive a Cartesian product of all related facts.  The result is double-counting, or worse. 

Instead of joining fact tables, we use a technique called drilling across. Facts are collected from each table and aggregated to common level of detail, then merged into a single result set. I wrote about this process earlier this year.1

Many BI tools that generate SQL are able to identify fact tables and automatically invoke drill across logic when required.

We do join bridge tables to fact tables

Bridge tables represent groups. We create them so that a single fact can be associated with a group of values (such as multiple salespeople) rather than a single value.

When we use a bridge table, we link it to other fact tables.2  By doing this, a single fact in the fact table associates with multiple rows in the bridge table.   

With a bridge, we are exploiting the very Cartesian product that we normally seek to avoid.

We are intentionally repeating a single fact for multiple group members.  To avoid incorrect results, it behooves us to group results by member, or to constrain for a single group member.

Even if you were to replace a bridge with a factless fact table, this is the behavior you would desire. Rather than drill across, you would link it to other fact tables, in order to associate the bridged values with various facts. Cartesian products and the danger of double counting would remain.

Bridge tables and BI software

A bridge table is not really a fact table. It is not the locus of process measurement.  It describes neither activities nor conditions. It is merely a construct that allows us to deal with repeating values. It is meant to be linked with fact tables, and used with care.

Because a bridge is composed of foreign keys, however, some BI tools may identify it as a fact table. If your tool does this, you will need to prevent it from invoking drill-across logic for queries that involve a bridge.

Your tool may have a facility for this. If it does not, you can hide the bridge by joining it to dimensions within a view.

More info

Thanks to Ashish for the comments. If you have a question about bridge tables, send it to the address on the sidebar.

You can learn more in my book, Star Schema: The Complete Reference.  Two full chapters are dedicated to bridge tables, including 30 diagrams.

See also:
1Factless fact tables which describe conditions are not joined to other fact tables either. When they are compared to other fact tables, we typically use set operators or subqueries.

2This may happen directly, in the case of a dimension bridge, or indirectly, in the case of an attribute bridge.


Image by ahisgett licensed under Creative Commons 2.0

Monday, August 29, 2011

Slowly Changing Facts?

This post looks at the issues faced when facts can change retroactively.
Q: Our source system allows facts to be updated.  We are trying to come up with a way to preserve this history in the data warehouse.  One idea is to use the type 2 technique for facts.  We call it "slowly changing facts."  Is this a good solution?
A: When transactions are changed retroactively, you should update corresponding facts. If you need to track the history of facts, do so in separate audit tables.

To understand why, we will  look at what happens if you record multiple "versions" of a fact in your fact tables. But first, a refresher on the type 2 slow change technique.

Slowly changing dimensions

Dimensional modelers must decide what will happen when the source data for a dimension attribute changes. There are several possible responses to a change; one of them is known as the "Type 2" response.

The type 2 response calls for keeping a sort of "version history" in the dimension table. For example, if a customer's address changes, we insert a second record into the dimension for the same customer.  This way, the old "version" of the customer can be linked to facts that took place prior to the change, and the new version can be linked to new facts.

I covered the basics in more detail in a previous post; you can also refer to the slowly changing dimensions category for more posts on the topic.

Keeping the history of facts

A fact table contains a record of things that happened.  A customer ordered 200 widgets yesterday; another ordered 300 today. Measurement of these events can be aggregated across various dimensions to answer a variety of questions.

But sometimes, what we know about an event changes.  We originally thought 200 widgets were ordered yesterday, but now we know that 250 were ordered.

At first glance,  it may seem like the same "type 2" technique will solve the problem.  When details about a specific transaction are changed, simply insert a new row into the fact table.  The most current version can be flagged for easy access.

Unfortunately, this seriously harms the usability of the fact table. When people study orders, they must now be sure to single out a single version for each transaction to avoid double counting. You can try to resolve this by adding multiple date-stamps to each record, or converting changes into "deltas", but the damage is done. The model is open to serious misuse.

We'll look at the solution in a moment, but first a look at the same problem in a snapshot model.

Snapshots exhibit the same problems

Another kind of fact table, the periodic snapshot, tracks status at predefined intervals.  A financial data mart, for example, might log the balance of each account in the general ledger at the close of business each day.

Here too, our understanding of what happened may change.  At the close of business Monday, a particular account contains US$ 12,000.  On Tuesday, back-dated transactions change that balance to $13,000.

If we try to track both versions of Monday's balance, usability is damaged in the same way.  We must always qualify our analysis for (1) the date we are studying, and (2) the "as of" date of our analysis.  If we don't, grave errors ensue.

Here, usability problems are compounded by density issues.  Tracking the daily balance of each account over 5 years would normally require 1,826 rows per account (365 days x 5 years = 1,825 days, plus 1 leap-day).  If we must also re-record balances for all prior days, the number grows to an astounding 1,668,051.  (This is a triangular number, computed as ( n2+n ) / 2 where n is the number of days.)

If there are 500 accounts in the general ledger, that's the difference between 913,000 rows in a standard periodic snapshot, and 834,025,500 for one that logs history.  As each additional day is added, the difference increases.

The alternative 

There is only on version of the truth, and this is what belongs in the fact table.

When our understanding about an event has changed, the corresponding facts should be updated. This holds for both transaction models and snapshot models.

If you must be able to produce an audit trail, do so in a separate table.  The audit table need only capture the transaction identifier, the date, and the value of each fact. Record rows for the original fact and each revision.


If need be, this can also be organized dimensionally as a fact table. Keep in mind, however, that using this fact table will be subject to the same usability challenges.  And ask the business if all this trouble is really necessary.

For a snapshot model, and audit table may not be necessary; a corresponding transaction table may be able to hold the history. Each transaction that affects status is recorded with 2 dates: date of the transaction, and the date we became aware of the information (often a "posted" date).

Thursday, May 12, 2011

Dates In Dimension Tables

A reader asks about tracking dates in the fact table vs. dimension tables.
Q: It seems to me that it is sometimes more effective to store date keys in the dimension table -- particularly if that date could be used to describe the rows in many different fact tables.  The alternative  would be to duplicate the date key in multiple fact tables.

For example, the effective date of an insurance policy could be used against many fact tables.  Should we store the effective date key in each of the fact tables?  Or store it in a single row in the Policy?

The differences in design would be a star schema (fact table) versus a snowflake (stored in the dimension).

- Robert
Montreal
A:  Many models feature at least one major dimension with a date that  might be used to qualify facts in more than one star. This happens in other industries, too.  Examples include dimension tables that represent orders, contracts, tests and so forth.

Avoid placing date_key columns in dimension tables.  It increases SQL complexity, reduces schema understandability, and may impact performance.

Some designers may allow limited use of snowflaking, but not in this situation.

Date in Dimension Table, Day Key in Fact Table

When you have an date that (1) clearly describes something represented by a dimension, and (2) will also be used to group facts, do two things:

  1. Place good old-fashioned dates in the dimension table 
  2. In any star where you may want filter/group/aggregate facts by some aspect of the date in question, add date keys to the fact table
For example, in Robert's case there might be a Policy dimension table with an attribute called policy_effective_date.  This is a date, not a key.

There is also a star that tracks claims. To support analysis of claim information using the policy effective date, the fact table will contain a day_key_policy_effective.  The fact table may contain other day_keys as well (such as day_key_claim, which supports analysis of claim data by the date of the claim.)

Judicious Use of Snowflaking?

There are times when some designers might choose a snowflake path -- placing dimension keys into dimension tables. This happens when there is a set of attributes that appear in multiple dimension tables.

This occurs with geographical data.  The characteristics of a location might appear in a customer dimension table, a warehouse dimension table and a department dimension table.

In this case, the concern is that the ETL routines may not consistently process locations across multiple dimension tables. If their attributes or slow change characteristics are not managed identically, inconsistencies may develop across these tables.

Designers reason that creating one location table solves this problem.  Each dimension that contains location data will contain a location_key. This ensures a single consistent representation of locations.

I discuss this technique in Chapter 7 of Star Schema The Complete Reference, and point out that it opens the door for a different kind of ETL challenge.  If location data is moved to an outrigger, each time there is Type 2 slow change to a location, all related dimension table rows must also undergo Type 2 changes.

My preference is to allow the attributes to repeat in multiple tables, and to increase the QA focus on related ETL programs.

Thanks to Robert for the question.  If you have one of your own, send it to the address in the sidebar.

-Chris
Image credit: Gravityx9 licensed under Creative Commons 2.0