Thursday, April 26, 2012

Q&A: Human resources data marts


A reader asks if Human Resources data marts are inherently complex. I run down a list of dimensional techniques he should expect to find:
Q: I will be working on a data mart design project to design star schemas for human resources data. I heard that HR data is more complex than sales or marketing and special techniques need to be applied.

I looked at the star schemas of pre-built analytical applications developed by some vendors for our packaged HR solution. I felt that they are quite complex and just wondering star design for HR data should be so complex.

If possible, can you please discuss this topic in a detailed manner by considering any one of popular HRMS system data and the most common data/reporting requirements along with the design discussion to achieve the star for those reports using the given HRMS data?

- Venkat, UK

A: Human Resources applications do indeed tend to use advanced techniques in dimensional design.

Below, I run down a list of topics you will probably need to brush up on. In reality, every subject area requires complete mastery of dimensional modeling, not just the basics.

Note that the complexity you are seeing in packaged solutions may stem from the subject area. Vendors often produce abstracted models to facilitate customization.

Techniques used in HR data marts

No doubt you are accustomed to the transaction-grained stars you encountered in sales.  You will find them in HR as well, but you will also encounter these:
  • Snapshot stars sample one or more metrics at pre-defined intervals.

    In an HR data mart, these may be used to track various kinds of accruals, balances in benefit programs, etc.

  • Accumulating snapshot stars track dimension members through a business process and allow analysis of the elapsed time between milestones.

    These may be used to track the filling of a position, "on-boarding" processes, disciplinary procedures, or applications to benefit programs.

  • Factless fact tables track business processes where the primary metric is the occurrence of an event. They contain no facts, but are used to count rows.

    These are likely to be used for tracking attendance or absence, participation in training courses, etc.

  • Coverage stars are factless fact tables that model conditions.  These are usually in place to support comparison to activities represented in other stars, but may also be leveraged to capture key relationships among dimensions.

    These are likely to be used for linking employees to positions, departments and managers. 
Your dimensions will also require reaching beyond the basics:
  • Transaction dimensions capture the effective and expiration date/time for each row in a dimension table.  These are advisable in almost any situation.

    In HR they may be used to  track changes in an employee dimension.

  • Bridge tables for Multi-valued attributes allow you to associate a repeating attribute with a dimension.

    In HR, these are likely to be used to associate an employee with skills, languages, and other important characteristics.

  • Hierarchy bridge tables allow you to aggregate facts through a recursive hierarchy.

    In HR, these are used to navigate reporting structures (employees report to employees, who in turn report to other employees, and so forth) as well as organizational structures.
I would also expect to encounter some complexity in slow-change processing rules.  Human Resource systems carefully audit certain kinds of data changes, tracking the reason for each change. As a result, you may have attributes in your dimension schema that may exhibit either type 1 or type 2 behavior, depending on the reason for the change.

Every schema goes beyond the basics

This list could go on, but I think you get the idea.

The only way to design a data mart that meets business needs is to have a well rounded understanding of the techniques of dimensional modeling.

You cannot get very far with nothing more than a grasp of the basics. This holds true in any subjet area -- even sales and marketing.  You need the complete toolbox to build a powerful business solution.

Packaged data marts

The complexity that concerns the reader may actually stem from another cause:  he is looking at packaged data mart solutions.

Packaged applications often introduce complexity for an entirely different reason: to support extensibility or customization. For example, facts may be stored row-wise rather than column-wise, and dimensions may contain generic attribute names.

Learn more

This blog contains posts on most of the topics listed above.  Click each header for a link to a related article.  Some have been discussed in multiple posts, but I have included only one link for each.  So also do some exploration.

In addition, please check out my book Star Schema: The Complete Reference.  When you purchase it from Amazon using the links on this page, you help support this blog.
  • Snapshots and accumulating snapshots are covered in Chapter 11, "Transactions, Snapshots and Accumulating Snapshots

  • Factless fact tables and coverage stars are covered in Chapter 12, "Factless Fact Tables"

  • Transaction dimensions are covered in Chapter 8, "More Slow Change Techniques"

  • Attribute bridges are covered in Chapter 9, "Multi-valued Dimensions and Bridges"

  • Hierarchy bridges are covered in Chapter 10, "Recursive Hierarchies and Bridges"
Thanks for the question!

- Chris

Send in your own questions to the address in the sidebar. 


Do you have another technique that was useful in an HR data mart? Use the comments.

Image credit: Gravityx9 licensed under Creative Commons 2.0

Friday, April 20, 2012

Big Data and Dimensional Modeling [Video]


I am often asked, "Is the dimensional model obsolete in the world of big data?"

I always respond with a question of my own: "Do you plan to stop measuring your business?"  Of course, the answer is always "no." 

Dimensional modeling is all about measurement, and so it will continue to have a place in our BI programs.

Here, I discuss the topic with Paul Kautza, Director of Education at The Data Warehousing Institute (TDWI).



In this interview, I talk about four ways we use dimensional models:
  1. To capture business requirements (e.g. measurement definitions)
  2. To define and manage a roadmap for a BI program (scope)
  3. As the basis for a database design (star schema, snowflake schema, cube)
  4. As a presentation layer for the business (semantic layer)
Developments in the world of big data may impact the third category, but not always.

Tuesday, April 17, 2012

Q&A: Degenerate Dimensions

A reader asks about degenerate dimensions:
Q: I understand that when a particular dimension value is unique to each fact row (i.e., an invoice number perhaps) it should probably exist as a degenerate dimension in the fact table.  Otherwise, that dimension table will contain as many rows as the fact table. 

However, what about a situation in which a dimension table contains only one column aside from the surrogate key (i.e., a status of some sort) but is not unique to each fact row.  In that case, should it be moved into the fact table as a degenerate dimension?

- Brian

Both cases are good applications for a degenerate dimension: a transaction identifier or a low-cardinality attribute that would otherwise need to be a very boring single-column table.

In all cases, the degenerate dimension is entirely optional.

Degenerate dimensions

A degenerate dimension is nothing more than a dimension attribute stored in a fact table, rather than a dimension table.

In all other respects, the degenerate dimension behaves exactly like any other dimension attribute.  It can be used to provide context for facts, qualify queries, sort results, drive master-detail relationships, and so forth.

The usual reason to create a degenerate dimension is simple:  A dimension with one attribute really doesn't need to be in a separate table with a surrogate key.

Two important things to note here:
  1. Degenerate dimensions are optional
  2. There is nothing wrong with a 1:1 relationship between dimension table and fact table
Lets take a closer look at two common uses of degenerate dimensions.

Transaction identifiers as degenerate dimensions

For a transaction fact table, we usually set the grain at the lowest level of detail possible. Often that is an individual transaction such as an order line, invoice line or shipping line.  (For a refresher on grain, see this post from 2009.)

Typically, these things have some kind of unique identifier which is carried into the dimensional model.  Examples include an order line number, invoice line ID, transaction ID, etc.  When these identifiers do not exist, we often create them, as I have discussed in a previous post.

Many designers opt to place these transaction identifiers directly into the fact table as a degenerate dimension.  However, this is entirely optional.  There is nothing wrong with a dimension table being in a 1:1 relationship with the fact table, as I have discussed previously. 

Some designers forgo the degenerate dimension for transaction identifiers because that there will be other schemas where the attribute in question is also referenced.  In many cases, our business intelligence software may function better if the attribute in question is located in exactly one place -- a shared dimension table. 

As an aside, note that in these other schemas, the cardinality may be different.  For example, the same invoice line may be referenced in multiple shipments or invoices.

Other degenerate dimensions

Not all degenerate dimensions are transaction identifiers.  They may simply be dimension attributes with no other suitable home.  The status code mentioned by the reader may be an example.  Another example appeared in a previous Q&A post: Reason Overpaid.

Once again, the use of the degenerate technique is entirely optional. Many designers are comfortable placing the attribute in the fact tables, while others opt for dimension tables with one attribute.

Notice that a dimension table with one attribute does not require a surrogate key for slow change purposes. However, without the surrogate key, the attribute value would be stored in the fact table anyway, as a foreign key reference to a single column table!

Placing the attribute in a separate table may help with our BI tools, for the same reason given above. Database administrators may also prefer the option because it makes it easier to configure a star-join optimizer.  Adding a surrogate key may reduce the size of the foreign key that will be stored in fact table rows.

Learn More

For more information on this topic, see the following posts:
  • Accumulating Snapshots (October 1, 2010)  Discusses another situation where dimension table may have the same number of rows as the fact table, or perhaps even less.
  • Rule 1: State Your Grain (December 9, 2009) Defines the concept of grain, and discusses the best practice of setting it at the lowest level of detail possible.
Of course, all these topics are also covered in Star Schema: The Complete Reference

You can help support this blog by using the links here to order a copy from Amazon.com.

Image licensed via Creative Commons 2.0
from Patrick Hoesley