Monday, November 5, 2012

Name Value Pairs and Dimensional Models

A reader asks how name/value pairs are best translated into a dimension design.
I was wondering if you have ever had to setup a data warehouse with a source system that has the ability to have dynamic "virtual" columns using a set of name/value pair tables...If you have done this or have any cool ideas on how to tackle this setup in a dimension that would be greatly appreciated.
-Joshua
Belmont, CA

This is a common scenario. The preferred solution is to identify the most commonly used "virtual columns" and model them as standard attributes. Bridged solutions are also possible, but are much less flexible from a usability standpoint.

Name Value Pairs

Name/value pairs are often used in OLTP solutions to offer a flexible way to describe an entity.

Rather than store each characteristic of an entity as an attribute, a special table lets you identify unlimited attributes for something. Its main columns capture the name of the attribute and its value. (Hence the moniker "name/value pair".)

For example, suppose you are Apple Inc. Your OLTP system has a product table, but each product has a wide variety of attributes. Rather than model each attribute, you might capture them in a name/value pair table.

The rows in the table for the newly released iPad Mini might look like this:

As you can see, the 18 rows above describe one iPad Mini model. 

Elsewhere in the data model, and associative table (or "intersect table") cross references these 18 pairs with a single row in the product table for this item.

Apple has 23 other iPad Mini models, each with different of values for the capacity attribute, color attribute, carrier attribute, etc. 

In addition, this same table can be used to describe the characteristics of products with different named attributes - a phone, a computer, a notebook, and so forth. New attributes do not require changing the database design. Instead, all that is necessary is to insert the appropriate rows.

In a dimensional model, there are several ways to handle name/value sources. Each involves tradeoffs between flexibility and usability.

Flattened Solution is Most Usable

The preferred method is to model the "names" as explicit dimension attributes.1

In the case of our product, for example, the resultant dimensional model might nook like this:

This approach produces the most understandable and usable dimensional model. Each important characteristic is present in the dimension, named according to business terms, and ready to provide context for facts.

Do not be discouraged if every instance of the source entity (product in this case) does not share the same set of attributes in the name/value pair table. Look for 100 or so attributes that are most commonly used. In many cases, this may satisfy most of your analytic requirements.

Core and Custom Dimensions

If there are simply too many names to flatten into a single dimension table, the next step is to consider "core and custom" models.

If you separate the entity by "types", can you find 100 named attributes for each type?  

If so:
  • Build a single "core" dimension that contains an identifier and any attributes common across all types, along with a surrogate key.
  • For each type, build an additional dimension that replicates the common attributes, plus contains the 100 or so attribute names specific to the type
  • The core and custom versions should use the same surrogate key domain for each member of the dimension, so that they can be swapped in and out of queries as required.
For example, Apple might do the following:
  • Build a single product table that has the part number, category, product name and a key. This table will have a row for all products, including computers, tablets, phones and music players. This is the "core" product table. 
  • Build additional "custom" dimensions would be built for each category of product. The table pictured in the previous section might be the custom dimension for tablets.
This solution retains the understandability of the previous solution - each attribute is clearly and explicitly named. However, we must now take care to join the appropriate version of the dimension to fact tables - a technical consideration based on the business question being asked.

Bridged Solution Mirrors OLTP Design

The last option is to employ a bridged solution. This technique actually mimics the original OLTP design. An outrigger contains the name/value pairs, and a bridge associates it with the dimension.

For example, Apple's product dimension might be bridged as follows:

This solution has the benefit of providing access to the full range of named attributes. It is also flexible; new named attributes do not require any change to the schema design.2

The disadvantage here is that this solution is the most difficult to use. Facts can be easily double counted, triple counted, etc. It will be necessary to take great care in constructing and qualifying queries, and it may also be necessary to lock down "grand total" functionality in end-user reporting tools.

Both Ways

Remember that this is not an either/or solution. If a bridged outrigger is necessary, consider also capturing the 100 most commonly used attributes in the dimension itself.

This allows you to create a safe "sandbox" for less technically sophisticated users. They are able to access much of the analytic value of your solution, without having to expose them to the complexity and possible dangers of the bridge table.

More Reading
  • Bridges: Much has been written in this blog about bridge tables. The most relevant entry if you want to learn more is Resolve Repeating Attributes With A Bridge Table (1/28/11).  There are also two full chapters in my latest book (see below.)
  • Core and custom dimensions have not been previously discussed here, but you can learn more about them in Star Schema: The Complete Reference.  It dedicates an entire chapter to "Type-Specific Stars."
Help support this blog

This blog is ad-free, and no one pays me to write it.  You can help out:  
  • Use any of the links on these pages to pick up my latest book, Star Schema: The Complete Reference.  
  • Or, if you already have it, use the links to check out any of the other recommended books.

When you do, a small portion of the purchase goes to this blog (you do not pay any extra).


Notes:

1I'll call these "names" rather than "pairs", since its the name that will translate into a column. For example, "color" is a name, it may participate in several name/value pairs -- one for each possible color.

2The bridge table in this diagram contains product and attribute keys. If there will be many combinations, dimensional modelers often replace the product key with a group key, and store the group key in the dimension table. See Resolve Repeating Attributes With A Bridge Table (1/28/11).