Showing posts with label Surrogate Keys. Show all posts
Showing posts with label Surrogate Keys. Show all posts

Wednesday, September 25, 2013

Optimizing warehouse data for business analytics

Business analytics often integrate information from your data warehouse with other sources of data. This post looks at the best practices of warehouse design that make this possible.

I receive a lot of questions regarding the best way to structure warehouse data to support an analytics program. The answer is simple: follow the same best practices you've already learned.

I'll cover these practices from a dimensional modeling perspective. Keep in mind that they apply in any data warehouse, including those modeled in third normal form.


1. Store Granular Facts

Analytic modelers often choose sources external to the data warehouse, even when the warehouse seems to contain relevant data. The number one reason for this is insufficient detail. The warehouse contains summarized data; the analytic model requires detail.

In this situation, the analytic modeler has no choice but to look elsewhere.  Worse, she may be forced to build redundant processes to transform source data and compile history. Luckily, this is not a failure of warehouse design principles; its a failure to follow standard best practices.

Best practices of  dimensional design dictate that we set the grain of base fact tables at the lowest level of detail possible. Need a daily summary of sales? Store the individual order lines. Asked to track the cost of tips? Store detail about each leg.

Dimensional solutions can contain summarized data. This takes the form of cubes, aggregates, or derived schemas. But these summaries should be derived exclusively from detailed data that also lives in the warehouse.

Like all rules, this rule has exceptions. There are times when the cost/benefit calculus is such that it doesn't make sense to house highly granular indefinitely. But more often than not, summary data is stored simply because basic best practices were not followed.

2. Build “Wide” Dimensions

The more attributes there are in your reference data (aka dimensions), the more useful source material there is for analytic discovery. So build dimensions that are full of attributes,  as many as you can find.

If the grain of your fact table gives the analytics team “observations” to work on, the dimensions give them “variables.”  And the more variables there are, the better the odds of finding useful associations, correlations, or influences.

Luckily, this too is already a best practice. Unfortunately, it is one that is often misunderstood and violated. Misguided modelers frequently break things down into the essential pieces only, or model just to specific requirements.  

3. Track Changes to Reference Data (and Use Effective Dating)

When reference data changes, too many dimensional models default to updating corresponding dimensions, because it is easier.

For example, suppose your company re-brands a product. It's still the same product, but with a new name. You may be tempted to simply update the reference data in your data warehouse. This is easier than tracking changes.  It may even seem to make business sense, because 90% of your reports require this-year-versus-last comparison by product name.

Unfortunately, some very important analysis may require understanding how consumer behavior correlates with the product name. You've lost this in your data set. Best practices help avoid these problems.

Dimensional models should track the change history of reference data. In dimensional speak, this means application of  type 2 slow changes as a rule. This preserves the historic context of every fact recorded in the fact table.

In addition, every row in a dimension table should track "effective" and "expiration" dates, as well as a flag rows that are current. This enables the delivery of type 1 behavior (the current value) even as we store type 2 behavior. From an analytic perspective, it also enables useful "what if" analysis.

As with all rules, again there are exceptions. In some cases, there may be good reason not to respond to changes in reference data by tracking history. But more often than not, type 1 responses are chosen for the wrong reason: because they are easier to implement.

4. Record Identifying Information, Including Alternate Identifiers

Good dimensional models allow us to trace back to the original source data. To do this, include transaction identifiers (real or manufactured) in fact tables, and maintain identifiers from source systems in dimension tables (these are called "natural keys").

Some of this is just plain necessary in order to get a dimensional schema loaded. For example, if we are tracking changes to a product name in a dimension, we may have multiple rows for a given product. The product's identifier is not a unique identifier, but we must have access to it. If we don't, it would become impossible to load a fact into the fact table.

Identifying information is also essential for business analytics. Data from the warehouse is likely to be combined with data that comes from other places. These identifiers are the connectors that allow analytic modelers to do this.  Without them, it may become necessary to bypass the warehouse.

Your analytic efforts, however, may require blending new data with your enterprise data. And that new data may not come with handy identifiers. You have a better chance blending it with enterprise data if your warehouse also includes alternate identifiers, which can be used to do matching. Include things like phone numbers, email addresses, geographic coordinates—anything that will give the analytics effort a fighting chance of linking up data sources.

Summary

If you've been following the best practices of dimensional modeling, you've produced an asset that maximized value for analytic modelers:

  • You have granular, detailed event data.
  • You have rich, detailed reference data.
  • You are tracking and time-stamping changes to reference data. 
  • You've got transaction identifiers, business keys, and alternate identifiers.  

It also goes without saying that conformed dimensions are crucial if you hope to sustain a program of business analytics.

Of course, there are other considerations that may cause an analytic modeler to turn her back on the data warehouse. Latency issues, for example, may steer them to operational solutions. Accessibility and procedural issues, too, may get in the way of the analytic process.

But from a database design perspective, the message is simple: follow those best practices!

Further Reading

You can also read more in prior posts.  For example:
You can also read more in my book, Star Schema: The Complete Reference.  If you use the links on this page to pick up a copy on Amazon, you will be helping support this blog.  

 It covers the best practices of dimensional design in depth. For example:

  • Grain, identifiers, keys and basic slow change techniques are covered in Chapter 3, "Stars and Cubes"
  • The place of summary data is covered in Chapter 14, "Derived Schemas" and Chapter 15, "Aggregates"
  • Conformance is covered in Chapter 5, "Conformed Dimensions"
  • Advanced slow change techniques are explored in Chapter 8, "More Slow Change Techniques"

Wednesday, September 21, 2011

Avoid Surrogate Keys for Fact Tables


I am often asked for feedback on designs that include a "surrogate key for the fact table."  There are two common reasons for which they are proposed; both have better alternatives. 

Surrogate keys are for dimension tables

A surrogate key is an attribute that is created to uniquely identify rows in a dimension table. It does not come from a source system; it is created expressly for the dimensional schema.

Surrogate keys for dimension tables serve two important purposes:
  1. They make it easier to track history.  They allow the dimension to capture changes to something, even if the source does not.  Absent a surrogate key, this would be difficult; the primary key of the dimension would be concatenation of natural keys and type 2 attributes.
  2. They make it easy to join to the dimension.  The dimensions' surrogate keys appear in fact tables as  foreign keys.  They allow the fact to be joined to the appropriate dimension values, without having to use a multi-part key.
A fact table does not require a manufactured data element for either of these purposes.

Tracking change history of facts?  Use a log.

When the value of a fact can change, a surrogate key for the fact table might be proposed. This would theoretically allow the fact table to record the change history of facts, in the same way that a dimension table does.

Unfortunately, storing the history of facts in this manner destroys the usability of the star. The additive properties of the facts are lost, and it becomes much harder to use.

If the facts can change, the fact table should be updated.

To track the history of facts, use an audit table as described in a previous post.  This table can log historic values, or can store the changes as "deltas" in a dimensional format.

Joining to other fact tables?  Drill across.

The other reason surrogate keys are proposed for fact tables is that it will make them "easier to join." 

Joining fact tables to dimensions is easy; the fact table already contains foreign keys that reference the surrogate keys in dimension tables.  But what about joining to other fact tables?

Because they contain facts meant to be aggregated, fact tables should never be joined to one another. Otherwise, double-counting will ensue.

Facts from multiple stars should be combined by drilling across, as described in a previous post.  Don't try to merge them by joining fact tables.

If you're not after facts with this proposed fact-to-fact join, you must be after dimension values.  If this is the case, carry forward any/all useful dimensions to other stars in the value chain.

More info

To learn more about surrogate keys, check out these posts:
For more on tracking the change history of facts, check out this post:
To learn about drilling across, read this post:
If you find this blog helpful, please consider picking up a copy of my book, Star Schema: The Complete Reference.


Friday, August 20, 2010

Surrogate Keys in a CIF

In this Q&A, two more questions about surrogate keys.  Here's the first:
Q:   In a Corporate Information Factory architecture, if the EDW contains a change history for every table, do data marts need surrogate keys?
- Los Angeles, CA
A:  Yes. 

This is similar to a question from last year, in which change history was tracked in an operational system. This time, we're looking at an EDW tracking changes.

First, a little background on the CIF for the uninitiated.

In a Corporate Information Factory (CIF) the Enterprise Data Warehouse (EDW) is an integrated repository of atomic data.  The tables in this repository are normalized.  Data marts take information from this repository and structure it in a dimensional format for departmental access.

Dimension tables are not normalized.  In a CIF, a given dimension may draw data from many tables in the EDW.

For example, a Company dimension might pull data in from a dozen or more normalized tables that include the information on the company, its locations, accounts, lines of business, market data and contact information. 

If the dimension relied exclusively on the unique identifiers of these tables to identify rows, it would require a multi-part key that would have numerous parts.

And if, as you say, your EDW tracks change history to each table, this multi-part key may be compounded by version numbers or date stamps from each source table. This is impractical.  (It also makes it difficult to support type 1 responses in the data marts.)

A related question sometimes comes up in my classes:

Q:  Can I maintain surrogate keys in the EDW?

A:  No.  

Most dimensions do not correspond to a single entity in the EDW, which is normalized, so there is nowhere to store and maintain surrogate keys.  Slow change processing and surrogate key management apply to the dimensional representation of data.

You can maintain a central repository of conformed dimensions, and you can even co-locate them with your EDW.  In fact, this is a good idea.  These master dimension tables can be replicated to your data marts, in whole or in part.

This ensures compatibility across subject areas.  It is a common misconception that conformed dimensions are only important in Kimball's bus architecture.  Even in a CIF, you may need to combine information from multiple data marts.

- Chris

Image:  "Keys" by Takacsi75, licensed under Creative Commons 2.0.

Wednesday, May 26, 2010

Avoid Using Dates as Keys

This post explores the impact of using a date as the primary key for a dimension table. 

Q: I'd like to hear your thoughts on using dates as keys.  Is it really necessary to have a surrogate key for a date dimension?

Hartford, CT

A:  It is not absolutely necessary.  But you should use a surrogate key anyway.  Here's why.

Suppose you have a DAY dimension which contains one row for each calendar date.  In addition to the full date, it contains columns that represent the year, quarter, month, day of week, holiday status, and so forth.  It also contains columns that represent frequently used combinations of these various attributes.

Users (or developers) can use the columns to do things like filter queries, aggregate facts, sort results, and drive subtotaling.  Database administrators can index these columns to optimize these activities.

You might be tempted to use the full date as the primary key of the table.  This is urge is understandable -- why bother creating an additional column called DAY_KEY.  Seems like extra work for nothing, right?

Consistency and Performance Issues

Remember that the primary key of a dimension table will also appear in fact tables.  If a date is used, users or developers will eventually try using it as a "shortcut" in queries, avoiding a join to the DAY table.

This leads to inconsistencies.  Developers may mean well, thinking they are "saving a join."  But they may apply different rules to compute things like fiscal periods.

Use of a day table guarantees consistent computation of all the characteristics of each day.  All the columns are computed exactly once, during the ETL process.  Everyone who uses them gets consistent results.

Moreover, avoiding the DAY table may generate performance issues.  If a report developer wants to filter for a particular day of the week, they may apply SQL date comparison functions to the day column in the fact table.  The database will probably have to make this comparison for each row in the fact table.

A day of the week in the DAY table is far more efficient, because it is pre-computed and indexed.

You may think you can prevent these mishaps.  But you cannot.  Sooner or later, they will occur. Dates in the fact table will beckon.  A user, developer, or consultant will try to take a shortcut.  Perhaps under your watch, or perhaps after you have gone.  

Guarantee consistency and performance by using a surrogate key.

If you are curious, read on for some other, less important reasons.

Tuesday, September 8, 2009

More on Surrogate Keys

This post addresses a common follow-up question regarding surrogate keys:
"My source system already tracks changes. In this case, do dimension tables really need their own unique identifiers?"
Yes. There may be exceptions, but they are far less common than you may think, and they limit your future options.

Wednesday, May 20, 2009

Do I really need Surrogate Keys?

Here is the #1 most frequently question that people ask of me:
Q: Do I really need surrogate keys?
A: You absolutely must have a unique identifier for each dimension table, one that does not come from a source system. A surrogate key is the best way to handle this, but there are other possibilities.

The case for the surrogate key is entirely pragmatic. Read on for a full explanation.

Dimensions Need their Own Unique Identifier

It is crucial that the dimensional schema be able to handle changes to source data in whatever manner makes most sense from an analytic perspective. This may be different from how the change is handled in the source.

For this reason, every dimension table needs its own unique identifier -- not one that comes from a source system. (For more on handling changes, start with this post.)

A surrogate key makes the best unique identifier. It is simply an integer value, holding no meaning for end users. A single, compact column, it keeps fact table rows small and makes SQL easy to read and write. It is simple to manage during the ETL process.

Compound Keys Work, But Why Take that Route?

An alternative is to supplement a unique identifier from the source system (also known as a natural key) with a sequence number. This results in a compound key, or multi-part key.

This kind of compound key also allows the dimension to handle changes differently than the source does. But there are several disadvantages:
  • Fact table rows become larger, as they must include the multi-part key for each dimension
  • The ETL process must manage a sequence for each natural key value, rather than a single sequence for a surrogate key
  • SQL becomes more difficult to read, write or debug
  • Multi-part dimension keys can disrupt star join optimizers
So: a compound key takes more space, is not any easier, and may disrupt performance. Why bother?

(By the way, many source system identifiers are already compound keys. Adding a sequence number will make them even larger!)

Sometimes it is suggested that the natrual key be supplemented with a date, rather than a sequence. This may simplify ETL slightly, but the rest of the drawbacks remain. Plus, dates are even bigger than a sequence number. Worse, the date will appear in the fact table. That is sure to lead to trouble!

(This is not to say that datestamps on dimension rows are a bad thing. To the contrary, they can be quite useful. They just don't work well as part of a compound key.)

That's the basic answer. I'll respond to some common follow up questions over the coming weeks.

- Chris

Tuesday, May 1, 2007

10 Things You Should Know About that Sample Star Schema

Today, many of us learn about the star schema by studying a sample database that comes with a software productusually one that covers sales or orders. Here are 10 terms and principles of dimensional modeling to go with that sample schema you've worked with.

The star schema has become a de facto standard for the design of analytic databases. Sample stars are often included with RDBMS software, BI Tools and ETL tools. They are also used for tutorials and training. Almost universally, the sample schema describes a sales or order taking process, similar to the one depicted in the figure below:

Figure 1: A demo schema usually represents orders or sales.(Click to Enlarge)

You may have learned about the Star Schema by working with a sample like this one. If so, you probably have an intuitive grasp of star schema design principles. Here are ten terms and principles you should know that describe important features of the sample star.

Most of this is probably readily apparent if you've worked with a sample schema—what may be new is the terminology. The first two you probably know:

  1. Facts are measurements that describe a business process. They are almost always numeric—but not all numeric attributes are facts. You can find facts (or measurements) in almost any analytic request—"Show me sales dollars by product" (sales dollars). "How many widgets were sold my John Smith in May?" (quantity ordered). There are some schemas that do not include facts—we'll look at those in another post.

  2. Dimensions give facts context. They may be textual or numeric. They are used to specify how facts are "filtered" and "broken out" on reports. You can usually find dimensions after the words "by" or "for" in an analytic request. "Show me sales dollars by product" (product). "What are margin dollars by Month and Salesperson?" (month, sales rep).

  3. Dimension tables are wide. Dimension tables usually group together a set of related dimension attributes, though there are situations where a dimension may include a set of attributes not related to one another. Dimension tables are not normalized, and usually have a lot of attributes—far more than appear in most sample schemas. This allows a rich set of detail to be used in analyzing facts. 100 or more columns is not uncommon for some dimensions. For this reason, we often call dimension tables wide.

  4. Dimensions have Surrogate Keys. The primary key for each dimension table is an attribute specifically created for the dimensional schema. It is an integer assigned by the ETL process, and has no inherent meaning. It is not a reused key from a source system, such as a customer ID or product code. We call these attributes natural keys, and they may exist in the star, but do not serve as unique identifiers.

    In the sample schema, customer_key is a surrogate key generated for the star schema; customer_id is a natural key carried over from a source system.

    By assigning surrogate keys, we enable the star to handle changes to source data differently than the source system does. For example, in a source system a customer record may be overwritten, while we want the star schema to track changes. Performance considerations also come into play—a surrogate key avoids the need for multi-column joins.

  5. Type 2 Changes track history. The term "Slowly Changing Dimension" (or SCD) describes how the data warehouse responds to changes in the source of dimensional data. There are several techniques that can be applied when the source of dimension detail changes. The most common is referred to as a "Type 2" change: an entirely new record is written to the dimension table.

    For example, if a customer moves, the record may simply be updated in a source system. But in the star schema, we choose to add a new row to the customer dimension, complete with a new surrogate key. All prior facts remain associated with the "old" customer record; all future facts will be associated with the new record.

  6. Type 1 Changes overwrite history. The Type 1 change is used when source data changes are not deemed significant, or may be the correction of an error. In such cases, we perform an update to an existing row in a dimension. For example, if a customer's gender is updated in the source, we may choose to update it in the corresponding dimension records. All prior facts are now associated with the changed value.

    In addition to Type 1 and Type 2 changes, there are other SCD techniques. Hybrid approaches exist as well. Every design should identify which technique(s) will be used for each attribute of each dimension table.

  7. Fact tables are narrow. A fact table row is usually entirely composed of numeric attributes: the facts, and foreign key references to the dimensions. Because of these characteristics, each fact table row is narrow, at least in contrast with wide dimension rows full of textual values. The narrowness of fact tables is important, because they will accumulate far more rows than dimension tables, and at a much faster rate.

  8. Fact tables are usually sparse. Rows are recorded in the fact table only when there is something to measure. For example, not every customer orders every product from every salesperson each day. Rows are only recorded when there is an order. This helps manage the growth of the fact table. It also saves us from having to filter out a huge number of rows that have no sales dollars when displaying results in a report. (Usually, you don't want a customer sales report to list every product—only the ones they bought. You can use an outer join when you want the latter.)

  9. Fact Table Grain The level of detail represented by a row in a fact table is referred to as its grain. Facts that are recorded with different levels of detail belong in separate fact tables. This avoids an array of reporting difficulties, as well as kludges such as including special rows in dimension tables for "not applicable." Determining the grain of a fact table is an important design step and helps avoid future confusion. (There are times when "not applicable" attributes are necessary, but they are most often a sign of the need for another fact table.)

    In the example, the grain is sales by customer, product, salesperson and date. A better design might capture sales at the order line level of detail.

  10. Additivity. Facts are usually additive. This means they can be summed across any dimension value. For example, order_dollars can be aggregated across customers, products, salespeople, or time periods, producing meaningful results. Additive facts are stored in the fact table. We also store additive facts that might be computed from other facts. (order_dollars might be the sum of extended_cost and margin_dollars, but why include only two out of the three.?

    Some facts are non-additive. For example, margin rate is a percentage. Two sales at 50% margin do not equate to a single sale at 100% margin—this fact is not additive. In the star, we store the fully additive components of margin (order_dollars and margin_dollars) and let front end tools compute the ratio. There are also semi-additive facts, which we will look at in the next post.
Most of these terms and principles can be learned by working with a sample schema. But there are many important principles that the typical "Sales" model does not reveal. In a future post, I'll look at the top 10 things the demo schema does not teach you.

Related Posts:

Ten Things You Won't Learn from that Demo Schema

© 2007 Chris Adamson