Chris Adamson’s Blog: Slow Changes

Showing posts with label Slow Changes. Show all posts

Wednesday, September 25, 2013

Optimizing warehouse data for business analytics

Business analytics often integrate information from your data warehouse with other sources of data. This post looks at the best practices of warehouse design that make this possible.

I receive a lot of questions regarding the best way to structure warehouse data to support an analytics program. The answer is simple: follow the same best practices you've already learned.

I'll cover these practices from a dimensional modeling perspective. Keep in mind that they apply in any data warehouse, including those modeled in third normal form.

1. Store Granular Facts

Analytic modelers often choose sources external to the data warehouse, even when the warehouse seems to contain relevant data. The number one reason for this is insufficient detail. The warehouse contains summarized data; the analytic model requires detail.

In this situation, the analytic modeler has no choice but to look elsewhere. Worse, she may be forced to build redundant processes to transform source data and compile history. Luckily, this is not a failure of warehouse design principles; its a failure to follow standard best practices.

Best practices of dimensional design dictate that we set the grain of base fact tables at the lowest level of detail possible. Need a daily summary of sales? Store the individual order lines. Asked to track the cost of tips? Store detail about each leg.

Dimensional solutions can contain summarized data. This takes the form of cubes, aggregates, or derived schemas. But these summaries should be derived exclusively from detailed data that also lives in the warehouse.

Like all rules, this rule has exceptions. There are times when the cost/benefit calculus is such that it doesn't make sense to house highly granular indefinitely. But more often than not, summary data is stored simply because basic best practices were not followed.

2. Build “Wide” Dimensions

The more attributes there are in your reference data (aka dimensions), the more useful source material there is for analytic discovery. So build dimensions that are full of attributes, as many as you can find.

If the grain of your fact table gives the analytics team “observations” to work on, the dimensions give them “variables.” And the more variables there are, the better the odds of finding useful associations, correlations, or influences.

Luckily, this too is already a best practice. Unfortunately, it is one that is often misunderstood and violated. Misguided modelers frequently break things down into the essential pieces only, or model just to specific requirements.

3. Track Changes to Reference Data (and Use Effective Dating)

When reference data changes, too many dimensional models default to updating corresponding dimensions, because it is easier.

For example, suppose your company re-brands a product. It's still the same product, but with a new name. You may be tempted to simply update the reference data in your data warehouse. This is easier than tracking changes. It may even seem to make business sense, because 90% of your reports require this-year-versus-last comparison by product name.

Unfortunately, some very important analysis may require understanding how consumer behavior correlates with the product name. You've lost this in your data set. Best practices help avoid these problems.

Dimensional models should track the change history of reference data. In dimensional speak, this means application of type 2 slow changes as a rule. This preserves the historic context of every fact recorded in the fact table.

In addition, every row in a dimension table should track "effective" and "expiration" dates, as well as a flag rows that are current. This enables the delivery of type 1 behavior (the current value) even as we store type 2 behavior. From an analytic perspective, it also enables useful "what if" analysis.

As with all rules, again there are exceptions. In some cases, there may be good reason not to respond to changes in reference data by tracking history. But more often than not, type 1 responses are chosen for the wrong reason: because they are easier to implement.

4. Record Identifying Information, Including Alternate Identifiers

Good dimensional models allow us to trace back to the original source data. To do this, include transaction identifiers (real or manufactured) in fact tables, and maintain identifiers from source systems in dimension tables (these are called "natural keys").

Some of this is just plain necessary in order to get a dimensional schema loaded. For example, if we are tracking changes to a product name in a dimension, we may have multiple rows for a given product. The product's identifier is not a unique identifier, but we must have access to it. If we don't, it would become impossible to load a fact into the fact table.

Identifying information is also essential for business analytics. Data from the warehouse is likely to be combined with data that comes from other places. These identifiers are the connectors that allow analytic modelers to do this. Without them, it may become necessary to bypass the warehouse.

Your analytic efforts, however, may require blending new data with your enterprise data. And that new data may not come with handy identifiers. You have a better chance blending it with enterprise data if your warehouse also includes alternate identifiers, which can be used to do matching. Include things like phone numbers, email addresses, geographic coordinates—anything that will give the analytics effort a fighting chance of linking up data sources.

Summary

If you've been following the best practices of dimensional modeling, you've produced an asset that maximized value for analytic modelers:

You have granular, detailed event data.
You have rich, detailed reference data.
You are tracking and time-stamping changes to reference data.
You've got transaction identifiers, business keys, and alternate identifiers.

It also goes without saying that conformed dimensions are crucial if you hope to sustain a program of business analytics.

Of course, there are other considerations that may cause an analytic modeler to turn her back on the data warehouse. Latency issues, for example, may steer them to operational solutions. Accessibility and procedural issues, too, may get in the way of the analytic process.

But from a database design perspective, the message is simple: follow those best practices!

Further Reading

You can also read more in prior posts. For example:

Rule 1: State Your Grain (December 9, 2009) covers the fundamentals of grain
Build High Resolution Stars (January 7, 2011) discusses the importance of setting grain at the lowest level possible
For Slowly Changing Dimensions, Change is Relative (October 9, 2007) covers type 1 vs type 2 processing and surrogate keys vs. natural keys.
Responding to Star Schema Detractors with Timestamps (March 12, 2008) covers the use of effective and expiration dates with Type 2 slow changes.
Do I Really Need Surrogate Keys (May 20, 2009) covers the fundamentals of business keys vs. warehouse keys
Creating transaction identifiers for fact tables (October 17, 2011) covers real and manufactured transaction identifiers

You can also read more in my book, Star Schema: The Complete Reference. If you use the links on this page to pick up a copy on Amazon, you will be helping support this blog.

It covers the best practices of dimensional design in depth. For example:

Grain, identifiers, keys and basic slow change techniques are covered in Chapter 3, "Stars and Cubes"
The place of summary data is covered in Chapter 14, "Derived Schemas" and Chapter 15, "Aggregates"
Conformance is covered in Chapter 5, "Conformed Dimensions"
Advanced slow change techniques are explored in Chapter 8, "More Slow Change Techniques"

Thursday, April 26, 2012

Q&A: Human resources data marts

Posted by Chris Adamson

A reader asks if Human Resources data marts are inherently complex. I run down a list of dimensional techniques he should expect to find:

Q: I will be working on a data mart design project to design star schemas for human resources data. I heard that HR data is more complex than sales or marketing and special techniques need to be applied.

I looked at the star schemas of pre-built analytical applications developed by some vendors for our packaged HR solution. I felt that they are quite complex and just wondering star design for HR data should be so complex.

If possible, can you please discuss this topic in a detailed manner by considering any one of popular HRMS system data and the most common data/reporting requirements along with the design discussion to achieve the star for those reports using the given HRMS data?

- Venkat, UK

A: Human Resources applications do indeed tend to use advanced techniques in dimensional design.

Below, I run down a list of topics you will probably need to brush up on. In reality, every subject area requires complete mastery of dimensional modeling, not just the basics.

Note that the complexity you are seeing in packaged solutions may stem from the subject area. Vendors often produce abstracted models to facilitate customization.

Techniques used in HR data marts

No doubt you are accustomed to the transaction-grained stars you encountered in sales. You will find them in HR as well, but you will also encounter these:

Snapshot stars sample one or more metrics at pre-defined intervals.

In an HR data mart, these may be used to track various kinds of accruals, balances in benefit programs, etc.

Accumulating snapshot stars track dimension members through a business process and allow analysis of the elapsed time between milestones.

These may be used to track the filling of a position, "on-boarding" processes, disciplinary procedures, or applications to benefit programs.

Factless fact tables track business processes where the primary metric is the occurrence of an event. They contain no facts, but are used to count rows.

These are likely to be used for tracking attendance or absence, participation in training courses, etc.

Coverage stars are factless fact tables that model conditions. These are usually in place to support comparison to activities represented in other stars, but may also be leveraged to capture key relationships among dimensions.

These are likely to be used for linking employees to positions, departments and managers.

Your dimensions will also require reaching beyond the basics:

Transaction dimensions capture the effective and expiration date/time for each row in a dimension table. These are advisable in almost any situation.

In HR they may be used to track changes in an employee dimension.

Bridge tables for Multi-valued attributes allow you to associate a repeating attribute with a dimension.

In HR, these are likely to be used to associate an employee with skills, languages, and other important characteristics.

Hierarchy bridge tables allow you to aggregate facts through a recursive hierarchy.

In HR, these are used to navigate reporting structures (employees report to employees, who in turn report to other employees, and so forth) as well as organizational structures.

I would also expect to encounter some complexity in slow-change processing rules. Human Resource systems carefully audit certain kinds of data changes, tracking the reason for each change. As a result, you may have attributes in your dimension schema that may exhibit either type 1 or type 2 behavior, depending on the reason for the change.

Every schema goes beyond the basics

This list could go on, but I think you get the idea.

The only way to design a data mart that meets business needs is to have a well rounded understanding of the techniques of dimensional modeling.

You cannot get very far with nothing more than a grasp of the basics. This holds true in any subjet area -- even sales and marketing. You need the complete toolbox to build a powerful business solution.

Packaged data marts

The complexity that concerns the reader may actually stem from another cause: he is looking at packaged data mart solutions.

Packaged applications often introduce complexity for an entirely different reason: to support extensibility or customization. For example, facts may be stored row-wise rather than column-wise, and dimensions may contain generic attribute names.

Learn more

This blog contains posts on most of the topics listed above. Click each header for a link to a related article. Some have been discussed in multiple posts, but I have included only one link for each. So also do some exploration.

In addition, please check out my book Star Schema: The Complete Reference. When you purchase it from Amazon using the links on this page, you help support this blog.

Snapshots and accumulating snapshots are covered in Chapter 11, "Transactions, Snapshots and Accumulating Snapshots

Factless fact tables and coverage stars are covered in Chapter 12, "Factless Fact Tables"

Transaction dimensions are covered in Chapter 8, "More Slow Change Techniques"

Attribute bridges are covered in Chapter 9, "Multi-valued Dimensions and Bridges"

Hierarchy bridges are covered in Chapter 10, "Recursive Hierarchies and Bridges"

Thanks for the question!

- Chris

Send in your own questions to the address in the sidebar.

Do you have another technique that was useful in an HR data mart? Use the comments.

Image credit: Gravityx9 licensed under Creative Commons 2.0

Thursday, January 26, 2012

When do you need an accumulating snapshot?

Posted by Chris Adamson

A reader wonders how to decide between two options: designing an accumulating snapshot vs. tracking status changes within a dimension.

I have new project to track the status of order transition. I'm unable to reach a conclusion as to implement as an accumulating snapshot or a type 2 slowly changing dimension. ETL integrates a number of individual systems as the order transits each stage. What is the best way to design it?

Kumar
Milton Keynes, UK

Many businesses have one or more central dimensions that undergo a state transition as they touch multiple processes. Is it enough to track the changes to the dimension? Or is an accumulating snapshot needed?

I'll walk you through some decision criteria. But first a little refresher on the two design options the reader is considering.

Type 2 changes and timestamps

Type 2 changes track the history of something represented by a dimension. Each time there is a change to this item, a new row is inserted in to the dimension table.

This allows any row in a fact table to be associated with a historically accurate version of the dimension as of the relevant point in time.

In the reader's case, status of the order might be an attribute of an Order dimension. Modeled as a type 2 change, the dimension holds a status history for each order. Adding effective and expiration dates to each row, you know exactly when each state transition occrred.

Accumulating snapshots

An accumulating snapshot is a type of fact table that records a single row for something the enterprise tracks closely, such as a trouble ticket or mortgage application--or, in the reader's case, an order.

This fact table contains multiple references to the date dimension -- one for each of the major milestones that the item in question can reach. In the case of the order, this might be the date of order, the date of credit approval, the date of picking, the date of shipment and the date of delivery.

Unlike other kinds of fact tables, the accumulating snapshot is intended to be updated. These dates are adjusted each time one of the milestones is reached.

There may also be facts that track the number of days (or minutes) spent between each milestone. These "lags" are a convenience -- they can be computed from the dates. (Building them into the fact table makes analysis much easier, but does require that the ETL process revisit rows on a regular basis, rather than when status changes.)

Avoiding correlated subqueries

If a type 2 slowly changing dimension with timestamps tracks the history, why consider an accumulating snapshot?

The analytic value of the accumulating snapshot is that it allows us to study the time spent at various stages. In the reader's case, it can make it simple to study the average time an order spends in the "picking" stage, for example.

We can do this with a type 2 slowly changing dimension as well, but it will be more difficult to study the average time between stages. For the order in question, days spent in the picking stage requires knowing the date of credit approval and the date picked. These will be in two different rows of the dimension. Now imagine doing this for all orders placed in January 2012. This will require a correlated subquery.

The accumulating snapshot pre-correlates these events and places them in a single row. This makes the queries much easier to write, and they are likely to run faster as well. The cost, of course, is the increased data integration burden of building the additional fact table.

Avoiding drilling across

When each of the discrete milestones is captured by a different fact table, lag may be computed without correlated subqueries. In this case, it will involve drilling across.

For example, separate fact tables track orders, credit approvals, picking and shipping. Each references the order dimension. Days spent in the picking stage can be studied by drilling across credit approvals and picking, with results linked by the common order dimension. ¹

Here, the pressure for an accumulating snapshot is reduced. It may still be warranted, depending on your reporting tools, developer skills an user base.

Summary and final advice

In the end, your decision should boil down to the following:

An accumulating snapshot should only be considered if you are studying the time spent between major milestones
If it helps avoid correlated subqueries, it may be a strong option
If it avoids drill-across queries, it may be a useful option

Making the choice will impact several groups -- ETL developers, report developers, and potentially users. Make sure this is a shared decision.

Also keep in mind the following:

If you build an accumulating snapshot, you will probably also want to track status in the dimension as a type 2 change.
Accumulating snapshots work best where the milestones are generally linear and predicable. If they are not, the design and maintenance will be significantly more complex.

Last but not least:

The accumulating snapshot should be derived from one or more base fact tables that capture the individual activities.

When in doubt, build the base transaction-grained fact tables first. You can always add an accumulating snapshot later.

Learn more

This is a popular topic for this blog. Here are some places where you can read more:

Deeper into the Accumulating Snapshot (December 13, 2010) Discusses the accumulating snapshot and issues surrounding non-linear processes.

Responding to star schema Detractors with Timestamps (March 12, 2008) Explains how a type 2 techniques and timestamps can track the change history of a dimension.

Accumulating snapshots: are they necessary? (January 14, 2010) Looks at why the accumulating snapshot and type 2 technique may be used together.

Q&A: Accumulating snapshots (October 1, 2010) Explores the cardinality relationship between accumulating snapshot and dimension table

Three ways to drill across (December 23, 2011) Covers the drill-across process

And of course, these concepts are covered extensively in my books. In the latest one, Star Schema: The Complete Reference, the following chapters may be of interest:

Chapter 8, "More Slow Change Techniques" discusses time stamped tracking of slow changes

Chapter 11, "Transactions, Snapshots and Accumulating Snapshots" explores the accumulating snapshot in detail.

Chapter 14, "Derived Schemas", discusses derivation of an accumulating snapshot from transaction-grained stars.

Chapter 4, "A Fact Table for Each Process", includes a detailed discussion of dril-across analysis.

Image Credit: Creativity103 via Creative Commons

¹ Note that this scenario applies to the reader, but does not always apply. Trouble tickets, for example, may be tracked in a single fact table that receives a new row for each status change. In this case, there is no drill-across option.

Wednesday, September 21, 2011

Avoid Surrogate Keys for Fact Tables

Posted by Chris Adamson

I am often asked for feedback on designs that include a "surrogate key for the fact table." There are two common reasons for which they are proposed; both have better alternatives.

Surrogate keys are for dimension tables

A surrogate key is an attribute that is created to uniquely identify rows in a dimension table. It does not come from a source system; it is created expressly for the dimensional schema.

Surrogate keys for dimension tables serve two important purposes:

They make it easier to track history. They allow the dimension to capture changes to something, even if the source does not. Absent a surrogate key, this would be difficult; the primary key of the dimension would be concatenation of natural keys and type 2 attributes.
They make it easy to join to the dimension. The dimensions' surrogate keys appear in fact tables as foreign keys. They allow the fact to be joined to the appropriate dimension values, without having to use a multi-part key.

A fact table does not require a manufactured data element for either of these purposes.

Tracking change history of facts? Use a log.

When the value of a fact can change, a surrogate key for the fact table might be proposed. This would theoretically allow the fact table to record the change history of facts, in the same way that a dimension table does.

Unfortunately, storing the history of facts in this manner destroys the usability of the star. The additive properties of the facts are lost, and it becomes much harder to use.

If the facts can change, the fact table should be updated.

To track the history of facts, use an audit table as described in a previous post. This table can log historic values, or can store the changes as "deltas" in a dimensional format.

Joining to other fact tables? Drill across.

The other reason surrogate keys are proposed for fact tables is that it will make them "easier to join."

Joining fact tables to dimensions is easy; the fact table already contains foreign keys that reference the surrogate keys in dimension tables. But what about joining to other fact tables?

Because they contain facts meant to be aggregated, fact tables should never be joined to one another. Otherwise, double-counting will ensue.

Facts from multiple stars should be combined by drilling across, as described in a previous post. Don't try to merge them by joining fact tables.

If you're not after facts with this proposed fact-to-fact join, you must be after dimension values. If this is the case, carry forward any/all useful dimensions to other stars in the value chain.

More info

To learn more about surrogate keys, check out these posts:

For Slowly Changing Dimensions, Change is Relative (10/9/2007)
Do I Really Need Sorrogate Keys? (5/20/2009)
More On Surrogagte Keys (9/8/2009)

For more on tracking the change history of facts, check out this post:

Slowly Changing Facts? (8/29/2011)

To learn about drilling across, read this post:

Multiple Stars and Conformed Dimensions (8/15/2011)

If you find this blog helpful, please consider picking up a copy of my book, Star Schema: The Complete Reference.

Image: Two Keys on a Keyring,

licensed by Creative Commons 2.5

Monday, August 29, 2011

Slowly Changing Facts?

Posted by Chris Adamson

This post looks at the issues faced when facts can change retroactively.

Q: Our source system allows facts to be updated. We are trying to come up with a way to preserve this history in the data warehouse. One idea is to use the type 2 technique for facts. We call it "slowly changing facts." Is this a good solution?

A: When transactions are changed retroactively, you should update corresponding facts. If you need to track the history of facts, do so in separate audit tables.

To understand why, we will look at what happens if you record multiple "versions" of a fact in your fact tables. But first, a refresher on the type 2 slow change technique.

Slowly changing dimensions

Dimensional modelers must decide what will happen when the source data for a dimension attribute changes. There are several possible responses to a change; one of them is known as the "Type 2" response.

The type 2 response calls for keeping a sort of "version history" in the dimension table. For example, if a customer's address changes, we insert a second record into the dimension for the same customer. This way, the old "version" of the customer can be linked to facts that took place prior to the change, and the new version can be linked to new facts.

I covered the basics in more detail in a previous post; you can also refer to the slowly changing dimensions category for more posts on the topic.

Keeping the history of facts

A fact table contains a record of things that happened. A customer ordered 200 widgets yesterday; another ordered 300 today. Measurement of these events can be aggregated across various dimensions to answer a variety of questions.

But sometimes, what we know about an event changes. We originally thought 200 widgets were ordered yesterday, but now we know that 250 were ordered.

At first glance, it may seem like the same "type 2" technique will solve the problem. When details about a specific transaction are changed, simply insert a new row into the fact table. The most current version can be flagged for easy access.

Unfortunately, this seriously harms the usability of the fact table. When people study orders, they must now be sure to single out a single version for each transaction to avoid double counting. You can try to resolve this by adding multiple date-stamps to each record, or converting changes into "deltas", but the damage is done. The model is open to serious misuse.

We'll look at the solution in a moment, but first a look at the same problem in a snapshot model.

Snapshots exhibit the same problems

Another kind of fact table, the periodic snapshot, tracks status at predefined intervals. A financial data mart, for example, might log the balance of each account in the general ledger at the close of business each day.

Here too, our understanding of what happened may change. At the close of business Monday, a particular account contains US$ 12,000. On Tuesday, back-dated transactions change that balance to $13,000.

If we try to track both versions of Monday's balance, usability is damaged in the same way. We must always qualify our analysis for (1) the date we are studying, and (2) the "as of" date of our analysis. If we don't, grave errors ensue.

Here, usability problems are compounded by density issues. Tracking the daily balance of each account over 5 years would normally require 1,826 rows per account (365 days x 5 years = 1,825 days, plus 1 leap-day). If we must also re-record balances for all prior days, the number grows to an astounding 1,668,051. (This is a triangular number, computed as ( n²+n ) / 2 where n is the number of days.)

If there are 500 accounts in the general ledger, that's the difference between 913,000 rows in a standard periodic snapshot, and 834,025,500 for one that logs history. As each additional day is added, the difference increases.

The alternative

There is only on version of the truth, and this is what belongs in the fact table.

When our understanding about an event has changed, the corresponding facts should be updated. This holds for both transaction models and snapshot models.

If you must be able to produce an audit trail, do so in a separate table. The audit table need only capture the transaction identifier, the date, and the value of each fact. Record rows for the original fact and each revision.

If need be, this can also be organized dimensionally as a fact table. Keep in mind, however, that using this fact table will be subject to the same usability challenges. And ask the business if all this trouble is really necessary.

For a snapshot model, and audit table may not be necessary; a corresponding transaction table may be able to hold the history. Each transaction that affects status is recorded with 2 dates: date of the transaction, and the date we became aware of the information (often a "posted" date).

Wednesday, May 26, 2010

Avoid Using Dates as Keys

Posted by Chris Adamson

This post explores the impact of using a date as the primary key for a dimension table.

Q: I'd like to hear your thoughts on using dates as keys. Is it really necessary to have a surrogate key for a date dimension?

Hartford, CT

A: It is not absolutely necessary. But you should use a surrogate key anyway. Here's why.

Suppose you have a DAY dimension which contains one row for each calendar date. In addition to the full date, it contains columns that represent the year, quarter, month, day of week, holiday status, and so forth. It also contains columns that represent frequently used combinations of these various attributes.

Users (or developers) can use the columns to do things like filter queries, aggregate facts, sort results, and drive subtotaling. Database administrators can index these columns to optimize these activities.

You might be tempted to use the full date as the primary key of the table. This is urge is understandable -- why bother creating an additional column called DAY_KEY. Seems like extra work for nothing, right?

Consistency and Performance Issues

Remember that the primary key of a dimension table will also appear in fact tables. If a date is used, users or developers will eventually try using it as a "shortcut" in queries, avoiding a join to the DAY table.

This leads to inconsistencies. Developers may mean well, thinking they are "saving a join." But they may apply different rules to compute things like fiscal periods.

Use of a day table guarantees consistent computation of all the characteristics of each day. All the columns are computed exactly once, during the ETL process. Everyone who uses them gets consistent results.

Moreover, avoiding the DAY table may generate performance issues. If a report developer wants to filter for a particular day of the week, they may apply SQL date comparison functions to the day column in the fact table. The database will probably have to make this comparison for each row in the fact table.

A day of the week in the DAY table is far more efficient, because it is pre-computed and indexed.

You may think you can prevent these mishaps. But you cannot. Sooner or later, they will occur. Dates in the fact table will beckon. A user, developer, or consultant will try to take a shortcut. Perhaps under your watch, or perhaps after you have gone.

Guarantee consistency and performance by using a surrogate key.

If you are curious, read on for some other, less important reasons.

Read the full post »

Wednesday, May 20, 2009

Do I really need Surrogate Keys?

Posted by Chris Adamson

Here is the #1 most frequently question that people ask of me:

Q: Do I really need surrogate keys?

A: You absolutely must have a unique identifier for each dimension table, one that does not come from a source system. A surrogate key is the best way to handle this, but there are other possibilities.

The case for the surrogate key is entirely pragmatic. Read on for a full explanation.

Dimensions Need their Own Unique Identifier

It is crucial that the dimensional schema be able to handle changes to source data in whatever manner makes most sense from an analytic perspective. This may be different from how the change is handled in the source.

For this reason, every dimension table needs its own unique identifier -- not one that comes from a source system. (For more on handling changes, start with this post.)

A surrogate key makes the best unique identifier. It is simply an integer value, holding no meaning for end users. A single, compact column, it keeps fact table rows small and makes SQL easy to read and write. It is simple to manage during the ETL process.

Compound Keys Work, But Why Take that Route?

An alternative is to supplement a unique identifier from the source system (also known as a natural key) with a sequence number. This results in a compound key, or multi-part key.

This kind of compound key also allows the dimension to handle changes differently than the source does. But there are several disadvantages:

Fact table rows become larger, as they must include the multi-part key for each dimension
The ETL process must manage a sequence for each natural key value, rather than a single sequence for a surrogate key
SQL becomes more difficult to read, write or debug
Multi-part dimension keys can disrupt star join optimizers

So: a compound key takes more space, is not any easier, and may disrupt performance. Why bother?

(By the way, many source system identifiers are already compound keys. Adding a sequence number will make them even larger!)

Sometimes it is suggested that the natrual key be supplemented with a date, rather than a sequence. This may simplify ETL slightly, but the rest of the drawbacks remain. Plus, dates are even bigger than a sequence number. Worse, the date will appear in the fact table. That is sure to lead to trouble!

(This is not to say that datestamps on dimension rows are a bad thing. To the contrary, they can be quite useful. They just don't work well as part of a compound key.)

That's the basic answer. I'll respond to some common follow up questions over the coming weeks.

- Chris

Wednesday, March 12, 2008

Responding to Star Schema Detractors with Timestamps

Posted by Chris Adamson

Time-stamped dimensions can pack additional analytic punch, compared with standard slow change techniques. There is additional cost in the load process, but it may be worth it, especially if you plan to load historic data.

This is the fourth in a series of posts on Slowly Changing Dimensions. If you are just joining in, you might want to review Part I, Part II and Part III.

Star schema detractors often suggest that that a dimensional model throws away valuable detail. Proponents of dimensional design counter that one of their basic design tenets is to set fact table grain at the lowest level of detail possible. Fair enough. But this response does not address the concern fully. What about the dimensions?

Designers who eschew the dimensional model for their enterprise data warehouse point out that they can tell you what a particular customer looked like on a given day, regardless of whether there were any sales. A dimensional design using standard Type 2 techniques does not offer this capability. It writes a new row to the dimension table each time a characteristic changes. But we can only link these versions of the customer to a point in time if there is a transaction recorded in a fact table.

In the over-hyped "religious wars" of data warehousing, this is one criticism of the star schema that I have always felt had some validity. However, it is an easy one to address.

Time Stamped Dimensions

The solution is so simple that I have encountered it being employed numerous times by groups with no formal training in dimensional design. The solution is to add a pair of dates to the dimension table: effective_date and expiration_date.

When a row is added to the dimension table, it is given an effective_date that indicates when the change became effective. Later, when a slow change occurs, two things occur. The new record is inserted with the new effective_date, similar to a standard Type 2 response. In addition, the previous record is updated: its expiration_date is set to the date before the change occurred.

More ETL Work, but More Analytic Possibilities

This extra step may be a bit of additional work for the ETL developers, but it provides some additional analytic capability. Using these dates, it is possible to:

List a series of changes to something in sequence: order by the effective_date.
Determine the status of something at a point in time: qualify the date in question as greater than or equal to the effective_date and less than or equal to the expiration_date
Quickly find the current value for a given dimension, by qualifying on expiration_date

For all this to work, it is useful to initialize expiration dates of current records to an actual date. This avoids the tortured query predicates that are brought on by the use of NULL values. I prefer to use the highest date value supported by the RDBMS, but that’s a topic for another post.

As described, this is more of a "Date Stamped" dimension. But if the granularity of significant change is finer than the day, it can easily be extended to include a time component.

Because there is a bit of extra ETL processing, it is worth considering whether the additional analytic value is worth it. But there is also another consideration to take into account, and here the ETL developers may actually save some work.

Another Benefit

Having the timestamps on our conformed dimensions will make it easier to load past history when we build new fact tables in the future. This is especially handy if we are building the warehouse one subject area at a time.

For example, our Customer dimension may be implemented first as part of an orders schema. Later, it may come into play again as part of a receivables schema. When it comes time to load the receivables history into a fact table, each transaction will need to be associated with the correct historic version of the customer. Without timestamps, ETL developers will need to use the operational data to identify exactly what the customer looked like at the time of the transaction, then search for a match in the customer dimension.

With timestamps, all the ETL developers will need is the customer_id. We can easily find the correct version of the customer in the dimension table by referring to the timestamps.

For this simple reason alone, I encourage designers to use time-stamps on the dimensions that are important across the enterprise--the conformed dimensions that form the backbone of the data warehouse.

Tuesday, January 8, 2008

Have it Both Ways With Hybrid Slow Changes

Posted by Chris Adamson

This is the third in a series on Slowly Changing Dimensions. It looks at an often misunderstood approach: the hybrid SCD.

When the source data for a dimension table changes, the Slowly Changing Dimension characteristics of a dimensional model dictate how the warehouse schema should respond. In my previous two posts, I examined the three most common techniques. (See Part I & Part II.) When the source value for a dimension attribute value changes:

Type 1 response: Update the existing value in the dimension table
Type 2 response: Add a new row to the dimension table, with a new surrogate key
Type 3 response: Maintain a single row in the dimension table, with separate columns for current and previous values of the dimension attribute in question.

Sometimes, one single technique does not fit the bill. Requirements will come along that seem to call for both Type 1 and Type 2 treatment. What to do?

“We always want to do it this way. But sometimes we want to do it that way.”

Consider a company building a typical orders star schema for business-to-business sales. Customer companies are represented by a customer dimension table. You are trying to determine what the correct slow change response should be when the name of a customer company changes. In an interview, the first thing you are told is:

“We always want to use the most current name for the company in our analysis…”

This sounds to you like a type 1 change, which looks reasonable. After all, you might reason, one wouldn’t want this year vs. last year comparisons to be befuddled by different company names. But no sooner have you noted this, the speaker continues:

“...except for when we want to use the name that was in effect at the time of the transaction.”

Now things look more complicated. To provide the company name as it existed at the time of an order would require a Type 2 response. But that would conflict with the first requirement.

What to do? Don’t jump for Type 3. That’s not the solution either. A type 3 response will not tell you which value was in effect at the time of a given transaction. (It also usually does not capture more than one previous value of the attribute – see the previous post.)

Having it Both Ways

The hybrid approach allows you to meet both requirements. It requires that you model two dimension attributes that will capture the value in question – company name in this case.

company_name_current will capture the current name of the company. This attribute will exhibit Type 1 behavior. When a company name changes, it will be updated.
company_name_historic will capture the detailed history of company names. This attribute will exhibit Type 2 behavior. When a company name changes, a new row will be created to capture the new value, and to associate with any subsequent transactions

This means that when a hybrid change occurs, the ETL process must do at least two things:

Update the _current value for all records that share the same natural key
Insert a new row, initializing both _current and _historic values to the new value

Years ago, it was popular to refer to this approach as a “Type 4 Change.” With the second edition of The Data Warehouse Toolkit, Kimball and Ross suggest we call this a hybrid change. And for good reason: it is really a bit of Type 1 and a bit of Type 2.

Hybrid Change in Action

Developers often have trouble with this type of design. To fully understand it, look what happens when the name of a company changes.

Suppose that one of the customers was Apple Computer. You may recall that in 2007, the company changed its name from Apple Computer Inc. to Apple Inc.

Initially, the dimension table contains a row that looks like this:

In 2007, Apple changed its name from “Apple Computer Inc.” to “Apple Inc.” The next illustration shows the schema’s hybrid response in yellow:

The first row shows the existing record. The _current value is updated with the new name, “Apple Inc.” Its _historic value is not touched. The next row shows the new record that has been inserted. It shows the new name for both _current and _historic values. Any facts subsequently recorded will be connected to this new row.

To analyze all facts (before and after the name change) using the new name, Apple Inc., you use company_name_current to group query results. To study all facts in the context of the historically accurate name, group query results by company_name_historic.

Repeatable

This process is repeatable. If Apple once again changed the corporate name, say to “Apple Electronics, Inc.”, the ETL process for the customer table would:

Change company_name_current to “Apple Electronics Inc.” for both existing rows.
Add a third row (with a new surrogate key) with _current and _historic values recorded as “Apple Electronics, Inc.” Subsequent facts would be connected to this row.

The Importance of Good Documentation

It is important to document the hybrid change carefully. ETL developers must understand how it is to be loaded, and report/query developers must understand how to use it. Otherwise, the hybrid approach can lead to confusion.

And be sure that it is worth the trouble. You might ask why people want the historic name, if 99% of the time they only need the current name. If it is because they want to be able to reproduce an invoice, perhaps that would be better served by going back to the source system. Then the schema can sport a more simple Type 1 approach.

But if a real analytic need exists, the hybrid approach is the only way to go.

Copyright (c) 2007 Chris Adamson

Thursday, November 15, 2007

How To Document a Type 3 SCD: A pair of Type 1 Responses

Posted by Chris Adamson

Many dimensionally-aware database modeling tools allow you to document the Slow Change characteristics of each dimension attribute. But often, your only choices are Type 1 or Type 2. This post describes what to do when your design calls for the less common Type 3 change. As it turns out, a Type 3 SCD response can be understood as a pair of Type 1 responses.

In a previous post, I pointed out that we think about slow changes with respect to a source system, we usually document them as part of the dimensional schema. For example, if the customer's name or date of birth changes on the source, what do we do in the star schema? A Type 1 response would be to overwrite a corresponding value in the star schema; a Type 2 response would be to insert a new row for the revised version of the customer. We usually document this behavior by flagging each attribute in the star as "Type 1" or "Type 2", even though the "change" happens on the source.

(Not sure Slowly Changing Dimensions are? See For Slowly Changing Dimensions, Change Is Relative for an definition of the concept and explanations of the Type 1 and Type 2 techniques.)

A Type 3 change calls for yet another kind of response to a change in source data. This approach response maintains before and after versions of the value in separate dimension attributes. Because the before and after value are stored in the same row, either version can be associated with activity (read: facts) that occurred before or after the change.

Schema designers often throw up their arms in disgust when they discover that their modeling tool does not support the concept of a Type 3 change. But this despair may not be warranted; it is possible to document a Type 3 change as a pair of Type 1 changes.

Before we look at how to do this, here's a quick example of a Type 3 slow change.

Type 3 SCD: An Example

Suppose that the dimensional model contains a dimension table called product, and that one of the dimension attributes is category. The category attribute carries a standard classification mechanism used to categorize products, and appears in a large number of reports. An example of a category is "stationery."

A wide variety of products have this category designation, including several types of envelope, note pad, and paper sheets. All these products have rows in the product dimension table with a category of "stationery." Fact tables linked to the product dimension can be analyzed by category, grouping all these products under the value "stationery."

Now lets assume that the business has expanded its product lines, and the old "stationery" category is a bit too broad. Management plans to redesignate existing stationery products as "Personal Stationery" or "Business Stationery". Going forward, they want to use the new values for all analysis, even when studying facts that were recorded prior to the changeover. So far, sounds like a Type 1 change.

The catch: some planning numbers are tied to the old designations. For at least a while, management will need the ability to use the old value to study facts, whether they were recorded before or after the change. (A type 2 response does not help here; it associates all facts recorded before the change with the old value, and all facts recorded after the change with the new value.) And, now that they've thought about it some more, it might be nice to be able to do this every year.

The solution is to maintain two dimensional attributes for the category. The first is meant to carry the current category designation, and might be called category_current. The second is meant to carry the old category designation, and might be called category_previous.

When the category for a product changes, say from "stationery" to "business stationery", we update the category_current to contain the new value "business stationery" and place the old value, "stationery" in category_previous. When management wants to study facts using the new designations, their reports should group facts using the category_current dimension attribute. When they want to study facts using the old categorization scheme, they can use the category_previous attribute.

Documenting Type 3 Change

As discussed in the previous post, we often document slow change response on an attribute by attribute basis in the dimensional model. For our product example, this might be done as shown in the following illustration.

Many modeling tools allow us to document the SCD characteristic of a column, offering a pick list that allows us to choose what type of response is expected when the source of an attribute changes. But what if your tool does not support Type 3? Or what if you have a tool that automates the construction of an ETL routine, aggregate table or OLAP cube, and the tool does not offer the Type 3 option?

There may be a solution. Notice that when a type 3 change occurs, we update both versions of the attribute. In our example, category_current is updated withe the new name of the product, and category_previous is updated with the prior value (which is the old value of category_current.) We effectively have a pair of Type 1 attributes, and can document them as shown below. I've added a highlight to illustrate the pair of attributes in question.

Of course, if you have tools that use this information to do something, such as generate a cube or ETL script, you need to be sure this method will not be disruptive. You will probably need to provide sources for each of the attributes, and it may be challenging to define a source for category_previous, which will probably not be present in the source after the change has occurred. The solution may be able to define its source as the the category_current value of the product dimension table itself. This, in turn, may require that you join to a copy of the dimension when extracting source data.

Repeatable Process

To round out this discussion of the Type 3 change, I should point out that this slow change process is repeatable. Next year, if the designations change again, we can move category_current values into category_previous, and place the new values in category_current.

Note that we always have access to two values: the current value and the prior value. Not enough? Its also possible to maintain more than two columns for a type 3 response. We might choose to maintain columns called category_current, category_last_year, category_2yrs_ago, etc. Now, any or all of the values can be used to study all facts, regardless of when they were recorded. (A scheme like category_2007, category_2006, category_2005 might work as well, but might require you to redesign reports each year to use the correct attributes.)

More Changes on the Way

That's not quite the end of the story of SCD's. In future posts, I will look at the popular "timestamped" and "hybrid" approaches, as well as run down a few advanced concepts and caveats.

Copyright (c) 2007 Chris Adamson

Tuesday, October 9, 2007

For Slowly Changing Dimensions, "Change" is Relative

Posted by Chris Adamson

There's a difference between the way we think about Slowly Changing Dimensions and the way we document them. In this post, I'll highlight this difference by examining the two most common Slow Change techniques.

The term "slowly changing dimension" originated with Ralph Kimball, who identified three techniques for dealing with changed data. Commonly abbreviated as SCD's, these techniques are applied in any form of dimensional design, regardless of the data warehouse architecture.

In practice, there is a subtle but importance between the way we think about these changes and the way we describe them in a dimensional design. This sometimes leads to confusion.

Before I explain this important distinction, let me review the difference between surrogate and natural keys, and describe the two most common SCD techniques. (Future posts will look at other slow change techniques.)

Natural Keys and Surrogate Keys

We usually think of dimension tables in a star schema as corresponding to something in a source system. For example, each row in a customer dimension table relates to a single customer in a source system. Each column is loaded from one or more sources, based on a set of rules. The link back to a source system is preserved in the form of a natural key—usually a unique identifier in a source system, such as a customer_id.

But the star schema design does not rely on this natural key, or business key, to uniquely identify rows in dimension tables. Instead, a surrogate key is introduced. This surrogate key gives the dimensional design flexibility to handle changes differently than they are handled in source systems, while preserving the ability to perform joins using a single column.

Type 1 and Type 2 Slow Changes

Slowly changing dimension techniques determine how the dimensional model will respond to changes in the source system. If the customer with id 8472 changes, what do we do with that change? Alert readers may already be concerned about what I mean by "change" here, but let's first recap the two most common techniques.

Type 1: Update When the dimensional model responds to a change in source data by updating a column, Kimball calls this a type 1 change. For example, if a customer's date of birth changes, it is probably appropriate to update the corresponding row for that customer in the dimension table.

Under this scenario, any facts that were already associated with the dimension table row have effectively been revised as well. A report of sales dollars by date of birth, for example, will provide different results immediately before and after the type 1 change is applied. The type 1 change does not preserve history of the attribute value.

Type 2: New Row A more common response to a changed data element is to insert a new row into the dimension table. For example, when the address of customer 8472 changes, we create a new row for the customer in the dimension table. This row has a different surrogate key, and the new address. Customer 8472 now has two rows in the dimension, each with its own surrogate key.

This preserves the history of the attribute, and does not revise any previously stored facts. New facts will be associated with the new version of customer 8472; old facts remain associated with the old version.

For the most part, these two techniques form the basis of a dimensional model's response to change. (Future posts will consider the less common type 3 change, and additional techniques.) While these concepts are fairly easy to understand, it is important to look a bit deeper.

We think about slow changes with respect to the source

Notice the way that the original problem was framed. I asked how the dimensional schema would "respond to changes in the source data." This is how we usually think about the problem, and for good reason. After all, the source data exists before it is loaded into the dimensional schema. If birth_date changes, we overwrite; if address changes, we insert a new record.

Now observe that a change to the source does not always result in a change in the dimensional schema. In the example, a change in address resulted in a new row—not a changed row. No data is changed.

Still, we refer to this process as the occurrence of a type 2 change. Why? Because we think about slow changes with respect to the source data. And there, a change did occur.

We document slow changes with respect to the star

The most common way to document the dimensional schema's response to change is on the dimensional side, on an attribute by attribute basis. For each column in a dimension table, we note how changes in the source data will be handled. Our customer example might be documented as follows:

In the diagram, each non-key attribute is tagged with a 1 or a 2. This indicates whether changes in the source of the attribute should be handled as type 1 or type 2 changes.

Documenting SCD behavior in this way is handy. ETL developers use this information to design a scheme for performing incremental loads. Report developers use this information to understand how facts will be grouped when combined with different dimension attributes.

The only drawback to documenting SCD rules in this way is that it can lead to confusion. By tagging an attribute as a "type 2 SCD" we risk implying that attribute values may change. After all, the "C" in "SCD" stands for "change."

But of course, this attribute does not change. Rather, its classification as a type 2 attribute means "for a given natural key, if the source for this attribute undergoes a change, it will be necessary to insert a new row. "

In future posts, I will look at some common misconceptions about slowly changing dimensions, and discuss additional techniques for handling changes.

Copyright (c) 2007 Chris Adamson

Monday, June 11, 2007

Ten Things You Won't Learn from that Demo Schema

Posted by Chris Adamson

Many people learn about dimensional modeling by studying a sample star schema database that comes with a software product. These sample databases are useful learning tools—to a point. Here are 10 things you won't learn by studying that demo.

If you've learned everything you know about star schema by working with a sample database, you probably have a good intuitive grasp of star schema design principles. In a previous post, I provided a list of 10 terms and principles that most sample databases illustrate well.

But there are many important things about a dimensional data warehouse that are not revealed by the typical "Orders" or "Sales" demo. Here are the top 10 things you will not learn from that sample database.

Multiple Fact Tables Most sample databases contain a single star—one fact table and its associated dimension tables. But it is rare to find a business process that can be modeled with a single fact table; it is impossible to find an enterprise that can. Most real-world designs will involve multiple fact tables, sharing a set of common dimensions.

When facts become available at different times, or with different dimensionality, they almost always belong in separate fact tables. Modeling them in a single fact table can have negative consequences. Mixed grain issues may result, complicating the load and making reporting very difficult. For example, building reports focused on only one of the facts can result in a confusing preponderance of extra rows containing the value zero.

Conformance With multiple fact tables, it is also important that each star be designed so that it works with others. A design principle called conformance helps ensure that as we build each new star, it works well with those that came before it. This avoids the dreaded stove-pipe. This principle allows a set of star schemas to be planned around a set of common dimensions and implemented incrementally.

Drilling Across It's also important to understand how to properly build a report that accesses data from multiple stars. A single SQL select statement won't do the job. Double counting, or worse, can result. Instead, we follow a process called drilling across, where each star is queried individually. The two result sets are then combined based on their common attributes. These drill across reports are some of the most powerful in the data warehouse.

Snapshot Fact Tables The fact table found in most demo stars is usually called a transaction fact table. But there are real world situations where other types of fact table designs are called for.

A snapshot design is useful for capturing the result of a series of transactions at a point-in-time; for example, the balance of each account in the general ledger at the end of each day. This type of design introduces the concept of semi-additivity, which can be a problem for many ad hoc query tools. It makes no sense to add together yesterday's balance and today's balance. It is not uncommon to compute averages based on the data in a snapshot star. But one must be careful here; the SQL Average() function may not always be what you need.

Factless Fact Tables Another type of fact table often contains no facts at all. Factless fact tables are useful in situations where there appears to be nothing to measure aside from the occurrence of an event, such as a customer contact. They also come in handy when we want to capture information about which there may be no event at all, such as eligibility.

In addition to transaction, snapshot and factless designs, there are other types of fact table as well. It is not uncommon to need more than one, even when modeling a single activity.

Roles and Aliasing Many business processes involve a dimension in multiple roles. For example, in accounting a transaction may include the employee who makes a purchase, as well as the employee who approves it. There is no need for separate"Purchaser" and "Approver" dimensions. A single "Employee" dimension will do the job. The fact table will have two foreign key references to the Employee dimension--one that represents the purchaser, and one that represents the approver. We use SQL "aliasing" when querying this schema in order to capture the two employee roles.

Advanced Slow Change Techniques If you are lucky, you were able to learn about Type 1 and Type 2 Slowly Changing Dimension techniques from the demo schema. I described these in a previous post. Often, analytic requirements require more.

A Type 3 change allows you to "have it both ways," analyzing all past and future transactions as if the change had occurred (retroactively) or not all.

There are also hybrid approaches, one of which tracks the "transaction-time" version of the changed data element as well as the "current-value" of the data element.

And then there's the time-stamped dimension technique, also called a transaction dimension. In this version, each row receives an effective date/time and an expiration date time. This provides Type 2 functionality, but also allows point-in-time analysis of the dimensional data.

Bridge Tables Perhaps the most confusing technique for the novice dimensional designer is the use of bridge tables. These tables are used when the standard one-to-many relationship between dimension and fact does not apply. There are three situations where bridge tables come in handy:

An attribute bridge resolves situations where a dimension attribute may repeat multiple times. For example, a dimension table called "Company" may include an attribute called "Industry." Some companies have more than one industry. Rather than flattening into "Industry 1," "Industry 2," and so on, an attribute bridge captures as many industries as needed.

A dimension bridge resolves situations where an entire dimension may repeat with respect to facts. For example, there may be multiple salespeople involved in a sale. Instead of loading the fact table with multiple salesperson keys, a dimension bridge gracefully manages the group of salespeople.

A hierarchy bridge resolves situations where a recursive hierarchy exists within a dimension. For example, companies own other companies. At times, users may want to roll transactions up that occur beneath a specific company, or vice versa. Instead of flattening the hierarchy, which imposes limitations and complicates analysis, a hierarchy bridge can be joined to the transaction data in various ways, allowing multiple forms of analysis.

All bridge implementations have implications for usage, or report building. Improper use of a bridge can result in double counting or incorrect results. Bridges also make deployment of business intelligence tools more difficult.

Derived Schemas Useful stars can also be derived from existing stars. Often called "second-line" solutions, these derived schemas can accelerate specific types of analysis with powerful results. The merged fact table combines stars to avoid drilling across. The sliced fact table partitions data based on a dimension value, useful in distributed collection and analysis. The pivoted fact table restructures row-wise data for columnar analysis and vice-versa. And set operation fact tables provide precomputed results for union, intersect and minus operations on existing stars.

Aggregate Schemas One of the reasons the star schema has become so popular is that it provides strong query performance. Still, there are times when we want results to come back faster. Aggregate schemas partially summarize the data in a base schema, allowing the database to compute query results faster. Of course, designers need to identify aggregates that will provide the most help, and queries and reports will receive no benefit unless they actually use the aggregate. Aggregates are usually designed as separate tables, instead of providing multiple levels of summarization in a single star. This avoids double counting errors, and can allow the exploitation of an automated query rewrite mechanism so that applications do not need to be aware of the aggregates.

I limited this list to 10 things. That's enough to make the point: a demo schema will only take you so far. When I teach Advanced Star Schema Design courses, I find that even people with many years of experience have questions, and always learn something new.

If you want to learn more, read the books recommended in the sidebar of this blog. Take a class on Advanced Star Schema Design. Interact with your peers at The Data Warehousing Institute conferences. And keep reading this blog. There will always be more to learn.

Related Posts:

Top 10 Thinks You Should Know About that Demo Schema

© 2007 Chris Adamson

Tuesday, May 1, 2007

10 Things You Should Know About that Sample Star Schema

Posted by Chris Adamson

Today, many of us learn about the star schema by studying a sample database that comes with a software product—usually one that covers sales or orders. Here are 10 terms and principles of dimensional modeling to go with that sample schema you've worked with.

The star schema has become a de facto standard for the design of analytic databases. Sample stars are often included with RDBMS software, BI Tools and ETL tools. They are also used for tutorials and training. Almost universally, the sample schema describes a sales or order taking process, similar to the one depicted in the figure below:

Figure 1: A demo schema usually represents orders or sales.(Click to Enlarge)

You may have learned about the Star Schema by working with a sample like this one. If so, you probably have an intuitive grasp of star schema design principles. Here are ten terms and principles you should know that describe important features of the sample star.

Most of this is probably readily apparent if you've worked with a sample schema—what may be new is the terminology. The first two you probably know:

Facts are measurements that describe a business process. They are almost always numeric—but not all numeric attributes are facts. You can find facts (or measurements) in almost any analytic request—"Show me sales dollars by product" (sales dollars). "How many widgets were sold my John Smith in May?" (quantity ordered). There are some schemas that do not include facts—we'll look at those in another post.

Dimensions give facts context. They may be textual or numeric. They are used to specify how facts are "filtered" and "broken out" on reports. You can usually find dimensions after the words "by" or "for" in an analytic request. "Show me sales dollars by product" (product). "What are margin dollars by Month and Salesperson?" (month, sales rep).

Dimension tables are wide. Dimension tables usually group together a set of related dimension attributes, though there are situations where a dimension may include a set of attributes not related to one another. Dimension tables are not normalized, and usually have a lot of attributes—far more than appear in most sample schemas. This allows a rich set of detail to be used in analyzing facts. 100 or more columns is not uncommon for some dimensions. For this reason, we often call dimension tables wide.

Dimensions have Surrogate Keys. The primary key for each dimension table is an attribute specifically created for the dimensional schema. It is an integer assigned by the ETL process, and has no inherent meaning. It is not a reused key from a source system, such as a customer ID or product code. We call these attributes natural keys, and they may exist in the star, but do not serve as unique identifiers.

In the sample schema, customer_key is a surrogate key generated for the star schema; customer_id is a natural key carried over from a source system.

By assigning surrogate keys, we enable the star to handle changes to source data differently than the source system does. For example, in a source system a customer record may be overwritten, while we want the star schema to track changes. Performance considerations also come into play—a surrogate key avoids the need for multi-column joins.

Type 2 Changes track history. The term "Slowly Changing Dimension" (or SCD) describes how the data warehouse responds to changes in the source of dimensional data. There are several techniques that can be applied when the source of dimension detail changes. The most common is referred to as a "Type 2" change: an entirely new record is written to the dimension table.

For example, if a customer moves, the record may simply be updated in a source system. But in the star schema, we choose to add a new row to the customer dimension, complete with a new surrogate key. All prior facts remain associated with the "old" customer record; all future facts will be associated with the new record.

Type 1 Changes overwrite history. The Type 1 change is used when source data changes are not deemed significant, or may be the correction of an error. In such cases, we perform an update to an existing row in a dimension. For example, if a customer's gender is updated in the source, we may choose to update it in the corresponding dimension records. All prior facts are now associated with the changed value.

In addition to Type 1 and Type 2 changes, there are other SCD techniques. Hybrid approaches exist as well. Every design should identify which technique(s) will be used for each attribute of each dimension table.

Fact tables are narrow. A fact table row is usually entirely composed of numeric attributes: the facts, and foreign key references to the dimensions. Because of these characteristics, each fact table row is narrow, at least in contrast with wide dimension rows full of textual values. The narrowness of fact tables is important, because they will accumulate far more rows than dimension tables, and at a much faster rate.

Fact tables are usually sparse. Rows are recorded in the fact table only when there is something to measure. For example, not every customer orders every product from every salesperson each day. Rows are only recorded when there is an order. This helps manage the growth of the fact table. It also saves us from having to filter out a huge number of rows that have no sales dollars when displaying results in a report. (Usually, you don't want a customer sales report to list every product—only the ones they bought. You can use an outer join when you want the latter.)

Fact Table Grain The level of detail represented by a row in a fact table is referred to as its grain. Facts that are recorded with different levels of detail belong in separate fact tables. This avoids an array of reporting difficulties, as well as kludges such as including special rows in dimension tables for "not applicable." Determining the grain of a fact table is an important design step and helps avoid future confusion. (There are times when "not applicable" attributes are necessary, but they are most often a sign of the need for another fact table.)

In the example, the grain is sales by customer, product, salesperson and date. A better design might capture sales at the order line level of detail.

Additivity. Facts are usually additive. This means they can be summed across any dimension value. For example, order_dollars can be aggregated across customers, products, salespeople, or time periods, producing meaningful results. Additive facts are stored in the fact table. We also store additive facts that might be computed from other facts. (order_dollars might be the sum of extended_cost and margin_dollars, but why include only two out of the three.?

Some facts are non-additive. For example, margin rate is a percentage. Two sales at 50% margin do not equate to a single sale at 100% margin—this fact is not additive. In the star, we store the fully additive components of margin (order_dollars and margin_dollars) and let front end tools compute the ratio. There are also semi-additive facts, which we will look at in the next post.

Most of these terms and principles can be learned by working with a sample schema. But there are many important principles that the typical "Sales" model does not reveal. In a future post, I'll look at the top 10 things the demo schema does not teach you.

Related Posts:

Ten Things You Won't Learn from that Demo Schema

Classes

Chris is scheduled to present at the following events. Course enrollment is open to the general public.

All these courses are also available on site (see below).

August 18, 2019
San Diego, CA
Data Modeling in the Age of Big Data
Registration: TDWI San Diego
August 20, 2019
San Diego, CA
Data Architecture: Managing Information in the Age of Big Data
Registration: TDWI San Diego
August 20, 2019
San Diego, CA
Workshop: Building the Business Case for Advanced Analytics
Registration: TDWI San Diego Strategy Summit
Monday October 21, 2019
San Francisco, CA
TDWI Dimensional Data Modeling Primer: From Requirements to Business Analysis
Registration: TDWI Seminars
Tuesday October 22, 2019
San Francisco, CA
Advanced Dimensional Modeling: Techniques for Practitioners
Registration: TDWI Seminars
Wednesday October 23, 2019
San Francisco, CA
Dimensional Models: What’s New in the Big Data Era
Registration: TDWI Seminars
November 12, 2019
Orlando, FL
Data Architecture: Managing Information in the Age of Big Data
Registration: TDWI Orlando
November 12, 2019
Orlando, FL
The Dimensional Model Refactored: New Techniques for the 21st Century
Registration: TDWI Orlando
November 15, 2019
Orlando, FL
Advanced Dimensional Modeling: Complete Tour of Modern Best Practices
Registration: TDWI Orlando

Onsite Education

You can bring Chris to your team for interactive education.

Dimensional Modeling

Chris provides full-day and expanded two-day courses covering the dimensional design concepts from Star Schema: The Complete Reference.
TDWI Courses

Chris teaches select TDWI courses that cover topics like data BI fundamentals, performance management, business analytics, dashboards and scorecards, and more.

All of Chris's education offerings are provided through TDWI.

For information on onsite offerings, contact TDWI Onsite Education. or Oakton Software