Thursday, November 15, 2007

How To Document a Type 3 SCD: A pair of Type 1 Responses

Many dimensionally-aware database modeling tools allow you to document the Slow Change characteristics of each dimension attribute. But often, your only choices are Type 1 or Type 2. This post describes what to do when your design calls for the less common Type 3 change. As it turns out, a Type 3 SCD response can be understood as a pair of Type 1 responses.

In a previous post, I pointed out that we think about slow changes with respect to a source system, we usually document them as part of the dimensional schema. For example, if the customer's name or date of birth changes on the source, what do we do in the star schema? A Type 1 response would be to overwrite a corresponding value in the star schema; a Type 2 response would be to insert a new row for the revised version of the customer. We usually document this behavior by flagging each attribute in the star as "Type 1" or "Type 2", even though the "change" happens on the source.

(Not sure Slowly Changing Dimensions are? See For Slowly Changing Dimensions, Change Is Relative for an definition of the concept and explanations of the Type 1 and Type 2 techniques.)

A Type 3 change calls for yet another kind of response to a change in source data. This approach response maintains before and after versions of the value in separate dimension attributes. Because the before and after value are stored in the same row, either version can be associated with activity (read: facts) that occurred before or after the change.

Schema designers often throw up their arms in disgust when they discover that their modeling tool does not support the concept of a Type 3 change. But this despair may not be warranted; it is possible to document a Type 3 change as a pair of Type 1 changes.

Before we look at how to do this, here's a quick example of a Type 3 slow change.

Type 3 SCD: An Example

Suppose that the dimensional model contains a dimension table called product, and that one of the dimension attributes is category. The category attribute carries a standard classification mechanism used to categorize products, and appears in a large number of reports. An example of a category is "stationery."

A wide variety of products have this category designation, including several types of envelope, note pad, and paper sheets. All these products have rows in the product dimension table with a category of "stationery." Fact tables linked to the product dimension can be analyzed by category, grouping all these products under the value "stationery."

Now lets assume that the business has expanded its product lines, and the old "stationery" category is a bit too broad. Management plans to redesignate existing stationery products as "Personal Stationery" or "Business Stationery". Going forward, they want to use the new values for all analysis, even when studying facts that were recorded prior to the changeover. So far, sounds like a Type 1 change.

The catch: some planning numbers are tied to the old designations. For at least a while, management will need the ability to use the old value to study facts, whether they were recorded before or after the change. (A type 2 response does not help here; it associates all facts recorded before the change with the old value, and all facts recorded after the change with the new value.) And, now that they've thought about it some more, it might be nice to be able to do this every year.

The solution is to maintain two dimensional attributes for the category. The first is meant to carry the current category designation, and might be called category_current. The second is meant to carry the old category designation, and might be called category_previous.

When the category for a product changes, say from "stationery" to "business stationery", we update the category_current to contain the new value "business stationery" and place the old value, "stationery" in category_previous. When management wants to study facts using the new designations, their reports should group facts using the category_current dimension attribute. When they want to study facts using the old categorization scheme, they can use the category_previous attribute.

Documenting Type 3 Change

As discussed in the previous post, we often document slow change response on an attribute by attribute basis in the dimensional model. For our product example, this might be done as shown in the following illustration.

Many modeling tools allow us to document the SCD characteristic of a column, offering a pick list that allows us to choose what type of response is expected when the source of an attribute changes. But what if your tool does not support Type 3? Or what if you have a tool that automates the construction of an ETL routine, aggregate table or OLAP cube, and the tool does not offer the Type 3 option?

There may be a solution. Notice that when a type 3 change occurs, we update both versions of the attribute. In our example, category_current is updated withe the new name of the product, and category_previous is updated with the prior value (which is the old value of category_current.) We effectively have a pair of Type 1 attributes, and can document them as shown below. I've added a highlight to illustrate the pair of attributes in question.

Of course, if you have tools that use this information to do something, such as generate a cube or ETL script, you need to be sure this method will not be disruptive. You will probably need to provide sources for each of the attributes, and it may be challenging to define a source for category_previous, which will probably not be present in the source after the change has occurred. The solution may be able to define its source as the the category_current value of the product dimension table itself. This, in turn, may require that you join to a copy of the dimension when extracting source data.

Repeatable Process

To round out this discussion of the Type 3 change, I should point out that this slow change process is repeatable. Next year, if the designations change again, we can move category_current values into category_previous, and place the new values in category_current.

Note that we always have access to two values: the current value and the prior value. Not enough? Its also possible to maintain more than two columns for a type 3 response. We might choose to maintain columns called category_current, category_last_year, category_2yrs_ago, etc. Now, any or all of the values can be used to study all facts, regardless of when they were recorded. (A scheme like category_2007, category_2006, category_2005 might work as well, but might require you to redesign reports each year to use the correct attributes.)

More Changes on the Way

That's not quite the end of the story of SCD's. In future posts, I will look at the popular "timestamped" and "hybrid" approaches, as well as run down a few advanced concepts and caveats.

Copyright (c) 2007 Chris Adamson

Tuesday, October 9, 2007

For Slowly Changing Dimensions, "Change" is Relative

There's a difference between the way we think about Slowly Changing Dimensions and the way we document them. In this post, I'll highlight this difference by examining the two most common Slow Change techniques.

The term "slowly changing dimension" originated with Ralph Kimball, who identified three techniques for dealing with changed data. Commonly abbreviated as SCD's, these techniques are applied in any form of dimensional design, regardless of the data warehouse architecture.

In practice, there is a subtle but importance between the way we think about these changes and the way we describe them in a dimensional design. This sometimes leads to confusion.

Before I explain this important distinction, let me review the difference between surrogate and natural keys, and describe the two most common SCD techniques. (Future posts will look at other slow change techniques.)

Natural Keys and Surrogate Keys

We usually think of dimension tables in a star schema as corresponding to something in a source system. For example, each row in a customer dimension table relates to a single customer in a source system. Each column is loaded from one or more sources, based on a set of rules. The link back to a source system is preserved in the form of a natural keyusually a unique identifier in a source system, such as a customer_id.

But the star schema design does not rely on this natural key, or business key, to uniquely identify rows in dimension tables. Instead, a surrogate key is introduced. This surrogate key gives the dimensional design flexibility to handle changes differently than they are handled in source systems, while preserving the ability to perform joins using a single column.

Type 1 and Type 2 Slow Changes

Slowly changing dimension techniques determine how the dimensional model will respond to changes in the source system. If the customer with id 8472 changes, what do we do with that change? Alert readers may already be concerned about what I mean by "change" here, but let's first recap the two most common techniques.

  • Type 1: Update When the dimensional model responds to a change in source data by updating a column, Kimball calls this a type 1 change. For example, if a customer's date of birth changes, it is probably appropriate to update the corresponding row for that customer in the dimension table.

    Under this scenario, any facts that were already associated with the dimension table row have effectively been revised as well. A report of sales dollars by date of birth, for example, will provide different results immediately before and after the type 1 change is applied. The type 1 change does not preserve history of the attribute value.
  • Type 2: New Row A more common response to a changed data element is to insert a new row into the dimension table. For example, when the address of customer 8472 changes, we create a new row for the customer in the dimension table. This row has a different surrogate key, and the new address. Customer 8472 now has two rows in the dimension, each with its own surrogate key.

    This preserves the history of the attribute, and does not revise any previously stored facts. New facts will be associated with the new version of customer 8472; old facts remain associated with the old version.

For the most part, these two techniques form the basis of a dimensional model's response to change. (Future posts will consider the less common type 3 change, and additional techniques.) While these concepts are fairly easy to understand, it is important to look a bit deeper.

We think about slow changes with respect to the source

Notice the way that the original problem was framed. I asked how the dimensional schema would "respond to changes in the source data." This is how we usually think about the problem, and for good reason. After all, the source data exists before it is loaded into the dimensional schema. If birth_date changes, we overwrite; if address changes, we insert a new record.

Now observe that a change to the source does not always result in a change in the dimensional schema. In the example, a change in address resulted in a new rownot a changed row. No data is changed.

Still, we refer to this process as the occurrence of a type 2 change. Why? Because we think about slow changes with respect to the source data. And there, a change did occur.

We document slow changes with respect to the star

The most common way to document the dimensional schema's response to change is on the dimensional side, on an attribute by attribute basis. For each column in a dimension table, we note how changes in the source data will be handled. Our customer example might be documented as follows:

In the diagram, each non-key attribute is tagged with a 1 or a 2. This indicates whether changes in the source of the attribute should be handled as type 1 or type 2 changes.

Documenting SCD behavior in this way is handy. ETL developers use this information to design a scheme for performing incremental loads. Report developers use this information to understand how facts will be grouped when combined with different dimension attributes.

The only drawback to documenting SCD rules in this way is that it can lead to confusion. By tagging an attribute as a "type 2 SCD" we risk implying that attribute values may change. After all, the "C" in "SCD" stands for "change."

But of course, this attribute does not change. Rather, its classification as a type 2 attribute means "for a given natural key, if the source for this attribute undergoes a change, it will be necessary to insert a new row. "

In future posts, I will look at some common misconceptions about slowly changing dimensions, and discuss additional techniques for handling changes.

Copyright (c) 2007 Chris Adamson

Monday, September 10, 2007

Mastering Data Warehouse Aggregates: Errata

This post contains updates and corrections for Mastering Data Warehouse Aggregates: Solutions for Star Schema Performance. I'll update it as neededhopefully that won't be very often!

Book: Mastering Data Warehouse Aggregates

Mastering Data Warehouse Aggregates

By Chris Adamson
Wiley Publishing Inc., 2006
Many thanks to those of you who have sent feedback, praise, comments and questions. It is good to know that the book has been well received, and is being studied carefully. Happily, I have only been made aware of a couple of errors. Here they are, in page number order.

  • Page 25: Third paragraph, last sentence, "...has summarized order_facts by completely omitting the salesperson and order_type dimension tables."  This sentence should end "customer and order_type dimension tables," in accordance with Figure 1.8.
  • Page 171: Figure 5.7 incorrectly lists a BRAND_KEY in the upper right of the diagram. The column should be labeled PRODUCT_KEY, as described in the text.

  • Page 195: Step 2.1.4 in figure 6.3 should have a connection to step 2.1.5 (as indicated in the text on page 203).
For more information on the book, including the complete table of contents and a sample chapter, visit the Wiley website. The book is available through and at stores everywhere. And, of course, feel free to send me your comments.


Friday, July 6, 2007

Drive Warehouse Strategy with a Dimensional Model

CIO's and managers of successful data warehouses understand that a dimensional model is more than just a design-stage project deliverable. They use the dimensional model to drive data warehouse strategy, capture requirements, set project priorities, and manage project scope.

Fundamentally, a dimensional model deals with the measurement of business processes. It describes how a business process is evaluated, and can be used to frame questions about a process. In this respect, it speaks clearly to the business users of the data warehouse.

A dimensional model also has technical implications. Its definition determines the data sources that must be integrated, how information must be cleansed or standardized, and what queries or reports can be built. In this respect, it speaks clearly to the developers of the data warehouse.

These business and technical characteristics of the dimensional model make it an ideal focal point for managing the entire data warehouse life cycle. A dimensional model can serve as the basis for a shared understanding of warehouse strategy. From a business perspective, it imparts a clear understanding of functional capability; from an I.T. perspective, it supports a clear understanding of technical activity.

  • Warehouse Strategy It is well understood that planning a dimensional model at an enterprise level can enable the incremental implementation of subject area applications. For a Kimball-style dimensional data warehouse, this process is critical to ensuring that stove-pipes are not developed.

    But a high-level dimensional model has additional value, even if you are not building a dimensional data warehouse. Because it provides a clear framework to describe functional capability and technical work-units, a dimensional model is an effective way to plan and document an enterprise data warehouse architecture.

    I use the dimensional model as the central focus of data warehouse strategic plans. It is understood by business and technical constituents, bringing them together with a shared understanding of scope, functionality and technical effort for each subject area. It clearly conveys functionality, while at the same time allowing development activity to be quantified.

  • Requirements Definition In the same way, a dimensional model is an ideal way to capture requirements—whether at a strategic level, or for a subject-area implementation (or data mart.) Report requirements, data requirements, and loading requirements can all be expressed in dimensional terms. This makes each requirement understandable to a variety of audiences, and allows their dependencies to be easily cross-referenced.

    I use a dimensional format to capture requirements and develop specifications for deliverables such as reports, applications, ETL routines, and, of course, schema design.

  • Project Prioritization A dimensional model can be cross referenced with business priorities, report functionality, data availability, load requirements, and several other factors that drive a development roadmap. As a common ground for business and technical interaction, it is an invaluable tool.

    Once priorities are set, the dimensional framework can be used to describe them. By segmenting a dimensional model into a set of sequenced projects, it clearly links both functionality and technical effort to the calendar. At the same time, it enables analysis of resource requirements that will be required.
  • Manging Scope The dimensional model is an ideal way to define and manage the scope of projects. I use dimensional terms to describe project objectives, what is considered in and out of project scope, and how change requests will be evaluated by project leadership.

    This is particularly useful for implementation projects that employ an "iterative" build process, where scope-creep is an ever-present possibility. By linking project scope to dimensional characteristics (such as grain, sources, or attribution), physical design can be enhanced through iterations without allowing the project to spiral out of control.
A dimensional focus can characterize a variety of deliverables, regardless of the type of architecture your data warehouse employs. Strategy documents, report requirement definitions, project definitions, load specifications... A unified dimensional approach links all these deliverables together, and expresses important information in a way that provides value to business and technical audiences.

© 2007 Chris Adamson

Monday, June 11, 2007

Ten Things You Won't Learn from that Demo Schema

Many people learn about dimensional modeling by studying a sample star schema database that comes with a software product. These sample databases are useful learning tools—to a point. Here are 10 things you won't learn by studying that demo.

If you've learned everything you know about star schema by working with a sample database, you probably have a good intuitive grasp of star schema design principles. In a previous post, I provided a list of 10 terms and principles that most sample databases illustrate well.

But there are many important things about a dimensional data warehouse that are not revealed by the typical "Orders" or "Sales" demo. Here are the top 10 things you will not learn from that sample database.

  1. Multiple Fact Tables Most sample databases contain a single star—one fact table and its associated dimension tables. But it is rare to find a business process that can be modeled with a single fact table; it is impossible to find an enterprise that can. Most real-world designs will involve multiple fact tables, sharing a set of common dimensions.

    When facts become available at different times, or with different dimensionality, they almost always belong in separate fact tables. Modeling them in a single fact table can have negative consequences. Mixed grain issues may result, complicating the load and making reporting very difficult. For example, building reports focused on only one of the facts can result in a confusing preponderance of extra rows containing the value zero.

  2. Conformance With multiple fact tables, it is also important that each star be designed so that it works with others. A design principle called conformance helps ensure that as we build each new star, it works well with those that came before it. This avoids the dreaded stove-pipe. This principle allows a set of star schemas to be planned around a set of common dimensions and implemented incrementally.

  3. Drilling Across It's also important to understand how to properly build a report that accesses data from multiple stars. A single SQL select statement won't do the job. Double counting, or worse, can result. Instead, we follow a process called drilling across, where each star is queried individually. The two result sets are then combined based on their common attributes. These drill across reports are some of the most powerful in the data warehouse.

  4. Snapshot Fact Tables The fact table found in most demo stars is usually called a transaction fact table. But there are real world situations where other types of fact table designs are called for.

    A snapshot design is useful for capturing the result of a series of transactions at a point-in-time; for example, the balance of each account in the general ledger at the end of each day. This type of design introduces the concept of semi-additivity, which can be a problem for many ad hoc query tools. It makes no sense to add together yesterday's balance and today's balance. It is not uncommon to compute averages based on the data in a snapshot star. But one must be careful here; the SQL Average() function may not always be what you need.

  5. Factless Fact Tables Another type of fact table often contains no facts at all. Factless fact tables are useful in situations where there appears to be nothing to measure aside from the occurrence of an event, such as a customer contact. They also come in handy when we want to capture information about which there may be no event at all, such as eligibility.

    In addition to transaction, snapshot and factless designs, there are other types of fact table as well. It is not uncommon to need more than one, even when modeling a single activity.

  6. Roles and Aliasing Many business processes involve a dimension in multiple roles. For example, in accounting a transaction may include the employee who makes a purchase, as well as the employee who approves it. There is no need for separate"Purchaser" and "Approver" dimensions. A single "Employee" dimension will do the job. The fact table will have two foreign key references to the Employee dimension--one that represents the purchaser, and one that represents the approver. We use SQL "aliasing" when querying this schema in order to capture the two employee roles.

  7. Advanced Slow Change Techniques If you are lucky, you were able to learn about Type 1 and Type 2 Slowly Changing Dimension techniques from the demo schema. I described these in a previous post. Often, analytic requirements require more.

    A Type 3 change allows you to "have it both ways," analyzing all past and future transactions as if the change had occurred (retroactively) or not all.

    There are also hybrid approaches, one of which tracks the "transaction-time" version of the changed data element as well as the "current-value" of the data element.

    And then there's the time-stamped dimension technique, also called a transaction dimension. In this version, each row receives an effective date/time and an expiration date time. This provides Type 2 functionality, but also allows point-in-time analysis of the dimensional data.

  8. Bridge Tables Perhaps the most confusing technique for the novice dimensional designer is the use of bridge tables. These tables are used when the standard one-to-many relationship between dimension and fact does not apply. There are three situations where bridge tables come in handy:

    An attribute bridge resolves situations where a dimension attribute may repeat multiple times. For example, a dimension table called "Company" may include an attribute called "Industry." Some companies have more than one industry. Rather than flattening into "Industry 1," "Industry 2," and so on, an attribute bridge captures as many industries as needed.

    A dimension bridge resolves situations where an entire dimension may repeat with respect to facts. For example, there may be multiple salespeople involved in a sale. Instead of loading the fact table with multiple salesperson keys, a dimension bridge gracefully manages the group of salespeople.

    A hierarchy bridge resolves situations where a recursive hierarchy exists within a dimension. For example, companies own other companies. At times, users may want to roll transactions up that occur beneath a specific company, or vice versa. Instead of flattening the hierarchy, which imposes limitations and complicates analysis, a hierarchy bridge can be joined to the transaction data in various ways, allowing multiple forms of analysis.

    All bridge implementations have implications for usage, or report building. Improper use of a bridge can result in double counting or incorrect results. Bridges also make deployment of business intelligence tools more difficult.

  9. Derived Schemas Useful stars can also be derived from existing stars. Often called "second-line" solutions, these derived schemas can accelerate specific types of analysis with powerful results. The merged fact table combines stars to avoid drilling across. The sliced fact table partitions data based on a dimension value, useful in distributed collection and analysis. The pivoted fact table restructures row-wise data for columnar analysis and vice-versa. And set operation fact tables provide precomputed results for union, intersect and minus operations on existing stars.

  10. Aggregate Schemas One of the reasons the star schema has become so popular is that it provides strong query performance. Still, there are times when we want results to come back faster. Aggregate schemas partially summarize the data in a base schema, allowing the database to compute query results faster. Of course, designers need to identify aggregates that will provide the most help, and queries and reports will receive no benefit unless they actually use the aggregate. Aggregates are usually designed as separate tables, instead of providing multiple levels of summarization in a single star. This avoids double counting errors, and can allow the exploitation of an automated query rewrite mechanism so that applications do not need to be aware of the aggregates.
I limited this list to 10 things. That's enough to make the point: a demo schema will only take you so far. When I teach Advanced Star Schema Design courses, I find that even people with many years of experience have questions, and always learn something new.

If you want to learn more, read the books recommended in the sidebar of this blog. Take a class on Advanced Star Schema Design. Interact with your peers at The Data Warehousing Institute conferences. And keep reading this blog. There will always be more to learn.

Related Posts:

Top 10 Thinks You Should Know About that Demo Schema

© 2007 Chris Adamson

Tuesday, May 22, 2007

Data Security: Don't Be the Weak Link

Data Warehouse project teams dedicate a lot of time to protecting personal information—and rightly so. But don’t let your notebook computer become the weak link. I suggest three simple steps you can follow to protect yourself, your company and your customers.

The Letter

Imagine that you have just received a letter that has you more than a little upset. It came from a financial institution where you have an account. It reads,

We are writing to inform you of a recent incident which may have compromised some of your personal information…Recently, a notebook computer belonging to one of our employees was stolen out of a hotel meeting room.

The computer contained confidential information about many of our customers. Data on the computer included your account numbers, balance information, and social security number…

The letter goes on to say that the computer was password protected, and that most computers are stolen for the value of the hardwarenot because the thief is looking for confidential data. Still, they have set up a hotline, and will provide resources to support you in the event of identity theft.

Your Reaction

How do you feel? Probably angry. Why was that notebook left unsecured? Why was information about you on it in the first place? One of your first reactions might be to consider taking your account elsewhere.

Now imagine that it was your computer that was stolen. How comfortable would you be, knowing the theft of your computer forced your company to notify several thousand customers that their confidential information may have been breached?

Worse yet, imagine that you were a consultant working for the company. You have caused them a massive headache, and also jeopardized your own company’s contract with the financial institution.

Guarding Confidential Data in the Warehouse

Data warehousing teams are often entrusted with confidential information relating to an organization’s customers, employees and activities. We spend a great deal of time designing solutions that safeguard this information:

  • We create alternate identifiers so that social security numbers are not needed to link data sets together
  • Extract files are maintained on secure servers, and accessed only by specific users
  • We implement encryption on sensitive data elements
  • We manage ETL processing in a secure environment
  • We provide lookup functions that allowing users to find a match to a data element, without providing unfettered access to the element itself
  • We take steps to ensure that log files do not contain sensitive data elements
  • We develop database-level security at column and row levels, and provide user access on a need-to-know basis
  • We audit access to confidential data elements

Most of this work is focused on securing the processing, storage and use of sensitive data elements. And most of the focus is on servers. But what about your PC?

Don’t Be the Weak Link

As someone involved in the design, development or testing of an analytic system (or any system), you may have access to files or extracts that contain confidential data. Some of this could, quite innocently, land on your PC. Your organization may have strict policies about this access.

Whether or not they do, I encourage you to adopt your own policy as well. I do. And it is a simple one:

"Not on MY computer. Period."

Whether developing database design, architecting an ETL process, or developing a quality assurance process, we are all able to do our jobs without turning our notebook computer to the weak link in the security chain.

I refuse to accept any file that contains sensitive data, whether encrypted or not. When I need to review an extract file, I examine it on secure server, from a secure workstation. You get the idea.

By keeping this information off our notebooks, we avoid being the cause of one of those dreaded customer letters. By protecting ourselves, we also protect our customers and our company.

Guarding Your PC

Taking a few simple steps to avoid being in that uncomfortable position.

  1. Just say No

    Don’t accept any confidential data and store it, even temporarily, on your PC.

  2. Wipe it Out

    If you receive a file that turns out to contain confidential data, get rid of it. Use a “wipe” utility or “secure delete” to completely expunge it from your system. Notify the sender that you have done so, and request a “clean” version of the file.

    There are a variety of utilities that can wipe a file or your hard drive. A list can be found at the bottom of the DBAN page at Sourceforge. One free utility is Heidi Eraser. (Links appear at the end of this article.)

  3. Stay Up to Date

    A physical theft is not the only thing there is to worry about. Keep you computer software up to date. Yes, I.T. probably keeps you OS, firewall and critical applications up-to-date.

    But you’ve got other things on that machine. You know you do. Plug-ins. Utilities. Media players. They are innocent enough: iTunes, WinAmp, Acrobat Reader, Flash Player, Real Player, Java...But these applications and many others can also cause a breach on your machine.

    One place to check what needs updating is the Secunia Software Inspector (link at end of post), which scans for a wide variety of programs and tells you what security updates are needed. This check requires Java – be sure to disable it after you’re done.

So don't be the weak link. Protect your PC and you'll be protecting yourself, your company and your customers.

© 2007 Chris Adamson

Related Links (all open in new browser windows):

DBAN page at Sourceforge:
List of hard drive wipe utilities appears at bottom of page.

Heidi Eraser (a GNU Licensed File Wipe Utility)

Secunia Software Inspector (free service scans your computer for insecure software versions)

Tuesday, May 1, 2007

10 Things You Should Know About that Sample Star Schema

Today, many of us learn about the star schema by studying a sample database that comes with a software productusually one that covers sales or orders. Here are 10 terms and principles of dimensional modeling to go with that sample schema you've worked with.

The star schema has become a de facto standard for the design of analytic databases. Sample stars are often included with RDBMS software, BI Tools and ETL tools. They are also used for tutorials and training. Almost universally, the sample schema describes a sales or order taking process, similar to the one depicted in the figure below:

Figure 1: A demo schema usually represents orders or sales.(Click to Enlarge)

You may have learned about the Star Schema by working with a sample like this one. If so, you probably have an intuitive grasp of star schema design principles. Here are ten terms and principles you should know that describe important features of the sample star.

Most of this is probably readily apparent if you've worked with a sample schema—what may be new is the terminology. The first two you probably know:

  1. Facts are measurements that describe a business process. They are almost always numeric—but not all numeric attributes are facts. You can find facts (or measurements) in almost any analytic request—"Show me sales dollars by product" (sales dollars). "How many widgets were sold my John Smith in May?" (quantity ordered). There are some schemas that do not include facts—we'll look at those in another post.

  2. Dimensions give facts context. They may be textual or numeric. They are used to specify how facts are "filtered" and "broken out" on reports. You can usually find dimensions after the words "by" or "for" in an analytic request. "Show me sales dollars by product" (product). "What are margin dollars by Month and Salesperson?" (month, sales rep).

  3. Dimension tables are wide. Dimension tables usually group together a set of related dimension attributes, though there are situations where a dimension may include a set of attributes not related to one another. Dimension tables are not normalized, and usually have a lot of attributes—far more than appear in most sample schemas. This allows a rich set of detail to be used in analyzing facts. 100 or more columns is not uncommon for some dimensions. For this reason, we often call dimension tables wide.

  4. Dimensions have Surrogate Keys. The primary key for each dimension table is an attribute specifically created for the dimensional schema. It is an integer assigned by the ETL process, and has no inherent meaning. It is not a reused key from a source system, such as a customer ID or product code. We call these attributes natural keys, and they may exist in the star, but do not serve as unique identifiers.

    In the sample schema, customer_key is a surrogate key generated for the star schema; customer_id is a natural key carried over from a source system.

    By assigning surrogate keys, we enable the star to handle changes to source data differently than the source system does. For example, in a source system a customer record may be overwritten, while we want the star schema to track changes. Performance considerations also come into play—a surrogate key avoids the need for multi-column joins.

  5. Type 2 Changes track history. The term "Slowly Changing Dimension" (or SCD) describes how the data warehouse responds to changes in the source of dimensional data. There are several techniques that can be applied when the source of dimension detail changes. The most common is referred to as a "Type 2" change: an entirely new record is written to the dimension table.

    For example, if a customer moves, the record may simply be updated in a source system. But in the star schema, we choose to add a new row to the customer dimension, complete with a new surrogate key. All prior facts remain associated with the "old" customer record; all future facts will be associated with the new record.

  6. Type 1 Changes overwrite history. The Type 1 change is used when source data changes are not deemed significant, or may be the correction of an error. In such cases, we perform an update to an existing row in a dimension. For example, if a customer's gender is updated in the source, we may choose to update it in the corresponding dimension records. All prior facts are now associated with the changed value.

    In addition to Type 1 and Type 2 changes, there are other SCD techniques. Hybrid approaches exist as well. Every design should identify which technique(s) will be used for each attribute of each dimension table.

  7. Fact tables are narrow. A fact table row is usually entirely composed of numeric attributes: the facts, and foreign key references to the dimensions. Because of these characteristics, each fact table row is narrow, at least in contrast with wide dimension rows full of textual values. The narrowness of fact tables is important, because they will accumulate far more rows than dimension tables, and at a much faster rate.

  8. Fact tables are usually sparse. Rows are recorded in the fact table only when there is something to measure. For example, not every customer orders every product from every salesperson each day. Rows are only recorded when there is an order. This helps manage the growth of the fact table. It also saves us from having to filter out a huge number of rows that have no sales dollars when displaying results in a report. (Usually, you don't want a customer sales report to list every product—only the ones they bought. You can use an outer join when you want the latter.)

  9. Fact Table Grain The level of detail represented by a row in a fact table is referred to as its grain. Facts that are recorded with different levels of detail belong in separate fact tables. This avoids an array of reporting difficulties, as well as kludges such as including special rows in dimension tables for "not applicable." Determining the grain of a fact table is an important design step and helps avoid future confusion. (There are times when "not applicable" attributes are necessary, but they are most often a sign of the need for another fact table.)

    In the example, the grain is sales by customer, product, salesperson and date. A better design might capture sales at the order line level of detail.

  10. Additivity. Facts are usually additive. This means they can be summed across any dimension value. For example, order_dollars can be aggregated across customers, products, salespeople, or time periods, producing meaningful results. Additive facts are stored in the fact table. We also store additive facts that might be computed from other facts. (order_dollars might be the sum of extended_cost and margin_dollars, but why include only two out of the three.?

    Some facts are non-additive. For example, margin rate is a percentage. Two sales at 50% margin do not equate to a single sale at 100% margin—this fact is not additive. In the star, we store the fully additive components of margin (order_dollars and margin_dollars) and let front end tools compute the ratio. There are also semi-additive facts, which we will look at in the next post.
Most of these terms and principles can be learned by working with a sample schema. But there are many important principles that the typical "Sales" model does not reveal. In a future post, I'll look at the top 10 things the demo schema does not teach you.

Related Posts:

Ten Things You Won't Learn from that Demo Schema

© 2007 Chris Adamson

Tuesday, April 10, 2007

Three Myths About Star Schema

Think Bill Inmon is anti-star schema? Think again. This post debunks three common myths about star schema design.

As I wrote in a previous post, there are three styles of data warehousing that incorporate star schema design: Kimball’s Dimensional Data Warehouse, Inmon’s Corporate Information Factory, and what I call the “stand-alone” data mart.

These architectures make use of the star schema in different ways. And for proponents of each approach, terms like “enterprise data warehouse” and “data mart” have very different connotations. This confusing situation has given rise to some misconceptions about the star schema.

In this post, I will set the record straight on three common myths.

Myth #1: Bill Inmon is anti-Star Schema

Perhaps the most common misconception about the star schema is that Bill Inmon is opposed to it. In fact, Inmon advocates the use of the star schema to build data marts as part of his Corporate Information Factory architecture. Where he does not want dimensional design used is for the granular repository of enterprise data from which data marts are drawn.

So where did this myth come from? In part, it stems from a common oversimplification of data warehousing theory, which distills a large set of technical issues into the catch-phrase “Kimball vs. Inmon.” Because the star schema has become so closely associated with Kimball, "Kimball vs. Inmon" is often misconstrued as a debate over star schema.

Lest you have any doubts that Inmon advocates the use of star schema, all you need to do is peruse some of his work. Or, search out the first edition of Kimball’s seminal work, The Data Warehouse Toolkit. The forward to the first edition was penned by none other than…Bill Inmon.

Myth #2: The Star Schema is for Aggregated Data

I frequently encounter people who believe that the star schema is used only for aggregated data. While it can be used for aggregated data, it does not have to be used that way. In fact, in a Kimball-style Dimensional Data Warehouse, one guiding principle dictates that fact tables should store the most granular data possible. That is quite the opposite of aggregated data.

So why the confusion? In other architectures, this principle may not be followed. A star schema data mart in an Inmon-style data warehouse may aggregate information that is extracted from a repository of atomic data. A stand-alone data mart may aggregate data drawn from source systems. And even in a Kimball-style architecture, star schemas may be supplemented by aggregate schemas designed to improve performance. (In fact, I’ve written an entire book on this topic.)

But none of these uses of the star preclude its use for the storage of granular, atomic data. Don't fall into the trap of assuming a particular application of the technique fully describes its capabilities or limitations.

Myth #3: Star Schema Data Marts Are Stovepipes

I hear this one in many flavors: using a star schema design leads to “stovepipes” or to “silos” or to “islands of information.” Another variation holds that star schema is for "bottom-up" development. The notion is that if you have multiple star schema data marts, they will not work together.

Whatever words are used, the characterization is false. This myth has probably arisen from a particular application of star schema design: the stand-alone data mart. When a data mart is built in the absence of an enterprise context--a standalone data mart--it serves an immediate purpose, but may later found to be incompatible with analytic requirements in related subject areas.

This sort of "failure" is often observed in stand alone data marts. But it hardly makes sense to blame such failures on the star schema design. You can build a car without a steering wheel, but that does not mean that cars can only drive in straight lines.

In fact, star schema can be very effectively used to avoid stovepipes. The principle of dimensional conformance can be applied, whether designing stars as part of a Dimensional Data Warehouse (Kimball) or a Corporate Information Factory (Inmon). This design technique ensures that common dimensions are used consistently across subject-areas, allowing data from multiple stars to be easily combined into a single report. In technical-speak, it allows applications to “drill-across” schemas or subject areas.

The result is a set of data marts that serve departmental needs, but can also be combined to answer questions that span subject areas. They may be built incrementally, but they work together. Even builders of stand-alone data marts can take measures to ensure that the solution will conform with future data marts. (I'll tackle that topic in a future post.)

So remember: Inmon is not anti-star schema. The star schema is not limited to aggregate data. Star schema data marts do not have to be stovepipes.

Copyright (c) 2007 Chris Adamson

Monday, March 26, 2007

Three Data Warehouse Architectures that Use Star Schema

Whether you follow Kimball or Inmon, or you just build stand-alone data marts, this Blog offers practical advice for your star schemas.

Most data warehouses have a data architecture that fits into one of three categories. These styles of data warehousing often use the same terms to mean very different things. The term “data mart,” for example, has very different connotations for proponents of each of the different architectures. As a result, there is a lot of misinformation about the relative merits of each approach.

But dig a little deeper, and it turns out all three architectures share something in common. Each architecture has a place for the star schema – a set of tables designed according to the principles of dimensional modeling.

Three Data Architectures

At the risk of fanning the flames of a debate that is often characterized as “religious,” here is a very brief summary of each of the three styles of data warehouse design.

Corporate Information Factory This approach is advocated by Bill Inmon, Claudia Imhoff, and many others. It involves a central repository of detailed or “atomic,” data, organized in third normal form. Separate departmental data marts reorganize portions of this data for departmental purposes. These data marts may be built using star schema design.

Dimensional Data Warehouse This approach is advocated by Ralph Kimball, Margy Ross and many others. Atomic data from across the enterprise is stored in star-schema format. The star schemas are designed with a set of “conformed dimensions” allowing them to work together. Data marts are not separate entities, but subject areas within this set of star schemas.

Stand-Alone Data Marts Although there are no major proponents of this approach, it is very popular. Stand-alone data marts focus on a particular business process or department, and are not designed in an enterprise context. Stand alone data marts may be built using star schema designs, or follow other approaches.

This information is summarized in the table below, which also includes some alternate terms used to describe each approach.

Keep in mind that this is a very high level summary, and it is here only to illustrate how the star schema factors in. I encourage readers to seek out more information on each of these approaches.

Something in Common: The Star Schema

Despite the furor over which of these architectures makes the most sense, it should be obvious that they have something in common: each has a place for the star schema. And in all cases, the principles of star schema design are the same.

Regardless of the role of the star schema in your architecture, it must be designed carefully. A poor design will lead to issues, whether you are building a corporate information factory or a dimensional data warehouse. A good design, coupled with proper expectations for your data warehouse architecture, will yield success.

Pragmatic, Not Dogmatic

In my first book on data warehouse design, there is a chapter on data warehouse architecture which advocates the dimensional data warehouse approach. But I have designed star schemas that are used successfully in all three architectures, and the principles of dimensional modeling are the same.

Sometimes people are surprised that I am willing to be involved in projects that follow all three approaches, or that I don’t automatically move to throw away imperfect solutions. But the investment of resources in a data warehouse is a business decision. Design principles may factor in, but so do a variety of other factors – factors over which I am not qualified to weigh in.

Instead, I try to be pragmatic. Yes, I have my preferred approach. But what makes the most sense for the business? Is there already a significant investment in a different architecture? Is there simply not funding for the ideal solution?

The information and resources in this blog are meant to help you, regardless of your architecture. If you use the star schema for anything, you will find help here. From time to time, I will look at concerns specific to one style or another, but this will always be clearly spelled out. Most of the time, we will look at principles that have applicability across the board.

Copyright © 2007 Chris Adamson

Tuesday, March 6, 2007

The Accumulating Snapshot

Although my latest book is about aggregate tables, it covers design basics as well as advanced topics. This excerpt describes how to use an accumulating snapshot design to analyze business processes involving multiple steps.

Excerpted from Mastering Data Warehouse Aggregates, by Chris Adamson.
Used by permission.

Many business processes can be understood in terms of a series of stages or steps that something must undergo. In made-to-order manufacturing, an individual item is ordered, manufactured, quality assured, packaged, and shipped. In banking, a mortgage application is submitted, reviewed by a loan officer, evaluated by an underwriter, and eventually consummated at a closing.

When the individual items that pass through the process are readily identifiable (an individual item being manufactured; a specific mortgage application), and the processing steps or milestones are predicable (the manufacturing process stages, the mortgage approval phases), its status can be tracked by an accumulating snapshot model. Such a model provides a way to understand the status of an individual item at the detail level, the workload at each processing step, and the efficiency with which items move from one stage to the next.

Consider the process by which a mortgage application moves from initial submission to closing. The process begins when an applicant completes a mortgage application with a mortgage officer. A processor then takes over, ensuring that documentation required to evaluate the loan application is gathered. Next, the application is reviewed by an underwriter, who reviews the application and supporting documentation, evaluates potential risk, and approves or rejects the application. The amount approved for underwriting may be less than the amount of the original application. Finally, the mortgage becomes legally binding at the time of closing.

Figure 1, below, shows an accumulating snapshot for the mortgage application process. The grain of this fact table is an application. Each application will be represented by a single row in the fact table. The major milestones are represented by multiple foreign key references to the Day dimension—the date of submission, the date approved by mortgage officer, the date all supporting documentation was complete, the date approved by an underwriter, and the date of closing.

Figure 1: An accumulating snapshot tracking mortgage applications.

Note that it is not necessary to build five Day dimension tables. A single Day table can serve in all these roles; it will simply be aliased at query time or through a series of views. Similarly, the three employees involved in the process are represented by three foreign key references to the Employee dimension table—one for the mortgage officer, one for the processor that assembles the supporting materials, and one for the underwriter.

The facts include several dollar amounts. The amount of the original application is the application_amount. The amounts approved and rejected by the mortgage officer and underwriter are recorded as officer_approved_amount and underwritten_amount. The amount actually loaned at closing is the closing_amount.

When the application is first submitted, the only date that is known is the application submission date. The appropriate day_key value is assigned to the application_submission_date_key. In order to accommodate the dates that are as yet unknown, the Day dimension must contain a row for "unknown" or "n/a." The key value for this row will be assigned to the remaining dates. The application_amount is set to the requested mortgage amount from the application. The remaining dollar amounts are initialized to zero.

As the application works its way through the process, the fact table row will be updated. The various date keys are revised as each milestone is reached. The additional facts are filled in as the appropriate milestones are reached.

The fact table also contains a series of lag columns. These represent the number of days the application spends at each stage. Each is a simple calculation based on milestone dates, populated once it completes the appropriate stage. They allow measurement of processing time.

The fact table also contains a series of quantities, each corresponding to one of the milestones. The value of each count is zero or one, depending on whether the application has reached the corresponding milestone. These facilitate workload reporting, and will take on other values when aggregated. To make it easier to identify the current status of an application, a status attribute may be added to the application dimension or as a degenerate dimension in the fact table.

Using this schema, it is possible to study the lifecycle of a particular application, the workload at each processing step, and the efficiency of the overall process itself.

Excerpted from Mastering Data Warehouse Aggregates, by Chris Adamson.
Copyright (c) 2006 by Wiley Publishing, Inc
Used by permission.