Chris Adamson’s Blog: Conformed Dimensions

Showing posts with label Conformed Dimensions. Show all posts

Monday, July 16, 2012

Q&A: Does Master-Detail Require Multiple Stars in a Travel Data Mart

A reader asks whether master-detail relationships call for multiple fact tables.

Q: Is there some general principle to use as a guide when deciding to use 1 star or 2 for master-detail situations?

We have a master-detail situation involving trips, and multiple trip expenses per trip. Many of our queries are at the trip level - eg. how many trips are for clients in their 30's going to Arizona in July. Many of the queries are at the detail level -- how much is spent on means by the client's age range?

Both levels of query can be answered with one star, with the fact at the detail level (trip expense line). For trip-level questions - you can count distinct occurrences of the natural key in the trip dimension. But it seems that grinding through all those trip expenses when I just wanted to count trips is inefficient.

So do we have 2 fact tables, one at the trip level, one at the trip expense level? Do the dimensions surround both of them? Is there a rule?

- Ellen, Ottawa

A: As the reader observes, this is a common situation: when there is a master-detail relationship, is there a need for multiple fact tables?

This case calls for at least two fact tables.

Read on for the details.

Photo by o5com, licensed by Creative Commons

Multiple Fact Tables

After learning the basics of dimensional modeling, the first real world challenge we face is understanding when and how to design multiple fact tables. Until we learn to think in dimensional terms, the choice can be difficult.

I suggest starting with some basic guidelines. You probably need different fact tables if:

You have measurements with different periodicity
You have measurements with different levels of detail

The first guideline suggests that if facts do not describe the same event, they probably belong in different fact tables. For example, orders and shipments do not always happen at the same time. Order dollars and shipment dollars belong in separate fact tables.

The second guideline pertains to facts that do describe the same events. Information about an order and information about an order line are ostensibly available at the same time, but they have different levels of detail. If there are facts at both of these levels, there will need to be multiple fact tables.

So how would you apply these guidelines?

Think Dimensionally

It is easiest to work these questions out by thinking dimensionally. Forget about master-detail or parent-child for a few minutes, and consider what is being measured.

What are the business questions being asked, and what business metrics do they contain?
Are these metrics available simultaneously?
At what level of detail are these metrics available?

The reader cited business questions in her example. These questions reveal at least two metrics: number of trips, and expense dollars.

These metrics may be available on different schedules -- for example, a trip may commence before we have the details of all the expenses incurred. This would argue for multiple fact tables.

But let's suppose that the single source of data is expense reports. We do not have any information about the trip until it is over, at which point we have all the expense details.

In this case, lets think about the level of detail of these metrics. Number of trips and expense dollars seem to share a lot of common detail -- the traveller, the destination, the trip start and end date, and so forth.

But expense dollars have some additional detail -- the date of the expense item, and an expense category. Since these facts have different levels of detail, it makes sense to include them in separate fact tables.

Factless Fact Table

The trip-level star may contain a factless fact table. It contains one row per trip, with no explicit facts. We can determine the number of trips that answer a business question simply by counting rows.

Many teams would prefer to add a fact called number_of_trips and always populate it with the value 1. This is useful, because it makes the business metric explicit in your schema design. (I've written about this technique before.)

The trip level star may also contain some summary level facts that describe expenses -- say the total trip cost, the total transportation cost, total meals cost, etc. More detail on these metrics (such as the vendor, payment method, etc.) can found in the expense item star.

Deeper into Trip Analysis

Digging deeper, the reader may discover she needs more than two fact tables.

Trips often include more than one destination. When it is necessary to study the various segments (or "legs") of a trip, we can define a third fact table that contains one row per segment.

In this case, the reader would have three fact tables:

trip_facts Grain: one row per trip. Metrics: Number of trips (always one)
trip_segment_facts Grain: one row per destination city. Metrics: number of segments (always one)
trip_expense_facts Grain: one row per expense item. Metrics: expense_dollars

Each of these stars will share some common dimensions -- the trip origination date, the trip end date, the traveller, the trip identifier, etc.

Trip_segment_facts will also include destination-specific dimensions: the destination city of the segment, the arrival and departure dates for the segment, etc.

Trip_expense_facts will include dimensions that describe the individual expense items: the date paid, the payment method, the payee, the expense category and subcategory, and so forth.

A conformance matrix will help you keep track of the dimensionality of each star.

Learn More

Thanks to Ellen for the thoughtful question. If you have a question of your own, please send it in.

Pick up a copy of Star Schema: The Complete Reference and you will be helping support this blog.

Chapter 4 is dedicated to designing multiple fact tables.

Chapter 5 looks closely at conformed dimensions.

Chapter 12 covers faceless fact tables.

You can learn more about these topics by referring to these posts:

Multiple Stars and Conformed Dimensions (August 15, 2011) introduces the use of multiple star schemas that share common dimensions.

Factless Fact Tables (September 15, 2011) describes faceless fact tables that describe events, and the use of a fact with the constant value of 1.

The Conformance Matrix (June 5, 2012) describes how to represent the dimensionality of your stars in a matrix format.

The photo in this post is by o5com, licensed by Creative Commons

Tuesday, June 5, 2012

The Conformance Matrix

Posted by Chris Adamson

Conformed dimensions are the linchpins of dimensional models. This post summarizes their use, and describes how to document them in matrix form.

Conformed dimensions: a refresher

Metrics describing different processes can be compared if they are stored in stars which share common dimensions. As I have discussed previously, these compound metrics often turn out to be among the most valuable from a business perspective.

Image by Patrick Hosley
Licensed under CC 2.0

The common dimensions do not have to be physically shared tables. Each star may reside in a separate database.

As long as the common dimensions, such as "customer" or "product" have the same structure and content, they are said to conform.

Using conformed dimensions, we are able to compare measurements stored in different star schemas through a process called drilling across.

Planning conformance

Conformed dimensions are therefore the linchpins of the dimensional model. They ensure that each star works on its own, and also works with other stars.

If conformed dimensions are planned in advance, you can implement one star at a time without worrying about incompatibility issues.

This is the core idea behind Ralph Kimball's bus architecture. Conformed dimensions are designed as part of an up-front architecture project. Then, they serve as a semantic "bus." Like cards plug into the backplane of a PC, fact tables plug into this dimensional bus.

The concept is also important in other architectures. For W.H. Inmon's Corporate Information Factory, conformance allows process comparison within data marts as well as across data marts.

Without conformed dimensions, subject areas become stovepipes. The opportunity to build cross-process metrics is lost. Worse yet, users also develop distrust in the individual data marts. Their thinking is that if sales and inventory cannot be compared, there must be something wrong with the data.

Documenting conformance

The conformance plan is a central feature of your dimensional model, so of course it must be documented.

Conformed dimensions are best documented in a matrix format, as in the diagram below.

Image from Star Schema: The Complete Reference by Chris Adamson
(c) 2010 McGraw-Hill. Used by permission.

The rows of this diagram correspond to fact tables, and the columns are dimensions. Where you see a checkmark, it indicates that the fact table makes use of the associated dimension.

The matrix makes it easy to identify compatibility across fact tables. When two fact tables have a checkmark in the same column, that dimension cab be used as the basis for comparing the processes (aka drilling across).

Notice that conformed dimensions are depicted with associated "levels." Salesperson, for example, has three successive levels of conformity: regions, territories, and individual salespeople. (This topic has been covered previously.)

It is possible that a degenerate dimension (a dimension attribute stored within a fact table) may be a conforming dimension. These attributes should also appear on the conformance matrix. In the picture above, order_line may be a degenerate dimension.

Variations on the conformance matrix

The small conformance matrix above focuses on a subject area (sales). Conformance across subject areas can also be illustrated using a matrix, albeit a larger one.

An enterprise level conformance matrix is a valuable tool. It is a blueprint that can guide incremental implementation. It also helps break down proprietary attitudes toward data among different groups within your business. One look at the matrix, for example, and it becomes clear that "customer" touches several parts of the business.

Conformance matrices can be produced at different levels of summarization. A more summarized matrix may contain one row per data mart, rather that one per fact table. This may help guide project planning, or simply make an enterprise level matrix easier to digest.

Similarly, the conformance matrix can be used to map individual fact to dimensions. Architects use this kind of matrix when they are having trouble identifying discrete fact table. Performing affinity analysis on this kind of matrix reveals facts that share dimensionality. These may be candidates for inclusion in a single star.

Support this blog

Pick up a copy of Star Schema: The Complete Reference and you will be helping support this blog. Chapter 5 is completely dedicated to conformed dimensions.

A lot of information on conformed dimensions also appears in this blog:

Multiple Stars and Conformed Dimensions (Aug 15, 2011) introduces the concept of conformed dimensions.

Conformed Dimensions (Nov 15, 2011) discusses different ways dimensions can conform, and introduces the concept of "levels" of conformance.

Three Ways to Drill Across (Dec 23, 2011) explains how to combine facts from different stars using conformed dimensions.

Kimball's Approach is Top-Down (May 19, 2010) shows how conformed dimensions play a central role in Ralph Kimball's data warehouse architecture.

There is also a category label for posts referencing conformed dimensions.

Matrix Falling image by Patrick Hosley
Licensed under CC 2.0

Conformance matrix illustration is from
Star Schema: The Complete Reference by Chris Adamson,

Friday, December 23, 2011

Three ways to drill across

Posted by Chris Adamson

Previous posts have explored the importance of conformed dimensions and drilling across. This post looks at the process of drilling across in more detail.

There are three primary methods for drilling across. These methods can be leveraged manually, automated with BI software, or performed during the data integration process.

Querying multiple stars

Conformed dimensions ensure compatibility of information in various stars and data marts; drilling across is the process of bringing it together. (For a refresher, see this post.)

For example, suppose we wish to compute "Return Rate" by product. Return rate is the ratio of shipments to returns.

Information about shipments and returns is captured in two granular star schemas:

shipment_facts captures shipment metrics by Salesperson, Customer, Product, Proposal, Contract, Shipment, Shipper, and Shipment Date
return_facts captures return metrics by Salesperson, Customer, Product, Contract, Reason and Return Date

Reporting on return rate will require comparing facts from each of these stars.

Drilling across

Recall that fetching facts from more than one fact table requires careful construction of queries. It is not appropriate to join two fact tables together, nor to link them via shared dimensions. Doing so will double-count facts, triple-count them, or worse.

Instead, the process must be completed in two phases.

Phase 1: Fetch facts from each fact table separately, aggregating them to a common level of detail
Phase 2: Merge these intermediate result sets together based on their common dimensions

In practice, there are several ways that this task can be performed.

Method 1: Issue two queries, then merge the results

The first method for drilling across completes phase 1 on the database, and phase 2 in the application (or on the application server).

Construct two separate queries: the sum of quantity shipped by product, and the sum of quantity returned by product.
Take the two result sets as returned by the DBMS, and merge them based on the common products. Compute the ratio at this point.

While it may seem odd that phase 2 not be performed on the DBMS, note that if the data sets are already sorted, this step is trivial.

Method 2: Build temp tables, then join them

The second method performs both phases on the DBMS, making use of temporary tables.

Construct two SQL statements that create temporary tables: the sum of quantity shipped by product, and the sum of quantity returned by product
When these are completed, issue a query that performs a full outer join of these tables on the product names and computes the ratio.

Be sure that the temporary tables are cleaned up.

Method 3: Join subqueries

Like the previous method, this method performs all the work on the DBMS. In this case, however, a single query does all the work.

Queries for each fact table are written, then joined together in the FROM clause of a master query. For example:

SELECT
COALESCE (shp.product, rtn.product) as Product,
quantity_returned / quantity_shipped as ReturnRate
FROM
( SELECT product, sum(quantity_shipped)as quantity_shipped
    FROM shipment_facts, product
    WHERE .....
) shp
FULL OUTER JOIN
( SELECT product, sum(quantity_returned) as quantity_returned
    FROM return_facts, product
    WHERE....
) rtn
ON
    shp.product = rtn.product

The two subqueries in the FROM clause represent phase 1. Phase 2 is represented by the main SELECT query that joins them and computes the ratio.

Applying these techniques

These techniques may applied in a variety of ways:

Report developers may write their own queries using one of more of these methods
You may have BI software that can automate drilling across using one or more of these methods
Drilling across may be performed at ETL time using one of these methods (or an incremental variant)

In the latter case, the ETL process builds a new star (or cube) that contains the result of drilling across. This is called a derived schema, or second line data mart.

Learn More

For more information, see the following resources:

Multiple Stars and Conformed Dimensions (August 15, 2011)

Conformed Dimensions (November 15, 2011)

Many pages are devoted to this topic in my books. In the latest one, Star Schema: The Complete Reference, the following chapters may be of interest:

Chapter 5, "Conformed Dimensions" discusses these techniques in greater detail.

Chapter 14, "Derived Schemas" looks at special considerations when creating derived stars that pre-compute drill-across comparisons.

Chapter 16, "Design and Business Intelligence", discusses how to work with SQL-generating BI software.

More to come on this topic in the future. If you have questions, send them in.

-Chris

Image by Patrick Hosely via Creative Commons 2.0

Tuesday, November 15, 2011

Conformed Dimensions

Posted by Chris Adamson

This second post on conformed dimensions explores different ways in which dimensions can conform.

There are several flavors of conformed dimensions. Dimensions may be identical, or may share a subset of attributes that conform.

Conformance basics

Conformed dimensions are central to the discipline of dimensional modeling. The basics of conformance were introduced in a post from earlier this year. In a nutshell:

Measurements of discrete business processes are captured in individual star schemas (e.g. proposals and orders)

Some powerful business metrics combine information from multiple processes. (e.g. fill rate: the ratio of orders to proposals)

We construct these metrics through a process called drilling across

Drilling across requires dimensions with the same structure and content (e.g. proposals and orders have a common customer dimension)

For a refresher on these concepts, see "Multiple stars and conformed dimensions" (8/15/2011).

Physically shared dimensions not required

When two fact tables share the same dimension table, their conformance is a given. Since the shared dimensions are the same table, we know they will support drilling across.

For example, stars for proposals and orders may share the customer dimension table. This makes it possible to query orders by customer and products by customer, and then merge the results together.

But this process of drilling across does not require shared dimension tables. It works equally well if proposals and orders are in separate data marts in separate databases.

As long as the stars each include dimensions that share the same structure (e.g. a column called customer_name) and content (i.e. the customer values are the same), it will be possible to merge information from the stars.

Levels of conformance

It is easy to take this a step further. We can also observe that there is compatibility between dimensions that are not identical.

If a subset of attributes from two dimension share the same structure and content, they form a sort of “lowest common denominator” across which we can compare data from the stars.

For example, suppose we establish budgets at the monthly level, and track spending at the daily level. Clearly, days roll up to months. If designed correctly, it should be possible to compare data from budget and spending stars by month.

The picture below illustrates the conformance of a MONTH and DAY table graphically. The ring highlights the shared attributes; any of these can be used as the basis for comparing facts in associated fact tables.

In this case, the two conformed dimensions participate in a natural hierarchy. Months summarize days. The month table is referred to as a “conformed roll-up” of day.

To successfully drill across, the content of the shared attributes must also be the same. Instances of month names, for example, must be identical in each table -- "January" and "January" conform; "January" and "JAN." do not.

To guarantee conformance of content, the source of the rollup should be the base dimension table. This also simplifies the ETL process, since it need not reach back to the source a second time.

Other kinds of conformance

Identical tables and conformed roll-ups are the most common kinds of conformed dimensions. Other kinds are less common.

Degenerate dimensions (dimensions that appear in a fact table) may also conform. This is particularly useful with transaction identifiers.

Overlapping dimensions may share a subset of attributes, but not participate in a hierarchy. This is most common with geographical data.

More Info

For more information, see the following posts:

Multiple Stars and Conformed Dimensions (August 15, 2011)

Q and A: Degenerate Dimensions, ETL and BI (October 15, 2010)
Creating transaction identifiers for fact tables (October 17, 2011) This post addresses degenerate dimensions and transaction identifiers

I also write about conformed dimensions extensively in Star Schema, The Complete Reference.

If you enjoy this blog, you can help support it by picking up a copy!

Photo by Agnes Periapse, licensed under Creative Commons 2.0

Wednesday, September 21, 2011

Avoid Surrogate Keys for Fact Tables

Posted by Chris Adamson

I am often asked for feedback on designs that include a "surrogate key for the fact table." There are two common reasons for which they are proposed; both have better alternatives.

Surrogate keys are for dimension tables

A surrogate key is an attribute that is created to uniquely identify rows in a dimension table. It does not come from a source system; it is created expressly for the dimensional schema.

Surrogate keys for dimension tables serve two important purposes:

They make it easier to track history. They allow the dimension to capture changes to something, even if the source does not. Absent a surrogate key, this would be difficult; the primary key of the dimension would be concatenation of natural keys and type 2 attributes.
They make it easy to join to the dimension. The dimensions' surrogate keys appear in fact tables as foreign keys. They allow the fact to be joined to the appropriate dimension values, without having to use a multi-part key.

A fact table does not require a manufactured data element for either of these purposes.

Tracking change history of facts? Use a log.

When the value of a fact can change, a surrogate key for the fact table might be proposed. This would theoretically allow the fact table to record the change history of facts, in the same way that a dimension table does.

Unfortunately, storing the history of facts in this manner destroys the usability of the star. The additive properties of the facts are lost, and it becomes much harder to use.

If the facts can change, the fact table should be updated.

To track the history of facts, use an audit table as described in a previous post. This table can log historic values, or can store the changes as "deltas" in a dimensional format.

Joining to other fact tables? Drill across.

The other reason surrogate keys are proposed for fact tables is that it will make them "easier to join."

Joining fact tables to dimensions is easy; the fact table already contains foreign keys that reference the surrogate keys in dimension tables. But what about joining to other fact tables?

Because they contain facts meant to be aggregated, fact tables should never be joined to one another. Otherwise, double-counting will ensue.

Facts from multiple stars should be combined by drilling across, as described in a previous post. Don't try to merge them by joining fact tables.

If you're not after facts with this proposed fact-to-fact join, you must be after dimension values. If this is the case, carry forward any/all useful dimensions to other stars in the value chain.

More info

To learn more about surrogate keys, check out these posts:

For Slowly Changing Dimensions, Change is Relative (10/9/2007)
Do I Really Need Sorrogate Keys? (5/20/2009)
More On Surrogagte Keys (9/8/2009)

For more on tracking the change history of facts, check out this post:

Slowly Changing Facts? (8/29/2011)

To learn about drilling across, read this post:

Multiple Stars and Conformed Dimensions (8/15/2011)

If you find this blog helpful, please consider picking up a copy of my book, Star Schema: The Complete Reference.

Image: Two Keys on a Keyring,

licensed by Creative Commons 2.5

Monday, August 15, 2011

Multiple Stars and Conformed Dimensions

Posted by Chris Adamson

The concept of conformed dimensions is central to the discipline of dimensional modeling. This post introduces conformed dimensions; future posts will explore them in more detail.

In your data warehouse, each star corresponds to a business process. Combining facts from different processes can produce powerful compound metrics. The key to making this work is a set of conformed dimensions.

Conformed dimensions are closely associated with Kimball's "Bus Architecture," but are crucial in any scenario that involves dimensional data.

Multiple Stars

In a dimensional design, each star captures collects measurements that describe a discrete business process.

If we have two measurements that describe different processes, we place them into separate fact tables.

For example, a sales data mart may contain multiple stars:

Proposal information by Salesperson, Prospect, Product, Proposal and Proposal Date
Order information by Salesperson, Customer, Product, Proposal, Contract and Order Date
Shipping information by Salesperson, Customer, Product, Proposal, Contract, Shipment, Shipper, and Shipment Date
Return information by Salesperson, Customer, Product, Contract, Reason and Return Date

In each of these stars, the fact table will record measurements that describe the processes of issuing proposals, taking orders, shipping product and handling returns.

By recording measurements of each process in a different star, we are able to capture information at the most detailed level possible. We can study each of these processes, complete with attendant details, by accessing the appropriate star.

Cross-process Metrics

Some of the most powerful measurements actually combine information from multiple processes. These metrics require combining facts from different stars.

In the sales data mart, the ratio of proposals to orders is the "close rate," a powerful indicator that sales managers and executives look at on a regular basis.

Similarly, the ratio of shipments to returns is the "return rate," an essential quality control metric.

Drilling Across

When we compare facts from different stars, we don't simply join the fact tables. To do so might cause double-counting of some facts.

Instead we follow a process that Kimball calls drilling across. The drill-across process can be broken down into two phases. In the first phase, each star is queried, and results are aggregated to a common level of detail. In the second phase, these result sets are merged based on their common dimensions.

For example, to compute the return rate by product for August of 2011, we do the following:

a. Figure out quantity shipped by product for August 2011
b. Figure out quantity returned by product for August 2011
Merge these amounts based on the common product names and compute the ratio

We may perform this drill across operation at query time (many BI tools can do this automatically), or we may do it at ETL time, storing the results in a separate star or cube (sometimes called a second-line data mart.)

Conformed Dimensions

The key to making all this work is the organization of the dimensions. As you saw in the example above, we used the dimension values to link our results together: product names were used to merge together shipment and return quantities and compute their ratio. This would not have worked if the product dimensions for Shipments and Returns had been different.

This is the basic idea behind conformed dimensions. We say that two dimensions conform if they have the same structure and content. Both our stars, had a product dimension table with a product name attribute, and the product names were specified the same way in each. Sharing a single physical table is one way to ensure conformance, but it is not required.

Two dimensions can also conform if one has a subset of the other's attributes. As long as the common attributes have the same structure and content, they are said to conform.

Planning Conformance

By planning a set of conformed dimensions, we ensure that fact tables can be used to compare processes. This is important within a single data mart, such as the one above, and it is also important when looking across multiple data marts.

Conformed dimensions are the organizing principle in Kimball's architecture. Conformed dimensions are planned up-front, as a part of a project that establishes a data architecture based on dimensional design. Implementation proceeds once this conformance bus has been planned.

The concept is also important in other architectures. For example, the sales data mart discussed above might be part of Corporate Information Factory architecture. Within this data mart, conformance guarantees we can compare shipments to returns, proposals to orders, and so forth.

More to come

In the coming weeks, I will post more about conformed dimensions. We will look at "levels" of conformance, how to document conformed dimensions, and how different tools work with conformed dimensions. If you have my book, I also encourage you to read Chapters 4 and 5, which look at these concepts in detail.

Photo by Agnes Periapse, licensed under Creative Commons 2.0

Wednesday, May 19, 2010

Kimball's Approach is Top-Down

Posted by Chris Adamson

Ralph Kimball's approach to data warehousing is frequently mis-characterized as being "bottom-up." This post aims to clear up that misconception.

Bus Architecture

Kimball's bus architecture (or dimensional data warehouse architecture) is an enterprise architecture. At its core, a set of conformed dimensions ensure a consistent representation of standard terms and data elements across multiple subject areas. The conformed dimensions describe important things like products, customers, locations, or anything of significance to the business.

The subject areas are called data marts. They represent things like manufacturing, sales, invoicing, receivables and so forth. Data marts don't need to be implemented all at once. They can be implemented one at a time, as part of an incremental program. Data marts also don't need to be stored in a single database (although they may.) When they are stored in different databases, the conformance bus ensures consistency and compatibility.

Top-Down

Kimball advocates planning a set of conformed dimensions as an up-front (i.e. strategic) activity. The conformance bus then serves as the blueprint for a set of integrated data marts, which can be built on whatever schedule makes the most sense.

Kimball and Ross put it this way:

During the limited-duration architecture phase, the team designs a master suite of standardized dimensions and facts that have uniform interpretation across the enterprise...We then tackle the implementation of separate data marts in which each iteration closely adheres to the architecture.

- From The Data Warehouse Toolkit, Second Edition
by Ralph Kimball and Margy Ross (Wiley, 2002)

Because it begins with an enterprise-level framework, then delivers departmental functionality, this is a top-down approach.

Bottom-Up

A bottom-up approach is one that moves in the opposite direction, beginning with a departmental focus and later evolving into one that has an enterprise focus. This occurs when organizations build stand-alone data marts, then later decide to integrate them.

Stand-alone data marts are designed and built for departmental use, without an enterprise context. They are cheaper in the short-run, offering a fast path to quick results. Stand-alone data marts also arrive due to mergers and acquisitions, or through packaged software.

When there is more than one stand-alone data mart, however, they are likely to exhibit incompatibilities and inconsistencies. They are sometimes labeled "stovepipes." Faced with these inconsistent data marts, some organizations resolve to retrofit them into a conformance framework. This can be a difficult and expensive process, requiring extensive rework.

When stand-alone data marts are successfully brought into conformance, a bottom-up path has been followed--one that starts with a departmental solution and moves to enterprise capability. Bottom-up development is cheaper in the short term but more expensive in the long term.

While the end result may be compatible with Kimball's vision, clearly the route is not. If this is news to you, you might want to check out his book. (The link appears beneath the quotation above.) You can also consult posts on data warehouse architectures and common misconceptions.

-- Chris

Image: PCI Slot by Ryan_Franklin_az

Licensed under Creative Commons 2.0

Monday, June 11, 2007

Ten Things You Won't Learn from that Demo Schema

Posted by Chris Adamson

Many people learn about dimensional modeling by studying a sample star schema database that comes with a software product. These sample databases are useful learning tools—to a point. Here are 10 things you won't learn by studying that demo.

If you've learned everything you know about star schema by working with a sample database, you probably have a good intuitive grasp of star schema design principles. In a previous post, I provided a list of 10 terms and principles that most sample databases illustrate well.

But there are many important things about a dimensional data warehouse that are not revealed by the typical "Orders" or "Sales" demo. Here are the top 10 things you will not learn from that sample database.

Multiple Fact Tables Most sample databases contain a single star—one fact table and its associated dimension tables. But it is rare to find a business process that can be modeled with a single fact table; it is impossible to find an enterprise that can. Most real-world designs will involve multiple fact tables, sharing a set of common dimensions.

When facts become available at different times, or with different dimensionality, they almost always belong in separate fact tables. Modeling them in a single fact table can have negative consequences. Mixed grain issues may result, complicating the load and making reporting very difficult. For example, building reports focused on only one of the facts can result in a confusing preponderance of extra rows containing the value zero.

Conformance With multiple fact tables, it is also important that each star be designed so that it works with others. A design principle called conformance helps ensure that as we build each new star, it works well with those that came before it. This avoids the dreaded stove-pipe. This principle allows a set of star schemas to be planned around a set of common dimensions and implemented incrementally.

Drilling Across It's also important to understand how to properly build a report that accesses data from multiple stars. A single SQL select statement won't do the job. Double counting, or worse, can result. Instead, we follow a process called drilling across, where each star is queried individually. The two result sets are then combined based on their common attributes. These drill across reports are some of the most powerful in the data warehouse.

Snapshot Fact Tables The fact table found in most demo stars is usually called a transaction fact table. But there are real world situations where other types of fact table designs are called for.

A snapshot design is useful for capturing the result of a series of transactions at a point-in-time; for example, the balance of each account in the general ledger at the end of each day. This type of design introduces the concept of semi-additivity, which can be a problem for many ad hoc query tools. It makes no sense to add together yesterday's balance and today's balance. It is not uncommon to compute averages based on the data in a snapshot star. But one must be careful here; the SQL Average() function may not always be what you need.

Factless Fact Tables Another type of fact table often contains no facts at all. Factless fact tables are useful in situations where there appears to be nothing to measure aside from the occurrence of an event, such as a customer contact. They also come in handy when we want to capture information about which there may be no event at all, such as eligibility.

In addition to transaction, snapshot and factless designs, there are other types of fact table as well. It is not uncommon to need more than one, even when modeling a single activity.

Roles and Aliasing Many business processes involve a dimension in multiple roles. For example, in accounting a transaction may include the employee who makes a purchase, as well as the employee who approves it. There is no need for separate"Purchaser" and "Approver" dimensions. A single "Employee" dimension will do the job. The fact table will have two foreign key references to the Employee dimension--one that represents the purchaser, and one that represents the approver. We use SQL "aliasing" when querying this schema in order to capture the two employee roles.

Advanced Slow Change Techniques If you are lucky, you were able to learn about Type 1 and Type 2 Slowly Changing Dimension techniques from the demo schema. I described these in a previous post. Often, analytic requirements require more.

A Type 3 change allows you to "have it both ways," analyzing all past and future transactions as if the change had occurred (retroactively) or not all.

There are also hybrid approaches, one of which tracks the "transaction-time" version of the changed data element as well as the "current-value" of the data element.

And then there's the time-stamped dimension technique, also called a transaction dimension. In this version, each row receives an effective date/time and an expiration date time. This provides Type 2 functionality, but also allows point-in-time analysis of the dimensional data.

Bridge Tables Perhaps the most confusing technique for the novice dimensional designer is the use of bridge tables. These tables are used when the standard one-to-many relationship between dimension and fact does not apply. There are three situations where bridge tables come in handy:

An attribute bridge resolves situations where a dimension attribute may repeat multiple times. For example, a dimension table called "Company" may include an attribute called "Industry." Some companies have more than one industry. Rather than flattening into "Industry 1," "Industry 2," and so on, an attribute bridge captures as many industries as needed.

A dimension bridge resolves situations where an entire dimension may repeat with respect to facts. For example, there may be multiple salespeople involved in a sale. Instead of loading the fact table with multiple salesperson keys, a dimension bridge gracefully manages the group of salespeople.

A hierarchy bridge resolves situations where a recursive hierarchy exists within a dimension. For example, companies own other companies. At times, users may want to roll transactions up that occur beneath a specific company, or vice versa. Instead of flattening the hierarchy, which imposes limitations and complicates analysis, a hierarchy bridge can be joined to the transaction data in various ways, allowing multiple forms of analysis.

All bridge implementations have implications for usage, or report building. Improper use of a bridge can result in double counting or incorrect results. Bridges also make deployment of business intelligence tools more difficult.

Derived Schemas Useful stars can also be derived from existing stars. Often called "second-line" solutions, these derived schemas can accelerate specific types of analysis with powerful results. The merged fact table combines stars to avoid drilling across. The sliced fact table partitions data based on a dimension value, useful in distributed collection and analysis. The pivoted fact table restructures row-wise data for columnar analysis and vice-versa. And set operation fact tables provide precomputed results for union, intersect and minus operations on existing stars.

Aggregate Schemas One of the reasons the star schema has become so popular is that it provides strong query performance. Still, there are times when we want results to come back faster. Aggregate schemas partially summarize the data in a base schema, allowing the database to compute query results faster. Of course, designers need to identify aggregates that will provide the most help, and queries and reports will receive no benefit unless they actually use the aggregate. Aggregates are usually designed as separate tables, instead of providing multiple levels of summarization in a single star. This avoids double counting errors, and can allow the exploitation of an automated query rewrite mechanism so that applications do not need to be aware of the aggregates.

I limited this list to 10 things. That's enough to make the point: a demo schema will only take you so far. When I teach Advanced Star Schema Design courses, I find that even people with many years of experience have questions, and always learn something new.

If you want to learn more, read the books recommended in the sidebar of this blog. Take a class on Advanced Star Schema Design. Interact with your peers at The Data Warehousing Institute conferences. And keep reading this blog. There will always be more to learn.

Related Posts:

Top 10 Thinks You Should Know About that Demo Schema

© 2007 Chris Adamson

Classes

Chris is scheduled to present at the following events. Course enrollment is open to the general public.

All these courses are also available on site (see below).

August 18, 2019
San Diego, CA
Data Modeling in the Age of Big Data
Registration: TDWI San Diego
August 20, 2019
San Diego, CA
Data Architecture: Managing Information in the Age of Big Data
Registration: TDWI San Diego
August 20, 2019
San Diego, CA
Workshop: Building the Business Case for Advanced Analytics
Registration: TDWI San Diego Strategy Summit
Monday October 21, 2019
San Francisco, CA
TDWI Dimensional Data Modeling Primer: From Requirements to Business Analysis
Registration: TDWI Seminars
Tuesday October 22, 2019
San Francisco, CA
Advanced Dimensional Modeling: Techniques for Practitioners
Registration: TDWI Seminars
Wednesday October 23, 2019
San Francisco, CA
Dimensional Models: What’s New in the Big Data Era
Registration: TDWI Seminars
November 12, 2019
Orlando, FL
Data Architecture: Managing Information in the Age of Big Data
Registration: TDWI Orlando
November 12, 2019
Orlando, FL
The Dimensional Model Refactored: New Techniques for the 21st Century
Registration: TDWI Orlando
November 15, 2019
Orlando, FL
Advanced Dimensional Modeling: Complete Tour of Modern Best Practices
Registration: TDWI Orlando

Onsite Education

You can bring Chris to your team for interactive education.

Dimensional Modeling

Chris provides full-day and expanded two-day courses covering the dimensional design concepts from Star Schema: The Complete Reference.
TDWI Courses

Chris teaches select TDWI courses that cover topics like data BI fundamentals, performance management, business analytics, dashboards and scorecards, and more.

All of Chris's education offerings are provided through TDWI.

For information on onsite offerings, contact TDWI Onsite Education. or Oakton Software