Chris Adamson’s Blog: Top Ten Lists

Showing posts with label Top Ten Lists. Show all posts

Monday, June 11, 2007

Ten Things You Won't Learn from that Demo Schema

Many people learn about dimensional modeling by studying a sample star schema database that comes with a software product. These sample databases are useful learning tools—to a point. Here are 10 things you won't learn by studying that demo.

If you've learned everything you know about star schema by working with a sample database, you probably have a good intuitive grasp of star schema design principles. In a previous post, I provided a list of 10 terms and principles that most sample databases illustrate well.

But there are many important things about a dimensional data warehouse that are not revealed by the typical "Orders" or "Sales" demo. Here are the top 10 things you will not learn from that sample database.

Multiple Fact Tables Most sample databases contain a single star—one fact table and its associated dimension tables. But it is rare to find a business process that can be modeled with a single fact table; it is impossible to find an enterprise that can. Most real-world designs will involve multiple fact tables, sharing a set of common dimensions.

When facts become available at different times, or with different dimensionality, they almost always belong in separate fact tables. Modeling them in a single fact table can have negative consequences. Mixed grain issues may result, complicating the load and making reporting very difficult. For example, building reports focused on only one of the facts can result in a confusing preponderance of extra rows containing the value zero.

Conformance With multiple fact tables, it is also important that each star be designed so that it works with others. A design principle called conformance helps ensure that as we build each new star, it works well with those that came before it. This avoids the dreaded stove-pipe. This principle allows a set of star schemas to be planned around a set of common dimensions and implemented incrementally.

Drilling Across It's also important to understand how to properly build a report that accesses data from multiple stars. A single SQL select statement won't do the job. Double counting, or worse, can result. Instead, we follow a process called drilling across, where each star is queried individually. The two result sets are then combined based on their common attributes. These drill across reports are some of the most powerful in the data warehouse.

Snapshot Fact Tables The fact table found in most demo stars is usually called a transaction fact table. But there are real world situations where other types of fact table designs are called for.

A snapshot design is useful for capturing the result of a series of transactions at a point-in-time; for example, the balance of each account in the general ledger at the end of each day. This type of design introduces the concept of semi-additivity, which can be a problem for many ad hoc query tools. It makes no sense to add together yesterday's balance and today's balance. It is not uncommon to compute averages based on the data in a snapshot star. But one must be careful here; the SQL Average() function may not always be what you need.

Factless Fact Tables Another type of fact table often contains no facts at all. Factless fact tables are useful in situations where there appears to be nothing to measure aside from the occurrence of an event, such as a customer contact. They also come in handy when we want to capture information about which there may be no event at all, such as eligibility.

In addition to transaction, snapshot and factless designs, there are other types of fact table as well. It is not uncommon to need more than one, even when modeling a single activity.

Roles and Aliasing Many business processes involve a dimension in multiple roles. For example, in accounting a transaction may include the employee who makes a purchase, as well as the employee who approves it. There is no need for separate"Purchaser" and "Approver" dimensions. A single "Employee" dimension will do the job. The fact table will have two foreign key references to the Employee dimension--one that represents the purchaser, and one that represents the approver. We use SQL "aliasing" when querying this schema in order to capture the two employee roles.

Advanced Slow Change Techniques If you are lucky, you were able to learn about Type 1 and Type 2 Slowly Changing Dimension techniques from the demo schema. I described these in a previous post. Often, analytic requirements require more.

A Type 3 change allows you to "have it both ways," analyzing all past and future transactions as if the change had occurred (retroactively) or not all.

There are also hybrid approaches, one of which tracks the "transaction-time" version of the changed data element as well as the "current-value" of the data element.

And then there's the time-stamped dimension technique, also called a transaction dimension. In this version, each row receives an effective date/time and an expiration date time. This provides Type 2 functionality, but also allows point-in-time analysis of the dimensional data.

Bridge Tables Perhaps the most confusing technique for the novice dimensional designer is the use of bridge tables. These tables are used when the standard one-to-many relationship between dimension and fact does not apply. There are three situations where bridge tables come in handy:

An attribute bridge resolves situations where a dimension attribute may repeat multiple times. For example, a dimension table called "Company" may include an attribute called "Industry." Some companies have more than one industry. Rather than flattening into "Industry 1," "Industry 2," and so on, an attribute bridge captures as many industries as needed.

A dimension bridge resolves situations where an entire dimension may repeat with respect to facts. For example, there may be multiple salespeople involved in a sale. Instead of loading the fact table with multiple salesperson keys, a dimension bridge gracefully manages the group of salespeople.

A hierarchy bridge resolves situations where a recursive hierarchy exists within a dimension. For example, companies own other companies. At times, users may want to roll transactions up that occur beneath a specific company, or vice versa. Instead of flattening the hierarchy, which imposes limitations and complicates analysis, a hierarchy bridge can be joined to the transaction data in various ways, allowing multiple forms of analysis.

All bridge implementations have implications for usage, or report building. Improper use of a bridge can result in double counting or incorrect results. Bridges also make deployment of business intelligence tools more difficult.

Derived Schemas Useful stars can also be derived from existing stars. Often called "second-line" solutions, these derived schemas can accelerate specific types of analysis with powerful results. The merged fact table combines stars to avoid drilling across. The sliced fact table partitions data based on a dimension value, useful in distributed collection and analysis. The pivoted fact table restructures row-wise data for columnar analysis and vice-versa. And set operation fact tables provide precomputed results for union, intersect and minus operations on existing stars.

Aggregate Schemas One of the reasons the star schema has become so popular is that it provides strong query performance. Still, there are times when we want results to come back faster. Aggregate schemas partially summarize the data in a base schema, allowing the database to compute query results faster. Of course, designers need to identify aggregates that will provide the most help, and queries and reports will receive no benefit unless they actually use the aggregate. Aggregates are usually designed as separate tables, instead of providing multiple levels of summarization in a single star. This avoids double counting errors, and can allow the exploitation of an automated query rewrite mechanism so that applications do not need to be aware of the aggregates.

I limited this list to 10 things. That's enough to make the point: a demo schema will only take you so far. When I teach Advanced Star Schema Design courses, I find that even people with many years of experience have questions, and always learn something new.

If you want to learn more, read the books recommended in the sidebar of this blog. Take a class on Advanced Star Schema Design. Interact with your peers at The Data Warehousing Institute conferences. And keep reading this blog. There will always be more to learn.

Related Posts:

Top 10 Thinks You Should Know About that Demo Schema

© 2007 Chris Adamson

Tuesday, May 1, 2007

10 Things You Should Know About that Sample Star Schema

Posted by Chris Adamson

Today, many of us learn about the star schema by studying a sample database that comes with a software product—usually one that covers sales or orders. Here are 10 terms and principles of dimensional modeling to go with that sample schema you've worked with.

The star schema has become a de facto standard for the design of analytic databases. Sample stars are often included with RDBMS software, BI Tools and ETL tools. They are also used for tutorials and training. Almost universally, the sample schema describes a sales or order taking process, similar to the one depicted in the figure below:

Figure 1: A demo schema usually represents orders or sales.(Click to Enlarge)

You may have learned about the Star Schema by working with a sample like this one. If so, you probably have an intuitive grasp of star schema design principles. Here are ten terms and principles you should know that describe important features of the sample star.

Most of this is probably readily apparent if you've worked with a sample schema—what may be new is the terminology. The first two you probably know:

Facts are measurements that describe a business process. They are almost always numeric—but not all numeric attributes are facts. You can find facts (or measurements) in almost any analytic request—"Show me sales dollars by product" (sales dollars). "How many widgets were sold my John Smith in May?" (quantity ordered). There are some schemas that do not include facts—we'll look at those in another post.

Dimensions give facts context. They may be textual or numeric. They are used to specify how facts are "filtered" and "broken out" on reports. You can usually find dimensions after the words "by" or "for" in an analytic request. "Show me sales dollars by product" (product). "What are margin dollars by Month and Salesperson?" (month, sales rep).

Dimension tables are wide. Dimension tables usually group together a set of related dimension attributes, though there are situations where a dimension may include a set of attributes not related to one another. Dimension tables are not normalized, and usually have a lot of attributes—far more than appear in most sample schemas. This allows a rich set of detail to be used in analyzing facts. 100 or more columns is not uncommon for some dimensions. For this reason, we often call dimension tables wide.

Dimensions have Surrogate Keys. The primary key for each dimension table is an attribute specifically created for the dimensional schema. It is an integer assigned by the ETL process, and has no inherent meaning. It is not a reused key from a source system, such as a customer ID or product code. We call these attributes natural keys, and they may exist in the star, but do not serve as unique identifiers.

In the sample schema, customer_key is a surrogate key generated for the star schema; customer_id is a natural key carried over from a source system.

By assigning surrogate keys, we enable the star to handle changes to source data differently than the source system does. For example, in a source system a customer record may be overwritten, while we want the star schema to track changes. Performance considerations also come into play—a surrogate key avoids the need for multi-column joins.

Type 2 Changes track history. The term "Slowly Changing Dimension" (or SCD) describes how the data warehouse responds to changes in the source of dimensional data. There are several techniques that can be applied when the source of dimension detail changes. The most common is referred to as a "Type 2" change: an entirely new record is written to the dimension table.

For example, if a customer moves, the record may simply be updated in a source system. But in the star schema, we choose to add a new row to the customer dimension, complete with a new surrogate key. All prior facts remain associated with the "old" customer record; all future facts will be associated with the new record.

Type 1 Changes overwrite history. The Type 1 change is used when source data changes are not deemed significant, or may be the correction of an error. In such cases, we perform an update to an existing row in a dimension. For example, if a customer's gender is updated in the source, we may choose to update it in the corresponding dimension records. All prior facts are now associated with the changed value.

In addition to Type 1 and Type 2 changes, there are other SCD techniques. Hybrid approaches exist as well. Every design should identify which technique(s) will be used for each attribute of each dimension table.

Fact tables are narrow. A fact table row is usually entirely composed of numeric attributes: the facts, and foreign key references to the dimensions. Because of these characteristics, each fact table row is narrow, at least in contrast with wide dimension rows full of textual values. The narrowness of fact tables is important, because they will accumulate far more rows than dimension tables, and at a much faster rate.

Fact tables are usually sparse. Rows are recorded in the fact table only when there is something to measure. For example, not every customer orders every product from every salesperson each day. Rows are only recorded when there is an order. This helps manage the growth of the fact table. It also saves us from having to filter out a huge number of rows that have no sales dollars when displaying results in a report. (Usually, you don't want a customer sales report to list every product—only the ones they bought. You can use an outer join when you want the latter.)

Fact Table Grain The level of detail represented by a row in a fact table is referred to as its grain. Facts that are recorded with different levels of detail belong in separate fact tables. This avoids an array of reporting difficulties, as well as kludges such as including special rows in dimension tables for "not applicable." Determining the grain of a fact table is an important design step and helps avoid future confusion. (There are times when "not applicable" attributes are necessary, but they are most often a sign of the need for another fact table.)

In the example, the grain is sales by customer, product, salesperson and date. A better design might capture sales at the order line level of detail.

Additivity. Facts are usually additive. This means they can be summed across any dimension value. For example, order_dollars can be aggregated across customers, products, salespeople, or time periods, producing meaningful results. Additive facts are stored in the fact table. We also store additive facts that might be computed from other facts. (order_dollars might be the sum of extended_cost and margin_dollars, but why include only two out of the three.?

Some facts are non-additive. For example, margin rate is a percentage. Two sales at 50% margin do not equate to a single sale at 100% margin—this fact is not additive. In the star, we store the fully additive components of margin (order_dollars and margin_dollars) and let front end tools compute the ratio. There are also semi-additive facts, which we will look at in the next post.

Most of these terms and principles can be learned by working with a sample schema. But there are many important principles that the typical "Sales" model does not reveal. In a future post, I'll look at the top 10 things the demo schema does not teach you.

Related Posts:

Ten Things You Won't Learn from that Demo Schema

Classes

Chris is scheduled to present at the following events. Course enrollment is open to the general public.

All these courses are also available on site (see below).

August 18, 2019
San Diego, CA
Data Modeling in the Age of Big Data
Registration: TDWI San Diego
August 20, 2019
San Diego, CA
Data Architecture: Managing Information in the Age of Big Data
Registration: TDWI San Diego
August 20, 2019
San Diego, CA
Workshop: Building the Business Case for Advanced Analytics
Registration: TDWI San Diego Strategy Summit
Monday October 21, 2019
San Francisco, CA
TDWI Dimensional Data Modeling Primer: From Requirements to Business Analysis
Registration: TDWI Seminars
Tuesday October 22, 2019
San Francisco, CA
Advanced Dimensional Modeling: Techniques for Practitioners
Registration: TDWI Seminars
Wednesday October 23, 2019
San Francisco, CA
Dimensional Models: What’s New in the Big Data Era
Registration: TDWI Seminars
November 12, 2019
Orlando, FL
Data Architecture: Managing Information in the Age of Big Data
Registration: TDWI Orlando
November 12, 2019
Orlando, FL
The Dimensional Model Refactored: New Techniques for the 21st Century
Registration: TDWI Orlando
November 15, 2019
Orlando, FL
Advanced Dimensional Modeling: Complete Tour of Modern Best Practices
Registration: TDWI Orlando

Onsite Education

You can bring Chris to your team for interactive education.

Dimensional Modeling

Chris provides full-day and expanded two-day courses covering the dimensional design concepts from Star Schema: The Complete Reference.
TDWI Courses

Chris teaches select TDWI courses that cover topics like data BI fundamentals, performance management, business analytics, dashboards and scorecards, and more.

All of Chris's education offerings are provided through TDWI.

For information on onsite offerings, contact TDWI Onsite Education. or Oakton Software