Showing posts with label Grain. Show all posts
Showing posts with label Grain. Show all posts

Wednesday, September 25, 2013

Optimizing warehouse data for business analytics

Business analytics often integrate information from your data warehouse with other sources of data. This post looks at the best practices of warehouse design that make this possible.

I receive a lot of questions regarding the best way to structure warehouse data to support an analytics program. The answer is simple: follow the same best practices you've already learned.

I'll cover these practices from a dimensional modeling perspective. Keep in mind that they apply in any data warehouse, including those modeled in third normal form.


1. Store Granular Facts

Analytic modelers often choose sources external to the data warehouse, even when the warehouse seems to contain relevant data. The number one reason for this is insufficient detail. The warehouse contains summarized data; the analytic model requires detail.

In this situation, the analytic modeler has no choice but to look elsewhere.  Worse, she may be forced to build redundant processes to transform source data and compile history. Luckily, this is not a failure of warehouse design principles; its a failure to follow standard best practices.

Best practices of  dimensional design dictate that we set the grain of base fact tables at the lowest level of detail possible. Need a daily summary of sales? Store the individual order lines. Asked to track the cost of tips? Store detail about each leg.

Dimensional solutions can contain summarized data. This takes the form of cubes, aggregates, or derived schemas. But these summaries should be derived exclusively from detailed data that also lives in the warehouse.

Like all rules, this rule has exceptions. There are times when the cost/benefit calculus is such that it doesn't make sense to house highly granular indefinitely. But more often than not, summary data is stored simply because basic best practices were not followed.

2. Build “Wide” Dimensions

The more attributes there are in your reference data (aka dimensions), the more useful source material there is for analytic discovery. So build dimensions that are full of attributes,  as many as you can find.

If the grain of your fact table gives the analytics team “observations” to work on, the dimensions give them “variables.”  And the more variables there are, the better the odds of finding useful associations, correlations, or influences.

Luckily, this too is already a best practice. Unfortunately, it is one that is often misunderstood and violated. Misguided modelers frequently break things down into the essential pieces only, or model just to specific requirements.  

3. Track Changes to Reference Data (and Use Effective Dating)

When reference data changes, too many dimensional models default to updating corresponding dimensions, because it is easier.

For example, suppose your company re-brands a product. It's still the same product, but with a new name. You may be tempted to simply update the reference data in your data warehouse. This is easier than tracking changes.  It may even seem to make business sense, because 90% of your reports require this-year-versus-last comparison by product name.

Unfortunately, some very important analysis may require understanding how consumer behavior correlates with the product name. You've lost this in your data set. Best practices help avoid these problems.

Dimensional models should track the change history of reference data. In dimensional speak, this means application of  type 2 slow changes as a rule. This preserves the historic context of every fact recorded in the fact table.

In addition, every row in a dimension table should track "effective" and "expiration" dates, as well as a flag rows that are current. This enables the delivery of type 1 behavior (the current value) even as we store type 2 behavior. From an analytic perspective, it also enables useful "what if" analysis.

As with all rules, again there are exceptions. In some cases, there may be good reason not to respond to changes in reference data by tracking history. But more often than not, type 1 responses are chosen for the wrong reason: because they are easier to implement.

4. Record Identifying Information, Including Alternate Identifiers

Good dimensional models allow us to trace back to the original source data. To do this, include transaction identifiers (real or manufactured) in fact tables, and maintain identifiers from source systems in dimension tables (these are called "natural keys").

Some of this is just plain necessary in order to get a dimensional schema loaded. For example, if we are tracking changes to a product name in a dimension, we may have multiple rows for a given product. The product's identifier is not a unique identifier, but we must have access to it. If we don't, it would become impossible to load a fact into the fact table.

Identifying information is also essential for business analytics. Data from the warehouse is likely to be combined with data that comes from other places. These identifiers are the connectors that allow analytic modelers to do this.  Without them, it may become necessary to bypass the warehouse.

Your analytic efforts, however, may require blending new data with your enterprise data. And that new data may not come with handy identifiers. You have a better chance blending it with enterprise data if your warehouse also includes alternate identifiers, which can be used to do matching. Include things like phone numbers, email addresses, geographic coordinates—anything that will give the analytics effort a fighting chance of linking up data sources.

Summary

If you've been following the best practices of dimensional modeling, you've produced an asset that maximized value for analytic modelers:

  • You have granular, detailed event data.
  • You have rich, detailed reference data.
  • You are tracking and time-stamping changes to reference data. 
  • You've got transaction identifiers, business keys, and alternate identifiers.  

It also goes without saying that conformed dimensions are crucial if you hope to sustain a program of business analytics.

Of course, there are other considerations that may cause an analytic modeler to turn her back on the data warehouse. Latency issues, for example, may steer them to operational solutions. Accessibility and procedural issues, too, may get in the way of the analytic process.

But from a database design perspective, the message is simple: follow those best practices!

Further Reading

You can also read more in prior posts.  For example:
You can also read more in my book, Star Schema: The Complete Reference.  If you use the links on this page to pick up a copy on Amazon, you will be helping support this blog.  

 It covers the best practices of dimensional design in depth. For example:

  • Grain, identifiers, keys and basic slow change techniques are covered in Chapter 3, "Stars and Cubes"
  • The place of summary data is covered in Chapter 14, "Derived Schemas" and Chapter 15, "Aggregates"
  • Conformance is covered in Chapter 5, "Conformed Dimensions"
  • Advanced slow change techniques are explored in Chapter 8, "More Slow Change Techniques"

Monday, July 16, 2012

Q&A: Does Master-Detail Require Multiple Stars in a Travel Data Mart

A reader asks whether master-detail relationships call for multiple fact tables.

Q: Is there some general principle to use as a guide when deciding to use 1 star or 2 for master-detail situations?

We have a master-detail situation involving trips, and multiple trip expenses per trip. Many of our queries are at the trip level - eg. how many trips are for clients in their 30's going to Arizona in July. Many of the queries are at the detail level -- how much is spent on means by the client's age range?

Both levels of query can be answered with one star, with the fact at the detail level (trip expense line). For trip-level questions - you can count distinct occurrences of the natural key in the trip dimension. But it seems that grinding through all those trip expenses when I just wanted to count trips is inefficient.

So do we have 2 fact tables, one at the trip level, one at the trip expense level? Do the dimensions surround both of them?  Is there a rule?


- Ellen, Ottawa
A: As the reader observes, this is a common situation: when there is a master-detail relationship, is there a need for multiple fact tables?

This case calls for at least two fact tables.
Read on for the details.

Photo by o5com, licensed by Creative Commons

Multiple Fact Tables

After learning the basics of dimensional modeling, the first real world challenge we face is understanding when and how to design multiple fact tables.  Until we learn to think in dimensional terms, the choice can be difficult.

I suggest starting with some basic guidelines.  You probably need different fact tables if:
  1. You have measurements with different periodicity
  2. You have measurements with different levels of detail
The first guideline suggests that if facts do not describe the same event, they probably belong in different fact tables. For example, orders and shipments do not always happen at the same time.  Order dollars and shipment dollars belong in separate fact tables.

The second guideline pertains to facts that do describe the same events.  Information about an order and information about an order line are ostensibly available at the same time, but they have different levels of detail. If there are facts at both of these levels, there will need to be multiple fact tables.

So how would you apply these guidelines?

Think Dimensionally

It is easiest to work these questions out by thinking dimensionally. Forget about master-detail or parent-child for a few minutes, and consider what is being measured. 
  • What are the business questions being asked, and what business metrics do they contain?  
  • Are these metrics available simultaneously?
  • At what level of detail are these metrics available?
The reader cited business questions in her example.  These questions reveal at least two metrics:  number of trips, and expense dollars.

These metrics may be available on different schedules -- for example, a trip may commence before we have the details of all the expenses incurred.  This would argue for multiple fact tables.

But let's suppose that the single source of data is expense reports. We do not have any information about the trip until it is over, at which point we have all the expense details.

In this case, lets think about the level of detail of these metrics. Number of trips and expense dollars seem to share a lot of common detail -- the traveller, the destination, the trip start and end date, and so forth.  

But expense dollars have some additional detail -- the date of the expense item, and an expense category. Since these facts have different levels of detail, it makes sense to include them in separate fact tables.

Factless Fact Table

The trip-level star may contain a factless fact table.  It contains one row per trip, with no explicit facts.  We can determine the number of trips that answer a business question simply by counting rows.

Many teams would prefer to add a fact called number_of_trips and always populate it with the value 1.  This is useful, because it makes the business metric explicit in your schema design.  (I've written about this technique before.)

The trip level star may also contain some summary level facts that describe expenses -- say the total trip cost, the total transportation cost, total meals cost, etc.  More detail on these metrics (such as the vendor, payment method, etc.) can found in the expense item star.

Deeper into Trip Analysis

Digging deeper, the reader may discover she needs more than two fact tables.  

Trips often include more than one destination.  When it is necessary to study the various segments (or "legs") of a trip, we can define a third fact table that contains one row per segment.

In this case, the reader would have three fact tables:
  1. trip_facts  Grain: one row per trip.  Metrics: Number of trips (always one)
  2. trip_segment_facts  Grain: one row per destination city.  Metrics: number of segments (always one)
  3. trip_expense_facts  Grain: one row per expense item.  Metrics: expense_dollars
Each of these stars will share some common dimensions -- the trip origination date, the trip end date, the traveller, the trip identifier, etc.

Trip_segment_facts will also include destination-specific dimensions: the destination city of the segment, the arrival and departure dates for the segment, etc.

Trip_expense_facts will include dimensions that describe the individual expense items: the date paid, the payment method, the payee, the expense category and subcategory, and so forth.

A conformance matrix will help you keep track of the dimensionality of each star.

Learn More

Thanks to Ellen for the thoughtful question.  If you have a question of your own, please send it in.

Pick up a copy of Star Schema: The Complete Reference and you will be helping support this blog. 

  • Chapter 4 is dedicated to designing multiple fact tables.  
  • Chapter 5 looks closely at conformed dimensions. 
  • Chapter 12 covers faceless fact tables.
You can learn more about these topics by referring to these posts:
  • Factless Fact Tables (September 15, 2011) describes faceless fact tables that describe events, and the use of a fact with the constant value of 1.
  • The Conformance Matrix (June 5, 2012) describes how to represent the dimensionality of your stars in a matrix format.

The photo in this post is by o5com, licensed by Creative Commons


Friday, January 7, 2011

Build High Resultion Stars

On TV, a low-resolution photo is never a problem for crime scene investigators. Load it into a computer, zoom in on a reflection, click "enhance", and presto! the criminal is revealed.

In the world of data warehousing, we cannot rely on the CSI effect. If we start with low-resolution data, the detail is lost forever. Eventually, someone will ask a question that requires more detail. And it simply wont be there to answer the question.

A few simple guidelines will help you avoid this unfortunate situation.

A Common Problem

Omission of detail is one of the biggest frustrations caused by legacy designs. It is the number three problem I encounter in design reviews (after failure to use surrogate keys and failure to plan slowly changing dimensions.)

The reason for this is simple: people often design their schema to support current requirements. Unfortunately, this does not work in data warehousing. Analytic requirements constantly change. As a designer, your job is to produce solutions that will answer questions that are not yet known.

This may sound like an impossible task, but it's not. Its one of the strong suits of dimensional design. Three guidelines for high-resolution design will help future-proof your solution.

1. Match Grain to Source Data, Not Requirements

When you're designing a fact table, you need to establish its grain. A statement of grain defines what is represented by a row in a fact table. This is the "resolution" of your measurements. (For a refresher on grain, see the post Rule 1: State Your Grain.)

Don't set grain at the level of detail that meets requirements. Set it at the level of detail at which data is available.

For example, you are designing a fact table that measures sales. Someone asks for daily totals of sales in dollars. Don't assume daily totals of sales is sufficient. Eventually someone will want to go deeper, perhaps looking for products that are bought together. This requires setting your grain at the order line level of detail.

Once you have identified the process a fact table represents, look at the resolution of the source data. That should guide your decisions on grain. Of course, this needs to be kept within reason. If the available level of detail would result in fact tables that are too large or expensive to manage, you might need to do some summarization. But be careful; you'll be working with low-resolution data in the future.

2. Capture All the Facts that Fit

Remember that each fact table describes a process. When you fill in the facts, don't just include ones that people asked for. Include all available facts that describe that process.

Returning to our sales data, this means you should not stop at sales dollars. What else is known about sales? Quantities, tax, and other information is likely to be readily available, and many operational systmes may also offer a rudimentary concept of cost or margin. Include them!

3. Fill Out Dimensions

The last guideline for a high-resolution design involves the dimensions. You can probably guess what it is:

When you design a dimension table, don't just include attributes someone asked for. Find what is available in the source system, and include it all.

A simple "Product Name" and "Product ID" might be all that's needed to support sales by product. But what else do source tables hold that describes your products? Manufactures? Suppliers? Colors? Sizes? Weights? Include all this in your designs.

Remember, dimensions are the source of all context in our reports. The more we fill them out, the more kinds of questions people will be able to ask.

Initial Design is the Right Time

Match your grain, facts and dimensions to the available data, even if that goes beyond the requirements.

Of course, you can recover from errors in any of these areas. This will require modifying your design, ETL process, and so forth. When its decided that additional detail is needed going forward, you can make these changes.

But it is often difficult to add detail to historic data already in the data warehouse. For example, it may require accessing backup data. Worse, it may put ETL developers through the difficult process of performing slow change processing in reverse.

Think high-resolution today, and you can avoid these problems tomorrow.

Image Credit: Alan Cleaver via Creative Commons 2.0

Wednesday, December 9, 2009

Rule 1: State Your Grain

Make sure you have a statement of grain for each fact table or cube in your dimensional design.

I receive lots of questions from people who are working through an issue with a star or cube. Most of the time, I must counter with a question of my own: “What is the grain?” Without this basic information, it is usually impossible to comment on whatever design issue the person is facing.

Being able to state grain is important, and not just for its value as a conversation starter. Here’s a brief look at what grain is, how to define it, and what happens if you don’t.

What is Grain

A statement of grain identifies what is represented by a single, granular row in a fact table (or an un-aggregated measure in a cube.) Each and every row in the fact table should meet this definition, with no wiggle-room.

When grain is not explicitly defined, or is defined in an ambiguous way, all manner of problems may arise. In addition to hampering your ability to talk about the design with someone else, ill-defined grain can cause to severe technical challenges in the reporting process.  It may even lead to reports that are just plain wrong.

Defining Grain

There are two ways to state the grain of a fact table. The first way is to use business language, referencing a specific, well-understood artifact of the activity described by the star. For example, a star that describes orders may have the following grain: “Order measurements at the order-line level of detail.” That sums things up pretty well. Each row of the fact table corresponds to a single order line.

Sometimes it is not easy to state grain in this manner. When that’s the case, you use the dimensions in a star to indicate what each unique row in the fact table represents. For example, a star that tracks the processing milestones of mortgage applications might have the following grain: “Processing measurements by application and status.” This fact table will have a new row each time an application (one dimension) undergoes a change in status (another dimension.)

When using dimensional terms to define grain, do not simply rattle off all the dimensions present in the star. Instead, list only those that are necessary to define a unique row. In the mortgage status change example, it is not necessary to mention that the star will also contain dimensions for date, customer, mortgage officer, mortgage product, and so forth.

Business definitions of grain are usually used for transaction fact tables.  Dimensional grain definitions are commonly used for snapshots, accumulating snapshots, derived schemas and aggregates. 

Fuzzy Grain

Poorly defined grain can lead to trouble. Ill-defined grain can mask a situation where a fact table is actually being used to track two or more processes. It can also lead to situations in which the fact table contains two or more levels of aggregation. In the former case, single-process reporting will be hampered. BI developers will be bending over backwards to focus on the relevant subset of data.  In the latter case, double-counting, triple-counting or worse is possible.

This is why “sate your grain” is rule #1 in any dimensional design.

-Chris


Image by GravityX9 licensed under Creative Commons 2.0

Tuesday, May 1, 2007

10 Things You Should Know About that Sample Star Schema

Today, many of us learn about the star schema by studying a sample database that comes with a software productusually one that covers sales or orders. Here are 10 terms and principles of dimensional modeling to go with that sample schema you've worked with.

The star schema has become a de facto standard for the design of analytic databases. Sample stars are often included with RDBMS software, BI Tools and ETL tools. They are also used for tutorials and training. Almost universally, the sample schema describes a sales or order taking process, similar to the one depicted in the figure below:

Figure 1: A demo schema usually represents orders or sales.(Click to Enlarge)

You may have learned about the Star Schema by working with a sample like this one. If so, you probably have an intuitive grasp of star schema design principles. Here are ten terms and principles you should know that describe important features of the sample star.

Most of this is probably readily apparent if you've worked with a sample schema—what may be new is the terminology. The first two you probably know:

  1. Facts are measurements that describe a business process. They are almost always numeric—but not all numeric attributes are facts. You can find facts (or measurements) in almost any analytic request—"Show me sales dollars by product" (sales dollars). "How many widgets were sold my John Smith in May?" (quantity ordered). There are some schemas that do not include facts—we'll look at those in another post.

  2. Dimensions give facts context. They may be textual or numeric. They are used to specify how facts are "filtered" and "broken out" on reports. You can usually find dimensions after the words "by" or "for" in an analytic request. "Show me sales dollars by product" (product). "What are margin dollars by Month and Salesperson?" (month, sales rep).

  3. Dimension tables are wide. Dimension tables usually group together a set of related dimension attributes, though there are situations where a dimension may include a set of attributes not related to one another. Dimension tables are not normalized, and usually have a lot of attributes—far more than appear in most sample schemas. This allows a rich set of detail to be used in analyzing facts. 100 or more columns is not uncommon for some dimensions. For this reason, we often call dimension tables wide.

  4. Dimensions have Surrogate Keys. The primary key for each dimension table is an attribute specifically created for the dimensional schema. It is an integer assigned by the ETL process, and has no inherent meaning. It is not a reused key from a source system, such as a customer ID or product code. We call these attributes natural keys, and they may exist in the star, but do not serve as unique identifiers.

    In the sample schema, customer_key is a surrogate key generated for the star schema; customer_id is a natural key carried over from a source system.

    By assigning surrogate keys, we enable the star to handle changes to source data differently than the source system does. For example, in a source system a customer record may be overwritten, while we want the star schema to track changes. Performance considerations also come into play—a surrogate key avoids the need for multi-column joins.

  5. Type 2 Changes track history. The term "Slowly Changing Dimension" (or SCD) describes how the data warehouse responds to changes in the source of dimensional data. There are several techniques that can be applied when the source of dimension detail changes. The most common is referred to as a "Type 2" change: an entirely new record is written to the dimension table.

    For example, if a customer moves, the record may simply be updated in a source system. But in the star schema, we choose to add a new row to the customer dimension, complete with a new surrogate key. All prior facts remain associated with the "old" customer record; all future facts will be associated with the new record.

  6. Type 1 Changes overwrite history. The Type 1 change is used when source data changes are not deemed significant, or may be the correction of an error. In such cases, we perform an update to an existing row in a dimension. For example, if a customer's gender is updated in the source, we may choose to update it in the corresponding dimension records. All prior facts are now associated with the changed value.

    In addition to Type 1 and Type 2 changes, there are other SCD techniques. Hybrid approaches exist as well. Every design should identify which technique(s) will be used for each attribute of each dimension table.

  7. Fact tables are narrow. A fact table row is usually entirely composed of numeric attributes: the facts, and foreign key references to the dimensions. Because of these characteristics, each fact table row is narrow, at least in contrast with wide dimension rows full of textual values. The narrowness of fact tables is important, because they will accumulate far more rows than dimension tables, and at a much faster rate.

  8. Fact tables are usually sparse. Rows are recorded in the fact table only when there is something to measure. For example, not every customer orders every product from every salesperson each day. Rows are only recorded when there is an order. This helps manage the growth of the fact table. It also saves us from having to filter out a huge number of rows that have no sales dollars when displaying results in a report. (Usually, you don't want a customer sales report to list every product—only the ones they bought. You can use an outer join when you want the latter.)

  9. Fact Table Grain The level of detail represented by a row in a fact table is referred to as its grain. Facts that are recorded with different levels of detail belong in separate fact tables. This avoids an array of reporting difficulties, as well as kludges such as including special rows in dimension tables for "not applicable." Determining the grain of a fact table is an important design step and helps avoid future confusion. (There are times when "not applicable" attributes are necessary, but they are most often a sign of the need for another fact table.)

    In the example, the grain is sales by customer, product, salesperson and date. A better design might capture sales at the order line level of detail.

  10. Additivity. Facts are usually additive. This means they can be summed across any dimension value. For example, order_dollars can be aggregated across customers, products, salespeople, or time periods, producing meaningful results. Additive facts are stored in the fact table. We also store additive facts that might be computed from other facts. (order_dollars might be the sum of extended_cost and margin_dollars, but why include only two out of the three.?

    Some facts are non-additive. For example, margin rate is a percentage. Two sales at 50% margin do not equate to a single sale at 100% margin—this fact is not additive. In the star, we store the fully additive components of margin (order_dollars and margin_dollars) and let front end tools compute the ratio. There are also semi-additive facts, which we will look at in the next post.
Most of these terms and principles can be learned by working with a sample schema. But there are many important principles that the typical "Sales" model does not reveal. In a future post, I'll look at the top 10 things the demo schema does not teach you.

Related Posts:

Ten Things You Won't Learn from that Demo Schema

© 2007 Chris Adamson