Chris Adamson’s Blog: Factless Fact Tables

Showing posts with label Factless Fact Tables. Show all posts

Thursday, March 28, 2013

Where To Put Dates

A reader is trying to decide if certain dates should be modeled as dimensions of a fact table or as attributes of a dimension table.

I have two attributes that I'm really not sure where is the best place to place:'Account Open Date' and 'Account Close Date'. In my model, I have [Dim Accounts] as a dimension and [F transact] as a fact table containing accounts transactions. An account can have many transactions, so the dates have different cardinality than the transactions.

I thought to put the dates in the Accounts dimension, but this led to problems: difficulties in calculations related to those dates--like if I want to get the transactions of the accounts that opened in the 4th quarter of 2012, or to get the difference between the date of last transaction and the account opening date, and so on. In other words I can't benefit from the Date dimension and the hierarchies it contains.

So I though about placing those dates in the fact table, but what made me hesitate is that the granularity of those dates is higher than the fact table, so there will be a lot of redundancy.

- Ahmad

Bethlehem, Palestine

This is a common dilemma. Many of our most important dimensions come with a number of possible dates that describe them.

Ahmad is thinking about this problem in the right way: how will my choice affect my ability to study the facts?

It turns out that (1) this is not an either/or question, and (2) granularity is not an issue.

Dates that Describe Important Dimensions

Image licensed via Creative Commons 2.0

from Patrick Hoesley

The dates are clearly useful dimension attributes. I suggest that you keep them in the dimension in one of two ways, which I will discuss in a moment.

First, though, lets look at what happens if the dates are only represented as foreign keys in the fact table:

If the dates are not stored in the dimension, the open and close date are only associated with the Account dimension through the fact table. The fact table only has records when transactions occur. So it becomes harder to find a list of open accounts, or to find the set of accounts that were active as of a particular date.

An additional factless fact table may help here, but it is far more simple to store the dates in the dimension.

Date as Attribute vs Outrigger

If plan to represent the dates in your dimension table, you have two choices. You can model the dates themselves as attributes, or you can model a pair of day keys in your account dimension. Either approach is acceptable.

The first option does not expose the richness of your full day dimension for analytic usage, but it may be simpler to use for many business questions. Other questions (like your quarterly example) will require a bit more technical knowledge, but most BI tools help with this.

The second option transforms your star into a (partial) snowflake. The day dimension becomes known as an "outrigger" when it connects to your account dimension. This allows you to explicitly leverage all the attributes of your Day dimension. The cost is some extra joins, which may be confusing and may also disrupt star-join optimization.

Making the correct choice here involves balancing several perspectives:

The business view and usability
The capabilities of your BI software front end
The capabilities of your DBMS back end software

Day Keys in the Fact Table

Having said all that, it is also useful to represent at least one of these dates in the fact table. The account open date may be a good dimensional perspective for the analysis of facts.

As you observed, this date has different cardinality than the transactions. The account open date for an account remains constant, even if it has dozens of transactions in your fact table. But the fact that it has low cardinality should not stop you from choosing it as a major dimension as your star!

Your account transaction fact table may have a pair of day keys -- one for the date the account was opened, and one for the date of the transaction.

If you choose to do this, the account dimension itself should include the open date. The outrigger solution is not necessary since your fact table has full access to the Day dimension.

Note that I do not recommend this for your account closed date, because that date changes. Storing it as a key for every transaction against an account would require a lot of updates to fact table rows once the account becomes closed.

More Information

I've touched on this topic in the past. In particular, see this post:

Dates in Dimension Tables (5/12/2011)

Although I edited it out of Ahmad's question, he also cited an issue surrounding the use of NULL for accounts that do not have a closed date. On that topic, see this recent post:

Avoid NULL in Dimensions (1/7/2013)

Support this Blog

I maintain this blog in my spare time. If you find it helpful, you can help support it by picking up a copy of my book: Star Schema: The Complete Reference.

Use the links on this blog to get a copy of this or any of the other recommended books, and you will be helping to keep this effort going.

Thursday, December 20, 2012

Dimensional Models: No E-R Fallback Required

Posted by Chris Adamson

People often suggest to me that a dimensional model cannot handle certain situations. These situations, the assertion holds, require falling back to entity-relationship modeling techniques.

Image by Patrick Hosley,
Licensed by Creative Commons 2.0

When we look together at their examples, though, I've always found a dimensional solution.

Here's a list of things people sometimes do not realize can be handled by standard dimensional techniques.

At the end of the post is a large collection of links to previous posts treat these topcis in more detail.

Note: This article should not be interpreted as a criticism of entity-relationship modeling. Rather, the point is to cover commonly overlooked capabilities of dimensional modeling.

Predictive Analytics: Storage of Granular Detail

I often hear people say: if you want to do predictive analytics, its important to capture all the relationships among data attributes. This is incorrect.

The job of predictive analytics is to tell us what is important--not the reverse! Neither data mining nor predictive analytics requires information to be stored in an entity-relationship format.

As people who do this work will tell you, they don't care how the data is supplied. Nor do they care what modelers believe are the important relationships. Many tools will completely de-normalize the data before doing anything else.

In order to support predictive analytics, standard best practices of dimensional modeling apply:

Fact tables record business activities and conditions at the lowest level of detail possible in granular, first line stars.
Transaction identifiers are carried in the model if available.
Surrounding dimensions are enriched with detail to the fullest extent possible
In the dimension tables, changes are tracked and time-stamped (“type 2”).

If we do these things with our dimensional data, we have not lost anything that is that is required to do predictive analytics.

Many-to-many Relationships: Multi-Valued Dimension Bridge

Usually, each fact in a star schema can be linked to a single member of a dimension. This is often mistaken for a rule, which leads to the incorrect conclusion that dimensional models cannot accommodate a fact that may link to multiple members of a dimension.

(This is often generalized as “dimensional cannot handle many-to-many relationships.”)

In a dimensional model, this situation is referred to as a multi-valued dimension. We use a bridge table to link the facts to the dimension in question. This special table sets up a many-to-many relationship between the facts and dimensions, allowing any number of dimension members to be associated with any number of facts.

Here are some examples:

In sales, a bridge table may be used to associate a sale, as recorded in a fact table, with multiple sales people.
In insurance, a single claim (again, tracked in a fact table) can be associated with multiple parties.
In healthcare, a single encounter may be associated with multiple providers or tests.
In government, a single audit or inspection may be associated with multiple findings.
In law enforcement, a single arrest may be associated with multiple charges.

Repeating Attributes: Multi-Values Attribute Bridge

It is sometimes suggested that dimensional models cannot gracefully accommodate repeating attributes. The dimensional solution is again a bridge table. This time, it is placed between the dimension table and an outrigger that contains the attribute in question.

Examples include:

Companies that have multiple Standard Industry Classification codes
People with multiple phone numbers or addresses
Accounts with multiple account holders
Patients with multiple diagnoses
Documents with multiple keywords.

Recursive Hierarchies: Hierarchy Bridge

Ragged hierarchies, unbalanced hierarchies, or recursive hierarchies are often cited as the downfall of the dimensional model. In fact, a solution exists, and it is extraordinarily powerful. The hierarchy bridge table allows facts to be aggregated either by rolling up or rolling down through the hierarchy, regardless of number of levels.

Examples include:

Parts that are made up of other parts
Departments that fall within other departments
Geographies that fall within other geographies
Companies that own other companies

Relationships Between Dimensions: Factless Fact Tables

A star schema does not include relationships between dimension tables. This has led to the misconception that you can't track these relationships.

In fact, any important relationship between dimension tables can and should be captured. It is done using factless fact tables. (Dimensions are never directly linked because of the implications this would have on slow change processing.)

Examples include:

Employees filling a job in a department
Marketing promotions in effect in a geographical market
Students registered for courses
The primary care physician assigned to an insured party
Brokers assigned to clients

Subtyping: Core and Custom

Another situation where modelers often believe they must “fall back” on ER techniques is when the attributes of a dimension table vary by type. This variation is often misconstrued as calling for the ER construct known as subtyping. Similar variation might also be found with associated facts

In the dimensional model, heterogeneous attributes are handled via the core and custom technique.

A core dimension captures common attributes, and type specific replicas capture the core attributes along with those specific to the subtype. Dimensions can then be joined to the fact table according to the analytic requirement. If there is variation in the facts, the same can be done with fact tables.

Examples include:

Products with attributes that vary by type
Customers that have different characteristics depending on whether they are businesses or individuals
In insurance, policies that have different characteristics depending on whether they are group or individual polices
In healthcare, medical procedures or tests that have different characteristics and result metrics depending on the test
In retail, stores that have varying characteristics depending on the type (e.g. owned, franchise, pocket)

Non Additive Metrics: Snapshots or Summary Tables

In a classic star schema design, facts are recorded at a granular level and “rolled up” across various dimensions at query time. Detractors often assume this means that non-additive facts have no place in a dimensional model.

In fact, dimensional modelers have several tools for handling non-additive facts.

Those that can be broken down into additive components are captured at the component level. This is common for many key business metrics such as:

Margin rate: stored as margin amount and cost amount in a single fact table. The ratio is computed after queries aggregate the detail.
Yield or conversion percentage: stored as quote count and order count in two separate fact tables, with ratio converted after aggregation at query time.

Other non-additive facts cannot be broken down into fully additive components. These are usually captured at the appropriate level of detail, and stored in snapshot tables or summary tables. Common examples include:

Period-to-date amounts, stored in a snapshot table or summary table
Distinct counts, stored in a summary table
Non-numeric grades, stored in a transaction-grained fact table

While these non-additive facts are flexible than additive facts in terms of how they can be used, this is not a result of dimensional representation.

Conclusion

Every technique mentioned here is part of dimensional modeling cannon. None are stopgaps or workarounds. While some may prove problematic for some of our BI software, these problems are not unique to the dimensional world.

In the end, the dimensional model can represent the same real-world complexities that entity-relationship models can. No ER fallback required.

- Chris

Learn More

All of these topics have been covered previously on this blog. Here are some links to get you started.

Establishing granular, detailed star schemas:

Build High Resolution Stars (January 7, 2011).

Multi-valued Dimensions:

Bridge to Multi-Valued Dimensions (February 9, 2011)
Bridge Tables and Many-to-many Relationships (April 25, 2011)
See the label bridges for more posts that deal with this topic.

Multi-Valued Attributes:

Resolve Repeating Attributes with a Bridge Table (January 28, 2011)
Q and A: Bridges are Part of Dimensional Modeling (March 7, 2011)
See the label bridges for more

Recursive Hierarchies:

Recursive Hierarchies and Bridge Tables (May 22, 2012)
Using a Hierarchy Bridge Without a Many-to-many Relationship (May 18, 2012)
And again, more will appear under the label bridges

Factless Fact Tables:

Factless Fact Tables (September 15, 2011)
See also the label Factless Fact Tables

Core and Custom:

There's not much on this topic on this blog right now, but see my book for more info (details below.)

Non-additive Facts:

Handling Rankings (October 10, 2012)
Storing Non-additive Facts (March 16, 2010)
More on Distinct Counts (April 27, 2009)
Dealing with Period-to-date Measurements (April 18, 2009)
See the category non-additve facts for more

Also consider checking out my book, Star Schema: The Complete Reference. It covers all these topics in much more detail:

Granular and detailed stars are covered in Chapter 3, "Stars and Cubes"
Multi-valued Dimensions and Multi-Valued attributes are covered in Chapter 9, "Multi-valued Dimensions and Bridges"
Hierarchy Bridges are covered in Chapter 10, "Recursive Hierarchies and Bridge Tables"
Factless Fact Tables are covered in Chapter 12, "Factless Fact Tables"
Core and Custom schemas are covered in Chapter 13, "Type-specific Stars"
Non Additive Facts are covered in Chapter 3, "Stars and Cubes," Chapter 11, "Transactions, Snapshots and Accumulating Snapshots" and Chapter 14, "Derived Schemas"

Use the links on this blog to order a copy from Amazon. There is no additional cost to you, and you will be helping support this blog.

Monday, July 16, 2012

Q&A: Does Master-Detail Require Multiple Stars in a Travel Data Mart

Posted by Chris Adamson

A reader asks whether master-detail relationships call for multiple fact tables.

Q: Is there some general principle to use as a guide when deciding to use 1 star or 2 for master-detail situations?

We have a master-detail situation involving trips, and multiple trip expenses per trip. Many of our queries are at the trip level - eg. how many trips are for clients in their 30's going to Arizona in July. Many of the queries are at the detail level -- how much is spent on means by the client's age range?

Both levels of query can be answered with one star, with the fact at the detail level (trip expense line). For trip-level questions - you can count distinct occurrences of the natural key in the trip dimension. But it seems that grinding through all those trip expenses when I just wanted to count trips is inefficient.

So do we have 2 fact tables, one at the trip level, one at the trip expense level? Do the dimensions surround both of them? Is there a rule?

- Ellen, Ottawa

A: As the reader observes, this is a common situation: when there is a master-detail relationship, is there a need for multiple fact tables?

This case calls for at least two fact tables.

Read on for the details.

Photo by o5com, licensed by Creative Commons

Multiple Fact Tables

After learning the basics of dimensional modeling, the first real world challenge we face is understanding when and how to design multiple fact tables. Until we learn to think in dimensional terms, the choice can be difficult.

I suggest starting with some basic guidelines. You probably need different fact tables if:

You have measurements with different periodicity
You have measurements with different levels of detail

The first guideline suggests that if facts do not describe the same event, they probably belong in different fact tables. For example, orders and shipments do not always happen at the same time. Order dollars and shipment dollars belong in separate fact tables.

The second guideline pertains to facts that do describe the same events. Information about an order and information about an order line are ostensibly available at the same time, but they have different levels of detail. If there are facts at both of these levels, there will need to be multiple fact tables.

So how would you apply these guidelines?

Think Dimensionally

It is easiest to work these questions out by thinking dimensionally. Forget about master-detail or parent-child for a few minutes, and consider what is being measured.

What are the business questions being asked, and what business metrics do they contain?
Are these metrics available simultaneously?
At what level of detail are these metrics available?

The reader cited business questions in her example. These questions reveal at least two metrics: number of trips, and expense dollars.

These metrics may be available on different schedules -- for example, a trip may commence before we have the details of all the expenses incurred. This would argue for multiple fact tables.

But let's suppose that the single source of data is expense reports. We do not have any information about the trip until it is over, at which point we have all the expense details.

In this case, lets think about the level of detail of these metrics. Number of trips and expense dollars seem to share a lot of common detail -- the traveller, the destination, the trip start and end date, and so forth.

But expense dollars have some additional detail -- the date of the expense item, and an expense category. Since these facts have different levels of detail, it makes sense to include them in separate fact tables.

Factless Fact Table

The trip-level star may contain a factless fact table. It contains one row per trip, with no explicit facts. We can determine the number of trips that answer a business question simply by counting rows.

Many teams would prefer to add a fact called number_of_trips and always populate it with the value 1. This is useful, because it makes the business metric explicit in your schema design. (I've written about this technique before.)

The trip level star may also contain some summary level facts that describe expenses -- say the total trip cost, the total transportation cost, total meals cost, etc. More detail on these metrics (such as the vendor, payment method, etc.) can found in the expense item star.

Deeper into Trip Analysis

Digging deeper, the reader may discover she needs more than two fact tables.

Trips often include more than one destination. When it is necessary to study the various segments (or "legs") of a trip, we can define a third fact table that contains one row per segment.

In this case, the reader would have three fact tables:

trip_facts Grain: one row per trip. Metrics: Number of trips (always one)
trip_segment_facts Grain: one row per destination city. Metrics: number of segments (always one)
trip_expense_facts Grain: one row per expense item. Metrics: expense_dollars

Each of these stars will share some common dimensions -- the trip origination date, the trip end date, the traveller, the trip identifier, etc.

Trip_segment_facts will also include destination-specific dimensions: the destination city of the segment, the arrival and departure dates for the segment, etc.

Trip_expense_facts will include dimensions that describe the individual expense items: the date paid, the payment method, the payee, the expense category and subcategory, and so forth.

A conformance matrix will help you keep track of the dimensionality of each star.

Learn More

Thanks to Ellen for the thoughtful question. If you have a question of your own, please send it in.

Pick up a copy of Star Schema: The Complete Reference and you will be helping support this blog.

Chapter 4 is dedicated to designing multiple fact tables.

Chapter 5 looks closely at conformed dimensions.

Chapter 12 covers faceless fact tables.

You can learn more about these topics by referring to these posts:

Multiple Stars and Conformed Dimensions (August 15, 2011) introduces the use of multiple star schemas that share common dimensions.

Factless Fact Tables (September 15, 2011) describes faceless fact tables that describe events, and the use of a fact with the constant value of 1.

The Conformance Matrix (June 5, 2012) describes how to represent the dimensionality of your stars in a matrix format.

The photo in this post is by o5com, licensed by Creative Commons

Thursday, April 26, 2012

Q&A: Human resources data marts

Posted by Chris Adamson

A reader asks if Human Resources data marts are inherently complex. I run down a list of dimensional techniques he should expect to find:

Q: I will be working on a data mart design project to design star schemas for human resources data. I heard that HR data is more complex than sales or marketing and special techniques need to be applied.

I looked at the star schemas of pre-built analytical applications developed by some vendors for our packaged HR solution. I felt that they are quite complex and just wondering star design for HR data should be so complex.

If possible, can you please discuss this topic in a detailed manner by considering any one of popular HRMS system data and the most common data/reporting requirements along with the design discussion to achieve the star for those reports using the given HRMS data?

- Venkat, UK

A: Human Resources applications do indeed tend to use advanced techniques in dimensional design.

Below, I run down a list of topics you will probably need to brush up on. In reality, every subject area requires complete mastery of dimensional modeling, not just the basics.

Note that the complexity you are seeing in packaged solutions may stem from the subject area. Vendors often produce abstracted models to facilitate customization.

Techniques used in HR data marts

No doubt you are accustomed to the transaction-grained stars you encountered in sales. You will find them in HR as well, but you will also encounter these:

Snapshot stars sample one or more metrics at pre-defined intervals.

In an HR data mart, these may be used to track various kinds of accruals, balances in benefit programs, etc.

Accumulating snapshot stars track dimension members through a business process and allow analysis of the elapsed time between milestones.

These may be used to track the filling of a position, "on-boarding" processes, disciplinary procedures, or applications to benefit programs.

Factless fact tables track business processes where the primary metric is the occurrence of an event. They contain no facts, but are used to count rows.

These are likely to be used for tracking attendance or absence, participation in training courses, etc.

Coverage stars are factless fact tables that model conditions. These are usually in place to support comparison to activities represented in other stars, but may also be leveraged to capture key relationships among dimensions.

These are likely to be used for linking employees to positions, departments and managers.

Your dimensions will also require reaching beyond the basics:

Transaction dimensions capture the effective and expiration date/time for each row in a dimension table. These are advisable in almost any situation.

In HR they may be used to track changes in an employee dimension.

Bridge tables for Multi-valued attributes allow you to associate a repeating attribute with a dimension.

In HR, these are likely to be used to associate an employee with skills, languages, and other important characteristics.

Hierarchy bridge tables allow you to aggregate facts through a recursive hierarchy.

In HR, these are used to navigate reporting structures (employees report to employees, who in turn report to other employees, and so forth) as well as organizational structures.

I would also expect to encounter some complexity in slow-change processing rules. Human Resource systems carefully audit certain kinds of data changes, tracking the reason for each change. As a result, you may have attributes in your dimension schema that may exhibit either type 1 or type 2 behavior, depending on the reason for the change.

Every schema goes beyond the basics

This list could go on, but I think you get the idea.

The only way to design a data mart that meets business needs is to have a well rounded understanding of the techniques of dimensional modeling.

You cannot get very far with nothing more than a grasp of the basics. This holds true in any subjet area -- even sales and marketing. You need the complete toolbox to build a powerful business solution.

Packaged data marts

The complexity that concerns the reader may actually stem from another cause: he is looking at packaged data mart solutions.

Packaged applications often introduce complexity for an entirely different reason: to support extensibility or customization. For example, facts may be stored row-wise rather than column-wise, and dimensions may contain generic attribute names.

Learn more

This blog contains posts on most of the topics listed above. Click each header for a link to a related article. Some have been discussed in multiple posts, but I have included only one link for each. So also do some exploration.

In addition, please check out my book Star Schema: The Complete Reference. When you purchase it from Amazon using the links on this page, you help support this blog.

Snapshots and accumulating snapshots are covered in Chapter 11, "Transactions, Snapshots and Accumulating Snapshots

Factless fact tables and coverage stars are covered in Chapter 12, "Factless Fact Tables"

Transaction dimensions are covered in Chapter 8, "More Slow Change Techniques"

Attribute bridges are covered in Chapter 9, "Multi-valued Dimensions and Bridges"

Hierarchy bridges are covered in Chapter 10, "Recursive Hierarchies and Bridges"

Thanks for the question!

- Chris

Send in your own questions to the address in the sidebar.

Do you have another technique that was useful in an HR data mart? Use the comments.

Image credit: Gravityx9 licensed under Creative Commons 2.0

Thursday, September 15, 2011

Factless Fact Tables

Posted by Chris Adamson

This post introduces the concept of the factless fact table and the situations where you would model one.

When a fact table does not contain any facts, it is called a factless fact table. There are two types of factless fact tables: those that describe events, and those that describe conditions. Both may play important roles in your dimensional models.

Factless fact tables for events

Sometimes there seem to be no facts associated with an important business process. Events or activities occur that you wish to track, but you find no measurements. In situations like this, build a standard transaction-grained fact table that contains no facts.

For example, you may be tracking contact events with customers. How often and why we have contact with each customer may be an important factor in focusing retention efforts, targeting marketing campaigns, and so forth.

The factless fact table shown here captures a row each time contact with a customer occurs. Dimensions represent the date and time of each contact, the customer contacted, and the type of communication (e.g. inbound phone call, outbound phone call, robo-call, email campaign, etc.).

While there are no facts, this kind of star schema is indeed measuring something: the occurrence of events. In this case, the event is a contact. Since the events correspond to the grain of the fact table, a fact is not required; users can simply count rows:

SELECT
    customer_name,
    count (contact_type_key)as "CONTACT COUNT"
FROM
    contact_facts...

CUSTOMER_   CONTACT
NAME        COUNT
=========== =======
BURNS, K          8
HANLEY, S        11
ROGERS, S         4
SCANLON, C        8
SMITH, B          8
SMITH, M.E.      12

This fragment of SQL reveals that there really is a fact here: the number of events. Its not necessary to store it, because it coincides with the grain of the fact table.

To acknowledge this, you can optionally add a fact called "contact_count"; it will always contain the constant value 1.

Factless fact tables for conditions

Factless fact tables are also used to model conditions or other important relationships among dimensions. In these cases, there are no clear transactions or events, but the factless fact table will enable useful analysis.

For example, an investment bank assigns a broker to each customer. Since customers may be inactive for periods of time, this relationship may not be visible in transaction-grained fact tables. A factless fact table tracks this important relationship:

Each row in this factless fact table represents a bounded time period during which a broker was assigned to a particular customer. This kind of factless fact table is used to track conditions, coverage or eligibility. In Kimball terminology, it is called a "coverage table."

Comparing conditions with events yields interesting business scenarios. This factless fact table can be compared to one that tracks investment transactions to find brokers who are not interacting with their customers, brokers who conducted transactions with accounts that belong to a different broker, etc.

More on factless fact tables

If you've got questions about factless fact tables, send them to the address in the sidebar of this blog.

You can also read more about factless fact tables in Star Schema: The Complete Reference. Chapter 12 provides detailed coverage of factless fact table design.

Monday, June 11, 2007

Ten Things You Won't Learn from that Demo Schema

Posted by Chris Adamson

Many people learn about dimensional modeling by studying a sample star schema database that comes with a software product. These sample databases are useful learning tools—to a point. Here are 10 things you won't learn by studying that demo.

If you've learned everything you know about star schema by working with a sample database, you probably have a good intuitive grasp of star schema design principles. In a previous post, I provided a list of 10 terms and principles that most sample databases illustrate well.

But there are many important things about a dimensional data warehouse that are not revealed by the typical "Orders" or "Sales" demo. Here are the top 10 things you will not learn from that sample database.

Multiple Fact Tables Most sample databases contain a single star—one fact table and its associated dimension tables. But it is rare to find a business process that can be modeled with a single fact table; it is impossible to find an enterprise that can. Most real-world designs will involve multiple fact tables, sharing a set of common dimensions.

When facts become available at different times, or with different dimensionality, they almost always belong in separate fact tables. Modeling them in a single fact table can have negative consequences. Mixed grain issues may result, complicating the load and making reporting very difficult. For example, building reports focused on only one of the facts can result in a confusing preponderance of extra rows containing the value zero.

Conformance With multiple fact tables, it is also important that each star be designed so that it works with others. A design principle called conformance helps ensure that as we build each new star, it works well with those that came before it. This avoids the dreaded stove-pipe. This principle allows a set of star schemas to be planned around a set of common dimensions and implemented incrementally.

Drilling Across It's also important to understand how to properly build a report that accesses data from multiple stars. A single SQL select statement won't do the job. Double counting, or worse, can result. Instead, we follow a process called drilling across, where each star is queried individually. The two result sets are then combined based on their common attributes. These drill across reports are some of the most powerful in the data warehouse.

Snapshot Fact Tables The fact table found in most demo stars is usually called a transaction fact table. But there are real world situations where other types of fact table designs are called for.

A snapshot design is useful for capturing the result of a series of transactions at a point-in-time; for example, the balance of each account in the general ledger at the end of each day. This type of design introduces the concept of semi-additivity, which can be a problem for many ad hoc query tools. It makes no sense to add together yesterday's balance and today's balance. It is not uncommon to compute averages based on the data in a snapshot star. But one must be careful here; the SQL Average() function may not always be what you need.

Factless Fact Tables Another type of fact table often contains no facts at all. Factless fact tables are useful in situations where there appears to be nothing to measure aside from the occurrence of an event, such as a customer contact. They also come in handy when we want to capture information about which there may be no event at all, such as eligibility.

In addition to transaction, snapshot and factless designs, there are other types of fact table as well. It is not uncommon to need more than one, even when modeling a single activity.

Roles and Aliasing Many business processes involve a dimension in multiple roles. For example, in accounting a transaction may include the employee who makes a purchase, as well as the employee who approves it. There is no need for separate"Purchaser" and "Approver" dimensions. A single "Employee" dimension will do the job. The fact table will have two foreign key references to the Employee dimension--one that represents the purchaser, and one that represents the approver. We use SQL "aliasing" when querying this schema in order to capture the two employee roles.

Advanced Slow Change Techniques If you are lucky, you were able to learn about Type 1 and Type 2 Slowly Changing Dimension techniques from the demo schema. I described these in a previous post. Often, analytic requirements require more.

A Type 3 change allows you to "have it both ways," analyzing all past and future transactions as if the change had occurred (retroactively) or not all.

There are also hybrid approaches, one of which tracks the "transaction-time" version of the changed data element as well as the "current-value" of the data element.

And then there's the time-stamped dimension technique, also called a transaction dimension. In this version, each row receives an effective date/time and an expiration date time. This provides Type 2 functionality, but also allows point-in-time analysis of the dimensional data.

Bridge Tables Perhaps the most confusing technique for the novice dimensional designer is the use of bridge tables. These tables are used when the standard one-to-many relationship between dimension and fact does not apply. There are three situations where bridge tables come in handy:

An attribute bridge resolves situations where a dimension attribute may repeat multiple times. For example, a dimension table called "Company" may include an attribute called "Industry." Some companies have more than one industry. Rather than flattening into "Industry 1," "Industry 2," and so on, an attribute bridge captures as many industries as needed.

A dimension bridge resolves situations where an entire dimension may repeat with respect to facts. For example, there may be multiple salespeople involved in a sale. Instead of loading the fact table with multiple salesperson keys, a dimension bridge gracefully manages the group of salespeople.

A hierarchy bridge resolves situations where a recursive hierarchy exists within a dimension. For example, companies own other companies. At times, users may want to roll transactions up that occur beneath a specific company, or vice versa. Instead of flattening the hierarchy, which imposes limitations and complicates analysis, a hierarchy bridge can be joined to the transaction data in various ways, allowing multiple forms of analysis.

All bridge implementations have implications for usage, or report building. Improper use of a bridge can result in double counting or incorrect results. Bridges also make deployment of business intelligence tools more difficult.

Derived Schemas Useful stars can also be derived from existing stars. Often called "second-line" solutions, these derived schemas can accelerate specific types of analysis with powerful results. The merged fact table combines stars to avoid drilling across. The sliced fact table partitions data based on a dimension value, useful in distributed collection and analysis. The pivoted fact table restructures row-wise data for columnar analysis and vice-versa. And set operation fact tables provide precomputed results for union, intersect and minus operations on existing stars.

Aggregate Schemas One of the reasons the star schema has become so popular is that it provides strong query performance. Still, there are times when we want results to come back faster. Aggregate schemas partially summarize the data in a base schema, allowing the database to compute query results faster. Of course, designers need to identify aggregates that will provide the most help, and queries and reports will receive no benefit unless they actually use the aggregate. Aggregates are usually designed as separate tables, instead of providing multiple levels of summarization in a single star. This avoids double counting errors, and can allow the exploitation of an automated query rewrite mechanism so that applications do not need to be aware of the aggregates.

I limited this list to 10 things. That's enough to make the point: a demo schema will only take you so far. When I teach Advanced Star Schema Design courses, I find that even people with many years of experience have questions, and always learn something new.

If you want to learn more, read the books recommended in the sidebar of this blog. Take a class on Advanced Star Schema Design. Interact with your peers at The Data Warehousing Institute conferences. And keep reading this blog. There will always be more to learn.

Related Posts:

Top 10 Thinks You Should Know About that Demo Schema

© 2007 Chris Adamson

Classes

Chris is scheduled to present at the following events. Course enrollment is open to the general public.

All these courses are also available on site (see below).

August 18, 2019
San Diego, CA
Data Modeling in the Age of Big Data
Registration: TDWI San Diego
August 20, 2019
San Diego, CA
Data Architecture: Managing Information in the Age of Big Data
Registration: TDWI San Diego
August 20, 2019
San Diego, CA
Workshop: Building the Business Case for Advanced Analytics
Registration: TDWI San Diego Strategy Summit
Monday October 21, 2019
San Francisco, CA
TDWI Dimensional Data Modeling Primer: From Requirements to Business Analysis
Registration: TDWI Seminars
Tuesday October 22, 2019
San Francisco, CA
Advanced Dimensional Modeling: Techniques for Practitioners
Registration: TDWI Seminars
Wednesday October 23, 2019
San Francisco, CA
Dimensional Models: What’s New in the Big Data Era
Registration: TDWI Seminars
November 12, 2019
Orlando, FL
Data Architecture: Managing Information in the Age of Big Data
Registration: TDWI Orlando
November 12, 2019
Orlando, FL
The Dimensional Model Refactored: New Techniques for the 21st Century
Registration: TDWI Orlando
November 15, 2019
Orlando, FL
Advanced Dimensional Modeling: Complete Tour of Modern Best Practices
Registration: TDWI Orlando

Onsite Education

You can bring Chris to your team for interactive education.

Dimensional Modeling

Chris provides full-day and expanded two-day courses covering the dimensional design concepts from Star Schema: The Complete Reference.
TDWI Courses

Chris teaches select TDWI courses that cover topics like data BI fundamentals, performance management, business analytics, dashboards and scorecards, and more.

All of Chris's education offerings are provided through TDWI.

For information on onsite offerings, contact TDWI Onsite Education. or Oakton Software