Chris Adamson’s Blog: Design Techniques

Showing posts with label Design Techniques. Show all posts

Thursday, December 20, 2012

Dimensional Models: No E-R Fallback Required

People often suggest to me that a dimensional model cannot handle certain situations. These situations, the assertion holds, require falling back to entity-relationship modeling techniques.

Image by Patrick Hosley,
Licensed by Creative Commons 2.0

When we look together at their examples, though, I've always found a dimensional solution.

Here's a list of things people sometimes do not realize can be handled by standard dimensional techniques.

At the end of the post is a large collection of links to previous posts treat these topcis in more detail.

Note: This article should not be interpreted as a criticism of entity-relationship modeling. Rather, the point is to cover commonly overlooked capabilities of dimensional modeling.

Predictive Analytics: Storage of Granular Detail

I often hear people say: if you want to do predictive analytics, its important to capture all the relationships among data attributes. This is incorrect.

The job of predictive analytics is to tell us what is important--not the reverse! Neither data mining nor predictive analytics requires information to be stored in an entity-relationship format.

As people who do this work will tell you, they don't care how the data is supplied. Nor do they care what modelers believe are the important relationships. Many tools will completely de-normalize the data before doing anything else.

In order to support predictive analytics, standard best practices of dimensional modeling apply:

Fact tables record business activities and conditions at the lowest level of detail possible in granular, first line stars.
Transaction identifiers are carried in the model if available.
Surrounding dimensions are enriched with detail to the fullest extent possible
In the dimension tables, changes are tracked and time-stamped (“type 2”).

If we do these things with our dimensional data, we have not lost anything that is that is required to do predictive analytics.

Many-to-many Relationships: Multi-Valued Dimension Bridge

Usually, each fact in a star schema can be linked to a single member of a dimension. This is often mistaken for a rule, which leads to the incorrect conclusion that dimensional models cannot accommodate a fact that may link to multiple members of a dimension.

(This is often generalized as “dimensional cannot handle many-to-many relationships.”)

In a dimensional model, this situation is referred to as a multi-valued dimension. We use a bridge table to link the facts to the dimension in question. This special table sets up a many-to-many relationship between the facts and dimensions, allowing any number of dimension members to be associated with any number of facts.

Here are some examples:

In sales, a bridge table may be used to associate a sale, as recorded in a fact table, with multiple sales people.
In insurance, a single claim (again, tracked in a fact table) can be associated with multiple parties.
In healthcare, a single encounter may be associated with multiple providers or tests.
In government, a single audit or inspection may be associated with multiple findings.
In law enforcement, a single arrest may be associated with multiple charges.

Repeating Attributes: Multi-Values Attribute Bridge

It is sometimes suggested that dimensional models cannot gracefully accommodate repeating attributes. The dimensional solution is again a bridge table. This time, it is placed between the dimension table and an outrigger that contains the attribute in question.

Examples include:

Companies that have multiple Standard Industry Classification codes
People with multiple phone numbers or addresses
Accounts with multiple account holders
Patients with multiple diagnoses
Documents with multiple keywords.

Recursive Hierarchies: Hierarchy Bridge

Ragged hierarchies, unbalanced hierarchies, or recursive hierarchies are often cited as the downfall of the dimensional model. In fact, a solution exists, and it is extraordinarily powerful. The hierarchy bridge table allows facts to be aggregated either by rolling up or rolling down through the hierarchy, regardless of number of levels.

Examples include:

Parts that are made up of other parts
Departments that fall within other departments
Geographies that fall within other geographies
Companies that own other companies

Relationships Between Dimensions: Factless Fact Tables

A star schema does not include relationships between dimension tables. This has led to the misconception that you can't track these relationships.

In fact, any important relationship between dimension tables can and should be captured. It is done using factless fact tables. (Dimensions are never directly linked because of the implications this would have on slow change processing.)

Examples include:

Employees filling a job in a department
Marketing promotions in effect in a geographical market
Students registered for courses
The primary care physician assigned to an insured party
Brokers assigned to clients

Subtyping: Core and Custom

Another situation where modelers often believe they must “fall back” on ER techniques is when the attributes of a dimension table vary by type. This variation is often misconstrued as calling for the ER construct known as subtyping. Similar variation might also be found with associated facts

In the dimensional model, heterogeneous attributes are handled via the core and custom technique.

A core dimension captures common attributes, and type specific replicas capture the core attributes along with those specific to the subtype. Dimensions can then be joined to the fact table according to the analytic requirement. If there is variation in the facts, the same can be done with fact tables.

Examples include:

Products with attributes that vary by type
Customers that have different characteristics depending on whether they are businesses or individuals
In insurance, policies that have different characteristics depending on whether they are group or individual polices
In healthcare, medical procedures or tests that have different characteristics and result metrics depending on the test
In retail, stores that have varying characteristics depending on the type (e.g. owned, franchise, pocket)

Non Additive Metrics: Snapshots or Summary Tables

In a classic star schema design, facts are recorded at a granular level and “rolled up” across various dimensions at query time. Detractors often assume this means that non-additive facts have no place in a dimensional model.

In fact, dimensional modelers have several tools for handling non-additive facts.

Those that can be broken down into additive components are captured at the component level. This is common for many key business metrics such as:

Margin rate: stored as margin amount and cost amount in a single fact table. The ratio is computed after queries aggregate the detail.
Yield or conversion percentage: stored as quote count and order count in two separate fact tables, with ratio converted after aggregation at query time.

Other non-additive facts cannot be broken down into fully additive components. These are usually captured at the appropriate level of detail, and stored in snapshot tables or summary tables. Common examples include:

Period-to-date amounts, stored in a snapshot table or summary table
Distinct counts, stored in a summary table
Non-numeric grades, stored in a transaction-grained fact table

While these non-additive facts are flexible than additive facts in terms of how they can be used, this is not a result of dimensional representation.

Conclusion

Every technique mentioned here is part of dimensional modeling cannon. None are stopgaps or workarounds. While some may prove problematic for some of our BI software, these problems are not unique to the dimensional world.

In the end, the dimensional model can represent the same real-world complexities that entity-relationship models can. No ER fallback required.

- Chris

Learn More

All of these topics have been covered previously on this blog. Here are some links to get you started.

Establishing granular, detailed star schemas:

Build High Resolution Stars (January 7, 2011).

Multi-valued Dimensions:

Bridge to Multi-Valued Dimensions (February 9, 2011)
Bridge Tables and Many-to-many Relationships (April 25, 2011)
See the label bridges for more posts that deal with this topic.

Multi-Valued Attributes:

Resolve Repeating Attributes with a Bridge Table (January 28, 2011)
Q and A: Bridges are Part of Dimensional Modeling (March 7, 2011)
See the label bridges for more

Recursive Hierarchies:

Recursive Hierarchies and Bridge Tables (May 22, 2012)
Using a Hierarchy Bridge Without a Many-to-many Relationship (May 18, 2012)
And again, more will appear under the label bridges

Factless Fact Tables:

Factless Fact Tables (September 15, 2011)
See also the label Factless Fact Tables

Core and Custom:

There's not much on this topic on this blog right now, but see my book for more info (details below.)

Non-additive Facts:

Handling Rankings (October 10, 2012)
Storing Non-additive Facts (March 16, 2010)
More on Distinct Counts (April 27, 2009)
Dealing with Period-to-date Measurements (April 18, 2009)
See the category non-additve facts for more

Also consider checking out my book, Star Schema: The Complete Reference. It covers all these topics in much more detail:

Granular and detailed stars are covered in Chapter 3, "Stars and Cubes"
Multi-valued Dimensions and Multi-Valued attributes are covered in Chapter 9, "Multi-valued Dimensions and Bridges"
Hierarchy Bridges are covered in Chapter 10, "Recursive Hierarchies and Bridge Tables"
Factless Fact Tables are covered in Chapter 12, "Factless Fact Tables"
Core and Custom schemas are covered in Chapter 13, "Type-specific Stars"
Non Additive Facts are covered in Chapter 3, "Stars and Cubes," Chapter 11, "Transactions, Snapshots and Accumulating Snapshots" and Chapter 14, "Derived Schemas"

Use the links on this blog to order a copy from Amazon. There is no additional cost to you, and you will be helping support this blog.

Monday, November 5, 2012

Name Value Pairs and Dimensional Models

Posted by Chris Adamson

A reader asks how name/value pairs are best translated into a dimension design.

I was wondering if you have ever had to setup a data warehouse with a source system that has the ability to have dynamic "virtual" columns using a set of name/value pair tables...If you have done this or have any cool ideas on how to tackle this setup in a dimension that would be greatly appreciated.

-Joshua

Belmont, CA

This is a common scenario. The preferred solution is to identify the most commonly used "virtual columns" and model them as standard attributes. Bridged solutions are also possible, but are much less flexible from a usability standpoint.

Name Value Pairs

Name/value pairs are often used in OLTP solutions to offer a flexible way to describe an entity.

Rather than store each characteristic of an entity as an attribute, a special table lets you identify unlimited attributes for something. Its main columns capture the name of the attribute and its value. (Hence the moniker "name/value pair".)

For example, suppose you are Apple Inc. Your OLTP system has a product table, but each product has a wide variety of attributes. Rather than model each attribute, you might capture them in a name/value pair table.

The rows in the table for the newly released iPad Mini might look like this:

As you can see, the 18 rows above describe one iPad Mini model.

Elsewhere in the data model, and associative table (or "intersect table") cross references these 18 pairs with a single row in the product table for this item.

Apple has 23 other iPad Mini models, each with different of values for the capacity attribute, color attribute, carrier attribute, etc.

In addition, this same table can be used to describe the characteristics of products with different named attributes - a phone, a computer, a notebook, and so forth. New attributes do not require changing the database design. Instead, all that is necessary is to insert the appropriate rows.

In a dimensional model, there are several ways to handle name/value sources. Each involves tradeoffs between flexibility and usability.

Flattened Solution is Most Usable

The preferred method is to model the "names" as explicit dimension attributes.¹

In the case of our product, for example, the resultant dimensional model might nook like this:

This approach produces the most understandable and usable dimensional model. Each important characteristic is present in the dimension, named according to business terms, and ready to provide context for facts.

Do not be discouraged if every instance of the source entity (product in this case) does not share the same set of attributes in the name/value pair table. Look for 100 or so attributes that are most commonly used. In many cases, this may satisfy most of your analytic requirements.

Core and Custom Dimensions

If there are simply too many names to flatten into a single dimension table, the next step is to consider "core and custom" models.

If you separate the entity by "types", can you find 100 named attributes for each type?

If so:

Build a single "core" dimension that contains an identifier and any attributes common across all types, along with a surrogate key.
For each type, build an additional dimension that replicates the common attributes, plus contains the 100 or so attribute names specific to the type
The core and custom versions should use the same surrogate key domain for each member of the dimension, so that they can be swapped in and out of queries as required.

For example, Apple might do the following:

Build a single product table that has the part number, category, product name and a key. This table will have a row for all products, including computers, tablets, phones and music players. This is the "core" product table.
Build additional "custom" dimensions would be built for each category of product. The table pictured in the previous section might be the custom dimension for tablets.

This solution retains the understandability of the previous solution - each attribute is clearly and explicitly named. However, we must now take care to join the appropriate version of the dimension to fact tables - a technical consideration based on the business question being asked.

Bridged Solution Mirrors OLTP Design

The last option is to employ a bridged solution. This technique actually mimics the original OLTP design. An outrigger contains the name/value pairs, and a bridge associates it with the dimension.

For example, Apple's product dimension might be bridged as follows:

This solution has the benefit of providing access to the full range of named attributes. It is also flexible; new named attributes do not require any change to the schema design.²

The disadvantage here is that this solution is the most difficult to use. Facts can be easily double counted, triple counted, etc. It will be necessary to take great care in constructing and qualifying queries, and it may also be necessary to lock down "grand total" functionality in end-user reporting tools.

Both Ways

Remember that this is not an either/or solution. If a bridged outrigger is necessary, consider also capturing the 100 most commonly used attributes in the dimension itself.

This allows you to create a safe "sandbox" for less technically sophisticated users. They are able to access much of the analytic value of your solution, without having to expose them to the complexity and possible dangers of the bridge table.

More Reading

Bridges: Much has been written in this blog about bridge tables. The most relevant entry if you want to learn more is Resolve Repeating Attributes With A Bridge Table (1/28/11). There are also two full chapters in my latest book (see below.)

Core and custom dimensions have not been previously discussed here, but you can learn more about them in Star Schema: The Complete Reference. It dedicates an entire chapter to "Type-Specific Stars."

Help support this blog

This blog is ad-free, and no one pays me to write it. You can help out:

Use any of the links on these pages to pick up my latest book, Star Schema: The Complete Reference.
Or, if you already have it, use the links to check out any of the other recommended books.

When you do, a small portion of the purchase goes to this blog (you do not pay any extra).

Notes:

¹I'll call these "names" rather than "pairs", since its the name that will translate into a column. For example, "color" is a name, it may participate in several name/value pairs -- one for each possible color.

²The bridge table in this diagram contains product and attribute keys. If there will be many combinations, dimensional modelers often replace the product key with a group key, and store the group key in the dimension table. See Resolve Repeating Attributes With A Bridge Table (1/28/11).

Tuesday, November 15, 2011

Conformed Dimensions

Posted by Chris Adamson

This second post on conformed dimensions explores different ways in which dimensions can conform.

There are several flavors of conformed dimensions. Dimensions may be identical, or may share a subset of attributes that conform.

Conformance basics

Conformed dimensions are central to the discipline of dimensional modeling. The basics of conformance were introduced in a post from earlier this year. In a nutshell:

Measurements of discrete business processes are captured in individual star schemas (e.g. proposals and orders)

Some powerful business metrics combine information from multiple processes. (e.g. fill rate: the ratio of orders to proposals)

We construct these metrics through a process called drilling across

Drilling across requires dimensions with the same structure and content (e.g. proposals and orders have a common customer dimension)

For a refresher on these concepts, see "Multiple stars and conformed dimensions" (8/15/2011).

Physically shared dimensions not required

When two fact tables share the same dimension table, their conformance is a given. Since the shared dimensions are the same table, we know they will support drilling across.

For example, stars for proposals and orders may share the customer dimension table. This makes it possible to query orders by customer and products by customer, and then merge the results together.

But this process of drilling across does not require shared dimension tables. It works equally well if proposals and orders are in separate data marts in separate databases.

As long as the stars each include dimensions that share the same structure (e.g. a column called customer_name) and content (i.e. the customer values are the same), it will be possible to merge information from the stars.

Levels of conformance

It is easy to take this a step further. We can also observe that there is compatibility between dimensions that are not identical.

If a subset of attributes from two dimension share the same structure and content, they form a sort of “lowest common denominator” across which we can compare data from the stars.

For example, suppose we establish budgets at the monthly level, and track spending at the daily level. Clearly, days roll up to months. If designed correctly, it should be possible to compare data from budget and spending stars by month.

The picture below illustrates the conformance of a MONTH and DAY table graphically. The ring highlights the shared attributes; any of these can be used as the basis for comparing facts in associated fact tables.

In this case, the two conformed dimensions participate in a natural hierarchy. Months summarize days. The month table is referred to as a “conformed roll-up” of day.

To successfully drill across, the content of the shared attributes must also be the same. Instances of month names, for example, must be identical in each table -- "January" and "January" conform; "January" and "JAN." do not.

To guarantee conformance of content, the source of the rollup should be the base dimension table. This also simplifies the ETL process, since it need not reach back to the source a second time.

Other kinds of conformance

Identical tables and conformed roll-ups are the most common kinds of conformed dimensions. Other kinds are less common.

Degenerate dimensions (dimensions that appear in a fact table) may also conform. This is particularly useful with transaction identifiers.

Overlapping dimensions may share a subset of attributes, but not participate in a hierarchy. This is most common with geographical data.

More Info

For more information, see the following posts:

Multiple Stars and Conformed Dimensions (August 15, 2011)

Q and A: Degenerate Dimensions, ETL and BI (October 15, 2010)
Creating transaction identifiers for fact tables (October 17, 2011) This post addresses degenerate dimensions and transaction identifiers

I also write about conformed dimensions extensively in Star Schema, The Complete Reference.

If you enjoy this blog, you can help support it by picking up a copy!

Photo by Agnes Periapse, licensed under Creative Commons 2.0

Monday, October 17, 2011

Creating transaction identifiers for fact tables

Posted by Chris Adamson

Sometimes, the source data for a fact table does not include a transaction identifier. When this happens, we often crate our own. A reader asks if this goes against best practices:

Q: You said that we should not create surrogate keys in fact tables. But we have a source system that does not provide us with identifiers for phone calls. When we get the log files, we create surrogate keys for each call before loading them into the star.

A: This is a common practice, and it is just fine.

You are not creating what dimensional modelers call surrogate keys. You are creating transaction identifiers. These will be helpful as part of the ETL and QA processes.

Surrogate keys vs. transaction identifiers

Surrogate keys should not be confused with transaction identifiers.

As I wrote in the post referenced in the question, a surrogate key is an attribute that is created to uniquely identify rows in dimension tables. It does not come from a source system; it is created expressly for the dimensional schema.

A transaction identifier is a dimension attribute in a fact table that defines its grain, or at least helps to do so. (If you need a refresher on grain, see this post.)

For example, PAYROLL_FACTS may have the grain "one row per pay stub." Each pay stub has an identifying number in the source system, which is carried into the star as a degenerate dimension. This attribute is a transaction identifier.¹

When a transaction identifier defines grain, the DBA may define it as "the primary key of the fact table." Likewise, an ETL developer or QA analyst may use it as a sort of key to compare the star to operational data or staging data.

From the dimensional perspective, however, these transaction identifiers are dimensions. They can be used to filter queries, group or sort the results, and so forth. They will not appear in other tables as foreign keys that reference the fact table.

Creating transaction identifiers

When operational systems do not have good transaction identifiers, we often "cook one up" during the ETL process. They are not surrogate keys in the dimensional modeling sense, though the term might sound appropriate. They are transaction identifiers.

For example, suppose you are designing a star that records phone calls made from your call center. You want the grain of the fact table to be "one row per phone call." The data will come from your call switch. As you've mentioned, it creates a log file, but does not provide an identifier for each call.

In your design, you may define your grain: "one row per date, time, originating number and number dialed." This works just fine, but may be difficult for ETL developers to manage. Instead, they assign a unique identifier to each call in the log. You can use this identifier to define the grain of the fact table.

Carried into the fact table, this attribute aids in the data integration and quality assurance process. It can be used to quickly crosscheck the star against the log file.

However, for analytic purposes, it is not used as a key. It is a dimension attribute -- a "cooked up" transaction identifier.

Other concerns

When you are forced to create your own transaction identifiers, you will have other things to worry about.

First and foremost, you must be sure there is a mechanism to prevent loading of duplicate facts. You will need to understand how the source system generates its log files, to be sure that you extract data that is complete, accurate, and not subject to revision.

Also, note that the artificial identifiers may convey a false sense that you can reach all the way back to the source to find the original transaction. While it may be possible, your identifier will not be the mechanism.

Thanks for the question, and keep them coming.

-Chris

More Info

For more information, see these posts:

Avoid surrogate keys for fact tables (September 21, 2011)
Degenerate dimensions, ETL and BI (October 15, 2010)
Rule 1: State your grain (December 9, 2009)

Also check out my book, Star Schema: The Complete Reference. I discuss surrogate keys in Chapter 1, Analytic Databases and Dimensional Design. Degenerate dimensions and transaction identifiers are covered in Chapter 3, Stars and Cubes.

¹Not all degenerate dimensions are transaction identifiers. For an example that is not a transaction identifier, see my post of October 15, 2010.

Image licensed via Creative Commons 2.0
from Patrick Hoesley

Monday, December 13, 2010

Deeper into the Accumulating Snapshot

Posted by Chris Adamson

The most popular posts on this blog discuss accumulating snapshot designs. In this post, I'll share some new insights on the technique, which come from my book Star Schema The Complete Reference.

In particular, I want to debunk some common misconceptions about accumulating snapshots.

The Accumulating Snapshot

One of the three major kinds of fact table, the accumulating snapshot is the least understood and the most underutilized.

I provided a basic description of the technique several years ago in the first post to this blog. The basic concept is simple: in this kind of design, each row in the fact table tracks the changes to something over time. The idea is to facilitate reports that look at how long things spend in various processing stages.

For example, the post mentioned above describes an accumulating snapshot that tracks the processing of mortgage applications. Each row in the fact table tracks a single application, and is updated each time it completes a processing stage. It contains foreign key references to multiple dates:

Submitted Date
Approval Date
Processed Date
Underwritten Date
Settlement Date

Corresponding to each of these milestones is a fact that indicates how long it spent in the stage. In the previous post, I've called these facts "lags," but I'll suggest a different way of looking at them in a moment.

These facts make it it easy to study the average time items spend in each stage. Without an accumulating snapshot, this might require correlated sub-queries, or the comparison of detail rows from multiple fact tables.

Unique Characteristics

Unlike other kinds of fact table, the intent is to update rows in an accumulating snapshot. When an application reaches a new status, for example, the appropriate date is added. I've touched on this topic in a prior Q and A post.

Another characteristic that differs from other kinds of fact tables is the size of the accumulating snapshot. It usually has less rows than the dimension table representing the item being processed. That's because the item itself is likely to undergo type 2 slow changes. I've noted this in a previous post as well.

Lastly, the accumulating snapshot is essentially current valued. It is used to study the net effects of a process up to and including the current point in time.

Non Linear Processes

In the past, I've written that the accumulating snapshot is only effective for processes that follow a consistent, predictable series of steps. (See my previous books in the sidebar to this blog.)

But the technique can easily be adapted to accommodate processes that are not linear.

For example, suppose our mortgage application may bounce back to the processing stage if the underwriter discovers some missing documentation. Conventional wisdom holds that this will "break" the accumulating snapshot, since it violates the standard sequence of status milestones.

This non-linear behavior is easily accounted for. First, we will rename our "lag" facts to represent the number of days spend at each stage:

Days awaiting approval
Days processing
Days underwriting
Days awaiting settlement

The ETL process will be developed to check each open application on a daily basis, and increment the appropriate fact for its processing stage.

This means it does not matter if an application reaches the underwriting stage and then gets sent back to the processing stage. If this occurs, the ETL process will simple resume incrementing the "days processing" fact.

One thing that will need to be worked out is what dates to use for each milestone. For the application in question, once the processor finishes with it for the second time, what is the processed date? The first time it achieved this status, or the second? Business rules must drive these decisions.

Long Duration Processes

We are also accustomed to thinking of accumulating snapshots as appropriate only for short-duration processes.

However, my customers and students have found success in modeling activities of significant duration. Examples include the tenure of faculty members, the life cycle of a service contract, the duration of a mortgage or insurance policy, and the incarceration of a prisoner.

Some of these processes do not have fixed durations, and may prove to span years or even decades. As long as it is feasible for the ETL process to continually update records for active entities in the process, the approach is effective.

Going Even Deeper

Several other topics are worth considering:

You may have an operational system that logs very detailed status changes. It is possible to map a set of hundreds of statuses into a few key milestones for an accumulating snapshot.

If the item or entity being processed undergoes slow changes, update its surrogate key in its corresponding fact table row. This will be sure the fact table always points to the most recent version of the item.

Some milestones make useful dimension attributes as well. Incorporate the dates into the dimension table as appropriate. For example, "date of approval" may be a useful characteristic in the mortgage application dimension.

For milestones that have not been achieved, don't include a NULL-valued foreign key in the fact table. Instead, establish a special row your day dimension for events that have yet to occur. Set its attributes to "N/A" and give it the highest possible date supported by your DBMS. This will simplify joins and query predicates.

If at all possible, source the accumulating snapshot from other fact tables that log status changes as transactions. This will make the ETL process much simpler.

These topics are all discussed in the book.

You will find this same kind of deep dive in every topic it covers. You can help support this blog by using the links in the sidebar to order a copy.

Thanks for your support, and feel free to send in your questions!

Image Credit: Creativity103 via Creative Commons

Friday, November 5, 2010

Q&A: Star vs. Snowflake

Posted by Chris Adamson

A question from Portugal gives me an excuse to talk about snowflakes, why you might want to avoid them, and why you might want to use them.

Q: In your perspective, when does a star schema start to be a snowflake? If you have a snowflaked dimension, do you consider the model a star-schema? Or if you have for example an outrigger, is the model not a star-schema anymore?

Pedro in Portugal
http://www.pedrocgd.blogspot.com/

I think the question when does a star become a snowflake is really one of semantics, but I'll give an answer.

Readers who just want to know when to use snowflake designs can skip ahead a couple of paragraphs.

Is it a Star or Snowflake?

As I said, this is really a semantic issue, but here is my answer. When dimension tables are linked to other dimension tables, or to anything that is not a fact table, the design is a snowflake.

The presence of an outrigger indicates a snowflake design, and so does a 3NF dimension containing 45 tables (and yes, I have seen that!).

Is a snowflake a star? I don't have a good answer for this. When I discuss design options with colleagues, we think of it as an either/or choice. But when a design is complete, we refer to each fact table and its dimensions as a "star," even if there are outriggers.

One thing I am sure of is that both types of design are dimensional designs. (So is a cube, by the way.)

Thanks to Pedro for the question. Now I'm going to talk a bit more about snowflakes, for those who are interested.

When in doubt, don't snowflake

Best practices dictate that snowflakes should be avoided. The reasons for this are wholly pragmatic.

They increase ETL complexity
The increase query complexity, harming "usability"
They reduce the understandability of the model

We could get into a discussion on this, but instead I'd like to look at some situations where these guidelines might not apply.

When snowflaking is acceptable

There are some cases the guideline against snowflaking may be relaxed.

Judicious use of outriggers may be acceptable in limited cases where there are relationships between dimensions that must be browsable. But consider a factless fact table.

Outriggers are necessary when there are repeating attributes such as a product with multiple features or a patient with multiple diagnoses.

Outriggers are helpful in situations where there is a recursive hierarchy such as departments that contain other departments, or regions that contain other regions.

Snowflaking may be justified when your software products require it. (e.g. your DBMS or reporting tool.)

In the latter case, you are making a strategic decision rather than a design decision, and I recommend involving several points of view in the decision making process--not just designers.

Read more, and help support this blog:

I devote many pages to snowflake designs in Star Schema The Complete Reference.

Chapter 7 discusses snowflakes and hierarchies, and has an in-depth discussion of some of the issues touched on in this post.
Chapter 9 discusses the use of outriggers and bridge tables to capture repeating dimensions or attributes.
Chapter 10 shows how you can use a bridged design to support a recursive relationship to very powerful effect.
Chapter 16 looks at some of the implications of the star vs. snowflake decision on your BI software

If you use the link above (or the links in the sidebar) to order my book, a portion of the proceeds helps to support this blog. You can also find the table of contents and other info right here.

Image credit: Gravityx9 licensed under Creative Commons 2.0

Friday, October 15, 2010

Q&A: Degenerate Dimensions, ETL and BI

Posted by Chris Adamson

A question from a reader about including dimensions in the fact table:

Q: Question concerning an argument I am having with a colleague. We have a transaction fact table that will have an attribute called "Reason Overpaid". This attribute can only contain one of 10 values. Is it better to create an "Reason Overpaid" dimension and put a FK in the fact table referencing to the dimension, or just have the "Reason Overpaid" description in the fact table.

A: This is one argument I will not be able to settle. Either approach is fine.

Stored in the fact table, this attribute would be known as a degenerate dimension. It is perfectly acceptable there, but you may decide to move it to a separate table for other reasons.

Degenerate Dimension

A degenerate dimension is nothing more than a dimension attribute stored in the fact table. This technique is commonly employed when there is an attribute left over that doesn't really fit into any of the other dimension tables.

Your "Reason Overpaid" attribute can be stored in the fact table as a degenerate dimension. You can still use it in the exact same way as any other dimension attribute -- as a way to filter queries, group results, break subtotals, and so forth.

Keeping it in the fact table avoids unnecessary complexity -- a new table and key attribute to manage and load, a new foreign key lookup when processing facts, and most importantly an extra join to include in queries.

That said, a dimension table for the attribute may make sense in some situations.

Junk Dimension

If there are more than one degenerate dimensions, consider moving them all to a separate dimension table. This is called a junk dimension. The attributes are not directly related to one another and there is no natural key. It is populated with the table that contains the Cartesian product of all possible values.

ETL Consistency Concerns

If your "Reason Overpaid" will also appear in other fact tables, worries about ETL consistency may arise.

Degenerate dimensions are still OK in this situation, but now two or more fact tables will contain the attribute, and it will be necessary to be sure it is loaded consistently. Creating a separate dimension table allows the values to be created exactly once, avoiding any problems that might be created by inconsistent ETL processing.

While I would not go to a separate table for this reason, I do understand why many designers opt to do so. The next situation is a different story.

BI Tool Capabilities

If your "Reason Overpaid" will also appear in other fact tables, the capabilities of your BI software may come into play.

The scenario is this: you are configuring your BI tool to auto-generate SQL queries for users. You'd like to have an item they can request called "Reason Overpaid", but the tool does not understand that it can appear in two places in the database schema.

Creating a dimension table for the attribute solves this problem. Both fact tables can link to the same dimension table. The tool can now have a definitive place to go for "Reason Overpaid", and may even be able to use it as the basis for comparing data in two fact tables. This is a strong reason to go with a separate table.

Luckily, many BI tools can be configured to acknowledge that a dimension may appear in more than one place, in which case this is not an issue. And if you are building cubes for the purposes of BI reporting, you can trust your developers to choose the right attribute.

If you're interested in reading more about how BI tools may influence your dimensional design, be sure to check Chapter 16, "Design and Business Intelligence" in my latest book, Star Schema The Complete Reference.

Send in your questions

Have a question of your own about dimensional modeling? Send it to the blog address in the sidebar.

Chris

Image credit: Mykl Roventine via Creative Commons 2.0

Friday, October 1, 2010

Q&A: Accumulating Snapshots

Posted by Chris Adamson

A reader asks about 1:1 relationships in a star schema.

I am designing a superannuation data warehouse that will have a claims accumulating snapshot fact table. This table will contain number of dates that will be filled in when a claim is being processed. From this we will be able to calculate time spans between the various claims processing stages.

It is intended that the fact table will contain a claim_key that will reference a claims dimension (1:1 relationship ). I have been told by a few people that a 1:1 relationship is not a recommended design model.

- A reader in Australia

A: There is absolutely nothing wrong with having a 1:1 relationship. (That's true for both ER models and dimensional models.)

Accumulating snapshots often cause the kind of disagreement you mentioned. They are less common. People mistakenly generalize characteristics of more common designs, assuming they apply to all stars.

This is natural, but it leads to three incorrect assumptions:

The fact table is on the "many" end of a 1:Many relationship with each dimension
The fact table will have the most rows
Fact table rows are not updated once they are loaded

These assumptions are all false when it comes to an accumulating snapshot. (In fact, even a transaction fact table may violate the first two assumptions.)

You've asked about the first one, but let me debunk all three:

By definition the accumulating snapshot will have a 1:1 relationship with one of the dimension tables. Why? Because its purpose is to track the progression of some discrete item through a business process. Its grain is almost always stated in terms of a dimension -- e.g. "One Row per Claim."
If the dimension can undergo type 2 slow changes, the accumulating snapshot will likely have LESS ROWS than the dimension. This is desirable, and I have written about it before.
The purpose of the accumulating snapshot is to study lags between key events, while avoiding correlated sub queries. Unlike other fact tables, this requires the row to be updated as more information becomes known about the item being tracked.

I've written about these topics a few times in the past, including the post mentioned above. I also explore these topics in much greater detail in Chapter 11 of Star Schema: The Complete Reference.

In the book, I also show that an accumulating snapshot be used even when there is not a fixed, linear series of business steps being tracked.

Anyone who is not familiar with the accumulating snapshot is encouraged to learn more. It is an essential design technique for many common business requirements.

Thanks for the question, and keep them coming...
-Chris

Image Credit: Creativity103 via Creative Commons

Friday, July 2, 2010

Price Waterfall Analysis and Dimensional Design

Posted by Chris Adamson

Many businesses rely on information in the data warehouse to conduct price waterfall analysis. There is no single way to design a schema to support waterfall analysis; your solution will vary according to the waterfall model being used and the availability of data elements.

Waterfall Analysis

Waterfall analysis represents pricing as a series of "leakages" or deductions from list price, usually grouped into major categories, with various intermediate assessments of price. Target margins are also sometimes included. The journey from "list price" to "pocket margin" may be expressed in terms of unit prices, average unit prices, or as percentages.

In the diagram below, for example, waterfall analysis represents "list price" as 100%. Several types of "leakages" appear as percentage reductions. The various approximations of price are shown in dark blue; leakages are shown in light blue. Targets appear in red.

"Invoice Price," for example is what appears on customer orders, and factors in several discounts. "Net Price" factors in post-order discounts. "Product Price" factors in sales and marketing costs, and "Pocket Margin" is the end result of the analysis.

Supporting from a Dimensional Model

From a dimensional design perspective, waterfall analysis is similar to profitability analysis. There is not a single, "best way" to support waterfall analysis. The sophistication of your solution will vary based on the complexity of the waterfall model and the availability of data.

It is best to start by thinking about the different prices and leakages as additive facts, which implies extended amounts. While prices and leakages can be specified as unit amounts, waterfall analysis is most flexible when it can be conducted across products, time periods or geography. Storing unit amounts in a dimensional schema does not serve this well. Instead, think about the waterfall components as facts that represent extended amounts. These are fully additive, and can be summarized with maximum flexibility. (Some graphics of waterfall analysis use unit amounts, but these are computed based on volume totals, rather than the reverse.)

A Value Chain

If thought of as additive facts, the various prices and leakages can be associated with one or more processes, each potentially represented by a fact table. Together, there fact tables form a value chain, related by a set of common dimensions. By combining the data from these stars, the waterfall can be constructed.

Next, look at your business's waterfall model itself. Are the data elements available? At what level of detail? Do some represent allocations? What are the allocation rules? This will drive the design of stars that capture key components of the waterfall.

In the illustration, the first several columns may be available directly from an order-entry system. We can envision a single order_facts star, which contains facts capturing these extended amounts for each order line: list price, volume discounts, corporate discounts, promotion discounts, and invoice price.

The order entry system may also support rebate processing, allowing the next two facts to be stored in the same table: rebates and net price. Or, it may be that these are computed at different periodicity and/or in a different system, in which case they may belong in a separate star. If stored separately, tying them to the original order will allow them to correlate to an order date rather than the time of the rebate, allowing for "lagged" and "unlagged" flavors of analysis.

Allocations

As in traditional profitability analysis, some components of waterfall analysis may be allocated. Where this occurs, you must be sure that there is a definitive source of data. Alternatively, there must be an agreed upon and consistent set of business rules used to construct allocations. (Remember that data warehouses publish data; they do not manufacture it.)

The example above requires marketing and advertising costs be available by (or allocated to) quarter, product category and region. This may be less granular than the data that supported the columns to their left. Conformed dimensions will ensure that all data can be expressed at a common level of detail.

If there is no source of data or business rule, the desired model cannot and should not be supported by the data warehouse. It is up to the business to develop the processes and systems to define these data elements, not the data warehouse.

Targets

You may also have targets for the various prices, which are specified by planners as percentages. These will also need a home somewhere. They are really non-additive facts, and will likely be stored in fact tables that associate them with time periods and plan version. Alternatively, they may be incorporated into reports by hand.

Pulling it Together

Waterfall analysis reports, like the one above, can be constructed by combining data from the constituent stars and combining it based on the common dimensions (such as products, time, or regions.) This process is sometimes referred to as "drilling across."

It is also possible to construct a single "derived" star (or cube), in which this consolidation is handled in the ETL process, rather than the reporting process. This makes the analysis much easier to perform, since much of the hard work is taken care of in advance.

Variation in the Model

Some businesses have models that vary over time or according to products or categories, and may have support for waterfall analysis incorporated directly into operational systems. This is accomplished on the operational side by setting up a breakdown of each order line according to a set of configurable waterfall components.

Such an operational solution may translate into a dimensional design that represents each transaction line as a series of rows in a single fact table, each line containing a single dollar amount. Each row represents one of the components in the waterfall chart. A dimension determines which component is represented by each row.

This design approach offers nice flexibility, as the model can be changed without altering the design. However, the reports themselves may require alteration as the model is refined. Furthermore, he star itself is limited in its utility; its single fact must be carefully qualified each time it is aggregated.

Wednesday, May 26, 2010

Avoid Using Dates as Keys

Posted by Chris Adamson

This post explores the impact of using a date as the primary key for a dimension table.

Q: I'd like to hear your thoughts on using dates as keys. Is it really necessary to have a surrogate key for a date dimension?

Hartford, CT

A: It is not absolutely necessary. But you should use a surrogate key anyway. Here's why.

Suppose you have a DAY dimension which contains one row for each calendar date. In addition to the full date, it contains columns that represent the year, quarter, month, day of week, holiday status, and so forth. It also contains columns that represent frequently used combinations of these various attributes.

Users (or developers) can use the columns to do things like filter queries, aggregate facts, sort results, and drive subtotaling. Database administrators can index these columns to optimize these activities.

You might be tempted to use the full date as the primary key of the table. This is urge is understandable -- why bother creating an additional column called DAY_KEY. Seems like extra work for nothing, right?

Consistency and Performance Issues

Remember that the primary key of a dimension table will also appear in fact tables. If a date is used, users or developers will eventually try using it as a "shortcut" in queries, avoiding a join to the DAY table.

This leads to inconsistencies. Developers may mean well, thinking they are "saving a join." But they may apply different rules to compute things like fiscal periods.

Use of a day table guarantees consistent computation of all the characteristics of each day. All the columns are computed exactly once, during the ETL process. Everyone who uses them gets consistent results.

Moreover, avoiding the DAY table may generate performance issues. If a report developer wants to filter for a particular day of the week, they may apply SQL date comparison functions to the day column in the fact table. The database will probably have to make this comparison for each row in the fact table.

A day of the week in the DAY table is far more efficient, because it is pre-computed and indexed.

You may think you can prevent these mishaps. But you cannot. Sooner or later, they will occur. Dates in the fact table will beckon. A user, developer, or consultant will try to take a shortcut. Perhaps under your watch, or perhaps after you have gone.

Guarantee consistency and performance by using a surrogate key.

If you are curious, read on for some other, less important reasons.

Read the full post »

Tuesday, March 2, 2010

Just the Facts?

Posted by Chris Adamson

A reader asks:

I’m having a problem in picking out facts. For example are all dates facts -- start date, end date and DoB? What about personal data like post code, job title, status (active /left) etc. Would you say that if you cant do math on it, its not a fact?
- [Name Withheld], UK

Good question. It allows me to cover the basics, as well as touch on some advanced topics.

First, the Basics..

Facts vs. Dimensions

Facts are measurements. I really prefer that word, because it is a bit more descriptive. But the rest of the world already knows "Facts".

Examples that are clearly measurements:

order dollars
gross pay
quantity sold
account balance

In the original question, all the attributes mentioned are dimensions. (Dates, personal data, and so forth.)

Dimensions are used to give facts context. You use dimension values to filter or group facts. Here are some in action:

order dollars (fact) by order date (dimension)
gross pay (fact) by department code (dimension)
quantity sold (fact) for January (dimension value)

You can usually "do math" on facts. The most common example is adding them together, or aggregating them, as might be done in the examples above.

Don't use that as a criterion to to identify facts, though. Many numbers are not facts. And, to make matters more complicated, and some non-numbers are facts.

Many Numbers are Not Facts

Some numbers are really dimensions. You can tell because we use them to give context to facts. A numeric department code is obviously a dimension. We wouldn't normally add the values together, but we might use them to break out a measurement -- for example: budget (fact) by department (dimension).

Less obvious are things like "unit price". It sounds like it might be a fact. After all, we can certainly add up money, right? Actually, when we add up sales, we don't add up unit prices, we add up "extended prices". Extended price is a fact. It is equivalent to unit price times quantity.

Look again, and you will also see that unit price behaves like the other dimensions we looked at. We can use it to group or filter facts. "How many did we sell at a unit price of 10 cents?" That question filters a fact (quantity sold) by a dimension value (10 cent unit price).

Some Facts are Not Numbers

There are also facts that are not numbers. These are rare, but they do occur.

Test results or grades are the most common example. You can't do math on "pass" or "fail," but they are measurements. Facts like this are sometimes called "text facts." A test_result fact that can take on the values "pass" or "fail" might be stored in a fact table that contains a row for each student taking a particular test on a particular day.

If they take on a relatively low number of possible values, consider converting text facts into discrete facts. In this case, we might replace test_result with two facts: number_passed and number_failed. For each row in the fact table, one of these will take on the value 1, the other the value 0. This may seem wasteful, but now it is easy to get a sum(number_passed) or sum(number_failed.)

Some Stars Have No Facts

If you don't have any facts, you can still have a fact table. Sometimes, the mere occurrence of an event is all you need to know about. For example, you might use a fact table to track the fact that a employee was present on a given day.

Factless fact tables can also track conditions. For example, each year, circumstances might make employees eligible to participate in particular programs. Aside from the dimensions (Employee, Year and Program) there might not be any facts to record. That's OK, you can still count up rows that meet various conditions.

- Chris

Do you have a question? Send it in. I answer all my email, though sometimes it takes a while.

Image: Jack Webb as Joe "Just the Facts" Friday,
from the Public Domain

Thursday, January 14, 2010

Accumulating Snapshots: Are They Necessary? [Q&A]

Posted by Chris Adamson

A reader asks the following question:

Q: We have an accumulating snapshot that tracks trouble tickets from submission to closure. The fact table has one row for each trouble ticket. Why do we need it? Can't we just place all this information in the trouble ticket dimension?

A: You probably do need the fact table, but it might be worth putting some of the information in the dimension table as well.

Before I explain, here is a quick refresher on the accumulating snapshot.

Accumulating Snapshot

An accumulating snapshot is a kind of fact table used to correlate activities surrounding a specified item. In this case, that item is a trouble ticket.

Accumulating snapshots are useful because they simplify the queries necessary to study elapsed time between events. In the reader's case, that might be the time between submission and closure of a ticket (there are probably some interim steps as well).

On the operational side, these events are probably recorded in different rows, and maybe even in different systems. Figuring out the days between these events can be messy, especially if averaging a results for a large number of tickets. Correlated subqueries will be reqiured, and performance may be poor.

The accumulating snapshot does the correlation of these activities ahead of time, as part of the ETL process. In this case, it has one row for each ticket. Presumably, there are foreign keys that identify the ticket, the date submitted, and the date closed. An optional fact can record the elapsed time between these dates. This fact is very easy to aggregate and group by any dimension of interest. No need for for correlated subqueries.

You can read more about accumulating snapshots in a prior post.

The Fact Table is Necessary

The reader has noted that the fact table has one row per trouble ticket. (That's a good statement of grain, by the way.) Since there is only one row for each ticket, it might seem that the content of the fact table could be simply be placed in the ticket dimension.

This is probably not a good idea. Though your fact table has one row per ticket, your ticket dimension may have more than one, due to type 2 changes that track history. For a particular trouble ticket, when a type 2 change occurs, a new row is added for the new version of the ticket. A given ticket may therefore have more than one row in the dimension.

If you were to place a fact in this dimension table, like the days between submission and closure, it would be repeated for each version of the same ticket. Aggregating it would result in double counting.

The fact is better placed in the fact table, where they can safely be summed, averaged, or whatever is necessary.

(By the way, the same would occur if you placed the two dates in the dimension table. They would repeat for each version of the ticket. If you try to aggregate the elapsed time between the dates, the same ticket would be counted multiple times -- once for each version.)

Place Information in the Dimension Also (Not Instead)

Information tracked by the accumulating snapshot may also be useful in the dimension table. But here, its purpose is to serve as dimensional data. Not as a replacement for the fact table.

For example, the dates the ticket was opened and closed could make helpful type 1 attributes in the ticket dimension. This is especially useful if the dimension is connected to other fact tables, such as one that tracks phone calls to the support center. The dates can be used to group call counts in interesting ways. For example, "How many support calls were generated by tickets opened on January 1 vs. January 2."

If you want to, you can also place the elapsed days fact in the dimension table. Here, it should be used strictly as a behavioral dimension -- not as a fact. Use it to group results. Don't try to use it as a fact; that would cause double counting. You might even want to convert it into bands to avoid confusion (1-5 days, 6-10 days, over 10 days.)

- Chris

Image Credit: Creativity103 via Creative Commons

Do you have a question of your own? Send it in. I'd be happy to answer.

Wednesday, December 16, 2009

Dimensional Designs are not Denormalized

Posted by Chris Adamson

I don’t usually bother to take a stand on issues surrounding terminology or theory. For this post I am going to make an exception.

I often hear people say that a star schema is a denormalized data model. There is an element of truth hiding in those words, but I think it describes how some people design stars, and not what a star is. A better description is unnormalized or not normalized. But even these words do not describe all stars.

Regardless of the words you choose, using the vocabulary of normalization is generally a disservice to the novice designer. And none of these words should ever by used to define the star schema.

Continue reading to learn why.

Read the full post »

Thursday, November 19, 2009

The Tolkien Effect

Posted by Chris Adamson

Schema designers must be on the lookout for data elements that are known by more than one name. Equally common is the use of a single name to signify very different things.

It may surprise you to learn that there is an important connection between data warehouse design and The Lord of the Rings. In the books, author JRR Tolkien challenges readers by using many different names, without explanation, for the same character.

The character Aragorn, for example, is also known as Strider, Dunadan, the Heir of Isildur, and several other names and titles. Each name, it turns out, is associated with a different culture or point of view.

At first, all this can be deeply confusing. With a little effort and patience, however, things begin to make sense, and the end result can be deeply rewarding.

The Tolkien Effect in Business

The same kind of thing happens when gathering requirements for a dimensional model. Within a business, it is commonplace to find several names for the same thing. Different departments, for example, may refer to products in different ways. Even within a department, there may be multiple names for the same thing.

Depending on your previous experience with the area in question, it may take you some time to realize this is going on. I will never forget the day I realized that a finance group meant the same thing by Ten-Digit-Department, Level 3 Code and Budget Line.

It’s crucial to identify these situations, or the capabilities of your model will be crippled. Data elements of interest in multiple contexts should be given a single, shared definition in your model. For dimensions in particular, this will be crucial in supporting analysis that crosses subject areas.

These shared dimensions are called conformed dimensions, and they are the key to avoiding stove-pipe subject areas. Even within a subject area, this can be crucial. The Ten-Digit-Department realization was essential in permitting comparison of budgets to actuals.

The Reverse-Tolkien

The converse is also a commonplace: a single name used to signify very different things. The best example of this is “Sales.” A salesperson will often use this word to refer to an order or contract. In finance, however, the word is reserved for the event that allows the recognition or revenue, which is often fulfillment or shipment of the order.

Once again, it is crucial that analyst keep an eye out for these situations; failure to produce consistent and well defined definitions for each fact or measurement is also a failure of conformance. The result will be challenges to the accuracy of the data, distrust of the solution, and a lack of user adoption.

What You Can Do

How then to avoid these problems? Listen. Don’t assume you know what people mean. Repeat things back in your own words. Be sure to write down and review definitions of each data element.

Look out for The Tolkien Effect. Pay close attention to people who live and work on the cusp of two subject areas or departments, as they will be keenly aware of these kind of linguistic challenges. So will the data administrator, if your organization has one.

-Chris

Image Attribution: lrargerich
CC BY 2.0

Tuesday, September 8, 2009

More on Surrogate Keys

Posted by Chris Adamson

This post addresses a common follow-up question regarding surrogate keys:

"My source system already tracks changes. In this case, do dimension tables really need their own unique identifiers?"

Yes. There may be exceptions, but they are far less common than you may think, and they limit your future options.

Read the full post »

Monday, April 27, 2009

More on Distinct Counts

Posted by Chris Adamson

Shortly before my recent post on unique counts, a reader sent in a related question that takes a slightly different perspective:

As fact tables could easily have a hierarchy of keys, the counts using the lowest grain can be counted via a count(*), and anything above that needs a count(distinct column_a).

Do you see many applications accommodating count(*)'s…by that I mean, making separate higher grain fact tables so ad-hoc users do not have to use a count distinct?

P. Petrini

As the reader points out, unique counts (or distinct counts) can be problematic for end users to put together. This may be particularly if you are using a SQL-generating Business Intelligence (BI) tool.

For example, suppose an orders fact table has one row per order line. Its major dimensions are customer, product, salesperson and day. To count the number of distinct products ordered on a day, you would need to place count(distinct product_name) in your SQL.

That is something a developer can do in her sleep, but may be a bit much for a power-user. And it may be difficult to get a BI tool to generate this kind of SQL.

The reader correctly points out that this would not be necessary if the grain of the fact table precisely matches whatever it is that needs to be counted. All that would be necessary would be count(*) or something along those lines. So a summary table might help....

The Issue

The problem with this is that the summary table will simplify one and only one kind of count.

There will probably be a variety of different, levels or time periods across which you want distinct counts, and each would require its own aggregate.

This could quickly become a lot of work -- simply to make it easier to formulate certain kinds of queries. And while it may make a particular query easier to express, it introduces new complexity for the user, who must now choose precisely the right star for each query.

For example, a summary fact table that has one row for each product for each day makes it easier to count the number of products that sold on a day. But this aggregate must omit other dimensions like salesperson and customer to be useful in this regard. If you also want to count other distinct things by day, each will require its own aggregate. Also want distinct counts by month? More aggregates.

Not very efficient, unfortunately. The SQL for counting distinct things is simplified, to be sure. But now the ETL process is doing a lot more work, and users are faced with choosing the right aggregate for each query they build.

Alternatives

My suggestion is that, rather than build fact tables that eliminate the need for count(distinct), build and cache reports that do the work. People interested in these counts can access the reports, instead of writing SQL. If these counts are common, you might also find it easier to use an OLAP tool.

Not perfect, I know, but nothing ever is.

This is not to say that there is anything wrong with creating a series of summary tables for this purpose, or perhaps storing some pre-computed counts as discussed earlier.

As people who attend my classes know, my motto is "be pragmatic, not dogmatic." As long as everyone is aware of the pros and cons, and understands how it affects each aspect of the data warehouse implementation, then if there is a consensus that adding summary tables is the best way to make life easier, go ahead and do it.

Chris

Many thanks to P. Petrini, for consenting to have his question appear here.

If you have a question, you can email it to me. Click my profile picture at the top of this page for the address. I try to answer all questions, and might even print yours here

Classes

Chris is scheduled to present at the following events. Course enrollment is open to the general public.

All these courses are also available on site (see below).

August 18, 2019
San Diego, CA
Data Modeling in the Age of Big Data
Registration: TDWI San Diego
August 20, 2019
San Diego, CA
Data Architecture: Managing Information in the Age of Big Data
Registration: TDWI San Diego
August 20, 2019
San Diego, CA
Workshop: Building the Business Case for Advanced Analytics
Registration: TDWI San Diego Strategy Summit
Monday October 21, 2019
San Francisco, CA
TDWI Dimensional Data Modeling Primer: From Requirements to Business Analysis
Registration: TDWI Seminars
Tuesday October 22, 2019
San Francisco, CA
Advanced Dimensional Modeling: Techniques for Practitioners
Registration: TDWI Seminars
Wednesday October 23, 2019
San Francisco, CA
Dimensional Models: What’s New in the Big Data Era
Registration: TDWI Seminars
November 12, 2019
Orlando, FL
Data Architecture: Managing Information in the Age of Big Data
Registration: TDWI Orlando
November 12, 2019
Orlando, FL
The Dimensional Model Refactored: New Techniques for the 21st Century
Registration: TDWI Orlando
November 15, 2019
Orlando, FL
Advanced Dimensional Modeling: Complete Tour of Modern Best Practices
Registration: TDWI Orlando

Onsite Education

You can bring Chris to your team for interactive education.

Dimensional Modeling

Chris provides full-day and expanded two-day courses covering the dimensional design concepts from Star Schema: The Complete Reference.
TDWI Courses

Chris teaches select TDWI courses that cover topics like data BI fundamentals, performance management, business analytics, dashboards and scorecards, and more.

All of Chris's education offerings are provided through TDWI.

For information on onsite offerings, contact TDWI Onsite Education. or Oakton Software