Chris Adamson’s Blog: Non-additive Facts

Showing posts with label Non-additive Facts. Show all posts

Thursday, December 20, 2012

Dimensional Models: No E-R Fallback Required

People often suggest to me that a dimensional model cannot handle certain situations. These situations, the assertion holds, require falling back to entity-relationship modeling techniques.

Image by Patrick Hosley,
Licensed by Creative Commons 2.0

When we look together at their examples, though, I've always found a dimensional solution.

Here's a list of things people sometimes do not realize can be handled by standard dimensional techniques.

At the end of the post is a large collection of links to previous posts treat these topcis in more detail.

Note: This article should not be interpreted as a criticism of entity-relationship modeling. Rather, the point is to cover commonly overlooked capabilities of dimensional modeling.

Predictive Analytics: Storage of Granular Detail

I often hear people say: if you want to do predictive analytics, its important to capture all the relationships among data attributes. This is incorrect.

The job of predictive analytics is to tell us what is important--not the reverse! Neither data mining nor predictive analytics requires information to be stored in an entity-relationship format.

As people who do this work will tell you, they don't care how the data is supplied. Nor do they care what modelers believe are the important relationships. Many tools will completely de-normalize the data before doing anything else.

In order to support predictive analytics, standard best practices of dimensional modeling apply:

Fact tables record business activities and conditions at the lowest level of detail possible in granular, first line stars.
Transaction identifiers are carried in the model if available.
Surrounding dimensions are enriched with detail to the fullest extent possible
In the dimension tables, changes are tracked and time-stamped (“type 2”).

If we do these things with our dimensional data, we have not lost anything that is that is required to do predictive analytics.

Many-to-many Relationships: Multi-Valued Dimension Bridge

Usually, each fact in a star schema can be linked to a single member of a dimension. This is often mistaken for a rule, which leads to the incorrect conclusion that dimensional models cannot accommodate a fact that may link to multiple members of a dimension.

(This is often generalized as “dimensional cannot handle many-to-many relationships.”)

In a dimensional model, this situation is referred to as a multi-valued dimension. We use a bridge table to link the facts to the dimension in question. This special table sets up a many-to-many relationship between the facts and dimensions, allowing any number of dimension members to be associated with any number of facts.

Here are some examples:

In sales, a bridge table may be used to associate a sale, as recorded in a fact table, with multiple sales people.
In insurance, a single claim (again, tracked in a fact table) can be associated with multiple parties.
In healthcare, a single encounter may be associated with multiple providers or tests.
In government, a single audit or inspection may be associated with multiple findings.
In law enforcement, a single arrest may be associated with multiple charges.

Repeating Attributes: Multi-Values Attribute Bridge

It is sometimes suggested that dimensional models cannot gracefully accommodate repeating attributes. The dimensional solution is again a bridge table. This time, it is placed between the dimension table and an outrigger that contains the attribute in question.

Examples include:

Companies that have multiple Standard Industry Classification codes
People with multiple phone numbers or addresses
Accounts with multiple account holders
Patients with multiple diagnoses
Documents with multiple keywords.

Recursive Hierarchies: Hierarchy Bridge

Ragged hierarchies, unbalanced hierarchies, or recursive hierarchies are often cited as the downfall of the dimensional model. In fact, a solution exists, and it is extraordinarily powerful. The hierarchy bridge table allows facts to be aggregated either by rolling up or rolling down through the hierarchy, regardless of number of levels.

Examples include:

Parts that are made up of other parts
Departments that fall within other departments
Geographies that fall within other geographies
Companies that own other companies

Relationships Between Dimensions: Factless Fact Tables

A star schema does not include relationships between dimension tables. This has led to the misconception that you can't track these relationships.

In fact, any important relationship between dimension tables can and should be captured. It is done using factless fact tables. (Dimensions are never directly linked because of the implications this would have on slow change processing.)

Examples include:

Employees filling a job in a department
Marketing promotions in effect in a geographical market
Students registered for courses
The primary care physician assigned to an insured party
Brokers assigned to clients

Subtyping: Core and Custom

Another situation where modelers often believe they must “fall back” on ER techniques is when the attributes of a dimension table vary by type. This variation is often misconstrued as calling for the ER construct known as subtyping. Similar variation might also be found with associated facts

In the dimensional model, heterogeneous attributes are handled via the core and custom technique.

A core dimension captures common attributes, and type specific replicas capture the core attributes along with those specific to the subtype. Dimensions can then be joined to the fact table according to the analytic requirement. If there is variation in the facts, the same can be done with fact tables.

Examples include:

Products with attributes that vary by type
Customers that have different characteristics depending on whether they are businesses or individuals
In insurance, policies that have different characteristics depending on whether they are group or individual polices
In healthcare, medical procedures or tests that have different characteristics and result metrics depending on the test
In retail, stores that have varying characteristics depending on the type (e.g. owned, franchise, pocket)

Non Additive Metrics: Snapshots or Summary Tables

In a classic star schema design, facts are recorded at a granular level and “rolled up” across various dimensions at query time. Detractors often assume this means that non-additive facts have no place in a dimensional model.

In fact, dimensional modelers have several tools for handling non-additive facts.

Those that can be broken down into additive components are captured at the component level. This is common for many key business metrics such as:

Margin rate: stored as margin amount and cost amount in a single fact table. The ratio is computed after queries aggregate the detail.
Yield or conversion percentage: stored as quote count and order count in two separate fact tables, with ratio converted after aggregation at query time.

Other non-additive facts cannot be broken down into fully additive components. These are usually captured at the appropriate level of detail, and stored in snapshot tables or summary tables. Common examples include:

Period-to-date amounts, stored in a snapshot table or summary table
Distinct counts, stored in a summary table
Non-numeric grades, stored in a transaction-grained fact table

While these non-additive facts are flexible than additive facts in terms of how they can be used, this is not a result of dimensional representation.

Conclusion

Every technique mentioned here is part of dimensional modeling cannon. None are stopgaps or workarounds. While some may prove problematic for some of our BI software, these problems are not unique to the dimensional world.

In the end, the dimensional model can represent the same real-world complexities that entity-relationship models can. No ER fallback required.

- Chris

Learn More

All of these topics have been covered previously on this blog. Here are some links to get you started.

Establishing granular, detailed star schemas:

Build High Resolution Stars (January 7, 2011).

Multi-valued Dimensions:

Bridge to Multi-Valued Dimensions (February 9, 2011)
Bridge Tables and Many-to-many Relationships (April 25, 2011)
See the label bridges for more posts that deal with this topic.

Multi-Valued Attributes:

Resolve Repeating Attributes with a Bridge Table (January 28, 2011)
Q and A: Bridges are Part of Dimensional Modeling (March 7, 2011)
See the label bridges for more

Recursive Hierarchies:

Recursive Hierarchies and Bridge Tables (May 22, 2012)
Using a Hierarchy Bridge Without a Many-to-many Relationship (May 18, 2012)
And again, more will appear under the label bridges

Factless Fact Tables:

Factless Fact Tables (September 15, 2011)
See also the label Factless Fact Tables

Core and Custom:

There's not much on this topic on this blog right now, but see my book for more info (details below.)

Non-additive Facts:

Handling Rankings (October 10, 2012)
Storing Non-additive Facts (March 16, 2010)
More on Distinct Counts (April 27, 2009)
Dealing with Period-to-date Measurements (April 18, 2009)
See the category non-additve facts for more

Also consider checking out my book, Star Schema: The Complete Reference. It covers all these topics in much more detail:

Granular and detailed stars are covered in Chapter 3, "Stars and Cubes"
Multi-valued Dimensions and Multi-Valued attributes are covered in Chapter 9, "Multi-valued Dimensions and Bridges"
Hierarchy Bridges are covered in Chapter 10, "Recursive Hierarchies and Bridge Tables"
Factless Fact Tables are covered in Chapter 12, "Factless Fact Tables"
Core and Custom schemas are covered in Chapter 13, "Type-specific Stars"
Non Additive Facts are covered in Chapter 3, "Stars and Cubes," Chapter 11, "Transactions, Snapshots and Accumulating Snapshots" and Chapter 14, "Derived Schemas"

Use the links on this blog to order a copy from Amazon. There is no additional cost to you, and you will be helping support this blog.

Wednesday, October 10, 2012

Handling Rankings

Posted by Chris Adamson

Dimensional modelers often struggle with issues surrounding rankings. Learn when to store them, where to store them, and where not to store them, by reading this post.

We need to calculate Top N rankings across several dimensions (account, channel, hour, day, week, month.) There are massive data sets involved. This must be a common problem. Is there an approach you can point us towards?
- Jeremy

Rankings are time and context dependent. When you need to track them, they pose challenges that are similar to non-additive facts such as period-to-date measurements or distinct counts.

The solutions, it turns out, are also similar.

Rankings and their Context

When thinking about how to handle rankings, it is useful to break them down. A ranking is actually made up of several parts. There is the dimension being ranked, the dimensions for which the ranking will be computed, and the fact that is the source of the ranking.

For example, look at this simple statement:

"Show me the top 10 customers by week and product."

The dimension being ranked here is Customer.

The dimensional context of each ranking is a product and week. For example, if we have 20 products, then each week we will need 20 top ten lists -- one for each product.

The fact driving the ranking is not explicit in this question. It is probably spending, though it may be based on margin or transaction counts.

That may sound simple, but things are usually a bit more complicated.

The reader, for example, does not just need a single ranking. He actually needs several -- daily, weekly, monthly. Presumably, these may be mixed and matched with the other dimensions he mentioned -- account and channel. In addition, there may be different rankings based on different facts.

Compute vs. Store

It may have occurred to you at this point that there are a multitude of possible rankings, even for a relatively simple schema.

In addition, once we compute a ranking, there is very little we can do with it. For example, we cannot sum it up over time periods, or other dimensions. We may be able to compare it to another ranking (a different period or dimension, for example), or treat it as reference data describing the thing being ranked. But that's about it.

For these reasons, many organizations choose to compute rankings within reports, or at query time. This provides maximum flexibility, and also sidesteps the ugly problem of having to store information that cannot be aggregated.

However, the reader indicates that data volumes are large. Let's assume they are so large that it is not feasible to keep the granular data around to compute rankings. Let's also assume that the rankings he needs are consistent over time. In his case, it may be necessary to compute all needed rankings and then store them.

Rankings fit with Snapshots or Aggregates

A ranking is only useful within the original context of its calculation. We won't really be able to aggregate it. In this respect, its is similar to a distinct count or a period-to-date measurement -- topics I have addressed in the past. (Some period-to-date metrics can be aggregated within a single period, but otherwise the comparison holds.)

The recommended solution is similar as well. Establish the granularity of the various rankings, then keep track of them fact tables of appropriate grain. These will either be periodic snapshots, or aggregate tables that summarize appropriate time periods. The rank itself will be stored in the fact table, even though it is not additive.

Many developers balk at this solution, observing that it increases the number of fact tables that must be maintained. But note that these fact tables do not introduce any new dimension tables, and they can usually be sourced from transaction-grained stars. And the alternatives have even less appeal.

Rankings don't fit with Transactions

Most of the time, it is not useful to store rankings with transactions. Unless all the things being ranked happen to have transactions that occur at the same time, it will be impossible to compare the ranking with anything -- including other things being ranked!

Resist abstracted models

It is tempting to come up with designs that abstract the concept of a ranking. For example, one can imagine setting up a dimension that describes a ranking, the dimension being ranked, and the period. A fact table can be set up that carries the key for this metadata, plus 10 or 20 keys that will be used to link to top 1, top 2, etc.

Abstracted designs are appealing because they may appear to limit the number of stars to be maintained and don't require schema changes when new rankings are added. But note that the same amount of work must be done to load each ranking.

More importantly, these abstracted designs severely hamper usability. With such a design, it is much harder to merge information about a ranking with other data. A report that contains product level detail, for example, cannot easily pull in info indicating the product appeared in a Top 10 ranking -- this would require decoding metadata and then checking each of the various keys.

Further Reading

As mentioned, some of the challenges related to rankings are similar to those involving distinct counts or period-to-date metrics. I've covered these topics before:

Dealing with Period-to-date Measurements (4/18/2009)

Is a Unique Count a Fact or Dimension? (3/19/2009)

Storing Non-additive Facts (3/16/2010)

I also touch on the topic in Chapter 8 of Star Schema: The Complete Reference, where I address the proper home of a period-to-date measurement. Use the link to purchase a copy, and you will be helping to support this blog.

Image licensed via Creative Commons 2.0

from Patrick Hoesley

Tuesday, March 16, 2010

Storing Nonadditive Facts [Q&A]

Posted by Chris Adamson

It can be challenging to decide what to do with a nonadditive fact. A reader asks if (and when) it might make sense to store a percentage in a fact table:

Chris,

What are your thoughts on storing percentages in fact tables? Many of the calculations seem too complex to do in the BI tool. For instance, we calculate life-to-date percentages of KPIs on products.

Thanks,
Adrienne
Burlington, NC

Percentages are (usually) non-additive facts. What this means is that you cannot add them together. For example, if a computer is sold at 5% margin, and then another is sold at 5% margin, it is a mistake to say that total margin for the two sales is 10%.

In introductory texts or courses, you are taught to decompose them into fully additive components. The idea is that you can sum all the components, then compute the ratio at the last moment. The margin example above is a good application. Storing cost and sales price allows margin to be computed at any level of aggregation.

However, this only really works when all the components for a particular ratio are stored in the same row, and used within the same scope of aggregation. As you have discovered, there are some non-additive facts that do not work this way. A ratio that compares something to a larger pool or category is a good example. Two or more aggregations are needed to compute this kind of metric.

It also sounds as if your problem is compounded by the need to use some kind of running-totals in the computation, which represent the lifetime-to-date sums. These are also problematic because they are also non-additive, or possibly semi-additive.

Non-additive measurements that are computed from different aggregations are often stored in fact tables for exactly the reason you give: they are difficult to automate in a BI tool. They are also precomputed in situations where they would otherwise harm performance.

If you decide to pre-compute and store a non-additive fact, there are usually two questions that follow: Where to store it, and how to use it. The issues here are very similar to those surrounding period-to-date measurements, as I discussed in a post last year.

Where to Store a Nonadditive Fact

Figuring out the answer to this question is closely linked to how it will be used. If it is needed at a transaction-level, it must go in a transaction-grained fact table.

However, if it also need to be known at a daily or monthly level, that solution is not enough. Since it is non-additive, it must be computed separately for these aggregations. A periodic snapshot makes sense here. Consider forgoing the transaction-level version, unless it is needed as well.

Also note that "time" is not the only dimension across which we might need a summarized non-additive fact. For example, you might also want to see it across product categories, customer categories, and so forth. Each would require its own snapshot-style fact table for discrete storage of the pre-computed fact.

In fact, this is the biggest problem with storing non-additive facts: each potential aggregation must be stored separately.

Nonadditive Facts and BI Software

Because they are nonadditive, (and may involve constituents that are semi-additive), these kind of facts must be used carefully. For example, they should never be summed, subtotaled, etc. Similarly, if we need to show them at different levels of detail, we need to fetch each aggregation separately.

This is relatively straight forward if the schema is well thought out, well documented, and fully understood by report developers.

However, it is very hard to make this kind of fact available in a "Self-Service" mode for end users, since most BI tools will easily break both rules. (BI tools can usually handle simple non-additive facts, like the margin example at the beginning, but not facts that are computed from multiple aggregations.)

Alternatives may include:

Hiding the non-additive fact from users while exposing it to developers
Classifying it as a dimension so that it is not aggregated, or
Blocking certain features of the tool.

The fact that a non-additive fact may be stored in multiple locations (corresponding to different levels of aggregation) may also befuddle your BI software.

It may be possible to make use of "aggregate navigation" features to teach it to fetch a fact from more than one possible location, but you will need to make sure that the tool always gets it right. In addition, using such a feature may prevent you from "turning off" the tool's ability to aggregate a fact.
Separate semantic layers for developers vs. users may be the solution to these problems.

Again, these considerations are less important in cases where trained developers do all the report building. It is only when leveraging SQL-generating BI tools that you need to worry.

Another Option

In some cases, non-additive facts can be linked to a specific member of a dimension, such as an individual customer or product. In this case, consider storing it in the dimension table, as a behavioral dimension. This is most effective when the measurement in question can be refreshed on a pre-defined schedule.

Thanks to Adrienne for her question, and for consenting to have it reprinted here!

- Chris

Do you have a question about dimensional design or data warehousing? Send it in. I answer all my email, though sometimes it takes a while.

Image is by Patrick Hosely licensed under Creative Commons 2.0

Monday, April 27, 2009

More on Distinct Counts

Posted by Chris Adamson

Shortly before my recent post on unique counts, a reader sent in a related question that takes a slightly different perspective:

As fact tables could easily have a hierarchy of keys, the counts using the lowest grain can be counted via a count(*), and anything above that needs a count(distinct column_a).

Do you see many applications accommodating count(*)'s…by that I mean, making separate higher grain fact tables so ad-hoc users do not have to use a count distinct?

P. Petrini

As the reader points out, unique counts (or distinct counts) can be problematic for end users to put together. This may be particularly if you are using a SQL-generating Business Intelligence (BI) tool.

For example, suppose an orders fact table has one row per order line. Its major dimensions are customer, product, salesperson and day. To count the number of distinct products ordered on a day, you would need to place count(distinct product_name) in your SQL.

That is something a developer can do in her sleep, but may be a bit much for a power-user. And it may be difficult to get a BI tool to generate this kind of SQL.

The reader correctly points out that this would not be necessary if the grain of the fact table precisely matches whatever it is that needs to be counted. All that would be necessary would be count(*) or something along those lines. So a summary table might help....

The Issue

The problem with this is that the summary table will simplify one and only one kind of count.

There will probably be a variety of different, levels or time periods across which you want distinct counts, and each would require its own aggregate.

This could quickly become a lot of work -- simply to make it easier to formulate certain kinds of queries. And while it may make a particular query easier to express, it introduces new complexity for the user, who must now choose precisely the right star for each query.

For example, a summary fact table that has one row for each product for each day makes it easier to count the number of products that sold on a day. But this aggregate must omit other dimensions like salesperson and customer to be useful in this regard. If you also want to count other distinct things by day, each will require its own aggregate. Also want distinct counts by month? More aggregates.

Not very efficient, unfortunately. The SQL for counting distinct things is simplified, to be sure. But now the ETL process is doing a lot more work, and users are faced with choosing the right aggregate for each query they build.

Alternatives

My suggestion is that, rather than build fact tables that eliminate the need for count(distinct), build and cache reports that do the work. People interested in these counts can access the reports, instead of writing SQL. If these counts are common, you might also find it easier to use an OLAP tool.

Not perfect, I know, but nothing ever is.

This is not to say that there is anything wrong with creating a series of summary tables for this purpose, or perhaps storing some pre-computed counts as discussed earlier.

As people who attend my classes know, my motto is "be pragmatic, not dogmatic." As long as everyone is aware of the pros and cons, and understands how it affects each aspect of the data warehouse implementation, then if there is a consensus that adding summary tables is the best way to make life easier, go ahead and do it.

Chris

Many thanks to P. Petrini, for consenting to have his question appear here.

If you have a question, you can email it to me. Click my profile picture at the top of this page for the address. I try to answer all questions, and might even print yours here

Saturday, April 18, 2009

Dealing with Period-to-Date Measurements (MTD, YTD and so forth)

Posted by Chris Adamson

A few weeks ago, I touched on some problems you might encounter when trying to store period-to-date measurements in a fact table. Today, I want to address the issue more directly.

Period-to-date measurements can summarize just about any fact over a period of time. A measurement of month-to-date sales , for example, aggregates sales from the beginning of the month to the date in question. Month-to-date, quarter-to-date and year-to-date facts are often used to do things like compute commissions, volume discounts, etc.

Period-to-date Facts and Additivity

Most facts in a star schema can be summed up across any and all dimensions. Sales dollars, for example, can be summed across days, customers, stores, or whatever dimension makes sense to the person doing the analysis. We call this kind of fact fully additive.

The basic issue with storing period-to-date facts is this: they are never fully additive. It does not make any sense to sum up period-to-date measurements that were taken at different times. Month-to-date sales for a customer on January 14, 15 and 16, for example, cannot be meaningfully added together.

Period-to-date measurements are semi-additive; the can be summed up across some dimensions, but not time.

It might make sense to aggregate together several period-to-date measurements if they were taken at the same time. The year-to-date sales of all customers who made a purchase on a particular day, for example, might factor into some useful analysis.

Storing a period-to-date fact with transactions

Since it makes no sense to sum up period-to-date measurements across different points in time, they have very limited use in a transaction-grained fact table.

For example, a transaction-grained fact table that stores a row for each order-line of each order might contain dimensions that capture the order_line number, date and time of the order, product sold, and so forth. You might store a period-to-date measurement in such a fact table, but there there will be little or no opportunity to sum this fact across different order lines. Only orders with the same date/time can be meaningfully summed. And if there are multiple order-lines for a given order, even this may not make sense, since the period-to-date measurement would be stored redundantly.

Stored in this kind of fact table, the period-to-date measurement has very limited use. It can really only be used to study the most granular data in the fact table. For example, it might be used to compute an applicable discount, or a commission percentage that escalates with sales volume.

But is this needed in the fact table? Such metrics are usually highly operational. They may influence other metrics, which have more analytic value and happen to be additive. The commission paid or discount given, for example, may be computed using period-to-date information, but itself is fully additive. This will be useful to store in a fact table.

This does not mean that period-to-date measurements have no place in a transaction-grained fact table. If they are central to the business, or to analysis of the process, then place them there. They may, however, cause excessive growth in the row size, since it is possible to enumerate multiple period-to-date versions of any given fact.

Storing a period-to-date fact with period aggregates or snapshots

The natural home for a period-to-date measurement is a fact table with a periodic-snapshot grain, or a fact table that aggregates across the time dimension.

An aggregate that summarizes the time dimension sums up data from a transaction-based fact table over a particular period. For each period summarized, the resulting rows all summarize the same period. This means that each can contain period-to-date measurements that might be meaningfully summed, at least within the period. (Note I said "might." There is a caveat, which I will get to in a second.)

An aggregate that records monthly sales totals, for example, is potentially a good place to include a quarter-to-date or year-to-date measurement. Each row in the table will summarize the same period, so the period-to-date measurements may be meaningfully summarized.

The same goes for a periodic snapshot fact table. Like a periodic aggregate, each row in a snapshot summarizes a particular period. The main difference is that a snapshot will contain some form of unique status measurement that cannot be found in the transactions, such as a balance or level. Once again, a semi-additive period-to-date fact can be stored here.
Snapshots record the status of things at fixed time intervals. A monthly snapshot of bank accounts, for example, records the day-end balance of each account. This is a logical place to store month-to-date or other such

Caution: Exactly what is being measured period-to-date?

Even within an aggregate or snapshot period, period-to-date facts may not be additive. It is important to ask yourself: exactly what is being measured period-to-date? If it does not coincide with the grain of the table, it may be non-additive.

For example, a monthly aggregate stores sales by product and customer. If we want to store year-to-date customer sales in this table, the same fact will be repeated multiple times if a single customer purchased multiple products. It is not additive across products; it is repeated if the customer bought more than one product.

On the other hand, if the period-to-date measurement captures sales by customer and product, it coincides with the grain of the table. It can be meaningfully aggregated across any dimension but time.

Reporting

The old stand-by is to compute a period-to-date measurement in a query or report. It is calculated from transactions at run-time. This allows access to any conceivable period-to-date metric, and may also be used as a behavioral qualification rather than a fact.

Such queries may be particularly nasty or time consuming, but they can be run during batch windows and cached to compensate.

My next post will be another Q&A. Please send in your questions!

Chris

Thursday, March 19, 2009

Q&A: Is a Unique Count a Fact or Dimension?

Posted by Chris Adamson

A reader recently sent in a question about unique counts. Are they facts, or dimensions?

Here is a paraphrased version of their questions. The business example has been changed.

Q: We report on the number of unique customers that visit a location. We do this at daily, monthly and annual levels, and also look at the same thing across all locations. We originally modeled the customer counts as facts. The problem is that if we compute unique customer counts by day, they do not correctly roll up to month. Does it make more sense to treat this as a dimension?

A: First, let me say that customer counts are certainly facts. They are important metrics that are colored by the dimensions we use to define them.

The difficulty with counts of distinct things, as you have discovered, is that they are non-additive facts. As soon as you use a set of dimensions to create a distinct count -- such as day, location and customer -- you have an aggregation that cannot be further summed. If you need to go to the monthly level, or across all locations, you need to go back to the granular data.

You have a few options.

Option 1: Compute in reports

The first is to compute unique counts in reports. This can be time consuming, and a major hassle if there are several counts that must be computed frequently. You might be able to schedule the reports to run in a batch window. This deals with the processing issues, but not the report or query complexity.

In my view, however, it is the best option. There are other possibilities, but each has significant drawbacks.

Option 2: Create snapshots or aggregates

A second option is to create some snapshot fact tables that compute the various distinct counts at various periodic intervals. These will contain non-additive facts, and your developers will have to understand that they should not be aggregated. Depending on their design, these tables may more closely resemble aggregate tables.

This approach may require quite a few fact tables, since each of your unique counts will have a different grain -- daily by location, monthly by location, annually by location, daily across all locations, monthly across all locations, etc.

To make life a little easier for report developers, designers sometimes replicate the monthly and annual totals in the daily level snapshot. Developers find this appealing, because all the information to produce subtotals over time can be fetched from a single row, rather than three queries. The flipside is that the monthly and annual data is stored redundantly in a lot of places, and must be maintained consistently and used properly.

Option 3: Behavioral Dimensions

If the current period is the primary focus, a third option is to create behavioral dimensions. For example, in the store dimension, you can add attributes that represent counts of unique customers for the current month and year. I mentioned behavioral dimensions in a previous Q&A post, and the concept is the same here.

Option 4: Creating Special Facts that can be Summed

Lastly, some designers like to create special facts in the fact table that can be used to sum unique counts. If you have a fact table that gets a row each time a customer visits a location, you may be able to tweak it a bit. Add an additional fact called first_visit_of_month that only contains a 1 the first time the customer visits any location during the month. First_visit_of_year will only contain a 1 the first time the customer visits during the year. These facts can be summed to produce unique counts for various time frames, but you need to be careful not to aggregate them across the periods they represent.

Another drawback is that, once again, numerous special facts will be required. To compute unique counts by location, you will need facts like first_visit_this_location_this_month. This fact should not be aggregated across months or locations. That's a lot of rules, and they are likely to be broken.

As you can see, no option looks perfect. Don’t let that lead you to push back on requirements, however. Distinct counts are often very important metrics, particularly when studying customer service or processing efficiency.

Thanks for the question!

- Chris

If you have a question, you can email it to me. Click my profile picture at the top of this page for the address. I try to answer all questions, and might even print yours here.

Classes

Chris is scheduled to present at the following events. Course enrollment is open to the general public.

All these courses are also available on site (see below).

August 18, 2019
San Diego, CA
Data Modeling in the Age of Big Data
Registration: TDWI San Diego
August 20, 2019
San Diego, CA
Data Architecture: Managing Information in the Age of Big Data
Registration: TDWI San Diego
August 20, 2019
San Diego, CA
Workshop: Building the Business Case for Advanced Analytics
Registration: TDWI San Diego Strategy Summit
Monday October 21, 2019
San Francisco, CA
TDWI Dimensional Data Modeling Primer: From Requirements to Business Analysis
Registration: TDWI Seminars
Tuesday October 22, 2019
San Francisco, CA
Advanced Dimensional Modeling: Techniques for Practitioners
Registration: TDWI Seminars
Wednesday October 23, 2019
San Francisco, CA
Dimensional Models: What’s New in the Big Data Era
Registration: TDWI Seminars
November 12, 2019
Orlando, FL
Data Architecture: Managing Information in the Age of Big Data
Registration: TDWI Orlando
November 12, 2019
Orlando, FL
The Dimensional Model Refactored: New Techniques for the 21st Century
Registration: TDWI Orlando
November 15, 2019
Orlando, FL
Advanced Dimensional Modeling: Complete Tour of Modern Best Practices
Registration: TDWI Orlando

Onsite Education

You can bring Chris to your team for interactive education.

Dimensional Modeling

Chris provides full-day and expanded two-day courses covering the dimensional design concepts from Star Schema: The Complete Reference.
TDWI Courses

Chris teaches select TDWI courses that cover topics like data BI fundamentals, performance management, business analytics, dashboards and scorecards, and more.

All of Chris's education offerings are provided through TDWI.

For information on onsite offerings, contact TDWI Onsite Education. or Oakton Software