I don’t usually bother to take a stand on issues surrounding terminology or theory. For this post I am going to make an exception.
I often hear people say that a star schema is a denormalized data model. There is an element of truth hiding in those words, but I think it describes how some people design stars, and not what a star is. A better description is unnormalized or not normalized. But even these words do not describe all stars.
Regardless of the words you choose, using the vocabulary of normalization is generally a disservice to the novice designer. And none of these words should ever by used to define the star schema.
Continue reading to learn why.
Wednesday, December 16, 2009
Wednesday, December 9, 2009
Rule 1: State Your Grain
Posted by
Chris Adamson
Make sure you have a statement of grain for each fact table or cube in your dimensional design.
I receive lots of questions from people who are working through an issue with a star or cube. Most of the time, I must counter with a question of my own: “What is the grain?” Without this basic information, it is usually impossible to comment on whatever design issue the person is facing.
Being able to state grain is important, and not just for its value as a conversation starter. Here’s a brief look at what grain is, how to define it, and what happens if you don’t.
What is Grain
A statement of grain identifies what is represented by a single, granular row in a fact table (or an un-aggregated measure in a cube.) Each and every row in the fact table should meet this definition, with no wiggle-room.
When grain is not explicitly defined, or is defined in an ambiguous way, all manner of problems may arise. In addition to hampering your ability to talk about the design with someone else, ill-defined grain can cause to severe technical challenges in the reporting process. It may even lead to reports that are just plain wrong.
Defining Grain
There are two ways to state the grain of a fact table. The first way is to use business language, referencing a specific, well-understood artifact of the activity described by the star. For example, a star that describes orders may have the following grain: “Order measurements at the order-line level of detail.” That sums things up pretty well. Each row of the fact table corresponds to a single order line.
Sometimes it is not easy to state grain in this manner. When that’s the case, you use the dimensions in a star to indicate what each unique row in the fact table represents. For example, a star that tracks the processing milestones of mortgage applications might have the following grain: “Processing measurements by application and status.” This fact table will have a new row each time an application (one dimension) undergoes a change in status (another dimension.)
When using dimensional terms to define grain, do not simply rattle off all the dimensions present in the star. Instead, list only those that are necessary to define a unique row. In the mortgage status change example, it is not necessary to mention that the star will also contain dimensions for date, customer, mortgage officer, mortgage product, and so forth.
Business definitions of grain are usually used for transaction fact tables. Dimensional grain definitions are commonly used for snapshots, accumulating snapshots, derived schemas and aggregates.
Fuzzy Grain
Poorly defined grain can lead to trouble. Ill-defined grain can mask a situation where a fact table is actually being used to track two or more processes. It can also lead to situations in which the fact table contains two or more levels of aggregation. In the former case, single-process reporting will be hampered. BI developers will be bending over backwards to focus on the relevant subset of data. In the latter case, double-counting, triple-counting or worse is possible.
This is why “sate your grain” is rule #1 in any dimensional design.
-Chris
I receive lots of questions from people who are working through an issue with a star or cube. Most of the time, I must counter with a question of my own: “What is the grain?” Without this basic information, it is usually impossible to comment on whatever design issue the person is facing.
Being able to state grain is important, and not just for its value as a conversation starter. Here’s a brief look at what grain is, how to define it, and what happens if you don’t.
What is Grain
A statement of grain identifies what is represented by a single, granular row in a fact table (or an un-aggregated measure in a cube.) Each and every row in the fact table should meet this definition, with no wiggle-room.
When grain is not explicitly defined, or is defined in an ambiguous way, all manner of problems may arise. In addition to hampering your ability to talk about the design with someone else, ill-defined grain can cause to severe technical challenges in the reporting process. It may even lead to reports that are just plain wrong.
Defining Grain
There are two ways to state the grain of a fact table. The first way is to use business language, referencing a specific, well-understood artifact of the activity described by the star. For example, a star that describes orders may have the following grain: “Order measurements at the order-line level of detail.” That sums things up pretty well. Each row of the fact table corresponds to a single order line.
Sometimes it is not easy to state grain in this manner. When that’s the case, you use the dimensions in a star to indicate what each unique row in the fact table represents. For example, a star that tracks the processing milestones of mortgage applications might have the following grain: “Processing measurements by application and status.” This fact table will have a new row each time an application (one dimension) undergoes a change in status (another dimension.)
When using dimensional terms to define grain, do not simply rattle off all the dimensions present in the star. Instead, list only those that are necessary to define a unique row. In the mortgage status change example, it is not necessary to mention that the star will also contain dimensions for date, customer, mortgage officer, mortgage product, and so forth.
Business definitions of grain are usually used for transaction fact tables. Dimensional grain definitions are commonly used for snapshots, accumulating snapshots, derived schemas and aggregates.
Fuzzy Grain
Poorly defined grain can lead to trouble. Ill-defined grain can mask a situation where a fact table is actually being used to track two or more processes. It can also lead to situations in which the fact table contains two or more levels of aggregation. In the former case, single-process reporting will be hampered. BI developers will be bending over backwards to focus on the relevant subset of data. In the latter case, double-counting, triple-counting or worse is possible.
This is why “sate your grain” is rule #1 in any dimensional design.
-Chris
Thursday, November 19, 2009
The Tolkien Effect
Posted by
Chris Adamson
Schema designers must be on the lookout for data elements that are known by more than one name. Equally common is the use of a single name to signify very different things.
It may surprise you to learn that there is an important connection between data warehouse design and The Lord of the Rings. In the books, author JRR Tolkien challenges readers by using many different names, without explanation, for the same character.
The character Aragorn, for example, is also known as Strider, Dunadan, the Heir of Isildur, and several other names and titles. Each name, it turns out, is associated with a different culture or point of view.
At first, all this can be deeply confusing. With a little effort and patience, however, things begin to make sense, and the end result can be deeply rewarding.
The Tolkien Effect in Business
The same kind of thing happens when gathering requirements for a dimensional model. Within a business, it is commonplace to find several names for the same thing. Different departments, for example, may refer to products in different ways. Even within a department, there may be multiple names for the same thing.
Depending on your previous experience with the area in question, it may take you some time to realize this is going on. I will never forget the day I realized that a finance group meant the same thing by Ten-Digit-Department, Level 3 Code and Budget Line.
It’s crucial to identify these situations, or the capabilities of your model will be crippled. Data elements of interest in multiple contexts should be given a single, shared definition in your model. For dimensions in particular, this will be crucial in supporting analysis that crosses subject areas.
These shared dimensions are called conformed dimensions, and they are the key to avoiding stove-pipe subject areas. Even within a subject area, this can be crucial. The Ten-Digit-Department realization was essential in permitting comparison of budgets to actuals.
The Reverse-Tolkien
The converse is also a commonplace: a single name used to signify very different things. The best example of this is “Sales.” A salesperson will often use this word to refer to an order or contract. In finance, however, the word is reserved for the event that allows the recognition or revenue, which is often fulfillment or shipment of the order.
Once again, it is crucial that analyst keep an eye out for these situations; failure to produce consistent and well defined definitions for each fact or measurement is also a failure of conformance. The result will be challenges to the accuracy of the data, distrust of the solution, and a lack of user adoption.
What You Can Do
How then to avoid these problems? Listen. Don’t assume you know what people mean. Repeat things back in your own words. Be sure to write down and review definitions of each data element.
Look out for The Tolkien Effect. Pay close attention to people who live and work on the cusp of two subject areas or departments, as they will be keenly aware of these kind of linguistic challenges. So will the data administrator, if your organization has one.
-Chris
It may surprise you to learn that there is an important connection between data warehouse design and The Lord of the Rings. In the books, author JRR Tolkien challenges readers by using many different names, without explanation, for the same character.
The character Aragorn, for example, is also known as Strider, Dunadan, the Heir of Isildur, and several other names and titles. Each name, it turns out, is associated with a different culture or point of view.
At first, all this can be deeply confusing. With a little effort and patience, however, things begin to make sense, and the end result can be deeply rewarding.
The Tolkien Effect in Business
The same kind of thing happens when gathering requirements for a dimensional model. Within a business, it is commonplace to find several names for the same thing. Different departments, for example, may refer to products in different ways. Even within a department, there may be multiple names for the same thing.
Depending on your previous experience with the area in question, it may take you some time to realize this is going on. I will never forget the day I realized that a finance group meant the same thing by Ten-Digit-Department, Level 3 Code and Budget Line.
It’s crucial to identify these situations, or the capabilities of your model will be crippled. Data elements of interest in multiple contexts should be given a single, shared definition in your model. For dimensions in particular, this will be crucial in supporting analysis that crosses subject areas.
These shared dimensions are called conformed dimensions, and they are the key to avoiding stove-pipe subject areas. Even within a subject area, this can be crucial. The Ten-Digit-Department realization was essential in permitting comparison of budgets to actuals.
The Reverse-Tolkien
The converse is also a commonplace: a single name used to signify very different things. The best example of this is “Sales.” A salesperson will often use this word to refer to an order or contract. In finance, however, the word is reserved for the event that allows the recognition or revenue, which is often fulfillment or shipment of the order.
Once again, it is crucial that analyst keep an eye out for these situations; failure to produce consistent and well defined definitions for each fact or measurement is also a failure of conformance. The result will be challenges to the accuracy of the data, distrust of the solution, and a lack of user adoption.
What You Can Do
How then to avoid these problems? Listen. Don’t assume you know what people mean. Repeat things back in your own words. Be sure to write down and review definitions of each data element.
Look out for The Tolkien Effect. Pay close attention to people who live and work on the cusp of two subject areas or departments, as they will be keenly aware of these kind of linguistic challenges. So will the data administrator, if your organization has one.
-Chris
Tuesday, October 13, 2009
Highly Summarized Dimensions
Posted by
Chris Adamson
From the inbox today, a question about aggregate design:
As a general rule of thumb, try to set up aggregate dimensions that preserve the original separation of the base dimension tables. So rather than have country and category combined as a sort of "junk" dimension, create a rollup of customer called "country" and a rollup of product called "category."
This gives the aggregate star a kind of "symmetry" with the original star. It tends to be the most workable solution with aggregate-aware query tools, databases, etc. It also is the easiest to manage when no tools are aggregate-aware, as it minimizes the effort required to rewrite SQL.
If the summary is very small, it may be workable to create a "big wide table" that stores the dimension values directly with the summary facts. This works best if you are using a tool to implement the aggregate that is able to exploit the query rewrite function of a database, like a materialized view or materialized query table. In this case, applications can still query the "base" star, but receive the benefit of the aggregate.
But do you really need this aggregate? The more highly summarized an aggregate is, the fewer queries or reports it is able to accelerate. It is usually possible to construct aggregates that are not quite so highly summarized, giving a performance boost to a wider range of queries. For example, summarizing a customer dimension by region, rather than country, will allow the aggregate to work with more reports--including those that focus on country.
- Chris
Q: "If you roll up a dimension that is hierarchical, but you are now left with the highest level of the hierarchy (no further roll-up would be possible in that dimension) - would you maintain a separate dimension for this in the aggregated star?The reader has a potential aggregate that will summarize more than one dimension at a very high level. For example, an aggregate of sales that summarizes the customer dimension by country and summarizes the product dimension by category.
"Or would you join a number of these unrelated top hierarchy levels in a junk dimension?"
- Dublin, Ireland
As a general rule of thumb, try to set up aggregate dimensions that preserve the original separation of the base dimension tables. So rather than have country and category combined as a sort of "junk" dimension, create a rollup of customer called "country" and a rollup of product called "category."
This gives the aggregate star a kind of "symmetry" with the original star. It tends to be the most workable solution with aggregate-aware query tools, databases, etc. It also is the easiest to manage when no tools are aggregate-aware, as it minimizes the effort required to rewrite SQL.
If the summary is very small, it may be workable to create a "big wide table" that stores the dimension values directly with the summary facts. This works best if you are using a tool to implement the aggregate that is able to exploit the query rewrite function of a database, like a materialized view or materialized query table. In this case, applications can still query the "base" star, but receive the benefit of the aggregate.
But do you really need this aggregate? The more highly summarized an aggregate is, the fewer queries or reports it is able to accelerate. It is usually possible to construct aggregates that are not quite so highly summarized, giving a performance boost to a wider range of queries. For example, summarizing a customer dimension by region, rather than country, will allow the aggregate to work with more reports--including those that focus on country.
- Chris
Image credit: ZapTheDingbat
Tuesday, September 8, 2009
More on Surrogate Keys
Posted by
Chris Adamson
This post addresses a common follow-up question regarding surrogate keys:
"My source system already tracks changes. In this case, do dimension tables really need their own unique identifiers?"Yes. There may be exceptions, but they are far less common than you may think, and they limit your future options.
Monday, July 27, 2009
Recommended Books on the Data Warehouse Lifecycle
Posted by
Chris Adamson
Recommended Reading: A new book by Laura Reeves, and a revised edition of the classic Lifecycle Toolkit.
If you've been to any of my classes, you already know that I am a fan of Laura Reeves. She has a pragmatic, get-things-done approach to data warehousing.
You may also know her as co-author of the original edition of The Data Warehouse Lifecycle Toolkit, a book she wrote with Ralph Kimball, Margy Ross and Warren Thornthwaite. (For more on that book, see below.)
Laura has a new book out, which I highly recommend: A Manager's Guide to Data Warehousing.
In this book, she provides a practical guide to planning and executing data warehouse projects. It is written for managers (I.T. and business) who do not necessarily have a technical background in data warehousing.
Laura touches on each phase of the data warehouse lifecycle, providing useful advice without over-burdensome methodology, detailed task lists or the like. This makes it easy to fit her advice into your own organization's development style.
Even if you already have a strong background in dimensional design, you will find this book to be quite useful. You can get it at Amazon.com.
Also Recommended
If you have a dimensional data warehouse, I also urge you to check out The Data Warehouse Lifecycle Toolkit, Second Edition by Ralph Kimball, Margy Ross, Warren Thornthwaite, Joy Mundy and Bob Becker.
This fully revised version of the classic book contains detailed tasks and deliverables to help you manage all phases of the data warehouse lifecycle.
It is an excellent reference for data warehousing professionals. Read more about it at Amazon.com.
The original edition has been a long time recommendation on this blog, and the new edition carries on the standard. (Apologies to Warren Thornthwaite, whose name was previously misspelled here.)
If you've been to any of my classes, you already know that I am a fan of Laura Reeves. She has a pragmatic, get-things-done approach to data warehousing.
You may also know her as co-author of the original edition of The Data Warehouse Lifecycle Toolkit, a book she wrote with Ralph Kimball, Margy Ross and Warren Thornthwaite. (For more on that book, see below.)
Laura has a new book out, which I highly recommend: A Manager's Guide to Data Warehousing.
In this book, she provides a practical guide to planning and executing data warehouse projects. It is written for managers (I.T. and business) who do not necessarily have a technical background in data warehousing.
Laura touches on each phase of the data warehouse lifecycle, providing useful advice without over-burdensome methodology, detailed task lists or the like. This makes it easy to fit her advice into your own organization's development style.
Even if you already have a strong background in dimensional design, you will find this book to be quite useful. You can get it at Amazon.com.
Also Recommended
If you have a dimensional data warehouse, I also urge you to check out The Data Warehouse Lifecycle Toolkit, Second Edition by Ralph Kimball, Margy Ross, Warren Thornthwaite, Joy Mundy and Bob Becker.
This fully revised version of the classic book contains detailed tasks and deliverables to help you manage all phases of the data warehouse lifecycle.
It is an excellent reference for data warehousing professionals. Read more about it at Amazon.com.
The original edition has been a long time recommendation on this blog, and the new edition carries on the standard. (Apologies to Warren Thornthwaite, whose name was previously misspelled here.)
Wednesday, May 20, 2009
Do I really need Surrogate Keys?
Posted by
Chris Adamson
Here is the #1 most frequently question that people ask of me:
The case for the surrogate key is entirely pragmatic. Read on for a full explanation.
Dimensions Need their Own Unique Identifier
It is crucial that the dimensional schema be able to handle changes to source data in whatever manner makes most sense from an analytic perspective. This may be different from how the change is handled in the source.
For this reason, every dimension table needs its own unique identifier -- not one that comes from a source system. (For more on handling changes, start with this post.)
A surrogate key makes the best unique identifier. It is simply an integer value, holding no meaning for end users. A single, compact column, it keeps fact table rows small and makes SQL easy to read and write. It is simple to manage during the ETL process.
Compound Keys Work, But Why Take that Route?
An alternative is to supplement a unique identifier from the source system (also known as a natural key) with a sequence number. This results in a compound key, or multi-part key.
This kind of compound key also allows the dimension to handle changes differently than the source does. But there are several disadvantages:
(By the way, many source system identifiers are already compound keys. Adding a sequence number will make them even larger!)
Sometimes it is suggested that the natrual key be supplemented with a date, rather than a sequence. This may simplify ETL slightly, but the rest of the drawbacks remain. Plus, dates are even bigger than a sequence number. Worse, the date will appear in the fact table. That is sure to lead to trouble!
(This is not to say that datestamps on dimension rows are a bad thing. To the contrary, they can be quite useful. They just don't work well as part of a compound key.)
That's the basic answer. I'll respond to some common follow up questions over the coming weeks.
- Chris
Q: Do I really need surrogate keys?A: You absolutely must have a unique identifier for each dimension table, one that does not come from a source system. A surrogate key is the best way to handle this, but there are other possibilities.
The case for the surrogate key is entirely pragmatic. Read on for a full explanation.
Dimensions Need their Own Unique Identifier
It is crucial that the dimensional schema be able to handle changes to source data in whatever manner makes most sense from an analytic perspective. This may be different from how the change is handled in the source.
For this reason, every dimension table needs its own unique identifier -- not one that comes from a source system. (For more on handling changes, start with this post.)
A surrogate key makes the best unique identifier. It is simply an integer value, holding no meaning for end users. A single, compact column, it keeps fact table rows small and makes SQL easy to read and write. It is simple to manage during the ETL process.
Compound Keys Work, But Why Take that Route?
An alternative is to supplement a unique identifier from the source system (also known as a natural key) with a sequence number. This results in a compound key, or multi-part key.
This kind of compound key also allows the dimension to handle changes differently than the source does. But there are several disadvantages:
- Fact table rows become larger, as they must include the multi-part key for each dimension
- The ETL process must manage a sequence for each natural key value, rather than a single sequence for a surrogate key
- SQL becomes more difficult to read, write or debug
- Multi-part dimension keys can disrupt star join optimizers
(By the way, many source system identifiers are already compound keys. Adding a sequence number will make them even larger!)
Sometimes it is suggested that the natrual key be supplemented with a date, rather than a sequence. This may simplify ETL slightly, but the rest of the drawbacks remain. Plus, dates are even bigger than a sequence number. Worse, the date will appear in the fact table. That is sure to lead to trouble!
(This is not to say that datestamps on dimension rows are a bad thing. To the contrary, they can be quite useful. They just don't work well as part of a compound key.)
That's the basic answer. I'll respond to some common follow up questions over the coming weeks.
- Chris
Monday, April 27, 2009
More on Distinct Counts
Posted by
Chris Adamson
Shortly before my recent post on unique counts, a reader sent in a related question that takes a slightly different perspective:
For example, suppose an orders fact table has one row per order line. Its major dimensions are customer, product, salesperson and day. To count the number of distinct products ordered on a day, you would need to place count(distinct product_name) in your SQL.
That is something a developer can do in her sleep, but may be a bit much for a power-user. And it may be difficult to get a BI tool to generate this kind of SQL.
The reader correctly points out that this would not be necessary if the grain of the fact table precisely matches whatever it is that needs to be counted. All that would be necessary would be count(*) or something along those lines. So a summary table might help....
The Issue
The problem with this is that the summary table will simplify one and only one kind of count.
There will probably be a variety of different, levels or time periods across which you want distinct counts, and each would require its own aggregate.
This could quickly become a lot of work -- simply to make it easier to formulate certain kinds of queries. And while it may make a particular query easier to express, it introduces new complexity for the user, who must now choose precisely the right star for each query.
For example, a summary fact table that has one row for each product for each day makes it easier to count the number of products that sold on a day. But this aggregate must omit other dimensions like salesperson and customer to be useful in this regard. If you also want to count other distinct things by day, each will require its own aggregate. Also want distinct counts by month? More aggregates.
Not very efficient, unfortunately. The SQL for counting distinct things is simplified, to be sure. But now the ETL process is doing a lot more work, and users are faced with choosing the right aggregate for each query they build.
Alternatives
My suggestion is that, rather than build fact tables that eliminate the need for count(distinct), build and cache reports that do the work. People interested in these counts can access the reports, instead of writing SQL. If these counts are common, you might also find it easier to use an OLAP tool.
Not perfect, I know, but nothing ever is.
This is not to say that there is anything wrong with creating a series of summary tables for this purpose, or perhaps storing some pre-computed counts as discussed earlier.
As people who attend my classes know, my motto is "be pragmatic, not dogmatic." As long as everyone is aware of the pros and cons, and understands how it affects each aspect of the data warehouse implementation, then if there is a consensus that adding summary tables is the best way to make life easier, go ahead and do it.
Chris
Many thanks to P. Petrini, for consenting to have his question appear here.
If you have a question, you can email it to me. Click my profile picture at the top of this page for the address. I try to answer all questions, and might even print yours here
As fact tables could easily have a hierarchy of keys, the counts using the lowest grain can be counted via a count(*), and anything above that needs a count(distinct column_a).As the reader points out, unique counts (or distinct counts) can be problematic for end users to put together. This may be particularly if you are using a SQL-generating Business Intelligence (BI) tool.
Do you see many applications accommodating count(*)'s…by that I mean, making separate higher grain fact tables so ad-hoc users do not have to use a count distinct?
P. Petrini
For example, suppose an orders fact table has one row per order line. Its major dimensions are customer, product, salesperson and day. To count the number of distinct products ordered on a day, you would need to place count(distinct product_name) in your SQL.
That is something a developer can do in her sleep, but may be a bit much for a power-user. And it may be difficult to get a BI tool to generate this kind of SQL.
The reader correctly points out that this would not be necessary if the grain of the fact table precisely matches whatever it is that needs to be counted. All that would be necessary would be count(*) or something along those lines. So a summary table might help....
The Issue
The problem with this is that the summary table will simplify one and only one kind of count.
There will probably be a variety of different, levels or time periods across which you want distinct counts, and each would require its own aggregate.
This could quickly become a lot of work -- simply to make it easier to formulate certain kinds of queries. And while it may make a particular query easier to express, it introduces new complexity for the user, who must now choose precisely the right star for each query.
For example, a summary fact table that has one row for each product for each day makes it easier to count the number of products that sold on a day. But this aggregate must omit other dimensions like salesperson and customer to be useful in this regard. If you also want to count other distinct things by day, each will require its own aggregate. Also want distinct counts by month? More aggregates.
Not very efficient, unfortunately. The SQL for counting distinct things is simplified, to be sure. But now the ETL process is doing a lot more work, and users are faced with choosing the right aggregate for each query they build.
Alternatives
My suggestion is that, rather than build fact tables that eliminate the need for count(distinct), build and cache reports that do the work. People interested in these counts can access the reports, instead of writing SQL. If these counts are common, you might also find it easier to use an OLAP tool.
Not perfect, I know, but nothing ever is.
This is not to say that there is anything wrong with creating a series of summary tables for this purpose, or perhaps storing some pre-computed counts as discussed earlier.
As people who attend my classes know, my motto is "be pragmatic, not dogmatic." As long as everyone is aware of the pros and cons, and understands how it affects each aspect of the data warehouse implementation, then if there is a consensus that adding summary tables is the best way to make life easier, go ahead and do it.
Chris
Many thanks to P. Petrini, for consenting to have his question appear here.
If you have a question, you can email it to me. Click my profile picture at the top of this page for the address. I try to answer all questions, and might even print yours here
Saturday, April 18, 2009
Dealing with Period-to-Date Measurements (MTD, YTD and so forth)
Posted by
Chris Adamson
A few weeks ago, I touched on some problems you might encounter when trying to store period-to-date measurements in a fact table. Today, I want to address the issue more directly.
Period-to-date measurements can summarize just about any fact over a period of time. A measurement of month-to-date sales , for example, aggregates sales from the beginning of the month to the date in question. Month-to-date, quarter-to-date and year-to-date facts are often used to do things like compute commissions, volume discounts, etc.
Period-to-date Facts and Additivity
Most facts in a star schema can be summed up across any and all dimensions. Sales dollars, for example, can be summed across days, customers, stores, or whatever dimension makes sense to the person doing the analysis. We call this kind of fact fully additive.
The basic issue with storing period-to-date facts is this: they are never fully additive. It does not make any sense to sum up period-to-date measurements that were taken at different times. Month-to-date sales for a customer on January 14, 15 and 16, for example, cannot be meaningfully added together.
Period-to-date measurements are semi-additive; the can be summed up across some dimensions, but not time.
It might make sense to aggregate together several period-to-date measurements if they were taken at the same time. The year-to-date sales of all customers who made a purchase on a particular day, for example, might factor into some useful analysis.
Storing a period-to-date fact with transactions
Since it makes no sense to sum up period-to-date measurements across different points in time, they have very limited use in a transaction-grained fact table.
For example, a transaction-grained fact table that stores a row for each order-line of each order might contain dimensions that capture the order_line number, date and time of the order, product sold, and so forth. You might store a period-to-date measurement in such a fact table, but there there will be little or no opportunity to sum this fact across different order lines. Only orders with the same date/time can be meaningfully summed. And if there are multiple order-lines for a given order, even this may not make sense, since the period-to-date measurement would be stored redundantly.
Stored in this kind of fact table, the period-to-date measurement has very limited use. It can really only be used to study the most granular data in the fact table. For example, it might be used to compute an applicable discount, or a commission percentage that escalates with sales volume.
But is this needed in the fact table? Such metrics are usually highly operational. They may influence other metrics, which have more analytic value and happen to be additive. The commission paid or discount given, for example, may be computed using period-to-date information, but itself is fully additive. This will be useful to store in a fact table.
This does not mean that period-to-date measurements have no place in a transaction-grained fact table. If they are central to the business, or to analysis of the process, then place them there. They may, however, cause excessive growth in the row size, since it is possible to enumerate multiple period-to-date versions of any given fact.
Storing a period-to-date fact with period aggregates or snapshots
The natural home for a period-to-date measurement is a fact table with a periodic-snapshot grain, or a fact table that aggregates across the time dimension.
An aggregate that summarizes the time dimension sums up data from a transaction-based fact table over a particular period. For each period summarized, the resulting rows all summarize the same period. This means that each can contain period-to-date measurements that might be meaningfully summed, at least within the period. (Note I said "might." There is a caveat, which I will get to in a second.)
An aggregate that records monthly sales totals, for example, is potentially a good place to include a quarter-to-date or year-to-date measurement. Each row in the table will summarize the same period, so the period-to-date measurements may be meaningfully summarized.
The same goes for a periodic snapshot fact table. Like a periodic aggregate, each row in a snapshot summarizes a particular period. The main difference is that a snapshot will contain some form of unique status measurement that cannot be found in the transactions, such as a balance or level. Once again, a semi-additive period-to-date fact can be stored here.
Snapshots record the status of things at fixed time intervals. A monthly snapshot of bank accounts, for example, records the day-end balance of each account. This is a logical place to store month-to-date or other such
Caution: Exactly what is being measured period-to-date?
Even within an aggregate or snapshot period, period-to-date facts may not be additive. It is important to ask yourself: exactly what is being measured period-to-date? If it does not coincide with the grain of the table, it may be non-additive.
For example, a monthly aggregate stores sales by product and customer. If we want to store year-to-date customer sales in this table, the same fact will be repeated multiple times if a single customer purchased multiple products. It is not additive across products; it is repeated if the customer bought more than one product.
On the other hand, if the period-to-date measurement captures sales by customer and product, it coincides with the grain of the table. It can be meaningfully aggregated across any dimension but time.
Reporting
The old stand-by is to compute a period-to-date measurement in a query or report. It is calculated from transactions at run-time. This allows access to any conceivable period-to-date metric, and may also be used as a behavioral qualification rather than a fact.
Such queries may be particularly nasty or time consuming, but they can be run during batch windows and cached to compensate.
My next post will be another Q&A. Please send in your questions!
Chris
Period-to-date measurements can summarize just about any fact over a period of time. A measurement of month-to-date sales , for example, aggregates sales from the beginning of the month to the date in question. Month-to-date, quarter-to-date and year-to-date facts are often used to do things like compute commissions, volume discounts, etc.
Period-to-date Facts and Additivity
Most facts in a star schema can be summed up across any and all dimensions. Sales dollars, for example, can be summed across days, customers, stores, or whatever dimension makes sense to the person doing the analysis. We call this kind of fact fully additive.
The basic issue with storing period-to-date facts is this: they are never fully additive. It does not make any sense to sum up period-to-date measurements that were taken at different times. Month-to-date sales for a customer on January 14, 15 and 16, for example, cannot be meaningfully added together.
Period-to-date measurements are semi-additive; the can be summed up across some dimensions, but not time.
It might make sense to aggregate together several period-to-date measurements if they were taken at the same time. The year-to-date sales of all customers who made a purchase on a particular day, for example, might factor into some useful analysis.
Storing a period-to-date fact with transactions
Since it makes no sense to sum up period-to-date measurements across different points in time, they have very limited use in a transaction-grained fact table.
For example, a transaction-grained fact table that stores a row for each order-line of each order might contain dimensions that capture the order_line number, date and time of the order, product sold, and so forth. You might store a period-to-date measurement in such a fact table, but there there will be little or no opportunity to sum this fact across different order lines. Only orders with the same date/time can be meaningfully summed. And if there are multiple order-lines for a given order, even this may not make sense, since the period-to-date measurement would be stored redundantly.
Stored in this kind of fact table, the period-to-date measurement has very limited use. It can really only be used to study the most granular data in the fact table. For example, it might be used to compute an applicable discount, or a commission percentage that escalates with sales volume.
But is this needed in the fact table? Such metrics are usually highly operational. They may influence other metrics, which have more analytic value and happen to be additive. The commission paid or discount given, for example, may be computed using period-to-date information, but itself is fully additive. This will be useful to store in a fact table.
This does not mean that period-to-date measurements have no place in a transaction-grained fact table. If they are central to the business, or to analysis of the process, then place them there. They may, however, cause excessive growth in the row size, since it is possible to enumerate multiple period-to-date versions of any given fact.
Storing a period-to-date fact with period aggregates or snapshots
The natural home for a period-to-date measurement is a fact table with a periodic-snapshot grain, or a fact table that aggregates across the time dimension.
An aggregate that summarizes the time dimension sums up data from a transaction-based fact table over a particular period. For each period summarized, the resulting rows all summarize the same period. This means that each can contain period-to-date measurements that might be meaningfully summed, at least within the period. (Note I said "might." There is a caveat, which I will get to in a second.)
An aggregate that records monthly sales totals, for example, is potentially a good place to include a quarter-to-date or year-to-date measurement. Each row in the table will summarize the same period, so the period-to-date measurements may be meaningfully summarized.
The same goes for a periodic snapshot fact table. Like a periodic aggregate, each row in a snapshot summarizes a particular period. The main difference is that a snapshot will contain some form of unique status measurement that cannot be found in the transactions, such as a balance or level. Once again, a semi-additive period-to-date fact can be stored here.
Snapshots record the status of things at fixed time intervals. A monthly snapshot of bank accounts, for example, records the day-end balance of each account. This is a logical place to store month-to-date or other such
Caution: Exactly what is being measured period-to-date?
Even within an aggregate or snapshot period, period-to-date facts may not be additive. It is important to ask yourself: exactly what is being measured period-to-date? If it does not coincide with the grain of the table, it may be non-additive.
For example, a monthly aggregate stores sales by product and customer. If we want to store year-to-date customer sales in this table, the same fact will be repeated multiple times if a single customer purchased multiple products. It is not additive across products; it is repeated if the customer bought more than one product.
On the other hand, if the period-to-date measurement captures sales by customer and product, it coincides with the grain of the table. It can be meaningfully aggregated across any dimension but time.
Reporting
The old stand-by is to compute a period-to-date measurement in a query or report. It is calculated from transactions at run-time. This allows access to any conceivable period-to-date metric, and may also be used as a behavioral qualification rather than a fact.
Such queries may be particularly nasty or time consuming, but they can be run during batch windows and cached to compensate.
My next post will be another Q&A. Please send in your questions!
Chris
Friday, March 27, 2009
Is a Snapshot an Aggregate?
Posted by
Chris Adamson
Students in my classes often ask me if a snapshot fact table is an aggregate. My short answer is that it doesn't matter. But here's the long answer.
This is really a matter of semantics more than anything else. What it boils down to is how you define an aggregate table. Is it simply a table that summarizes data from other tables? Then yes, most of the time a snapshot will fit this definition -- but not always. However, I like to reserve the word "aggregate" for a very specific kind of summary. If you do as well, then the answer is no. Never.
Before getting into that, though, here is some background.
Snapshot Fact Tables
A snapshot, for the uninitiated, is a fact table that periodically samples a measurement. A snapshot always includes at least one measurement of status, such as a balance or level. A transaction fact table, in contrast, captures measurements as they occur.
The snapshot is useful for describing the effect of transactions, rather than transactions themselves. The best example is your bank account. Its all well and good if you have a record of each deposit, check, interest payment, fee, direct debit, and so forth. But what is your balance? Aggregating all your transactions is not the most expedient way to figure that out.
Sometimes Snapshots Summarize Transactions, but not Always
People sometimes think of snapshots as aggregates because they can be constructed from a table that contains transactions. That is certainly true in the case of account balances. If you have the transactions, you can compute the snapshots.
In this sense, you might be justified in thinking of a snapshot as an aggregate. But I don't. I'll explain why in just a second.
First, though, it is important to understand that not all snapshots are constructed from transactions. Sometimes, the number of transactions is too numerous to keep around. But in other cases, there simply aren't any. For example, we can't make use of transactions to describe the ambient temperature on the factory floor, or the level of pressure inside an industrial device. In these cases, the snapshot is clearly not a summary.
Snapshots Contain "new" Facts
So how about those other situations -- the ones where there are transactions that can be used to create a snapshot? Isn't a snapshot an aggregate in this case?
It really depends on how you define an aggregate.
The word aggregate is normally reserved for a summary table that does not transform the structure of its corresponding base table. For example, suppose account transactions are recorded in a fact table, with a fact called transaction_amount. A monthly aggregate would have exactly the same fact. Still called transaction_amount, is the same as the fact in the base table. The only difference is in level of detail: it represents a month's worth of transactions, rather than individual transactions. It is the same measurement.
Account balance (or other status measurements like levels) is a not the same as what the base fact table records. It is not an aggregation of transaction that occurred during the snapshot period; it is something more. It describes the net effect of the account transactions. It is a different measurement.
Conclusion
I use the term derived table to describe any table in a dimensional design uses another as its source. I save the word aggregate for non-transformative summaries.
To sum things up, then:
Yes, a snapshot sometimes summarizes a transaction fact table.
But if you accept these definitions for aggregate tables and derived tables, the answer is different:
A snapshot is is sometimes a derived table, but never an aggregate.
This is just a matter of semantics, and not very important to me. Call it whatever you want, and I will be fine with it. The important thing is that you have a dimensional design that enables useful analytics.
- Chris
This is really a matter of semantics more than anything else. What it boils down to is how you define an aggregate table. Is it simply a table that summarizes data from other tables? Then yes, most of the time a snapshot will fit this definition -- but not always. However, I like to reserve the word "aggregate" for a very specific kind of summary. If you do as well, then the answer is no. Never.
Before getting into that, though, here is some background.
Snapshot Fact Tables
A snapshot, for the uninitiated, is a fact table that periodically samples a measurement. A snapshot always includes at least one measurement of status, such as a balance or level. A transaction fact table, in contrast, captures measurements as they occur.
The snapshot is useful for describing the effect of transactions, rather than transactions themselves. The best example is your bank account. Its all well and good if you have a record of each deposit, check, interest payment, fee, direct debit, and so forth. But what is your balance? Aggregating all your transactions is not the most expedient way to figure that out.
Sometimes Snapshots Summarize Transactions, but not Always
People sometimes think of snapshots as aggregates because they can be constructed from a table that contains transactions. That is certainly true in the case of account balances. If you have the transactions, you can compute the snapshots.
In this sense, you might be justified in thinking of a snapshot as an aggregate. But I don't. I'll explain why in just a second.
First, though, it is important to understand that not all snapshots are constructed from transactions. Sometimes, the number of transactions is too numerous to keep around. But in other cases, there simply aren't any. For example, we can't make use of transactions to describe the ambient temperature on the factory floor, or the level of pressure inside an industrial device. In these cases, the snapshot is clearly not a summary.
Snapshots Contain "new" Facts
So how about those other situations -- the ones where there are transactions that can be used to create a snapshot? Isn't a snapshot an aggregate in this case?
It really depends on how you define an aggregate.
The word aggregate is normally reserved for a summary table that does not transform the structure of its corresponding base table. For example, suppose account transactions are recorded in a fact table, with a fact called transaction_amount. A monthly aggregate would have exactly the same fact. Still called transaction_amount, is the same as the fact in the base table. The only difference is in level of detail: it represents a month's worth of transactions, rather than individual transactions. It is the same measurement.
Account balance (or other status measurements like levels) is a not the same as what the base fact table records. It is not an aggregation of transaction that occurred during the snapshot period; it is something more. It describes the net effect of the account transactions. It is a different measurement.
Conclusion
I use the term derived table to describe any table in a dimensional design uses another as its source. I save the word aggregate for non-transformative summaries.
To sum things up, then:
Yes, a snapshot sometimes summarizes a transaction fact table.
But if you accept these definitions for aggregate tables and derived tables, the answer is different:
A snapshot is is sometimes a derived table, but never an aggregate.
This is just a matter of semantics, and not very important to me. Call it whatever you want, and I will be fine with it. The important thing is that you have a dimensional design that enables useful analytics.
- Chris
Tuesday, March 24, 2009
Feed Updated, Now Contains Full Posts
Posted by
Chris Adamson
If you follow this blog in a newsreader, you will be happy to know that the RSS feed now contains full posts, rather than the first couple of paragraphs. That should make your reading experience a lot easier.
The change seems to have been glitch-free, but if there are any issues please let me know.
Chris
The change seems to have been glitch-free, but if there are any issues please let me know.
Chris
Thursday, March 19, 2009
Q&A: Is a Unique Count a Fact or Dimension?
Posted by
Chris Adamson
A reader recently sent in a question about unique counts. Are they facts, or dimensions?
Here is a paraphrased version of their questions. The business example has been changed.
A: First, let me say that customer counts are certainly facts. They are important metrics that are colored by the dimensions we use to define them.
The difficulty with counts of distinct things, as you have discovered, is that they are non-additive facts. As soon as you use a set of dimensions to create a distinct count -- such as day, location and customer -- you have an aggregation that cannot be further summed. If you need to go to the monthly level, or across all locations, you need to go back to the granular data.
You have a few options.
Option 1: Compute in reports
The first is to compute unique counts in reports. This can be time consuming, and a major hassle if there are several counts that must be computed frequently. You might be able to schedule the reports to run in a batch window. This deals with the processing issues, but not the report or query complexity.
In my view, however, it is the best option. There are other possibilities, but each has significant drawbacks.
Option 2: Create snapshots or aggregates
A second option is to create some snapshot fact tables that compute the various distinct counts at various periodic intervals. These will contain non-additive facts, and your developers will have to understand that they should not be aggregated. Depending on their design, these tables may more closely resemble aggregate tables.
This approach may require quite a few fact tables, since each of your unique counts will have a different grain -- daily by location, monthly by location, annually by location, daily across all locations, monthly across all locations, etc.
To make life a little easier for report developers, designers sometimes replicate the monthly and annual totals in the daily level snapshot. Developers find this appealing, because all the information to produce subtotals over time can be fetched from a single row, rather than three queries. The flipside is that the monthly and annual data is stored redundantly in a lot of places, and must be maintained consistently and used properly.
Option 3: Behavioral Dimensions
If the current period is the primary focus, a third option is to create behavioral dimensions. For example, in the store dimension, you can add attributes that represent counts of unique customers for the current month and year. I mentioned behavioral dimensions in a previous Q&A post, and the concept is the same here.
Option 4: Creating Special Facts that can be Summed
Lastly, some designers like to create special facts in the fact table that can be used to sum unique counts. If you have a fact table that gets a row each time a customer visits a location, you may be able to tweak it a bit. Add an additional fact called first_visit_of_month that only contains a 1 the first time the customer visits any location during the month. First_visit_of_year will only contain a 1 the first time the customer visits during the year. These facts can be summed to produce unique counts for various time frames, but you need to be careful not to aggregate them across the periods they represent.
Another drawback is that, once again, numerous special facts will be required. To compute unique counts by location, you will need facts like first_visit_this_location_this_month. This fact should not be aggregated across months or locations. That's a lot of rules, and they are likely to be broken.
As you can see, no option looks perfect. Don’t let that lead you to push back on requirements, however. Distinct counts are often very important metrics, particularly when studying customer service or processing efficiency.
Thanks for the question!
- Chris
If you have a question, you can email it to me. Click my profile picture at the top of this page for the address. I try to answer all questions, and might even print yours here.
Here is a paraphrased version of their questions. The business example has been changed.
Q: We report on the number of unique customers that visit a location. We do this at daily, monthly and annual levels, and also look at the same thing across all locations. We originally modeled the customer counts as facts. The problem is that if we compute unique customer counts by day, they do not correctly roll up to month. Does it make more sense to treat this as a dimension?
A: First, let me say that customer counts are certainly facts. They are important metrics that are colored by the dimensions we use to define them.
The difficulty with counts of distinct things, as you have discovered, is that they are non-additive facts. As soon as you use a set of dimensions to create a distinct count -- such as day, location and customer -- you have an aggregation that cannot be further summed. If you need to go to the monthly level, or across all locations, you need to go back to the granular data.
You have a few options.
Option 1: Compute in reports
The first is to compute unique counts in reports. This can be time consuming, and a major hassle if there are several counts that must be computed frequently. You might be able to schedule the reports to run in a batch window. This deals with the processing issues, but not the report or query complexity.
In my view, however, it is the best option. There are other possibilities, but each has significant drawbacks.
Option 2: Create snapshots or aggregates
A second option is to create some snapshot fact tables that compute the various distinct counts at various periodic intervals. These will contain non-additive facts, and your developers will have to understand that they should not be aggregated. Depending on their design, these tables may more closely resemble aggregate tables.
This approach may require quite a few fact tables, since each of your unique counts will have a different grain -- daily by location, monthly by location, annually by location, daily across all locations, monthly across all locations, etc.
To make life a little easier for report developers, designers sometimes replicate the monthly and annual totals in the daily level snapshot. Developers find this appealing, because all the information to produce subtotals over time can be fetched from a single row, rather than three queries. The flipside is that the monthly and annual data is stored redundantly in a lot of places, and must be maintained consistently and used properly.
Option 3: Behavioral Dimensions
If the current period is the primary focus, a third option is to create behavioral dimensions. For example, in the store dimension, you can add attributes that represent counts of unique customers for the current month and year. I mentioned behavioral dimensions in a previous Q&A post, and the concept is the same here.
Option 4: Creating Special Facts that can be Summed
Lastly, some designers like to create special facts in the fact table that can be used to sum unique counts. If you have a fact table that gets a row each time a customer visits a location, you may be able to tweak it a bit. Add an additional fact called first_visit_of_month that only contains a 1 the first time the customer visits any location during the month. First_visit_of_year will only contain a 1 the first time the customer visits during the year. These facts can be summed to produce unique counts for various time frames, but you need to be careful not to aggregate them across the periods they represent.
Another drawback is that, once again, numerous special facts will be required. To compute unique counts by location, you will need facts like first_visit_this_location_this_month. This fact should not be aggregated across months or locations. That's a lot of rules, and they are likely to be broken.
As you can see, no option looks perfect. Don’t let that lead you to push back on requirements, however. Distinct counts are often very important metrics, particularly when studying customer service or processing efficiency.
Thanks for the question!
- Chris
If you have a question, you can email it to me. Click my profile picture at the top of this page for the address. I try to answer all questions, and might even print yours here.
Privacy Policy
Posted by
Chris Adamson
Recent changes to the Google AdSense program prompted me to define a privacy policy for this blog. Here it is:
I wish the policy could be that simple, but you should also know the following:
Blogger:
The blog is hosted by Google's Blogger.com.
If you have a Google, Gmail or Blogger account, Google may log information about your visits to this blog. The Blogger privacy policy can be found by visiting the Google Privacy Center and clicking on the link for Blogger.
Newsletter:
If you subscribe to the newsletter, the information you provide is used to send occasional email updates. These always include an opt-out link.
The mail list is maintained and managed by an external service provider. They follow a confirmed opt-in policy to ensure compliance with anti-spam legislation.
Ads By Google:
UPDATE 12/16/2010: I have removed ads from this blog. While they helped support it, I felt they were detracting from its value. If you'd like to help support StarSchemaCentral, please use the links in the sidebars to buy books from Amazon.com.
Oakton Software does not gather any information about your visits to this blog, or the pages you view.
I wish the policy could be that simple, but you should also know the following:
Blogger:
The blog is hosted by Google's Blogger.com.
If you have a Google, Gmail or Blogger account, Google may log information about your visits to this blog. The Blogger privacy policy can be found by visiting the Google Privacy Center and clicking on the link for Blogger.
Newsletter:
If you subscribe to the newsletter, the information you provide is used to send occasional email updates. These always include an opt-out link.
The mail list is maintained and managed by an external service provider. They follow a confirmed opt-in policy to ensure compliance with anti-spam legislation.
Ads By Google:
UPDATE 12/16/2010: I have removed ads from this blog. While they helped support it, I felt they were detracting from its value. If you'd like to help support StarSchemaCentral, please use the links in the sidebars to buy books from Amazon.com.
Monday, February 16, 2009
Accounting Transactions Have Multiple Dates
Posted by
Chris Adamson
Building a financial data mart? Don't let the multiple dates associated with accounting transactions confuse you.
A standard feature of financial systems often befuddles star schema designers. When someone inputs a transaction into a financial system, the transaction has at least two associated dates. One represents the date that the event happened, and the second represents the date the transaction was entered into the system. The first is usually called the transaction date, while the second is called the date posted. Sometimes, there are more dates as well.
For example, if a check is written, the record in the accounting system will have two dates. The transaction date will be the date on the check. The date posted will be the day that someone in the accounting department actually entered the transaction into the system. (Some financial systems may have other dates as well, such as an audit date or an edited date that applies to comments.)
Don’t let these dates confuse you. If you are designing a financial data mart, you do not need to choose between them; both dates will be useful, and both should appear in your model. A fact table that logs general ledger activity, for example, will have one row per transaction. That row can contain two references to a day dimension; one represents the transaction date and one represents the posting date. Name the keys appropriately. For example, you can call them day_key_transaction, day_key_posted.
The transaction date will be the one that is used to drive most reporting. This is the date that properly associates the transaction with an accounting period. You would use it to summarize transactions by day, month or fiscal period, for example.
The date posted will be of lesser use, but may also occasionally serve analytic purposes--especially when compared to the transaction date. For example, auditors may want to look for transactions posted one or more months after they actually took place. Managers might want to improve their efficiency by studying lags between transaction date and date posted.
If you are also going to build a monthly snapshot by account, then the star for account transactions may serve as its source. In that case, the date posted will help the ETL developers maintaining the snapshot. During each processing window, they can simply look for any transactions that were posted prior to the last time the snapshot was updated, and process only those rows.
- Chris
A standard feature of financial systems often befuddles star schema designers. When someone inputs a transaction into a financial system, the transaction has at least two associated dates. One represents the date that the event happened, and the second represents the date the transaction was entered into the system. The first is usually called the transaction date, while the second is called the date posted. Sometimes, there are more dates as well.
For example, if a check is written, the record in the accounting system will have two dates. The transaction date will be the date on the check. The date posted will be the day that someone in the accounting department actually entered the transaction into the system. (Some financial systems may have other dates as well, such as an audit date or an edited date that applies to comments.)
Don’t let these dates confuse you. If you are designing a financial data mart, you do not need to choose between them; both dates will be useful, and both should appear in your model. A fact table that logs general ledger activity, for example, will have one row per transaction. That row can contain two references to a day dimension; one represents the transaction date and one represents the posting date. Name the keys appropriately. For example, you can call them day_key_transaction, day_key_posted.
The transaction date will be the one that is used to drive most reporting. This is the date that properly associates the transaction with an accounting period. You would use it to summarize transactions by day, month or fiscal period, for example.
The date posted will be of lesser use, but may also occasionally serve analytic purposes--especially when compared to the transaction date. For example, auditors may want to look for transactions posted one or more months after they actually took place. Managers might want to improve their efficiency by studying lags between transaction date and date posted.
If you are also going to build a monthly snapshot by account, then the star for account transactions may serve as its source. In that case, the date posted will help the ETL developers maintaining the snapshot. During each processing window, they can simply look for any transactions that were posted prior to the last time the snapshot was updated, and process only those rows.
- Chris
Q&A: Chosing the right time dimension for a Snapshot
Posted by
Chris Adamson
In this Q&A, a reader asks how to designate a time dimension for a fact table that is a snapshot.
First, I want to commend you for stating the grain of your fact table, and stating it clearly. Grain confusion can cause many problems, and makes a question like yours even more difficult to answer. Snapshots always have a time component to their grain; in your case it is monthly. Your fact table will have one row per account per month.
Now to your question: I strongly prefer the use of a month dimension for a monthly snapshot. This approach is non-ambiguous. It is self-evident what each row represents; the chance for any confusion is minimal.
When a monthly snapshot is designed with a day dimension, the designers are generally clear on what each row represents. However, report developers or other users may misinterpret the data now or in the future.
Designers who use a day dimension with a monthly snapshot generally choose the last day of the period represented, as you suggested. The problem with this is that there are many elements of a typical day dimension that simply do not apply to a monthly summary. Tor example, a snapshot representing March 2009 would have the day 3/31/09 associated with it. The day dimension has a lot of information about March 31, 2009 that does not apply to the snapshot, such as what day of the week it was, whether it was a holiday, and so forth. It would be all too easy to misinterpret this information as being applicable to the facts.
Even if everyone is clear on this at the start, a year or two down the road there are likely to be new members of your development team who may misinterpret the data. The presence of a date can also lead to confusion about what a row represents. If your monthly summary
contains an average daily balance, for example, a user may confuse this for a balance as of the day in question.
Confusion will also arise over what date to use for the current period. During March 2009, you may be updating the snapshot rows for each account on a daily basis. Using a day dimension, ETL developers may assign the particular date in March, which now serves double duty -- designating the snapshot period and the date loaded.
Sometimes, there is a strong desire among technical team members to use a day dimension with a monthly snapshot. If you cannot convince them otherwise, then you can avoid some confusion by creating a view for use with the monthly snapshot that hides the day-level attributes. You may be able to do the same kind of “hiding” of non-applicable dimension attributes within your BI tool. Use the month-end date for each fact table row. If the ETL team wants to track the day each row was loaded, they can do this with a second relationship to the day dimension. (This trick works nicely even if you do go with a month dimension.)
Regardless of your approach, I also advise you to name the facts carefully. If you want to include a balance as of the end of the period, name it period_end_balance. An average daily balance for the period in question should be called average_daily_balance. A column name like balance is simply too vague, and bound to be misused.
Thanks for your question!
- Chris
If you have a question, click my profile picture at the top of the page for my address. I try to respond to all messages, and may even post an answer here.
Q: We are designing a new star. The fact table is a monthly snapshot. It will capture several metrics each month for each open account. My question is how to handle the time dimension. We cannot agree whether it makes more sense to have a month dimension, or to use a month end date from a day dimension. What makes the most sense?
- Name withheld by request
First, I want to commend you for stating the grain of your fact table, and stating it clearly. Grain confusion can cause many problems, and makes a question like yours even more difficult to answer. Snapshots always have a time component to their grain; in your case it is monthly. Your fact table will have one row per account per month.
Now to your question: I strongly prefer the use of a month dimension for a monthly snapshot. This approach is non-ambiguous. It is self-evident what each row represents; the chance for any confusion is minimal.
When a monthly snapshot is designed with a day dimension, the designers are generally clear on what each row represents. However, report developers or other users may misinterpret the data now or in the future.
Designers who use a day dimension with a monthly snapshot generally choose the last day of the period represented, as you suggested. The problem with this is that there are many elements of a typical day dimension that simply do not apply to a monthly summary. Tor example, a snapshot representing March 2009 would have the day 3/31/09 associated with it. The day dimension has a lot of information about March 31, 2009 that does not apply to the snapshot, such as what day of the week it was, whether it was a holiday, and so forth. It would be all too easy to misinterpret this information as being applicable to the facts.
Even if everyone is clear on this at the start, a year or two down the road there are likely to be new members of your development team who may misinterpret the data. The presence of a date can also lead to confusion about what a row represents. If your monthly summary
contains an average daily balance, for example, a user may confuse this for a balance as of the day in question.
Confusion will also arise over what date to use for the current period. During March 2009, you may be updating the snapshot rows for each account on a daily basis. Using a day dimension, ETL developers may assign the particular date in March, which now serves double duty -- designating the snapshot period and the date loaded.
Sometimes, there is a strong desire among technical team members to use a day dimension with a monthly snapshot. If you cannot convince them otherwise, then you can avoid some confusion by creating a view for use with the monthly snapshot that hides the day-level attributes. You may be able to do the same kind of “hiding” of non-applicable dimension attributes within your BI tool. Use the month-end date for each fact table row. If the ETL team wants to track the day each row was loaded, they can do this with a second relationship to the day dimension. (This trick works nicely even if you do go with a month dimension.)
Regardless of your approach, I also advise you to name the facts carefully. If you want to include a balance as of the end of the period, name it period_end_balance. An average daily balance for the period in question should be called average_daily_balance. A column name like balance is simply too vague, and bound to be misused.
Thanks for your question!
- Chris
If you have a question, click my profile picture at the top of the page for my address. I try to respond to all messages, and may even post an answer here.
Friday, January 23, 2009
Q&A: Customers with No Sales in Fact Table
Posted by
Chris Adamson
In this post, Chris answer a reader's question about reporting on customers with no sales in the fact table.
New Q&A Feature
I get a lot of questions about dimensional modeling from readers of the blog, my books, and people who attend my classes. I try to answer all of them. I thought it would be interesting to post some of them here, for all to read.
This week, I'm answering a question about reporting on customers with no sales.
Usually, something like “customers with no sales for a period” is computed as part of the reporting process. Developers often bristle at this kind of report, because it requires using a correlated subquery. This makes it difficult to do with a SQL-generating BI tool, and can be slow. And if you want to use the list of customers with no sales as a filter for another query, things can get really hairy.
An alternative is to pre-compute this information, storing the results in a table that summarizes reporting periods. This is likely to be a factless fact table, with foreign keys to dimension tables representing the period summarized and the dimension in question--customers in your case. If you require monthly and yearly slices, you will want to multiple summaries. This approach may simplify reporting, but adds complexity to the ETL process.
If current period is the only period of concern, as in your question, then it is also possible to pre-compute the information and store the result directly to the customer dimension table, as a flag. The flag would specify whether the customer has placed an order during the current period. In your case, it would be two flags -- one for month to date and one for year to date.
This kind of attribute is called a behavioral dimension because it takes behavioral activity recorded in the fact table and transforms it into a dimension. Behavioral dimensions are very powerful, because they can be used as simple filters for other queries.
Again, this concept trades ETL complexity for reporting flexibility and performance. But isn't that what data warehousing is all about?
Thanks to J. Dolan for agreeing to have his question posted. If you have a question, send it in. I try to answer all my email, though it sometimes takes a while!
Chris
New Q&A Feature
I get a lot of questions about dimensional modeling from readers of the blog, my books, and people who attend my classes. I try to answer all of them. I thought it would be interesting to post some of them here, for all to read.
This week, I'm answering a question about reporting on customers with no sales.
In general what is the recommended approach for reporting customers that have no sales in the fact table for MTD or YTD? Or, for that matter, any dimension member where no fact exists for some time period?
- J. Dolan
Usually, something like “customers with no sales for a period” is computed as part of the reporting process. Developers often bristle at this kind of report, because it requires using a correlated subquery. This makes it difficult to do with a SQL-generating BI tool, and can be slow. And if you want to use the list of customers with no sales as a filter for another query, things can get really hairy.
An alternative is to pre-compute this information, storing the results in a table that summarizes reporting periods. This is likely to be a factless fact table, with foreign keys to dimension tables representing the period summarized and the dimension in question--customers in your case. If you require monthly and yearly slices, you will want to multiple summaries. This approach may simplify reporting, but adds complexity to the ETL process.
If current period is the only period of concern, as in your question, then it is also possible to pre-compute the information and store the result directly to the customer dimension table, as a flag. The flag would specify whether the customer has placed an order during the current period. In your case, it would be two flags -- one for month to date and one for year to date.
This kind of attribute is called a behavioral dimension because it takes behavioral activity recorded in the fact table and transforms it into a dimension. Behavioral dimensions are very powerful, because they can be used as simple filters for other queries.
Again, this concept trades ETL complexity for reporting flexibility and performance. But isn't that what data warehousing is all about?
Thanks to J. Dolan for agreeing to have his question posted. If you have a question, send it in. I try to answer all my email, though it sometimes takes a while!
Chris
Subscribe to:
Posts (Atom)