Showing posts with label Aggregates. Show all posts
Showing posts with label Aggregates. Show all posts

Tuesday, October 13, 2009

Highly Summarized Dimensions

From the inbox today, a question about aggregate design:
Q:  "If you roll up a dimension that is hierarchical, but you are now left with the highest level of the hierarchy (no further roll-up would be possible in that dimension) - would you maintain a separate dimension for this in the aggregated star?

"Or would you join a number of these unrelated top hierarchy levels in a junk dimension?"

- Dublin, Ireland
The reader has a potential aggregate that will summarize more than one dimension at a very high level.  For example, an aggregate of sales that summarizes the customer dimension by country and summarizes the product dimension by category.

As a general rule of thumb, try to set up aggregate dimensions that preserve the original separation of the base dimension tables.  So rather than have country and category combined as a sort of "junk" dimension, create a rollup of customer called "country" and a rollup of product called "category." 

This gives the aggregate star a kind of "symmetry" with the original star.  It tends to be the most workable solution with aggregate-aware query tools, databases, etc.  It also is the easiest to manage when no tools are aggregate-aware, as it minimizes the effort required to rewrite SQL.

If the summary is very small, it may be workable to create a "big wide table" that stores the dimension values directly with the summary facts.  This works best if you are using a tool to implement the aggregate that is able to exploit the query rewrite function of a database, like a materialized view or materialized query table.  In this case, applications can still query the "base" star, but receive the benefit of the aggregate.

But do you really need this aggregate?  The more highly summarized an aggregate is, the fewer queries or reports it is able to accelerate.   It is usually possible to construct aggregates that are not quite so highly summarized, giving a performance boost to a wider range of queries.  For example, summarizing a customer dimension by region, rather than country, will allow the aggregate to work with more reports--including those that focus on country.

- Chris

Image credit:  ZapTheDingbat

Saturday, April 18, 2009

Dealing with Period-to-Date Measurements (MTD, YTD and so forth)

A few weeks ago, I touched on some problems you might encounter when trying to store period-to-date measurements in a fact table. Today, I want to address the issue more directly.

Period-to-date measurements can summarize just about any fact over a period of time. A measurement of month-to-date sales , for example, aggregates sales from the beginning of the month to the date in question. Month-to-date, quarter-to-date and year-to-date facts are often used to do things like compute commissions, volume discounts, etc.

Period-to-date Facts and Additivity

Most facts in a star schema can be summed up across any and all dimensions. Sales dollars, for example, can be summed across days, customers, stores, or whatever dimension makes sense to the person doing the analysis. We call this kind of fact fully additive.

The basic issue with storing period-to-date facts is this: they are never fully additive. It does not make any sense to sum up period-to-date measurements that were taken at different times. Month-to-date sales for a customer on January 14, 15 and 16, for example, cannot be meaningfully added together.

Period-to-date measurements are semi-additive; the can be summed up across some dimensions, but not time.

It might make sense to aggregate together several period-to-date measurements if they were taken at the same time. The year-to-date sales of all customers who made a purchase on a particular day, for example, might factor into some useful analysis.

Storing a period-to-date fact with transactions

Since it makes no sense to sum up period-to-date measurements across different points in time, they have very limited use in a transaction-grained fact table.

For example, a transaction-grained fact table that stores a row for each order-line of each order might contain dimensions that capture the order_line number, date and time of the order, product sold, and so forth. You might store a period-to-date measurement in such a fact table, but there there will be little or no opportunity to sum this fact across different order lines. Only orders with the same date/time can be meaningfully summed. And if there are multiple order-lines for a given order, even this may not make sense, since the period-to-date measurement would be stored redundantly.

Stored in this kind of fact table, the period-to-date measurement has very limited use. It can really only be used to study the most granular data in the fact table. For example, it might be used to compute an applicable discount, or a commission percentage that escalates with sales volume.

But is this needed in the fact table? Such metrics are usually highly operational. They may influence other metrics, which have more analytic value and happen to be additive. The commission paid or discount given, for example, may be computed using period-to-date information, but itself is fully additive. This will be useful to store in a fact table.

This does not mean that period-to-date measurements have no place in a transaction-grained fact table. If they are central to the business, or to analysis of the process, then place them there. They may, however, cause excessive growth in the row size, since it is possible to enumerate multiple period-to-date versions of any given fact.

Storing a period-to-date fact with period aggregates or snapshots

The natural home for a period-to-date measurement is a fact table with a periodic-snapshot grain, or a fact table that aggregates across the time dimension.

An aggregate that summarizes the time dimension sums up data from a transaction-based fact table over a particular period. For each period summarized, the resulting rows all summarize the same period. This means that each can contain period-to-date measurements that might be meaningfully summed, at least within the period. (Note I said "might." There is a caveat, which I will get to in a second.)

An aggregate that records monthly sales totals, for example, is potentially a good place to include a quarter-to-date or year-to-date measurement. Each row in the table will summarize the same period, so the period-to-date measurements may be meaningfully summarized.

The same goes for a periodic snapshot fact table. Like a periodic aggregate, each row in a snapshot summarizes a particular period. The main difference is that a snapshot will contain some form of unique status measurement that cannot be found in the transactions, such as a balance or level. Once again, a semi-additive period-to-date fact can be stored here.
Snapshots record the status of things at fixed time intervals. A monthly snapshot of bank accounts, for example, records the day-end balance of each account. This is a logical place to store month-to-date or other such

Caution: Exactly what is being measured period-to-date?

Even within an aggregate or snapshot period, period-to-date facts may not be additive. It is important to ask yourself: exactly what is being measured period-to-date? If it does not coincide with the grain of the table, it may be non-additive.

For example, a monthly aggregate stores sales by product and customer. If we want to store year-to-date customer sales in this table, the same fact will be repeated multiple times if a single customer purchased multiple products. It is not additive across products; it is repeated if the customer bought more than one product.

On the other hand, if the period-to-date measurement captures sales by customer and product, it coincides with the grain of the table. It can be meaningfully aggregated across any dimension but time.

Reporting

The old stand-by is to compute a period-to-date measurement in a query or report. It is calculated from transactions at run-time. This allows access to any conceivable period-to-date metric, and may also be used as a behavioral qualification rather than a fact.

Such queries may be particularly nasty or time consuming, but they can be run during batch windows and cached to compensate.

My next post will be another Q&A. Please send in your questions!

Chris

Friday, March 27, 2009

Is a Snapshot an Aggregate?

Students in my classes often ask me if a snapshot fact table is an aggregate. My short answer is that it doesn't matter. But here's the long answer.

This is really a matter of semantics more than anything else. What it boils down to is how you define an aggregate table. Is it simply a table that summarizes data from other tables? Then yes, most of the time a snapshot will fit this definition -- but not always. However, I like to reserve the word "aggregate" for a very specific kind of summary. If you do as well, then the answer is no. Never.

Before getting into that, though, here is some background.

Snapshot Fact Tables

A snapshot, for the uninitiated, is a fact table that periodically samples a measurement. A snapshot always includes at least one measurement of status, such as a balance or level. A transaction fact table, in contrast, captures measurements as they occur.

The snapshot is useful for describing the effect of transactions, rather than transactions themselves. The best example is your bank account. Its all well and good if you have a record of each deposit, check, interest payment, fee, direct debit, and so forth. But what is your balance? Aggregating all your transactions is not the most expedient way to figure that out.

Sometimes Snapshots Summarize Transactions, but not Always

People sometimes think of snapshots as aggregates because they can be constructed from a table that contains transactions. That is certainly true in the case of account balances. If you have the transactions, you can compute the snapshots.

In this sense, you might be justified in thinking of a snapshot as an aggregate. But I don't. I'll explain why in just a second.

First, though, it is important to understand that not all snapshots are constructed from transactions. Sometimes, the number of transactions is too numerous to keep around. But in other cases, there simply aren't any. For example, we can't make use of transactions to describe the ambient temperature on the factory floor, or the level of pressure inside an industrial device. In these cases, the snapshot is clearly not a summary.

Snapshots Contain "new" Facts

So how about those other situations -- the ones where there are transactions that can be used to create a snapshot? Isn't a snapshot an aggregate in this case?

It really depends on how you define an aggregate.

The word aggregate is normally reserved for a summary table that does not transform the structure of its corresponding base table. For example, suppose account transactions are recorded in a fact table, with a fact called transaction_amount. A monthly aggregate would have exactly the same fact. Still called transaction_amount, is the same as the fact in the base table. The only difference is in level of detail: it represents a month's worth of transactions, rather than individual transactions. It is the same measurement.

Account balance (or other status measurements like levels) is a not the same as what the base fact table records. It is not an aggregation of transaction that occurred during the snapshot period; it is something more. It describes the net effect of the account transactions. It is a different measurement.

Conclusion

I use the term derived table to describe any table in a dimensional design uses another as its source. I save the word aggregate for non-transformative summaries.

To sum things up, then:

Yes, a snapshot sometimes summarizes a transaction fact table.

But if you accept these definitions for aggregate tables and derived tables, the answer is different:

A snapshot is is sometimes a derived table, but never an aggregate.

This is just a matter of semantics, and not very important to me. Call it whatever you want, and I will be fine with it. The important thing is that you have a dimensional design that enables useful analytics.

- Chris

Tuesday, June 10, 2008

Available Now: Designing a Performance Layer

My article "Designing A Performance Layer: Aggregates and Derived Tables" appears in the current issue of The Business Intelligence Journal (Vol 13, No. 1), available to members of The Data Warehouse Institute.

This article shows you how to create a performance layer for your dimensional data warehouse -- a set of cubes and tables designed specifically for performance.

The article covers two types of performance design: aggregate tables and derived tables.
  • Aggregate Tables store partially summarized copies of existing data. These tables (or cubes) improve performance for queries that do not require granular detail. The article fully explains aggregates, including summarization of metrics, hierarchies, and implementation with or without aggregate navigators.
  • Derived Tables store restructured copies of existing data. They do not necessarily summarize information, but they may. Derived tables deliver performance benefits by streamlining the query and reporting processes. In the article, I show how to supplement an existing schema with merge, pivot, and set operation fact tables (or cubes).
The full article appears in The Business Intelligence Journal, which is available in print and on line for TDWI members.

Monday, June 11, 2007

Ten Things You Won't Learn from that Demo Schema

Many people learn about dimensional modeling by studying a sample star schema database that comes with a software product. These sample databases are useful learning tools—to a point. Here are 10 things you won't learn by studying that demo.

If you've learned everything you know about star schema by working with a sample database, you probably have a good intuitive grasp of star schema design principles. In a previous post, I provided a list of 10 terms and principles that most sample databases illustrate well.

But there are many important things about a dimensional data warehouse that are not revealed by the typical "Orders" or "Sales" demo. Here are the top 10 things you will not learn from that sample database.

  1. Multiple Fact Tables Most sample databases contain a single star—one fact table and its associated dimension tables. But it is rare to find a business process that can be modeled with a single fact table; it is impossible to find an enterprise that can. Most real-world designs will involve multiple fact tables, sharing a set of common dimensions.

    When facts become available at different times, or with different dimensionality, they almost always belong in separate fact tables. Modeling them in a single fact table can have negative consequences. Mixed grain issues may result, complicating the load and making reporting very difficult. For example, building reports focused on only one of the facts can result in a confusing preponderance of extra rows containing the value zero.

  2. Conformance With multiple fact tables, it is also important that each star be designed so that it works with others. A design principle called conformance helps ensure that as we build each new star, it works well with those that came before it. This avoids the dreaded stove-pipe. This principle allows a set of star schemas to be planned around a set of common dimensions and implemented incrementally.

  3. Drilling Across It's also important to understand how to properly build a report that accesses data from multiple stars. A single SQL select statement won't do the job. Double counting, or worse, can result. Instead, we follow a process called drilling across, where each star is queried individually. The two result sets are then combined based on their common attributes. These drill across reports are some of the most powerful in the data warehouse.

  4. Snapshot Fact Tables The fact table found in most demo stars is usually called a transaction fact table. But there are real world situations where other types of fact table designs are called for.

    A snapshot design is useful for capturing the result of a series of transactions at a point-in-time; for example, the balance of each account in the general ledger at the end of each day. This type of design introduces the concept of semi-additivity, which can be a problem for many ad hoc query tools. It makes no sense to add together yesterday's balance and today's balance. It is not uncommon to compute averages based on the data in a snapshot star. But one must be careful here; the SQL Average() function may not always be what you need.

  5. Factless Fact Tables Another type of fact table often contains no facts at all. Factless fact tables are useful in situations where there appears to be nothing to measure aside from the occurrence of an event, such as a customer contact. They also come in handy when we want to capture information about which there may be no event at all, such as eligibility.

    In addition to transaction, snapshot and factless designs, there are other types of fact table as well. It is not uncommon to need more than one, even when modeling a single activity.

  6. Roles and Aliasing Many business processes involve a dimension in multiple roles. For example, in accounting a transaction may include the employee who makes a purchase, as well as the employee who approves it. There is no need for separate"Purchaser" and "Approver" dimensions. A single "Employee" dimension will do the job. The fact table will have two foreign key references to the Employee dimension--one that represents the purchaser, and one that represents the approver. We use SQL "aliasing" when querying this schema in order to capture the two employee roles.

  7. Advanced Slow Change Techniques If you are lucky, you were able to learn about Type 1 and Type 2 Slowly Changing Dimension techniques from the demo schema. I described these in a previous post. Often, analytic requirements require more.

    A Type 3 change allows you to "have it both ways," analyzing all past and future transactions as if the change had occurred (retroactively) or not all.

    There are also hybrid approaches, one of which tracks the "transaction-time" version of the changed data element as well as the "current-value" of the data element.

    And then there's the time-stamped dimension technique, also called a transaction dimension. In this version, each row receives an effective date/time and an expiration date time. This provides Type 2 functionality, but also allows point-in-time analysis of the dimensional data.

  8. Bridge Tables Perhaps the most confusing technique for the novice dimensional designer is the use of bridge tables. These tables are used when the standard one-to-many relationship between dimension and fact does not apply. There are three situations where bridge tables come in handy:

    An attribute bridge resolves situations where a dimension attribute may repeat multiple times. For example, a dimension table called "Company" may include an attribute called "Industry." Some companies have more than one industry. Rather than flattening into "Industry 1," "Industry 2," and so on, an attribute bridge captures as many industries as needed.

    A dimension bridge resolves situations where an entire dimension may repeat with respect to facts. For example, there may be multiple salespeople involved in a sale. Instead of loading the fact table with multiple salesperson keys, a dimension bridge gracefully manages the group of salespeople.

    A hierarchy bridge resolves situations where a recursive hierarchy exists within a dimension. For example, companies own other companies. At times, users may want to roll transactions up that occur beneath a specific company, or vice versa. Instead of flattening the hierarchy, which imposes limitations and complicates analysis, a hierarchy bridge can be joined to the transaction data in various ways, allowing multiple forms of analysis.

    All bridge implementations have implications for usage, or report building. Improper use of a bridge can result in double counting or incorrect results. Bridges also make deployment of business intelligence tools more difficult.

  9. Derived Schemas Useful stars can also be derived from existing stars. Often called "second-line" solutions, these derived schemas can accelerate specific types of analysis with powerful results. The merged fact table combines stars to avoid drilling across. The sliced fact table partitions data based on a dimension value, useful in distributed collection and analysis. The pivoted fact table restructures row-wise data for columnar analysis and vice-versa. And set operation fact tables provide precomputed results for union, intersect and minus operations on existing stars.

  10. Aggregate Schemas One of the reasons the star schema has become so popular is that it provides strong query performance. Still, there are times when we want results to come back faster. Aggregate schemas partially summarize the data in a base schema, allowing the database to compute query results faster. Of course, designers need to identify aggregates that will provide the most help, and queries and reports will receive no benefit unless they actually use the aggregate. Aggregates are usually designed as separate tables, instead of providing multiple levels of summarization in a single star. This avoids double counting errors, and can allow the exploitation of an automated query rewrite mechanism so that applications do not need to be aware of the aggregates.
I limited this list to 10 things. That's enough to make the point: a demo schema will only take you so far. When I teach Advanced Star Schema Design courses, I find that even people with many years of experience have questions, and always learn something new.

If you want to learn more, read the books recommended in the sidebar of this blog. Take a class on Advanced Star Schema Design. Interact with your peers at The Data Warehousing Institute conferences. And keep reading this blog. There will always be more to learn.

Related Posts:

Top 10 Thinks You Should Know About that Demo Schema

© 2007 Chris Adamson