Showing posts with label Snapshots (Periodic). Show all posts
Showing posts with label Snapshots (Periodic). Show all posts

Wednesday, October 10, 2012

Handling Rankings

Dimensional modelers often struggle with issues surrounding rankings. Learn when to store them, where to store them, and where not to store them, by reading this post.
We need to calculate Top N rankings across several dimensions (account, channel, hour, day, week, month.) There are massive data sets involved. This must be a common problem. Is there an approach you can point us towards?
- Jeremy
Rankings are time and context dependent. When you need to track them, they pose challenges that are similar to non-additive facts such as period-to-date measurements or distinct counts.

The solutions, it turns out, are also similar.

Rankings and their Context

When thinking about how to handle rankings, it is useful to break them down. A ranking is actually made up of several parts. There is the dimension being ranked, the dimensions for which the ranking will be computed,  and the fact that is the source of the ranking.

For example, look at this simple statement:

"Show me the top 10 customers by week and product."
  • The dimension being ranked here is Customer.  
  • The dimensional context of each ranking is a product and week. For example, if we have 20 products, then each week we will need 20 top ten lists -- one for each product.
  • The fact driving the ranking is not explicit in this question. It is probably spending, though it may be based on margin or transaction counts.  
That may sound simple, but things are usually a bit more complicated.

The reader, for example, does not just need a single ranking. He actually needs several -- daily, weekly, monthly. Presumably, these may be mixed and matched with the other dimensions he mentioned -- account and channel. In addition, there may be different rankings based on different facts.

Compute vs. Store

It may have occurred to you at this point that there are a multitude of possible rankings, even for a relatively simple schema.

In addition, once we compute a ranking, there is very little we can do with it. For example, we cannot sum it up over time periods, or other dimensions.  We may be able to compare it to another ranking (a different period or dimension, for example), or treat it as reference data describing the thing being ranked. But that's about it.

For these reasons, many organizations choose to compute rankings within reports, or at query time.  This provides maximum flexibility, and also sidesteps the ugly problem of having to store information that cannot be aggregated.

However, the reader indicates that data volumes are large. Let's assume they are so large that it is not feasible to keep the granular data around to compute rankings. Let's also assume that the rankings he needs are consistent over time. In his case, it may be necessary to compute all needed rankings and then store them.

Rankings fit with Snapshots or Aggregates

A ranking is only useful within the original context of its calculation. We won't really be able to aggregate it.  In this respect, its is similar to a distinct count or a period-to-date measurement -- topics I have addressed in the past.  (Some period-to-date metrics can be aggregated within a single period, but otherwise the comparison holds.)

The recommended solution is similar as well. Establish the granularity of the various rankings, then keep track of them fact tables of appropriate grain. These will either be periodic snapshots, or aggregate tables that summarize appropriate time periods. The rank itself will be stored in the fact table, even though it is not additive.

Many developers balk at this solution, observing that it increases the number of fact tables that must be maintained. But note that these fact tables do not introduce any new dimension tables, and they can usually be sourced from transaction-grained stars. And the alternatives have even less appeal.

Rankings don't fit with Transactions

Most of the time, it is not useful to store rankings with transactions. Unless all the things being ranked happen to have transactions that occur at the same time, it will be impossible to compare the ranking with anything -- including other things being ranked!

Resist abstracted models

It is tempting to come up with designs that abstract the concept of a ranking. For example, one can imagine setting up a dimension that describes a ranking, the dimension being ranked, and the period. A fact table can be set up that carries the key for this metadata, plus 10 or 20 keys that will be used to link to top 1, top 2, etc.

Abstracted designs are appealing because they may appear to limit the number of stars to be maintained and don't require schema changes when new rankings are added. But note that the same amount of work must be done to load each ranking.

More importantly, these abstracted designs severely hamper usability. With such a design, it is much harder to merge information about a ranking with other data. A report that contains product level detail, for example, cannot easily pull in info indicating the product appeared in a Top 10 ranking -- this would require decoding metadata and then checking each of the various keys.

Further Reading

As mentioned, some of the challenges related to rankings are similar to those involving distinct counts or period-to-date metrics.  I've covered these topics before:


I also touch on the topic in Chapter 8 of  Star Schema: The Complete Reference, where I address the proper home of a period-to-date measurement.  Use the link to purchase a copy, and you will be helping to support this blog.



Image licensed via Creative Commons 2.0



Thursday, April 26, 2012

Q&A: Human resources data marts


A reader asks if Human Resources data marts are inherently complex. I run down a list of dimensional techniques he should expect to find:
Q: I will be working on a data mart design project to design star schemas for human resources data. I heard that HR data is more complex than sales or marketing and special techniques need to be applied.

I looked at the star schemas of pre-built analytical applications developed by some vendors for our packaged HR solution. I felt that they are quite complex and just wondering star design for HR data should be so complex.

If possible, can you please discuss this topic in a detailed manner by considering any one of popular HRMS system data and the most common data/reporting requirements along with the design discussion to achieve the star for those reports using the given HRMS data?

- Venkat, UK

A: Human Resources applications do indeed tend to use advanced techniques in dimensional design.

Below, I run down a list of topics you will probably need to brush up on. In reality, every subject area requires complete mastery of dimensional modeling, not just the basics.

Note that the complexity you are seeing in packaged solutions may stem from the subject area. Vendors often produce abstracted models to facilitate customization.

Techniques used in HR data marts

No doubt you are accustomed to the transaction-grained stars you encountered in sales.  You will find them in HR as well, but you will also encounter these:
  • Snapshot stars sample one or more metrics at pre-defined intervals.

    In an HR data mart, these may be used to track various kinds of accruals, balances in benefit programs, etc.

  • Accumulating snapshot stars track dimension members through a business process and allow analysis of the elapsed time between milestones.

    These may be used to track the filling of a position, "on-boarding" processes, disciplinary procedures, or applications to benefit programs.

  • Factless fact tables track business processes where the primary metric is the occurrence of an event. They contain no facts, but are used to count rows.

    These are likely to be used for tracking attendance or absence, participation in training courses, etc.

  • Coverage stars are factless fact tables that model conditions.  These are usually in place to support comparison to activities represented in other stars, but may also be leveraged to capture key relationships among dimensions.

    These are likely to be used for linking employees to positions, departments and managers. 
Your dimensions will also require reaching beyond the basics:
  • Transaction dimensions capture the effective and expiration date/time for each row in a dimension table.  These are advisable in almost any situation.

    In HR they may be used to  track changes in an employee dimension.

  • Bridge tables for Multi-valued attributes allow you to associate a repeating attribute with a dimension.

    In HR, these are likely to be used to associate an employee with skills, languages, and other important characteristics.

  • Hierarchy bridge tables allow you to aggregate facts through a recursive hierarchy.

    In HR, these are used to navigate reporting structures (employees report to employees, who in turn report to other employees, and so forth) as well as organizational structures.
I would also expect to encounter some complexity in slow-change processing rules.  Human Resource systems carefully audit certain kinds of data changes, tracking the reason for each change. As a result, you may have attributes in your dimension schema that may exhibit either type 1 or type 2 behavior, depending on the reason for the change.

Every schema goes beyond the basics

This list could go on, but I think you get the idea.

The only way to design a data mart that meets business needs is to have a well rounded understanding of the techniques of dimensional modeling.

You cannot get very far with nothing more than a grasp of the basics. This holds true in any subjet area -- even sales and marketing.  You need the complete toolbox to build a powerful business solution.

Packaged data marts

The complexity that concerns the reader may actually stem from another cause:  he is looking at packaged data mart solutions.

Packaged applications often introduce complexity for an entirely different reason: to support extensibility or customization. For example, facts may be stored row-wise rather than column-wise, and dimensions may contain generic attribute names.

Learn more

This blog contains posts on most of the topics listed above.  Click each header for a link to a related article.  Some have been discussed in multiple posts, but I have included only one link for each.  So also do some exploration.

In addition, please check out my book Star Schema: The Complete Reference.  When you purchase it from Amazon using the links on this page, you help support this blog.
  • Snapshots and accumulating snapshots are covered in Chapter 11, "Transactions, Snapshots and Accumulating Snapshots

  • Factless fact tables and coverage stars are covered in Chapter 12, "Factless Fact Tables"

  • Transaction dimensions are covered in Chapter 8, "More Slow Change Techniques"

  • Attribute bridges are covered in Chapter 9, "Multi-valued Dimensions and Bridges"

  • Hierarchy bridges are covered in Chapter 10, "Recursive Hierarchies and Bridges"
Thanks for the question!

- Chris

Send in your own questions to the address in the sidebar. 


Do you have another technique that was useful in an HR data mart? Use the comments.

Image credit: Gravityx9 licensed under Creative Commons 2.0

Monday, August 29, 2011

Slowly Changing Facts?

This post looks at the issues faced when facts can change retroactively.
Q: Our source system allows facts to be updated.  We are trying to come up with a way to preserve this history in the data warehouse.  One idea is to use the type 2 technique for facts.  We call it "slowly changing facts."  Is this a good solution?
A: When transactions are changed retroactively, you should update corresponding facts. If you need to track the history of facts, do so in separate audit tables.

To understand why, we will  look at what happens if you record multiple "versions" of a fact in your fact tables. But first, a refresher on the type 2 slow change technique.

Slowly changing dimensions

Dimensional modelers must decide what will happen when the source data for a dimension attribute changes. There are several possible responses to a change; one of them is known as the "Type 2" response.

The type 2 response calls for keeping a sort of "version history" in the dimension table. For example, if a customer's address changes, we insert a second record into the dimension for the same customer.  This way, the old "version" of the customer can be linked to facts that took place prior to the change, and the new version can be linked to new facts.

I covered the basics in more detail in a previous post; you can also refer to the slowly changing dimensions category for more posts on the topic.

Keeping the history of facts

A fact table contains a record of things that happened.  A customer ordered 200 widgets yesterday; another ordered 300 today. Measurement of these events can be aggregated across various dimensions to answer a variety of questions.

But sometimes, what we know about an event changes.  We originally thought 200 widgets were ordered yesterday, but now we know that 250 were ordered.

At first glance,  it may seem like the same "type 2" technique will solve the problem.  When details about a specific transaction are changed, simply insert a new row into the fact table.  The most current version can be flagged for easy access.

Unfortunately, this seriously harms the usability of the fact table. When people study orders, they must now be sure to single out a single version for each transaction to avoid double counting. You can try to resolve this by adding multiple date-stamps to each record, or converting changes into "deltas", but the damage is done. The model is open to serious misuse.

We'll look at the solution in a moment, but first a look at the same problem in a snapshot model.

Snapshots exhibit the same problems

Another kind of fact table, the periodic snapshot, tracks status at predefined intervals.  A financial data mart, for example, might log the balance of each account in the general ledger at the close of business each day.

Here too, our understanding of what happened may change.  At the close of business Monday, a particular account contains US$ 12,000.  On Tuesday, back-dated transactions change that balance to $13,000.

If we try to track both versions of Monday's balance, usability is damaged in the same way.  We must always qualify our analysis for (1) the date we are studying, and (2) the "as of" date of our analysis.  If we don't, grave errors ensue.

Here, usability problems are compounded by density issues.  Tracking the daily balance of each account over 5 years would normally require 1,826 rows per account (365 days x 5 years = 1,825 days, plus 1 leap-day).  If we must also re-record balances for all prior days, the number grows to an astounding 1,668,051.  (This is a triangular number, computed as ( n2+n ) / 2 where n is the number of days.)

If there are 500 accounts in the general ledger, that's the difference between 913,000 rows in a standard periodic snapshot, and 834,025,500 for one that logs history.  As each additional day is added, the difference increases.

The alternative 

There is only on version of the truth, and this is what belongs in the fact table.

When our understanding about an event has changed, the corresponding facts should be updated. This holds for both transaction models and snapshot models.

If you must be able to produce an audit trail, do so in a separate table.  The audit table need only capture the transaction identifier, the date, and the value of each fact. Record rows for the original fact and each revision.


If need be, this can also be organized dimensionally as a fact table. Keep in mind, however, that using this fact table will be subject to the same usability challenges.  And ask the business if all this trouble is really necessary.

For a snapshot model, and audit table may not be necessary; a corresponding transaction table may be able to hold the history. Each transaction that affects status is recorded with 2 dates: date of the transaction, and the date we became aware of the information (often a "posted" date).

Friday, March 27, 2009

Is a Snapshot an Aggregate?

Students in my classes often ask me if a snapshot fact table is an aggregate. My short answer is that it doesn't matter. But here's the long answer.

This is really a matter of semantics more than anything else. What it boils down to is how you define an aggregate table. Is it simply a table that summarizes data from other tables? Then yes, most of the time a snapshot will fit this definition -- but not always. However, I like to reserve the word "aggregate" for a very specific kind of summary. If you do as well, then the answer is no. Never.

Before getting into that, though, here is some background.

Snapshot Fact Tables

A snapshot, for the uninitiated, is a fact table that periodically samples a measurement. A snapshot always includes at least one measurement of status, such as a balance or level. A transaction fact table, in contrast, captures measurements as they occur.

The snapshot is useful for describing the effect of transactions, rather than transactions themselves. The best example is your bank account. Its all well and good if you have a record of each deposit, check, interest payment, fee, direct debit, and so forth. But what is your balance? Aggregating all your transactions is not the most expedient way to figure that out.

Sometimes Snapshots Summarize Transactions, but not Always

People sometimes think of snapshots as aggregates because they can be constructed from a table that contains transactions. That is certainly true in the case of account balances. If you have the transactions, you can compute the snapshots.

In this sense, you might be justified in thinking of a snapshot as an aggregate. But I don't. I'll explain why in just a second.

First, though, it is important to understand that not all snapshots are constructed from transactions. Sometimes, the number of transactions is too numerous to keep around. But in other cases, there simply aren't any. For example, we can't make use of transactions to describe the ambient temperature on the factory floor, or the level of pressure inside an industrial device. In these cases, the snapshot is clearly not a summary.

Snapshots Contain "new" Facts

So how about those other situations -- the ones where there are transactions that can be used to create a snapshot? Isn't a snapshot an aggregate in this case?

It really depends on how you define an aggregate.

The word aggregate is normally reserved for a summary table that does not transform the structure of its corresponding base table. For example, suppose account transactions are recorded in a fact table, with a fact called transaction_amount. A monthly aggregate would have exactly the same fact. Still called transaction_amount, is the same as the fact in the base table. The only difference is in level of detail: it represents a month's worth of transactions, rather than individual transactions. It is the same measurement.

Account balance (or other status measurements like levels) is a not the same as what the base fact table records. It is not an aggregation of transaction that occurred during the snapshot period; it is something more. It describes the net effect of the account transactions. It is a different measurement.

Conclusion

I use the term derived table to describe any table in a dimensional design uses another as its source. I save the word aggregate for non-transformative summaries.

To sum things up, then:

Yes, a snapshot sometimes summarizes a transaction fact table.

But if you accept these definitions for aggregate tables and derived tables, the answer is different:

A snapshot is is sometimes a derived table, but never an aggregate.

This is just a matter of semantics, and not very important to me. Call it whatever you want, and I will be fine with it. The important thing is that you have a dimensional design that enables useful analytics.

- Chris

Monday, February 16, 2009

Q&A: Chosing the right time dimension for a Snapshot

In this Q&A, a reader asks how to designate a time dimension for a fact table that is a snapshot.
Q: We are designing a new star. The fact table is a monthly snapshot. It will capture several metrics each month for each open account. My question is how to handle the time dimension. We cannot agree whether it makes more sense to have a month dimension, or to use a month end date from a day dimension. What makes the most sense?

- Name withheld by request

First, I want to commend you for stating the grain of your fact table, and stating it clearly. Grain confusion can cause many problems, and makes a question like yours even more difficult to answer. Snapshots always have a time component to their grain; in your case it is monthly. Your fact table will have one row per account per month.

Now to your question: I strongly prefer the use of a month dimension for a monthly snapshot. This approach is non-ambiguous. It is self-evident what each row represents; the chance for any confusion is minimal.

When a monthly snapshot is designed with a day dimension, the designers are generally clear on what each row represents. However, report developers or other users may misinterpret the data now or in the future.

Designers who use a day dimension with a monthly snapshot generally choose the last day of the period represented, as you suggested. The problem with this is that there are many elements of a typical day dimension that simply do not apply to a monthly summary. Tor example, a snapshot representing March 2009 would have the day 3/31/09 associated with it. The day dimension has a lot of information about March 31, 2009 that does not apply to the snapshot, such as what day of the week it was, whether it was a holiday, and so forth. It would be all too easy to misinterpret this information as being applicable to the facts.

Even if everyone is clear on this at the start, a year or two down the road there are likely to be new members of your development team who may misinterpret the data. The presence of a date can also lead to confusion about what a row represents. If your monthly summary
contains an average daily balance, for example, a user may confuse this for a balance as of the day in question.

Confusion will also arise over what date to use for the current period. During March 2009, you may be updating the snapshot rows for each account on a daily basis. Using a day dimension, ETL developers may assign the particular date in March, which now serves double duty -- designating the snapshot period and the date loaded.

Sometimes, there is a strong desire among technical team members to use a day dimension with a monthly snapshot. If you cannot convince them otherwise, then you can avoid some confusion by creating a view for use with the monthly snapshot that hides the day-level attributes. You may be able to do the same kind of “hiding” of non-applicable dimension attributes within your BI tool. Use the month-end date for each fact table row. If the ETL team wants to track the day each row was loaded, they can do this with a second relationship to the day dimension. (This trick works nicely even if you do go with a month dimension.)

Regardless of your approach, I also advise you to name the facts carefully. If you want to include a balance as of the end of the period, name it period_end_balance. An average daily balance for the period in question should be called average_daily_balance. A column name like balance is simply too vague, and bound to be misused.

Thanks for your question!

- Chris

If you have a question, click my profile picture at the top of the page for my address. I try to respond to all messages, and may even post an answer here.

Monday, June 11, 2007

Ten Things You Won't Learn from that Demo Schema

Many people learn about dimensional modeling by studying a sample star schema database that comes with a software product. These sample databases are useful learning tools—to a point. Here are 10 things you won't learn by studying that demo.

If you've learned everything you know about star schema by working with a sample database, you probably have a good intuitive grasp of star schema design principles. In a previous post, I provided a list of 10 terms and principles that most sample databases illustrate well.

But there are many important things about a dimensional data warehouse that are not revealed by the typical "Orders" or "Sales" demo. Here are the top 10 things you will not learn from that sample database.

  1. Multiple Fact Tables Most sample databases contain a single star—one fact table and its associated dimension tables. But it is rare to find a business process that can be modeled with a single fact table; it is impossible to find an enterprise that can. Most real-world designs will involve multiple fact tables, sharing a set of common dimensions.

    When facts become available at different times, or with different dimensionality, they almost always belong in separate fact tables. Modeling them in a single fact table can have negative consequences. Mixed grain issues may result, complicating the load and making reporting very difficult. For example, building reports focused on only one of the facts can result in a confusing preponderance of extra rows containing the value zero.

  2. Conformance With multiple fact tables, it is also important that each star be designed so that it works with others. A design principle called conformance helps ensure that as we build each new star, it works well with those that came before it. This avoids the dreaded stove-pipe. This principle allows a set of star schemas to be planned around a set of common dimensions and implemented incrementally.

  3. Drilling Across It's also important to understand how to properly build a report that accesses data from multiple stars. A single SQL select statement won't do the job. Double counting, or worse, can result. Instead, we follow a process called drilling across, where each star is queried individually. The two result sets are then combined based on their common attributes. These drill across reports are some of the most powerful in the data warehouse.

  4. Snapshot Fact Tables The fact table found in most demo stars is usually called a transaction fact table. But there are real world situations where other types of fact table designs are called for.

    A snapshot design is useful for capturing the result of a series of transactions at a point-in-time; for example, the balance of each account in the general ledger at the end of each day. This type of design introduces the concept of semi-additivity, which can be a problem for many ad hoc query tools. It makes no sense to add together yesterday's balance and today's balance. It is not uncommon to compute averages based on the data in a snapshot star. But one must be careful here; the SQL Average() function may not always be what you need.

  5. Factless Fact Tables Another type of fact table often contains no facts at all. Factless fact tables are useful in situations where there appears to be nothing to measure aside from the occurrence of an event, such as a customer contact. They also come in handy when we want to capture information about which there may be no event at all, such as eligibility.

    In addition to transaction, snapshot and factless designs, there are other types of fact table as well. It is not uncommon to need more than one, even when modeling a single activity.

  6. Roles and Aliasing Many business processes involve a dimension in multiple roles. For example, in accounting a transaction may include the employee who makes a purchase, as well as the employee who approves it. There is no need for separate"Purchaser" and "Approver" dimensions. A single "Employee" dimension will do the job. The fact table will have two foreign key references to the Employee dimension--one that represents the purchaser, and one that represents the approver. We use SQL "aliasing" when querying this schema in order to capture the two employee roles.

  7. Advanced Slow Change Techniques If you are lucky, you were able to learn about Type 1 and Type 2 Slowly Changing Dimension techniques from the demo schema. I described these in a previous post. Often, analytic requirements require more.

    A Type 3 change allows you to "have it both ways," analyzing all past and future transactions as if the change had occurred (retroactively) or not all.

    There are also hybrid approaches, one of which tracks the "transaction-time" version of the changed data element as well as the "current-value" of the data element.

    And then there's the time-stamped dimension technique, also called a transaction dimension. In this version, each row receives an effective date/time and an expiration date time. This provides Type 2 functionality, but also allows point-in-time analysis of the dimensional data.

  8. Bridge Tables Perhaps the most confusing technique for the novice dimensional designer is the use of bridge tables. These tables are used when the standard one-to-many relationship between dimension and fact does not apply. There are three situations where bridge tables come in handy:

    An attribute bridge resolves situations where a dimension attribute may repeat multiple times. For example, a dimension table called "Company" may include an attribute called "Industry." Some companies have more than one industry. Rather than flattening into "Industry 1," "Industry 2," and so on, an attribute bridge captures as many industries as needed.

    A dimension bridge resolves situations where an entire dimension may repeat with respect to facts. For example, there may be multiple salespeople involved in a sale. Instead of loading the fact table with multiple salesperson keys, a dimension bridge gracefully manages the group of salespeople.

    A hierarchy bridge resolves situations where a recursive hierarchy exists within a dimension. For example, companies own other companies. At times, users may want to roll transactions up that occur beneath a specific company, or vice versa. Instead of flattening the hierarchy, which imposes limitations and complicates analysis, a hierarchy bridge can be joined to the transaction data in various ways, allowing multiple forms of analysis.

    All bridge implementations have implications for usage, or report building. Improper use of a bridge can result in double counting or incorrect results. Bridges also make deployment of business intelligence tools more difficult.

  9. Derived Schemas Useful stars can also be derived from existing stars. Often called "second-line" solutions, these derived schemas can accelerate specific types of analysis with powerful results. The merged fact table combines stars to avoid drilling across. The sliced fact table partitions data based on a dimension value, useful in distributed collection and analysis. The pivoted fact table restructures row-wise data for columnar analysis and vice-versa. And set operation fact tables provide precomputed results for union, intersect and minus operations on existing stars.

  10. Aggregate Schemas One of the reasons the star schema has become so popular is that it provides strong query performance. Still, there are times when we want results to come back faster. Aggregate schemas partially summarize the data in a base schema, allowing the database to compute query results faster. Of course, designers need to identify aggregates that will provide the most help, and queries and reports will receive no benefit unless they actually use the aggregate. Aggregates are usually designed as separate tables, instead of providing multiple levels of summarization in a single star. This avoids double counting errors, and can allow the exploitation of an automated query rewrite mechanism so that applications do not need to be aware of the aggregates.
I limited this list to 10 things. That's enough to make the point: a demo schema will only take you so far. When I teach Advanced Star Schema Design courses, I find that even people with many years of experience have questions, and always learn something new.

If you want to learn more, read the books recommended in the sidebar of this blog. Take a class on Advanced Star Schema Design. Interact with your peers at The Data Warehousing Institute conferences. And keep reading this blog. There will always be more to learn.

Related Posts:

Top 10 Thinks You Should Know About that Demo Schema

© 2007 Chris Adamson