Sunday, March 28, 2010

Pre-order Chris's New Book

Star Schema: The Complete Reference is now available for pre-order at Amazon.com in the US and in several other countries.

[ UPDATE 8/20/10:  Now shipping!  Use the links below to order. For more information, see this post.]

Use these links to pre-order your copy from Amazon in your country:
* At the time of this post, Amazon Canada was not yet taking pre-orders.  However, you can sign up to be notified when pre-ordering is possible.

    Star Schema: The Complete Reference
    By Christopher Adamson
    ISBN 0071744320 / 9780071744324
    McGraw-Hill Osborne Media

    Star Schema: The Complete Reference is Chris Adamson's comprehensive guide to dimensional modeling, written for anyone who works with star schemas, snowflakes or cubes. Its breadth and depth make it the definitive reference, ideal for beginners and experts alike.

    Offering clear and easy to understand explanations of complex topics, it teaches the best practices of dimensional modeling and their underlying rationales, equipping you to make informed design decisions.

    This reference is organized around dimensional design topics rather than vertical industry, so you can find what you need when you need it.

    Architecture-neutral, the book’s practical advice can be put to use in any type of data warehouse, including those advocated by W.H. Inmon and Ralph Kimball.

    Coverage begins with fundamentals, then follows a logical progression through advanced topics that address real-world complexity. All concepts are illustrated through real-world examples.

    The book also explores the impact of design decisions on the BI and ETL processes, and discusses situations where it is useful to deviate from best practices.

    The final chapter provides a comprehensive set of tasks and deliverables that can be incorporated into any project, regardless of architecture, scope or methodology.

    Tuesday, March 16, 2010

    Storing Nonadditive Facts [Q&A]

    It can be challenging to decide what to do with a nonadditive fact. A reader asks if (and when) it might make sense to store a percentage in a fact table:
    Chris,

    What are your thoughts on storing percentages in fact tables?  Many of the calculations seem too complex to do in the BI tool. For instance, we calculate life-to-date percentages of KPIs on products.

    Thanks,
    Adrienne
    Burlington, NC
    Percentages are (usually) non-additive facts.  What this means is that you cannot add them together.  For example, if a computer is sold at 5% margin, and then another is sold at 5% margin, it is a mistake to say that total margin for the two sales is 10%. 

    In introductory texts or courses, you are taught to decompose them into fully additive components.  The idea is that you can sum all the components, then compute the ratio at the last moment. The margin example above is a good application. Storing cost and sales price allows margin to be computed at any level of aggregation.

    However, this only really works when all the components for a particular ratio are stored in the same row, and used within the same scope of aggregation.   As you have discovered, there are some non-additive facts that do not work this way.  A ratio that compares something to a larger pool or category is a good example.  Two or more aggregations are needed to compute this kind of metric.

    It also sounds as if your problem is compounded by the need to use some kind of running-totals in the computation, which represent the lifetime-to-date sums.  These are also problematic because they are also non-additive, or possibly semi-additive.

    Non-additive measurements that are computed from different aggregations are often stored in fact tables for exactly the reason you give: they are difficult to automate in a BI tool. They are also precomputed in situations where they would otherwise harm performance.

    If you decide to pre-compute and store a non-additive fact, there are usually two questions that follow:  Where to store it, and how to use it.  The issues here are very similar to those surrounding period-to-date measurements, as I discussed in a post last year. 

    Where to Store a Nonadditive Fact

    Figuring out the answer to this question is closely linked to how it will be used.  If it is needed at a transaction-level, it must go in a transaction-grained fact table.

    However, if it also need to be known at a daily or monthly level, that solution is not enough.  Since it is non-additive, it must be computed separately for these aggregations.  A periodic snapshot makes sense here.  Consider forgoing the transaction-level version, unless it is needed as well.

    Also note that "time" is not the only dimension across which we might need a summarized non-additive fact.  For example, you might also want to see it across product categories, customer categories, and so forth.  Each would require its own snapshot-style fact table for discrete storage of the pre-computed fact.

    In fact, this is the biggest problem with storing non-additive facts:  each potential aggregation must be stored separately.

    Nonadditive Facts and BI Software

    Because they are nonadditive,  (and may involve constituents that are semi-additive), these kind of facts must be used carefully.  For example, they should never be summed, subtotaled, etc.  Similarly, if we need to show them at different levels of detail, we need to fetch each aggregation separately.

    This is relatively straight forward if the schema is well thought out, well documented, and fully understood by report developers.

    However, it is very hard to make this kind of fact available in a "Self-Service" mode for end users, since most BI tools will easily break both rules.  (BI tools can usually handle simple non-additive facts, like the margin example at the beginning, but not facts that are computed from multiple aggregations.)

    Alternatives may include:
    • Hiding the non-additive fact from users while exposing it to developers
    • Classifying it as a dimension so that it is not aggregated,  or 
    • Blocking certain features of the tool.
    The fact that a non-additive fact may be stored in multiple locations (corresponding to different levels of aggregation) may also befuddle your BI software.
    • It may be possible to make use of  "aggregate navigation" features to teach it to fetch a fact from more than one possible location, but you will need to make sure that the tool always gets it right.  In addition, using such a feature may prevent you from "turning off" the tool's ability to aggregate a fact.  
    • Separate semantic layers for developers vs. users may be the solution to these problems.
    Again, these considerations are less important in cases where trained developers do all the report building.  It is only when leveraging SQL-generating BI tools that you need to worry.

    Another Option

    In some cases, non-additive facts can be linked to a specific member of a dimension, such as an individual customer or  product.  In this case, consider storing it in the dimension table, as a behavioral dimension.  This is most effective when the measurement in question can be refreshed on a pre-defined schedule.

    Thanks to Adrienne for her question, and for consenting to have it reprinted here!

    - Chris

    Do you have a question about dimensional design or data warehousing?  Send it in.  I answer all my email, though sometimes it takes a while.

    Image is by Patrick Hosely licensed under Creative Commons 2.0

    Thursday, March 11, 2010

    "The Star Schema Handbook" Gets New Name, Coming Soon

    I am pleased to announce my next book:  Star Schema: The Complete Reference.   

    It is being published by McGraw-Hill, and should be available this summer. McGraw-Hill has posted some of the publication details and availability information.

    [ UPDATE 8/20/10:  Now shipping!  Use the link below to order in the US. For other countries, or for more information, see this post.]

    Amazon.com is not taking pre-orders yet, but they have added a page for the book, where you can sign up to be notified when becomes available.

    The book was formerly titled The Star Schema Handbook: The Complete Reference To Dimensional Data Warehouse Design 

    If you pre-ordered under the old title, you'll need to place a new order.

    Many thanks to all those of you who have been sending in inquiries!
    I will provide more information about the book during the weeks to come.

    - Chris

    Tuesday, March 2, 2010

    Just the Facts?

    A reader asks:
    I’m having a problem in picking out facts. For example are all dates facts -- start date, end date and DoB?  What about personal data like post code, job title, status (active /left)  etc. Would you say that if you cant do math on it, its not a fact?
    - [Name Withheld], UK
    Good question.  It allows me to cover the basics, as well as touch on some advanced topics.

    First, the Basics..


    Facts vs. Dimensions

    Facts are measurements.  I really prefer that word, because it is a bit more descriptive.  But the rest of the world already knows "Facts".

    Examples that are clearly measurements:
    • order dollars
    • gross pay
    • quantity sold
    • account balance
    In the original question, all the attributes mentioned are dimensions.  (Dates, personal data, and so forth.)

    Dimensions are used to give facts context.  You use dimension values to filter or group facts.  Here are some in action:
    • order dollars (fact) by order date (dimension)
    • gross pay (fact) by department code (dimension)
    • quantity sold (fact) for January (dimension value)
    You can usually "do math" on facts.  The most common example is adding them together, or aggregating them, as might be done in the examples above.

    Don't use that as a criterion to to identify facts, though.  Many numbers are not facts. And, to make matters more complicated, and some non-numbers are facts.

    Many Numbers are Not Facts

    Some numbers are really dimensions.  You can tell because we use them to give context to facts.  A numeric department code is obviously a dimension.  We wouldn't normally add the values together, but we might use them to break out a measurement -- for example: budget (fact) by department (dimension).

    Less obvious are things like "unit price".  It sounds like it might be a fact. After all, we can certainly add up money, right?   Actually, when we add up sales, we don't add up unit prices, we add up "extended prices".  Extended price is a fact.  It is equivalent to unit price times quantity.

    Look again, and you will also see that unit price behaves like the other dimensions we looked at.  We can use it to group or filter facts. "How many did we sell at a unit price of 10 cents?" That question filters a fact (quantity sold) by a dimension value (10 cent unit price).

    Some Facts are Not Numbers

    There are also facts that are not numbers.  These are rare, but they do occur.

    Test results or grades are the most common example.  You can't do math on "pass" or "fail," but they are measurements.  Facts like this are sometimes called "text facts."  A test_result fact that can take on the values "pass" or "fail" might be stored in a fact table that contains a row for each student taking a particular test on a particular day.

    If they take on a relatively low number of possible values, consider converting text facts into discrete facts.  In this case, we might replace test_result with two facts:  number_passed and number_failed.  For each row in the fact table, one of these will take on the value 1, the other the value 0.  This may seem wasteful, but now it is easy to get a sum(number_passed) or sum(number_failed.)

    Some Stars Have No Facts

    If you don't have any facts, you can still have a fact table.  Sometimes, the mere occurrence of an event is all you need to know about.  For example, you might use a fact table to track the fact that a employee was present on a given day.

    Factless fact tables can also track conditions.  For example, each year, circumstances might make employees eligible to participate in particular programs.   Aside from the dimensions (Employee, Year and Program) there might not be any facts to record.  That's OK, you can still count up rows that meet various conditions.

    - Chris

    Do you have a question?  Send it in.  I answer all my email, though sometimes it takes a while.

    Image: Jack Webb as Joe "Just the Facts" Friday,
    from the Public Domain