Showing posts with label Architecture. Show all posts
Showing posts with label Architecture. Show all posts

Monday, March 20, 2017

Tapping Into Non-Relational Data

Modern BI and Analytics programs use non-reatlional data management for six key functions. You should be cognizant of these functions when you add non-relational technology to your data architecture.





Kinds of Non-Relational Storage

Most IT professionals are familiar with the RDBMS. Relational databases store data in tables that are defined in advance. The definition specifies the columns that comprise the table, their data types, and so forth. The design is referred to as a data model or schema.

Relational storage is immensely useful, but it is not the only game in town. There are several alternative types of data storage including:
  • Key-value stores store data in associative arrays comprised of sort keys and associated data values. These flexible data structures can be distributed across nodes of commodity hardware, and are manipulated using distributed processing algorithms (map-reduce). Hadoop functions as a key-value store.
  • Document stores track collections of documents that have self-defining structure, often represented in XML or JSON formats. A document store may be built on top of a key-value store. MongoDB is a document store.
  • Graph databases store the connections between things as explicit data structures (similar to pointers). These are stored separately from the things they connect. This contrasts with the RDBMS approach, where relationship information is stored within the things being associated (keys within tables). Neo4j is a graph database. 
Reasons for Non-Relational Storage

In the age of big data, organizations are tapping into these forms of storage for a variety of reasons. Here are just a few:
  • No model: not requiring a predefined schema enables storing new data that has yet to be explored and modeled.
  • Low cost: the cost of data management can be significantly lower than RDBMS storage.
  • Better fit: some business use cases are a natural fit to alternative paradigms.
But don’t simply add a box to your data architecture for non-relational data. You must plan for specific usage paradigms, and make sure your architecture and processes support them.

Uses for Non-Relational Storage

There are six major use cases for non-relational data in modern data architectures:
  • Capture  Non-relational storage facilitates intake of raw data.  New data sets are captured in non-relational data stores, where they become “raw material” for various uses. A non-relational data store such as Hadoop can be used to capture data without having to model it first. This is often referred to as a data lake. I prefer to call it a landing zone.
  • Explore  Non-relational storage facilitates exploration and discovery. Exploration is the search for value in new data sets. Exploration applies analytic methods to captured data, often combining it with existing enterprise data. The goal is to find value in the data, and to identify things that will be worth tracking on a regular basis. 
  • Archive  Non-relational storage serves as an archive. Data for which immediate value has not been identified is moved to an archive. From here it can be fetched for future use. Archiving data helps ensure the data lake does not become the fabled “data swamp.” An alternative to archiving unused data is simply to purge it.
  • Deploy  Non-relational storage supports production analytics. When value is found in data, it is transitioned to a production environment and processes are automated to keep it up to date. Deployments range from simple reports to complex analytic models.
  • Augment  Non-relational storage serves as a staging area for the data warehouse. In many cases, the insights gained from exploration prove valuable enough to track on a regular basis. Augmentation is the process of adding elements to a relational data warehouse that come from non-relational sources or analytic processes. A credit score, for example, might be incorporated into a customer dimension.
  • Extend:  Non-relational storage expands what can be maintained in the data warehouse. Sometimes there is lasting value in non-relational data, but is not appropriate to migrate it to relational storage. In such cases, the non-relational data is moved to a non-relational extension of the data warehouse. Applications can link relational and non-relational data. For example, non-relational XML documents may made available for “drill down” from a dimensional cube.
In addition to these six primary use cases, non relational platforms may serve several utility functions. These include staging, data standardization, cleansing, and so forth.

Learn More

Join me for my course Data Modeling in the Age of Big Data, offered exclusively through TDWI. At the time of this writing, it is offered next at TDWI Chicago on May 11, 2017. You can also bring this course to your site through TDWI Onsite Education. For more information contact me.


Sunday, April 24, 2016

Chris Adamson on Modeling Challenges

In a recent interview, the folks at WhereScape asked me some questions about data modeling challenges.



In Business Intelligence, modeling is a social activity. You cannot design a good model alone. You have to go out and talk to people.

As a modeler, your job is to facilitate consensus among all interested parties. Your models need to reflect business needs first and foremost. They must also balance a variety of other concerns — including program objectives, the behavior of your reporting and visualization tools, your data integration tools, and your DBMS.

It’s also important to understand what information resources are available. You need to verify that it is possible to fill the model with actual enterprise data. This means you need to profile and understand potential data sources. If you don’t consider sources of data, your designs are nothing more than wishful thinking.

When considering a non-relational data sources, resist the urge to impose structure before you explore it. You’ve got to understand the data before you spend time building a model around it.

Check out the video above, where I discuss these and other topics. For a full-sized version, visit the WhereScape page.

Monday, August 15, 2011

Multiple Stars and Conformed Dimensions

The concept of conformed dimensions is central to the discipline of dimensional modeling. This post introduces conformed dimensions; future posts will explore them in more detail. 

In your data warehouse, each star corresponds to a business process. Combining facts from different processes can produce powerful compound metrics. The key to making this work is a set of conformed dimensions.

Conformed dimensions are closely associated with Kimball's "Bus Architecture," but are crucial in any scenario that involves dimensional data.

Multiple Stars

In a dimensional design, each star captures collects measurements that describe a discrete business process.

If we have two measurements that describe different processes, we place them into separate fact tables.

For example, a sales data mart may contain multiple stars:
  • Proposal information by Salesperson, Prospect, Product, Proposal and Proposal Date
  • Order information by Salesperson, Customer, Product, Proposal, Contract and Order Date
  • Shipping information by Salesperson, Customer, Product, Proposal, Contract, Shipment, Shipper, and Shipment Date
  • Return information by Salesperson, Customer, Product, Contract, Reason and Return Date
In each of these stars, the fact table will record measurements that describe the processes of issuing proposals, taking orders, shipping product and handling returns.

By recording measurements of each process in a different star, we are able to capture information at the most detailed level possible.  We can study each of these processes, complete with attendant details, by accessing the appropriate star.

Cross-process Metrics

Some of the most powerful measurements actually combine information from multiple processes. These metrics require combining facts from different stars.

In the sales data mart, the ratio of proposals to orders is the "close rate," a powerful indicator that sales managers and executives look at on a regular basis.

Similarly, the ratio of shipments to returns is the "return rate," an essential quality control metric.

Drilling Across

When we compare facts from different stars, we don't simply join the fact tables.  To do so might cause double-counting of some facts.

Instead we follow a process that Kimball calls drilling across.  The drill-across process can be broken down into two phases.  In the first phase, each star is queried, and results are aggregated to a common level of detail.  In the second phase, these result sets are merged based on their common dimensions.

For example, to compute the return rate by product for August of 2011, we do the following:
  1. a. Figure out quantity shipped by product for August 2011
    b. Figure out quantity returned by product for August 2011
  2. Merge these amounts based on the common product names and compute the ratio
We may perform this drill across operation at query time (many BI tools can do this automatically), or we may do it at ETL time, storing the results in a separate star or cube (sometimes called a second-line data mart.) 

Conformed Dimensions

The key to making all this work is the organization of the dimensions.  As you saw in the example above, we used the dimension values to link our results together: product names were used to merge together shipment and return quantities and compute their ratio.  This would not have worked if the product dimensions for Shipments and Returns had been different.

This is the basic idea behind conformed dimensions.  We say that two dimensions conform if they have the same structure and content.  Both our stars, had a product dimension table with a product name attribute, and the product names were specified the same way in each.  Sharing a single physical table is one way to ensure conformance, but it is not required.

Two dimensions can also conform if one has a subset of the other's attributes.  As long as the common attributes have the same structure and content, they are said to conform.

Planning Conformance

By planning a set of conformed dimensions, we ensure that fact tables can be used to compare processes.  This is important within a single data mart, such as the one above, and it is also important when looking across multiple data marts.

Conformed dimensions are the organizing principle in Kimball's architecture.  Conformed dimensions are planned up-front, as a part of a project that establishes a data architecture based on dimensional design. Implementation proceeds once this conformance bus has been planned.

The concept is also important in other architectures. For example, the sales data mart discussed above might be part of Corporate Information Factory architecture.  Within this data mart, conformance guarantees we can compare shipments to returns, proposals to orders, and so forth.

More to come

In the coming weeks, I will post more about conformed dimensions.  We will look at "levels" of conformance, how to document conformed dimensions, and how different tools work with conformed dimensions. If you have my book, I also encourage you to read Chapters 4 and 5, which look at these concepts in detail.

Photo by Agnes Periapse, licensed under Creative Commons 2.0

Tuesday, March 29, 2011

Q&A: Dimensional vs ER Model for the Data Warehouse

A question about the fundamental data architecture of the data warehouse:

Q: We are embarking on designing a warehouse for our BI efforts.  Someone asked if we should create the warehouse in a dimensional structure, rather than the typical ER structures.

Everything I’ve read and seen has advocated the warehouse be ER modeled at the lowest level of detail appropriate and then for specific mart needs, the data is modeled dimensionally at a grain that is consistent with the needs for that mart, utilizing conformed dimensions.

My question then is, have you seen any instances where a Warehouse (or BI Warehouse) has been designed and built successfully utilizing a non ER structural approach?
Brenda
Iowa

A: Your question deals with the primary difference between two popular data architectures -- those associated with W.H. Inmon and Ralph Kimball.

Two Architectures

Normalized ER models are great for capturing data, but very difficult to use for analytic reporting.  That's why Inmon believes that, although the enterprise data warehouse should be normalized, the data marts should be dimensional. This approach is the one you have read about.

But it is also possible to use a dimensional model as an integrated repository of atomic data. That is the Kimball philosophy -- dimensional all the way. This repository may be centralized (physical) or distributed (logical) repository. In either case, conformed dimensions ensure compatibility across subject ares.

I explain both architectures in my latest book, but I don't argue for one over the other.  This was a conscious choice, because I wanted it to be useful to anyone who uses dimensional data. I take the same approach in the classes I teach.

Dimensional Works Just Fine

The answer to the original question: Yes, I have seen many data warehouses that are successful without an ER layer.  

Keeping data in a dimensional format does not harm the ability to integrate data or to capture detail. (In some cases it reduces the amount of work needed to load all the data structures.) It also allows dimensional constructs to guide scope from requirements to completion, without translation back and forth into an ER format. Secondary dimensional structures ("derived schemas" or "second line data marts") can always be added as required for specific reporting needs. 

Other Approaches Work Too

Keep in mind that other approaches can be successful, as I have discussed before. Most of us won't get to make the choice that Brenda is considering.  If you already have an Inmon-style architecture in place, you should not change it simply for philosophical reasons. 

The motto of this blog is: be pragmatic, not dogmatic.

- Chris


Image licensed via Creative Commons 2.0
 from Patrick Hoesley

Wednesday, May 19, 2010

Kimball's Approach is Top-Down

Ralph Kimball's approach to data warehousing is frequently mis-characterized as being "bottom-up." This post aims to clear up that misconception.

Bus Architecture

Kimball's bus architecture (or dimensional data warehouse architecture) is an enterprise architecture.  At its core, a set of conformed dimensions ensure a consistent representation of standard terms and data elements across multiple subject areas.  The conformed dimensions describe important things like products, customers, locations, or anything of significance to the business.


The subject areas  are called data marts.  They represent things like manufacturing, sales, invoicing, receivables and so forth.  Data marts don't need to be implemented all at once.  They can be implemented one at a time, as part of an incremental program. Data marts also don't need to be stored in a single database (although they may.)  When they are stored in different databases, the conformance bus ensures consistency and compatibility.

Top-Down

Kimball advocates planning a set of conformed dimensions as an up-front (i.e. strategic) activity.  The conformance bus then serves as the blueprint for a set of integrated data marts, which can be built on whatever schedule makes the most sense.

Kimball and Ross put it this way:
During the limited-duration architecture phase, the team designs a master suite of standardized dimensions and facts that have uniform interpretation across the enterprise...We then tackle the implementation of separate data marts in which each iteration closely adheres to the architecture.

- From The Data Warehouse Toolkit, Second Edition
by Ralph Kimball and Margy Ross (Wiley, 2002)
Because it begins with an enterprise-level framework, then delivers departmental functionality, this is a top-down approach. 

Bottom-Up

A bottom-up approach is one that moves in the opposite direction, beginning with a departmental focus and later evolving into one that has an enterprise focus. This occurs when organizations build stand-alone data marts, then later decide to integrate them.  

Stand-alone data marts are designed and built for departmental use, without an enterprise context. They are cheaper in the short-run, offering a fast path to quick results.  Stand-alone data marts also arrive due to mergers and acquisitions, or through packaged software. 

When there is more than one stand-alone data mart, however, they are likely to exhibit incompatibilities and inconsistencies.  They are sometimes labeled "stovepipes." Faced with these inconsistent data marts, some organizations resolve to retrofit them into a conformance framework.  This can be a difficult and expensive process, requiring extensive rework.

When stand-alone data marts are successfully brought into conformance, a bottom-up path has been followed--one that starts with a departmental solution and moves to enterprise capability.   Bottom-up development is cheaper in the short term but more expensive in the long term. 

While the end result may be compatible with Kimball's vision, clearly the route is not.  If this is news to you, you might want to check out his book. (The link appears beneath the quotation above.)  You can also consult posts on data warehouse architectures and common misconceptions.

-- Chris

Image:  PCI Slot by  Ryan_Franklin_az
Licensed under Creative Commons 2.0

Tuesday, April 10, 2007

Three Myths About Star Schema

Think Bill Inmon is anti-star schema? Think again. This post debunks three common myths about star schema design.

As I wrote in a previous post, there are three styles of data warehousing that incorporate star schema design: Kimball’s Dimensional Data Warehouse, Inmon’s Corporate Information Factory, and what I call the “stand-alone” data mart.

These architectures make use of the star schema in different ways. And for proponents of each approach, terms like “enterprise data warehouse” and “data mart” have very different connotations. This confusing situation has given rise to some misconceptions about the star schema.

In this post, I will set the record straight on three common myths.

Myth #1: Bill Inmon is anti-Star Schema

Perhaps the most common misconception about the star schema is that Bill Inmon is opposed to it. In fact, Inmon advocates the use of the star schema to build data marts as part of his Corporate Information Factory architecture. Where he does not want dimensional design used is for the granular repository of enterprise data from which data marts are drawn.

So where did this myth come from? In part, it stems from a common oversimplification of data warehousing theory, which distills a large set of technical issues into the catch-phrase “Kimball vs. Inmon.” Because the star schema has become so closely associated with Kimball, "Kimball vs. Inmon" is often misconstrued as a debate over star schema.

Lest you have any doubts that Inmon advocates the use of star schema, all you need to do is peruse some of his work. Or, search out the first edition of Kimball’s seminal work, The Data Warehouse Toolkit. The forward to the first edition was penned by none other than…Bill Inmon.

Myth #2: The Star Schema is for Aggregated Data

I frequently encounter people who believe that the star schema is used only for aggregated data. While it can be used for aggregated data, it does not have to be used that way. In fact, in a Kimball-style Dimensional Data Warehouse, one guiding principle dictates that fact tables should store the most granular data possible. That is quite the opposite of aggregated data.

So why the confusion? In other architectures, this principle may not be followed. A star schema data mart in an Inmon-style data warehouse may aggregate information that is extracted from a repository of atomic data. A stand-alone data mart may aggregate data drawn from source systems. And even in a Kimball-style architecture, star schemas may be supplemented by aggregate schemas designed to improve performance. (In fact, I’ve written an entire book on this topic.)

But none of these uses of the star preclude its use for the storage of granular, atomic data. Don't fall into the trap of assuming a particular application of the technique fully describes its capabilities or limitations.

Myth #3: Star Schema Data Marts Are Stovepipes

I hear this one in many flavors: using a star schema design leads to “stovepipes” or to “silos” or to “islands of information.” Another variation holds that star schema is for "bottom-up" development. The notion is that if you have multiple star schema data marts, they will not work together.

Whatever words are used, the characterization is false. This myth has probably arisen from a particular application of star schema design: the stand-alone data mart. When a data mart is built in the absence of an enterprise context--a standalone data mart--it serves an immediate purpose, but may later found to be incompatible with analytic requirements in related subject areas.

This sort of "failure" is often observed in stand alone data marts. But it hardly makes sense to blame such failures on the star schema design. You can build a car without a steering wheel, but that does not mean that cars can only drive in straight lines.

In fact, star schema can be very effectively used to avoid stovepipes. The principle of dimensional conformance can be applied, whether designing stars as part of a Dimensional Data Warehouse (Kimball) or a Corporate Information Factory (Inmon). This design technique ensures that common dimensions are used consistently across subject-areas, allowing data from multiple stars to be easily combined into a single report. In technical-speak, it allows applications to “drill-across” schemas or subject areas.

The result is a set of data marts that serve departmental needs, but can also be combined to answer questions that span subject areas. They may be built incrementally, but they work together. Even builders of stand-alone data marts can take measures to ensure that the solution will conform with future data marts. (I'll tackle that topic in a future post.)

So remember: Inmon is not anti-star schema. The star schema is not limited to aggregate data. Star schema data marts do not have to be stovepipes.

Copyright (c) 2007 Chris Adamson

Monday, March 26, 2007

Three Data Warehouse Architectures that Use Star Schema

Whether you follow Kimball or Inmon, or you just build stand-alone data marts, this Blog offers practical advice for your star schemas.

Most data warehouses have a data architecture that fits into one of three categories. These styles of data warehousing often use the same terms to mean very different things. The term “data mart,” for example, has very different connotations for proponents of each of the different architectures. As a result, there is a lot of misinformation about the relative merits of each approach.

But dig a little deeper, and it turns out all three architectures share something in common. Each architecture has a place for the star schema – a set of tables designed according to the principles of dimensional modeling.

Three Data Architectures

At the risk of fanning the flames of a debate that is often characterized as “religious,” here is a very brief summary of each of the three styles of data warehouse design.

Corporate Information Factory This approach is advocated by Bill Inmon, Claudia Imhoff, and many others. It involves a central repository of detailed or “atomic,” data, organized in third normal form. Separate departmental data marts reorganize portions of this data for departmental purposes. These data marts may be built using star schema design.

Dimensional Data Warehouse This approach is advocated by Ralph Kimball, Margy Ross and many others. Atomic data from across the enterprise is stored in star-schema format. The star schemas are designed with a set of “conformed dimensions” allowing them to work together. Data marts are not separate entities, but subject areas within this set of star schemas.

Stand-Alone Data Marts Although there are no major proponents of this approach, it is very popular. Stand-alone data marts focus on a particular business process or department, and are not designed in an enterprise context. Stand alone data marts may be built using star schema designs, or follow other approaches.


This information is summarized in the table below, which also includes some alternate terms used to describe each approach.

Keep in mind that this is a very high level summary, and it is here only to illustrate how the star schema factors in. I encourage readers to seek out more information on each of these approaches.

Something in Common: The Star Schema

Despite the furor over which of these architectures makes the most sense, it should be obvious that they have something in common: each has a place for the star schema. And in all cases, the principles of star schema design are the same.

Regardless of the role of the star schema in your architecture, it must be designed carefully. A poor design will lead to issues, whether you are building a corporate information factory or a dimensional data warehouse. A good design, coupled with proper expectations for your data warehouse architecture, will yield success.

Pragmatic, Not Dogmatic

In my first book on data warehouse design, there is a chapter on data warehouse architecture which advocates the dimensional data warehouse approach. But I have designed star schemas that are used successfully in all three architectures, and the principles of dimensional modeling are the same.

Sometimes people are surprised that I am willing to be involved in projects that follow all three approaches, or that I don’t automatically move to throw away imperfect solutions. But the investment of resources in a data warehouse is a business decision. Design principles may factor in, but so do a variety of other factors – factors over which I am not qualified to weigh in.

Instead, I try to be pragmatic. Yes, I have my preferred approach. But what makes the most sense for the business? Is there already a significant investment in a different architecture? Is there simply not funding for the ideal solution?

The information and resources in this blog are meant to help you, regardless of your architecture. If you use the star schema for anything, you will find help here. From time to time, I will look at concerns specific to one style or another, but this will always be clearly spelled out. Most of the time, we will look at principles that have applicability across the board.


Copyright © 2007 Chris Adamson