Monday, September 25, 2017

Dimensional Models: Now More Than Ever

Do new technologies and methods render the dimensional model obsolete?



The top question from readers of this blog continues to be: "Is the dimensional model still relevant?"

It is easy to understand why people ask this question:

  • Our BI programs have expanded beyond data warehousing to include performance management, analytics, and governance functions. 
  • Methods have evolved and streamlined, thanks to the application of agile principles. 
  • Technologies have advanced to include NoSQL solutions, schema-less paradigms, and virtualization options. 

In a recent article for TDWI’s Upside, I discuss these changes to our data management processes, and their impacts on the dimensional model.

The conclusion: these pressures actually increase the importance of the dimensional model.

Here are a few points made in the article:
  • Although NoSQL technologies are contributing to the evolution of data management platforms, they are not rendering relational storage extinct. It is still necessary to track key business metrics over time, and on this front relational storage reigns. In part, this explains why several big data initiatives seek to support relational processing on top of platforms such as Hadoop. Nonrelational technology is evolving to support relational; the future still contains stars.
  • [A dimensional view of data] grows in importance as the underlying storage of data assets grows in complexity. The dimensional model is the business's entry point into the sprawling repositories of available data and the focal point that makes sense of it all.
  • The dimensional model of a business process provides a representation of information needs that simultaneously drives the traditional facts and dimensions of a data mart, the key performance indicators of performance dashboards, the variables of analytics models, and the reference data managed by governance and MDM. In this light, the dimensional model becomes the nexus of a holistic approach managing BI, analytics, and governance programs.

As we move to treat information as a business asset, the dimensional model has become a critical success factor. Yet many organizations are spread so thin that this critical skill is often missing. Be sure that doesn’t apply to your business!

For the full discussion, check out the article:  Dimensional Models in the Big Data Era. (Chris Adamson, April 12, 2017, TDWI’s Upside.)



Learn More

Join Chris for three days of dimension modeling education in New York next month!
  • TDWI New York Seminar, October 23-25.  Earn a certificate and 24 CPE credits.  Check out the sidebar of this blog for additional dates.  You can also bring my courses on site.
Read the full article on Upside:
Read Chris's book:

Monday, June 5, 2017

In Praise of the Whiteboard

Modeling tools are great, but early stage modeling activities are best conducted on a whiteboard. It is better for collaborative development, and handles rapid change more effectively. And there are inexpensive solutions if one is not readily available.

When I lead seminars that cover modeling techniques, this question always comes up: “What’s the best modeling tool?” 

My answer: Start modeling activities on a whiteboard. Break out the software later, one the model has stabilized.

Using a whiteboard, you will get to a better solution, faster. 

I find this to be true across the spectrum of BI project types and modeling techniques. Examples  include:
  • Dimensional models (OLAP projects)
  • Strategy maps (Performance Management projects)
  • Influence Diagrams (Business Analytics projects)
  • Causal loop diagrams (Business Analytics projects)
There are two reasons a whiteboard works best for this kind of work: it supports collaboration, and it is better suited to the rapid changes common in early stage modeling. In short, a whiteboard is inherently agile.

Collaboration

The best models are produced by small groups, not individuals. Collaboration generates useful and creative ideas which reach beyond what a seasoned modeler can do alone. 

Each of the techniques listed above requires collaboration between business and technical personnel. And within either of these realms, a diversity of perspectives always produces better results. Brainstorming is the name of the game.

Use of a modeling tool quashes the creativity and spontaneity of brainstorming sessions. You may have experienced this yourself.

Imagine five people in a room, one person’s laptop connected to a projector. Four people call out ideas, but the facilitator with the laptop can only respond to one at a time. The session becomes frustrating to all participants, no matter how good the facilitator is. The team loses ideas, and participants lose enthusiasm.

Now imagine the same five people in front of a whiteboard, each holding a pen. Everyone is able to get their ideas onto the board. While this may seem like anarchy, it helps ensure that no ideas are lost and it keeps everyone engaged. The result is always a better model, developed faster.

Rapid change

The other reason to start on a white board is practical: it is easy to erase, change, redraw. And you will be doing a lot of these things if you are collaborating.

Imagine a group is sketching out a model, and decides to make a major change. Perhaps one fact table is to be split into two. Or a single input parameter is to be decomposed into four. If a modeling tool is in use, making the change will will require deletion of elements, addition of new elements, and perhaps a few dialog boxes, check boxes, and warning messages.  The tool gets in the way of the creative process.

Now imagine a whiteboard is in use. A couple of boxes are drawn, some lines erased, some new lines added. The free flow of ideas continues, uninterrupted. Once again, this is what you want.

Unimpeded collaboration produces better results, faster.

No white board? No problem.

Don’t let the lack of a whiteboard in your cube or project room stop you.

There are many brands of inexpensive whiteboard sheets that cling to the wall. These have the additional benefit of being easy to relocate if you are forced to move to another room.  Here is one example, available on Amazon.com:



There are also several kinds of whiteboard-style notebooks. These are useful if you are working alone or in a group of two. They provide the same benefit of being able to collaborate and quickly change your minds, but in a smaller format.  The one that I carry is called Wipebook:


I learned about these and similar solutions from clients and students in my seminars.

…and then the tool

All this is not to say that modeling tools are bad. To the contrary, they are essential. Once the ideas have been firmed up, a modeling tool is the next step.

Modeling tools allow you to gets ideas into a form that can be reviewed and revised. They support division of labor for doing required “grunt work” — such as filling in business definitions, technical characteristics, and other metadata. And they produce useful documentation for developers, maintainers, and consumers of your solutions.

But when you’re getting started, use a whiteboard!

Monday, March 20, 2017

Tapping Into Non-Relational Data

Modern BI and Analytics programs use non-reatlional data management for six key functions. You should be cognizant of these functions when you add non-relational technology to your data architecture.





Kinds of Non-Relational Storage

Most IT professionals are familiar with the RDBMS. Relational databases store data in tables that are defined in advance. The definition specifies the columns that comprise the table, their data types, and so forth. The design is referred to as a data model or schema.

Relational storage is immensely useful, but it is not the only game in town. There are several alternative types of data storage including:
  • Key-value stores store data in associative arrays comprised of sort keys and associated data values. These flexible data structures can be distributed across nodes of commodity hardware, and are manipulated using distributed processing algorithms (map-reduce). Hadoop functions as a key-value store.
  • Document stores track collections of documents that have self-defining structure, often represented in XML or JSON formats. A document store may be built on top of a key-value store. MongoDB is a document store.
  • Graph databases store the connections between things as explicit data structures (similar to pointers). These are stored separately from the things they connect. This contrasts with the RDBMS approach, where relationship information is stored within the things being associated (keys within tables). Neo4j is a graph database. 
Reasons for Non-Relational Storage

In the age of big data, organizations are tapping into these forms of storage for a variety of reasons. Here are just a few:
  • No model: not requiring a predefined schema enables storing new data that has yet to be explored and modeled.
  • Low cost: the cost of data management can be significantly lower than RDBMS storage.
  • Better fit: some business use cases are a natural fit to alternative paradigms.
But don’t simply add a box to your data architecture for non-relational data. You must plan for specific usage paradigms, and make sure your architecture and processes support them.

Uses for Non-Relational Storage

There are six major use cases for non-relational data in modern data architectures:
  • Capture  Non-relational storage facilitates intake of raw data.  New data sets are captured in non-relational data stores, where they become “raw material” for various uses. A non-relational data store such as Hadoop can be used to capture data without having to model it first. This is often referred to as a data lake. I prefer to call it a landing zone.
  • Explore  Non-relational storage facilitates exploration and discovery. Exploration is the search for value in new data sets. Exploration applies analytic methods to captured data, often combining it with existing enterprise data. The goal is to find value in the data, and to identify things that will be worth tracking on a regular basis. 
  • Archive  Non-relational storage serves as an archive. Data for which immediate value has not been identified is moved to an archive. From here it can be fetched for future use. Archiving data helps ensure the data lake does not become the fabled “data swamp.” An alternative to archiving unused data is simply to purge it.
  • Deploy  Non-relational storage supports production analytics. When value is found in data, it is transitioned to a production environment and processes are automated to keep it up to date. Deployments range from simple reports to complex analytic models.
  • Augment  Non-relational storage serves as a staging area for the data warehouse. In many cases, the insights gained from exploration prove valuable enough to track on a regular basis. Augmentation is the process of adding elements to a relational data warehouse that come from non-relational sources or analytic processes. A credit score, for example, might be incorporated into a customer dimension.
  • Extend:  Non-relational storage expands what can be maintained in the data warehouse. Sometimes there is lasting value in non-relational data, but is not appropriate to migrate it to relational storage. In such cases, the non-relational data is moved to a non-relational extension of the data warehouse. Applications can link relational and non-relational data. For example, non-relational XML documents may made available for “drill down” from a dimensional cube.
In addition to these six primary use cases, non relational platforms may serve several utility functions. These include staging, data standardization, cleansing, and so forth.

Learn More

Join me for my course Data Modeling in the Age of Big Data, offered exclusively through TDWI. At the time of this writing, it is offered next at TDWI Chicago on May 11, 2017. You can also bring this course to your site through TDWI Onsite Education. For more information contact me.


Sunday, March 19, 2017

Data Alone Does Not Change People’s Minds

On NPR’s Hidden Brain podcast, cognitive neuroscientist Tali Sharot discusses the role of data in changing people’s behavior.

From Data to Action

The goal of analytics is to have a positive impact on the performance of your organization. To have an impact, you usually need to convince people to change their behavior.

This is required whether you want to convince a CEO to adopt a new strategy, a manager to allocate resources differently, or a knowledge worker to change their processes.

That’s why data visualization and data storytelling have become key skill sets for modern analytics professionals.


Data is Not Enough

How do you convince people to change their behaviors? Many analysts fall into the trap of letting the data speak for itself.

On a recent episode of NPR’s Hidden Brain podcast, cognitive neuroscientists Tali Sharot explains that data alone won’t do the job.  (The podcast is embedded above.)

Most people are familiar with the concept of confirmation bias, where we tend to accept data that supports our existing opinions. Sharot suggests there are ways to override this kind of bias.

Some key takeaways:
  • People evaluate new information based on what they already believe
  • Strongly held false beliefs are difficult to change with data
  • Fear tends to lead to inaction, rather than action
  • Positive feedback or hope is a powerful motivator if you want to change peoples actions
This is a fascinating listen for anyone interested in telling stories with data. Not only does it offer suggestions on how to change people’s behavior, it also illustrates the power of tracking results and making them available to people.

I’ve pre-ordered Sharot’s upcoming book, The Influential Mind. You should too!

Recommended Podcast Apps

I have received a lot of positive feedback from people who enjoy listening to the podcasts I mention on this blog. Several people have asked me how to listen to podcasts.

You can, of course, simply click on the play button in the posts. But you can also subscribe to podcasts using a smartphone app. This lets you listen on the go, and also notifies you when new episodes are available.

Here are two apps I recommend if you use an iOS device:
  • Castro Podcast Player is perfect if you are new to podcasts, or if you subscribe to a handful of podcasts.
  • Overcast: Podcast Player is good for people who subscribe to a large number of podcasts. It is more complex, but allows you to set up multiple playlists and priorities.
Further Reading




Wednesday, January 25, 2017

Avoid the Unintended Consequences of Analytic Models

Cathy O’Neil’s Weapons of Math Destruction is a must-read for analytics professionals and data scientists.

In a world where it is acceptable for people to say, “I’m not good at math,” it’s tempting to lean on analytic models as the arbiters of truth.

But like anything else, analytic models can be done poorly. And sometimes, you must look outside your organization to spot the damages.

The Nature of Analytic Insights

Traditional OLAP focuses on the objective aspects of business information. “The person who placed this $40 order is 39 years old and lives in Helena Montana.” No argument there.

But analytics go beyond simple descriptive assertions. Analytic insights are derived from mathematical models that make inferences or predictions, often using statistics and data mining.1

This brings in the messy world of probability. The result is a different kind of insight: “This person is likely to default on their payment.” How likely? What degree of certainty is needed to turn them away?

When you make a decision based on analytics, you are playing the odds at best. But what if the underlying model is flawed?

Several things can go wrong with the model itself:
  • It is a poor fit to the business situation
  • It is based on inappropriate proxy metrics
  • It uses training data that reinforces past errors or injustices
  • It is so complex that it is not understood by those who use it to make decisions.
And here is the worst news: whether or not you manage to avoid these pitfalls, a model can seem to be “working” for one area of your business, while causing damage elsewhere.

The first step in learning to avoid these problems is knowing what to look for.

Shining a Light on Hidden Damages

In Weapons of Math Destruction, Cathy O’Neil teaches you to identify a class of models that do serious harm. This harm might otherwise go un-noticed, since the negative impacts are often felt outside the organization.2 She calls these models “weapons of math destruction.”

O’Neil defines a WMD as a model with three characteristics:
  • Opacity – the workings of the model are not accessible to those it impacts
  • Scale – the model has the potential to impact large numbers of people
  • Damage – the model is used to make decisions that may negatively impact individuals
The book explores models that have all three of these characteristics. It exposes their hidden effects on familiar areas of everyday life – choosing a college, getting a job, or securing a loan. It also explores their effects on parts of our culture that might not be familiar to the reader, such as sentencing in the criminal justice system.

Misaligned Incentives

O’Neil’s book is not a blanket indictment of analytics. She points out that analytic models can have wide ranging benefits. This occurs when everyone’s best interests line up.

For example, as Amazon’s recommendation engine improves, both Amazon and their customers benefit. In this case, the internal incentive to improve lines up with the external benefits.

WMD’s occur when these interests conflict. O’Neil finds this to be the case for models that screen job applications. If these models reduce the number of rĂ©sumĂ©s that HR staff must consider, they are deemed “good enough” to use. They may also exclude valid candidates from consideration, but there is not an internal incentive to improve them. The fact that they harm outside parties may even go unnoticed.

Untangling Impact from Intent

WMD’s can seem insidious, but they are often born of good intentions. O’Neil shows that it is important to distinguish between the business objective and the model itself.  It’s possible to have the best of intentions, but produce a model that generates untold damage.

The hand-screening of job applications, for example, has been shown to be inherently biased. Who would argue against “solving” this problem by replacing the manual screening with an objective model?

This may be a noble intention, but O’Neil shows that it fails miserably when the model internalizes the very same biases. Couple that with misaligned incentives for improvement, and the WMD fuels a vicious cycle that can have the precisely the opposite of the intended effect.

Learning to Spot Analytic Pitfalls

The first step to avoiding analytics gone awry is to learn what to look for.

“Data scientists all too often lose sight of the folks at the receiving end of the transaction,” O’Neill writes in the introduction. This book is the vaccine that helps prevent that mistake.

If you work in the field of analytics, Weapons of Math Destruction is an essential read.


Notes:

1. OLAP and Analytics are two of the key service areas of a modern BI program. To learn more about what distinguishes them, see The Three Pillars of Modern BI (Feb 9, 2005).

2. But not always. For example, some of the models explored in the book have negative impacts on employees.