Tuesday, May 22, 2007

Data Security: Don't Be the Weak Link

Data Warehouse project teams dedicate a lot of time to protecting personal information—and rightly so. But don’t let your notebook computer become the weak link. I suggest three simple steps you can follow to protect yourself, your company and your customers.

The Letter

Imagine that you have just received a letter that has you more than a little upset. It came from a financial institution where you have an account. It reads,

We are writing to inform you of a recent incident which may have compromised some of your personal information…Recently, a notebook computer belonging to one of our employees was stolen out of a hotel meeting room.

The computer contained confidential information about many of our customers. Data on the computer included your account numbers, balance information, and social security number…

The letter goes on to say that the computer was password protected, and that most computers are stolen for the value of the hardwarenot because the thief is looking for confidential data. Still, they have set up a hotline, and will provide resources to support you in the event of identity theft.

Your Reaction

How do you feel? Probably angry. Why was that notebook left unsecured? Why was information about you on it in the first place? One of your first reactions might be to consider taking your account elsewhere.

Now imagine that it was your computer that was stolen. How comfortable would you be, knowing the theft of your computer forced your company to notify several thousand customers that their confidential information may have been breached?

Worse yet, imagine that you were a consultant working for the company. You have caused them a massive headache, and also jeopardized your own company’s contract with the financial institution.

Guarding Confidential Data in the Warehouse

Data warehousing teams are often entrusted with confidential information relating to an organization’s customers, employees and activities. We spend a great deal of time designing solutions that safeguard this information:

  • We create alternate identifiers so that social security numbers are not needed to link data sets together
  • Extract files are maintained on secure servers, and accessed only by specific users
  • We implement encryption on sensitive data elements
  • We manage ETL processing in a secure environment
  • We provide lookup functions that allowing users to find a match to a data element, without providing unfettered access to the element itself
  • We take steps to ensure that log files do not contain sensitive data elements
  • We develop database-level security at column and row levels, and provide user access on a need-to-know basis
  • We audit access to confidential data elements

Most of this work is focused on securing the processing, storage and use of sensitive data elements. And most of the focus is on servers. But what about your PC?

Don’t Be the Weak Link

As someone involved in the design, development or testing of an analytic system (or any system), you may have access to files or extracts that contain confidential data. Some of this could, quite innocently, land on your PC. Your organization may have strict policies about this access.

Whether or not they do, I encourage you to adopt your own policy as well. I do. And it is a simple one:

"Not on MY computer. Period."

Whether developing database design, architecting an ETL process, or developing a quality assurance process, we are all able to do our jobs without turning our notebook computer to the weak link in the security chain.

I refuse to accept any file that contains sensitive data, whether encrypted or not. When I need to review an extract file, I examine it on secure server, from a secure workstation. You get the idea.

By keeping this information off our notebooks, we avoid being the cause of one of those dreaded customer letters. By protecting ourselves, we also protect our customers and our company.

Guarding Your PC

Taking a few simple steps to avoid being in that uncomfortable position.

  1. Just say No

    Don’t accept any confidential data and store it, even temporarily, on your PC.

  2. Wipe it Out

    If you receive a file that turns out to contain confidential data, get rid of it. Use a “wipe” utility or “secure delete” to completely expunge it from your system. Notify the sender that you have done so, and request a “clean” version of the file.

    There are a variety of utilities that can wipe a file or your hard drive. A list can be found at the bottom of the DBAN page at Sourceforge. One free utility is Heidi Eraser. (Links appear at the end of this article.)

  3. Stay Up to Date

    A physical theft is not the only thing there is to worry about. Keep you computer software up to date. Yes, I.T. probably keeps you OS, firewall and critical applications up-to-date.

    But you’ve got other things on that machine. You know you do. Plug-ins. Utilities. Media players. They are innocent enough: iTunes, WinAmp, Acrobat Reader, Flash Player, Real Player, Java...But these applications and many others can also cause a breach on your machine.

    One place to check what needs updating is the Secunia Software Inspector (link at end of post), which scans for a wide variety of programs and tells you what security updates are needed. This check requires Java – be sure to disable it after you’re done.

So don't be the weak link. Protect your PC and you'll be protecting yourself, your company and your customers.

© 2007 Chris Adamson

Related Links (all open in new browser windows):

DBAN page at Sourceforge: http://dban.sourceforge.net/
List of hard drive wipe utilities appears at bottom of page.

Heidi Eraser (a GNU Licensed File Wipe Utility) http://www.heidi.ie/eraser/default.php

Secunia Software Inspector (free service scans your computer for insecure software versions) http://secunia.com/software_inspector/


Tuesday, May 1, 2007

10 Things You Should Know About that Sample Star Schema

Today, many of us learn about the star schema by studying a sample database that comes with a software productusually one that covers sales or orders. Here are 10 terms and principles of dimensional modeling to go with that sample schema you've worked with.

The star schema has become a de facto standard for the design of analytic databases. Sample stars are often included with RDBMS software, BI Tools and ETL tools. They are also used for tutorials and training. Almost universally, the sample schema describes a sales or order taking process, similar to the one depicted in the figure below:

Figure 1: A demo schema usually represents orders or sales.(Click to Enlarge)

You may have learned about the Star Schema by working with a sample like this one. If so, you probably have an intuitive grasp of star schema design principles. Here are ten terms and principles you should know that describe important features of the sample star.

Most of this is probably readily apparent if you've worked with a sample schema—what may be new is the terminology. The first two you probably know:

  1. Facts are measurements that describe a business process. They are almost always numeric—but not all numeric attributes are facts. You can find facts (or measurements) in almost any analytic request—"Show me sales dollars by product" (sales dollars). "How many widgets were sold my John Smith in May?" (quantity ordered). There are some schemas that do not include facts—we'll look at those in another post.

  2. Dimensions give facts context. They may be textual or numeric. They are used to specify how facts are "filtered" and "broken out" on reports. You can usually find dimensions after the words "by" or "for" in an analytic request. "Show me sales dollars by product" (product). "What are margin dollars by Month and Salesperson?" (month, sales rep).

  3. Dimension tables are wide. Dimension tables usually group together a set of related dimension attributes, though there are situations where a dimension may include a set of attributes not related to one another. Dimension tables are not normalized, and usually have a lot of attributes—far more than appear in most sample schemas. This allows a rich set of detail to be used in analyzing facts. 100 or more columns is not uncommon for some dimensions. For this reason, we often call dimension tables wide.

  4. Dimensions have Surrogate Keys. The primary key for each dimension table is an attribute specifically created for the dimensional schema. It is an integer assigned by the ETL process, and has no inherent meaning. It is not a reused key from a source system, such as a customer ID or product code. We call these attributes natural keys, and they may exist in the star, but do not serve as unique identifiers.

    In the sample schema, customer_key is a surrogate key generated for the star schema; customer_id is a natural key carried over from a source system.

    By assigning surrogate keys, we enable the star to handle changes to source data differently than the source system does. For example, in a source system a customer record may be overwritten, while we want the star schema to track changes. Performance considerations also come into play—a surrogate key avoids the need for multi-column joins.

  5. Type 2 Changes track history. The term "Slowly Changing Dimension" (or SCD) describes how the data warehouse responds to changes in the source of dimensional data. There are several techniques that can be applied when the source of dimension detail changes. The most common is referred to as a "Type 2" change: an entirely new record is written to the dimension table.

    For example, if a customer moves, the record may simply be updated in a source system. But in the star schema, we choose to add a new row to the customer dimension, complete with a new surrogate key. All prior facts remain associated with the "old" customer record; all future facts will be associated with the new record.

  6. Type 1 Changes overwrite history. The Type 1 change is used when source data changes are not deemed significant, or may be the correction of an error. In such cases, we perform an update to an existing row in a dimension. For example, if a customer's gender is updated in the source, we may choose to update it in the corresponding dimension records. All prior facts are now associated with the changed value.

    In addition to Type 1 and Type 2 changes, there are other SCD techniques. Hybrid approaches exist as well. Every design should identify which technique(s) will be used for each attribute of each dimension table.

  7. Fact tables are narrow. A fact table row is usually entirely composed of numeric attributes: the facts, and foreign key references to the dimensions. Because of these characteristics, each fact table row is narrow, at least in contrast with wide dimension rows full of textual values. The narrowness of fact tables is important, because they will accumulate far more rows than dimension tables, and at a much faster rate.

  8. Fact tables are usually sparse. Rows are recorded in the fact table only when there is something to measure. For example, not every customer orders every product from every salesperson each day. Rows are only recorded when there is an order. This helps manage the growth of the fact table. It also saves us from having to filter out a huge number of rows that have no sales dollars when displaying results in a report. (Usually, you don't want a customer sales report to list every product—only the ones they bought. You can use an outer join when you want the latter.)

  9. Fact Table Grain The level of detail represented by a row in a fact table is referred to as its grain. Facts that are recorded with different levels of detail belong in separate fact tables. This avoids an array of reporting difficulties, as well as kludges such as including special rows in dimension tables for "not applicable." Determining the grain of a fact table is an important design step and helps avoid future confusion. (There are times when "not applicable" attributes are necessary, but they are most often a sign of the need for another fact table.)

    In the example, the grain is sales by customer, product, salesperson and date. A better design might capture sales at the order line level of detail.

  10. Additivity. Facts are usually additive. This means they can be summed across any dimension value. For example, order_dollars can be aggregated across customers, products, salespeople, or time periods, producing meaningful results. Additive facts are stored in the fact table. We also store additive facts that might be computed from other facts. (order_dollars might be the sum of extended_cost and margin_dollars, but why include only two out of the three.?

    Some facts are non-additive. For example, margin rate is a percentage. Two sales at 50% margin do not equate to a single sale at 100% margin—this fact is not additive. In the star, we store the fully additive components of margin (order_dollars and margin_dollars) and let front end tools compute the ratio. There are also semi-additive facts, which we will look at in the next post.
Most of these terms and principles can be learned by working with a sample schema. But there are many important principles that the typical "Sales" model does not reveal. In a future post, I'll look at the top 10 things the demo schema does not teach you.

Related Posts:

Ten Things You Won't Learn from that Demo Schema

© 2007 Chris Adamson