Working With External Data (Part 2 of X)

February 2, 2010

This is the second post of a series related to working with external data for analysis or modeling purposes. You can read the first post HERE or read the “cliff notes” summary below.


Part 1 “Cliff Notes”

1.    Know what information you are hoping to derive from the data.
2.    Methodically narrow down your data for relevant data points. More data does not guarantee better or more accurate information.
3.    Some refinement considerations include:
a.    Time frames. Limit your data set to a span of time commensurate with a minimum level of technology as well as a consistent expectation of regulatory / industry standard requirements.
b.    Good fit. Consider points related to your industry, service offering / value proposition, and loss form categories.
c.    Duplicate records. When working with multiple external data sources, keep an eye out for duplicate event records.
d.    Consistency. Be consistent in how you analyze data points.

NOTE: Collecting and right sizing external data is useful for comparing your internal loss events to external loss events, understanding what a worst-case loss could like for your company and possibly incorporating into a data set to be used for modeling purposes.


In this post, I want to focus on right-sizing data points in your data set commiserate with the size of your company.

1.     Determine the minimum value where right-sizing is not worth the effort. When faced with hundreds or even thousands of data points – there is going to be a “magic number” of records where the number of records lost does not warrant right sizing. The more you understand your business processes and partner with the appropriate stakeholder (business partners, marketing, legal, privacy, etc…) the easier making this determination should be. They *should* be the subject matter expert(s) on these matters and be leveraged whenever possible.

TIME-OUT. For the uber-privacy / legal folks our there, the loss of just one record is not desirable. But, we have to be reasonable and acknowledge that are thresholds where the “squirm factor” varies.

2.    Understand the type of data lost / compromised. This is really easy to overlook. Some data loss events involve just customer (consumer) data, other events may include just employee data, and some events may include both types of data. Understanding the type of data lost could prove useful in determining which right-sizing method to use.

3.    Right-sizing factors (proportions). This is where things get interesting. It also where objectivity and consistency have to be demonstrated. Whether we are performing risk assessments, right-sizing data points, or collecting information to draw conclusions from – it is important that we are as objective as possible; reducing subjectivity whenever possible. The key point I want to make here is that if used appropriately and consistently, a right-sizing exercise is more objective in nature then stating it happened to company X so it could happen to us without any analysis whatsoever. You may want to make a brief note as to why you chose a certain right-sizing factor in case you need reminded at a later point for whatever reasons. Let’s look at a few right-sizing options (keep in mind that we are building upon what we covered in the first post):

a.    # of Employees. If a data point is from a company in the same industry or has the same value proposition – using number of employees could be a good right-sizing factor. Some inferences can be made around the number of employees. Is it unreasonable that if a company half of your size loses 10,000 records and is obligated to protect data equally as well as your company, that the same event could happen to your company for 20,000 records? Using number of employees is also useful when the data point only involves confidential employee data.

b.    Revenue. Revenue could be another right-sizing factor. Maybe the data point is from a different industry where staffing types / level differ from your industry but the value proposition is related (property & casualty insurance versus health insurance).

c.    Equity. In some cases, equity can be used as a right-sizing data point. # of employees or revenue may not be appropriate or the proportions could be unrealistic. Equity could be a third option.

d.    Others. There could be some other right sizing factors depending on your industry or the problem you need to make decisions for. Just make sure that whatever that factor is, that it is generally regarded as a sound comparative measurement factor and document any assumptions. Again, I cannot underscore the need for objectivity and consistency.

NOTE: Keep an eye out for some data points that have been right-sized but the calculated value far exceeds the total number of records you have in your organization. Consider another right-sizing factor or change to the maximum amount of records you have; and of course, document. Some situations may warrant keeping the value (like a risk assessment related to merger or acquisition where the number of records you are obligated to protect now- doubles, triples, etc…). What you don’t want is someone calling you out on a data point that is not even realistic for your organization because of a simple oversight (I speak from experience).

4.    Sources of Information. In order to right-size data points – especially in the context of the factors above (employees, revenue, equity), you have to get information about the companies related to that external data point. I would submit that the information you need is available in most of the cases – it is just a matter of time and creativity.

a.    Internal information. You need to collect your own internal information for the right sizing factors before you can right-size against external data point company information. HR may be able to give you # of employees by year and hopefully there are numerous internal authoritative resources that have your revenue and equity values (if you are publically traded – this information is publically available; though you should still confer / validate with internal sources).

b.    External information. Be Creative. Yahoo business, Dun & Bradstreet, company name Google searches combined with “annual report” or “corporate filings”, company websites and Fortune 100, 500 or 1000 lists; these are stating points. Just remember to make sure you are right-sizing in the context of the same year the incident occurred in – and be consistent.

In the next post for this series, we will look at analyzing a right-sized data set to begin collecting information. For example, does the data resemble a statistical model? Does it resemble your internal data points? What if my data set has too few data points?