Working With External Data (Part 1 of X)

November 21, 2009

In early October I began reviewing three external data repositories containing “loss event” data. I think it is important to state that what you are about to read is the result of me being guided by a real risk modeler at the company I work for. Modelers are very methodical, consistent, and have high expectations of quality – sort of like engineers. I understand information security, he understands modeling. I get to do the mundane work – he gets to build the mathematical relationships and distributions. No matter what though – I have to be able to explain everything in the model as well as maintain it moving forward. Thus, in this series, I want to share some observations and lessons I learned from the “gathering external” data exercise.

Really understand what you are looking to get from the data.
It is too easy to jump into these data sets, perform some simple statistical calculations and then communicate outrageous findings to an audience. For me and my employer’s purpose we wanted to use “some” of this external data for use in a loss model. Specifically, to help establish a distribution of possible number of records that could be lost and potential loss magnitude per event in various types of security incidents. (Notice I said possible, not probable). The reality is that most companies do not have dozens let alone hundreds of loss events to develop loss models without needing to use external data. So, one of the benefits of using external data in a loss model is that it can really help understand worst-case loss magnitude also know as “tail risk”. Internal data may more influence the mean value of a loss model. For two of the data sources – dataloss.org and privacyrights.org – the number of records lost was the key data point. For the third and non-public data consortium source, the cost of security related events (not necessarily data loss events) was the most useful. Below are some considerations for narrowing down the number of data points in data set from all to “some”.

a.    Time. Technology and the regulatory landscape changes quickly. Thus, it is preferable to time limit data points to a period where a minimum level of technology was assumed as well as a consistent expectation of regulatory / industry standard requirements. For our purposes, we only chose data points dating back to 2005. Again, this time range will vary from model to model, person to person, company to company and industry to industry.

Note 1: One record in the dataloss.org set goes back to 1903. Seriously.

Note 2: In the dataloss.org data set dated 9/30/2010. There were 2013 data points. Using only records from 2005 to 9/30/2009; reduced the set down to 1945.

b.    Good Fit. Not all data points are a good fit to be included in your analysis. Security control expectations vary from industry to industry. Thus you need to have a way of methodically reviewing data points to determine which are a good fit. Below are just a few considerations:

i.    Industry. Most data sets are not industry specific – so they contain data points spanning all kinds of industries. The transportation industry has a different value proposition then the financial services industry. So, depending on your model – points outside your industry may not be relevant.

ii.    Service or Value Proposition. Somewhat related to industry but some services and value propositions are shared between industries. I think of health care insurance and property and casualty insurance. Both industries have to protect confidential information. This does not mean that if I am in the financial services industry that I would include ALL healthcare industry data points – it just means that I am acknowledging there is a shared value proposition and that some data points – depending on the loss form – can be used for my purposes.

iii.    Loss Form Categories. When I talk about loss form categories, I am referring mostly to BASEL II Operational Risk Categories (Level 1); “Internal fraud”, “External fraud”, “Employment Practices and Workplace Safety”, “Clients, Products & Business Practices”, “Damage to Physical Assets”, “Business Disruption and System failures” and “Execution, Delivery & Process Management”. Most data loss events will only map to a few of these categories and in some instances these categories may not even be applicable to your needs, your company or your industry – but classifying each data point to one of these categories or another category framework more relevant for your company / industry can allow you to refine your data set in a methodical and unbiased manner.

Note 3: After applying my good fit criteria, the total number of dataloss.org data points I am using for my model has been reduced from 1945 (note 2 above) down to 84.

Note 4: Of those 84 data points: 9 data points were categorized as “Internal Fraud”, 37 were categorized as “External Fraud” and 38 were categorized as “Execution, Delivery, and Process Management”.

c.    Duplicate Records. When you are using multiple data sets, you have to assume there is duplication of data points between data sets. This was definitely the case for the dataloss.org and privacyrights.org data sets. To compound matters, just expect that for a certain percentage of duplicate data points – the details might differ. This is not a super big deal – just understand that you will have duplicate data points and will have to choose one of the data points.

Note 5: Ok, there could be some duplicates where the variance in details is so wide and there is neither time to determine which one is more correct or there is not a valid source to determine which one is more accurate; you could throw them both out.

d.    Consistency. You have to be consistent in your approach to reviewing data points. Distributing the work between numerous people could be problematic if they are not all properly aligned on the goals of what you are doing and properly calibrated on determining if a data point meets the criteria for inclusion.

In the next post, I will focus more on “right-sizing” data points. In other words, adjusting data points to be commensurate with your particular company.

Note 6: Please do not take any of my remarks about dataloss.org or privacyrights.org having errors to be an attack against the fine folks that are maintaining those data sets. My intent for raising these points is related to taking personal responsibility for knowing the data points you are using to derive information from. It is too easy for our business partners and even others in the security industry to raise the “garbage in garbage out” argument when trying to understand risk or loss models.

Advertisements