Finally, A Black Swan

Anyone who has read Nassim Nicholas Taleb‘s work will know the story of the black swan. In a nutshell, the story goes that up until about 220 years ago, the Olde World thought that all swans were white. The basis for this was of course purely empirical – out of the millions of confirmed swan sightings over many hundreds of years, every single one was white. Until the discovery of Australia that is, when a proposition with hundreds of years of supporting data was instantly invalidated with the first sighting of a black swan. And so Taleb coins the term “Black Swan” (note the capitalisation) to describe a phenomenon that is an outlier (in the statistical sense), that has extreme impact, and is treated in retrospect as predictable.

And so Danger‘s massively public data loss last week can be considered a Black Swan. Already the ignorant are calling it a failure of the cloud as a model, or reporting it with such gross simplicity that it makes an infrastructure guy like me want to start punching things (mainly the so called “tech” reporters who have penned such articles). And I have no doubt that whatever the root cause is, it will in retrospect be deemed as predictable so that people can have warm fuzzies about “the cloud” once more, knowing that such things couldn’t happen again because we now know how to identify and mitigate such things in advance. Just like the recent economic crisis.

The truth of the matter is, I’m bloody glad that this has happened because it takes things like this for senior management types to understand that when it comes to infrastructure, you usually get what you pay for. The problem with the current cloud offererings is that you are not really told what you are getting – you are just told how cheap it is. And so when these senior management types look at the cost of internal IT versus external cloud, they often immediately assume apples for apples and thus conclude external cloud to be massively cheaper. Newsflash: it is never an apples for apples comparison, and if it was I very much doubt there would be a big enough price difference to warrant large scale evacuation of corporate datacenters.

In order to analyse this a bit further, I’m going to play the part of ignorant reporter for a sec and describe the problem as follows: During some kind of storage array maintenance, something went wrong, a large amount of data was lost and no backup was available from which to restore the data.

So to me, Danger’s failure was not so much grounded in technology as it was in a complete failure of IT to understand the business side of their operation, and thus a complete failure to consider the impact of a Black Swan. During my time in the enterprise, I have seen array level failures that well and truly satisfy the requirements of “outlier” and “large impact”, however thankfully I have been fortunate enough to work with people who had the intelligence to see that such failures could not possibly be predicted or were too expensive to mitigate in light of the risk. If you ever catch me for a beer, ask me about these experiences. I am certainly much better for having been through them, and so will you be better for having heard about them.

But the difference is, these failures were with internal infrastructure at companies whose core business was not IT related. This puts a massively different spin on the risk equation, because a public losing of face for the company in the event of such failures was never a possibility. Minor loss of revenue? Maybe. A regulatory wrist slapping? Possible. But a complete and utter public shaming and loss of confidence in the core service the company was providing? Never on the cards. And so as catastrophic as these things were, the cost of mitigation completely outweighed the risks. And so we lived with them. But cloud based services cannot live with such risks. And mitigation of these kinds of failure is usually expensive. All of a sudden a “cut corners to save costs, at all cost” model of cloud computing doesn’t look so great, does it? I’m not saying that such a model is _the_ cloud model, it’s _a_ model that hopefully now won’t seem as attractive as what it may have previously.

I’m hoping the impact of Danger’s failure will force cloud providers to provide more disclosure about their infrastructure, in order to allow businesses to make better decisions about the reliability and risks of various providers. At the very least, it should allow for better differentiation between the providers based on these factors. And almost certainly, the actual cost of things will become more evident and the cloud will probably get more expensive than it is today.

As for internal IT people, we can take something away from this too. We can read the dumbed down explanations of what happened and know there is any number of specifics that will not be covered and possibilities that may have had an equal likelihood of occurring. We should not just publicly speculate about such things, we should speculate and then ask the questions of our own infrastructures. You may think synchronous replication and SRM is protecting you, but what if a raid group goes down and corrupts data with it, and that corruption is instantly replicated to your DR site? You may think that putting 100 VMs onto a single blade is an acceptable risk if a single blade fails, but what if the whole chassis becomes unavailable? You may deem both of these as extremely unlikely, but that doesn’t mean you should wave them off. Think about them, and communicate them.

In closing, I hope you have gathered that the point of this post is not about what’s right or wrong from an infrastructure design perspective. It’s about knowing the risk including the outliers, communicating those risks to your customers (either internal or external) or the decision makers (ie the business people) and most important of all recording what decisions were made by who and why. So when that Black Swan actually appears, rather than being lost for words and looking stupid, you will either have mitigating plans in place or a record that this was a risk you were willing to live with for whatever reason. And for fucks sake, ensure that your customers also accept the unmitigated risks before onboarding your platform.

Because that Black Swan will appear, believe me. Or I’m not Australian.


One Response to “Finally, A Black Swan”

  1. Steve Chambers Says:

    And I thought this was going to be a tourist article about Leeds Castle (where they have black swans – note the _lack of_ capitalisation).

    Good post, mate, I’m sure there’s more Cloud fun and games on the way!

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: