Start Simple

This post is part of a series based on a presentation I did to the London VMware User Group on February 25th, 2010 about the reality of Enterprise scale internal cloud platforms. To find other posts in the series, just look for the tag “Bringing the Cloud Down to Earth”.

With all the hype surrounding Cloud, it’s easy to lose your head when planning what your first release is going to offer. Well maybe not you, but other people around you who aren’t as grounded will almost certainly lose theirs. “ZOMG we’ll have vMotion to Amazon and single click clone-to-new-VM for production workloads BOOM! HEADSHOT!!!”. Calm down, Doug. You should always remember Gall’s Law, which in a nutshell says any complex system that works is bound to have evolved from a simple system that worked. Amen to that. It’s also worth remembering that simple does not necessarily mean easy, of course as Einstein said “It can scarcely be denied that the supreme goal of all theory is to make the irreducible basic elements as simple and as few as possible without having to surrender the adequate representation of a single datum of experience.” (ie make it as simple as possible, but no simpler).

Use Case Elimination
But how simple is simple? The correct answer is actually driven by your timeline and resources. You need to figure out what you can reasonably deliver within those constraints, and the easiest way to do that is by getting together a list of use cases, and then taking to them with a hatchet (unless you have unlimited time and resources, in which case sit back and have another caramel soy Latte).

When compiling your initial list of use cases, make sure you flesh them out in enough detail to justify the final shortened list for the initial release – you’ll likely need to show solid reasoning behind your choices to non-technical people both in and outside of the project. You need to be able to show how much work something like “archive VM” actually entails – what does archive actually mean? Is there a time limit on it? Does it move to another storage tier? How do you bring it back? What happens if the cluster it used to run on doesn’t have capacity for it anymore? Will you allow something that has been off the network for 6 months to just come back on? Do you need an isolated environment where VM’s can be “re-socialised” (patched, AV updates etc etc)? But is the point of the the archive that it needs to be retrieved in exactly the same state as when it was archived? What about Windows machines who are domain members – after what period will their machine account password have expired and broken domain authentication? What did you do with the IP address / DNS entry of the guest when it was archived? What about all the other enterprise systems it may have been registered with when it was alive – do you need to re-register? There are many other questions like these associated with many different use cases – you don’t have to answer them all up front, but you’ll need to figure out the answers to questions like these if you are to implement the feature. And make sure you consider both the forward and reverse scenarios of your use cases. You can’t archive without retrieval, you shouldn’t automate creation without automating deletion, and so on. Remember when SRM first came out and had failover but didn’t have failback? Don’t let that be you.

When considering items for the hatchet, think about what the main driver for your Cloud implementation is. Stuff like user controlled snapshots, archiving and datacenter migration (live or not) may sound great at first glance but if you aren’t doing it today do you really need to offer it in the first release of your internal Cloud? Do you need to have the capability to run production workloads on your first release? You should not underestimate the value and complexity of delivering nothing more than a self-service, rapid provisioning, pay-as-you-go platform for non-production workloads.

Finally, try to think fresh when prioritising / eliminating items. There are probably things on there that are important in the context of the way you do things today, but may be less important in the world of Cloud. Let’s look at the archive use case again – it could be that archiving is attractive today only because of the pain associated with “new” requests. But if your users have a low governance self service option that gives them a new machine in an hour or less, the use case for archiving may be more limited than you originally thought.

The other aspect of starting simple to consider is what VM size choices to offer. Once again, it’s easy to get carried away – do you have a fixed ratio of compute power to memory, or leave it completely variable? How do you factor vCPU’s into that – could someone request a low amount of aggregate compute power and memory but ask for 4 vCPU’s? In which case, what’s your strategy for ensuring that people get what they pay for / don’t get an inconsistent experience? How about service tiers – as I’ve mentioned previously, do people have support and no support options? Is there a tier that includes backup?

EC2 is a great example of service evolution. From the recent announcement of support availability to the constant new mixes of memory and compute capabilities… they didn’t offer everything they have today from day one. Instead, their intial release had very few options, and they added more options based on customer demand. And you should do the same – don’t stress too much about what goes into a small / medium / large if that’s all you’ll offer to begin with. Nothing is set in stone, and service catalogues aren’t difficult to change. The best thing you can do is just get something out there and change it in response to _actual_ consumption and feature requests – not what you _think_ those things will look like.

Wow – nearly 1000 words on this one… surely something in there must be useful to someone đŸ™‚



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: