API First

This post is part of a series based on a presentation I did to the London VMware User Group on February 25th, 2010 about the reality of Enterprise scale internal cloud platforms. To find other posts in the series, just look for the tag “Bringing the Cloud Down to Earth”.

I cannot stress how important this point is, and at the same time I can’t convey how difficult it is to do. The difficulty doesn’t lie in the technical part (although that is far from easy) – it’s linked to one of the other points which I called “The Importance of Showing Others”. The sad reality of life in the Enterprise, and even in the outside world to a certain extent, is that non-technical people generally need to see a UI in order to get a sense of what something does – as I once heard a sales guy say, “it’s very difficult to sell an API”. And even sadder, the better the UI, the better that thing is perceived to be. But although this is an unfortunate state of affairs (IMHO), you can work it to your advantage. There’s a saying which goes something like “you can’t polish a turd”… well actually, yes you can. You could have absolute rubbish under the hood, but dress it up in a nice UI and 99% of non-technical people will be sold. Case in point, the Windows operating system (heheh, that was a cheap shot). Now I’ll admit, I don’t know the solution to this problem – in my experience, I have failed to convince others of the importance of developing the API first and they have only come to realise it later (but thankfully not when it was too late). Given that, I’m going to talk about what I mean by ‘API’ rather than how to do it first.

When I say API in the context of Enterprise Cloud, I mean _your_ API. This is probably easier to describe with an image:

Developing your own API is important because you really need to abstract away the underlying systems that your service will sit on.
For example, “Archive VM” is probably a fairly lightweight operation on the underlying virtual infrastructure layer. Essentially, it’s a power operation and storage migration at most. But different platforms may implement these actions in different ways… for example on vSphere a storage migration can be done online, which may not be possible on other platforms. Or you may have the capability to skip the virtual infrastructure layer entirely and have the storage array do the work. You may decide that an archive in the internal Cloud will mean leaving the VM running but moving the machine to a cluster with a much higher VM density, and that archiving on an external Cloud will mean shutting the VM down but not actually performing any storage operations. Inside the guest, the implications are much much greater and will likely require interaction with a whole bunch of different systems related to the guest itself (AD, DNS, Monitoring, Backup, etc etc). So first you have to understand what the implications for such an operation are, then you can derive the base of the API method (what args will be involved etc), and then after a good deal more analysis you can move to implementation. I like to flesh these details out by “pseudo coding” the operation at hand.

Pseudo Code
For those that haven’t done it before, pseudo coding is basically writing the logical skeleton of an operation in plain english – you don’t need to have written a piece of real code in your life. In fact, you probably don’t want real developers doing this kind of thing, because they more than likely won’t have the required ‘big picture’ that the infrastructure people will have.

So lets have a look at how the process works. It’s always best to do this with at least one other person, and in my experience it’s most productive to lock yourselves away in the nearest cafe to nail it out – it’s a time consuming process, and interruptions are costly. It’s also a very iterative process, and you’ll likely find yourself identifying stuff in one use case that impacts another seemingly unrelated one. Let’s continue with our “Archive” use case and see how it might play out.

[xml]
Assumptions
– archival = workload offline > 1 month
– archive capacity is always available

Do Archive (VM ID)
Shutdown VM
Move VM to archive storage tier
Done
[/xml]
OK, nice and simple. It’s important to capture the assumptions in the same document as the psuedo code itself so you don’t get flooded with “what about…” when others read it. So to take this a little further, you might want to wrap some governance into such an operation, as there is likely some kind of ongoing cost implication. In order to do so, you’ll probably need to do something with the requestor’s identity along the way.

[xml]
Assumptions
– archival = workload offline > 1 month
– archive capacity is always available

Do Archive (VM ID, Requestor ID)
Forward request for approval
If approved
Shutdown VM
Move VM to archive storage tier
Else
Flag request denied
Do nothing
End If
Done
[/xml]
OK, a little more meat on it now. Now, depending on how you have implemented your capacity engine you may need to do something there. Let’s not worry about the implementation details too much for now, and use “Log” as a generic verb. While we’re here, let’s thtow in a little more error handling. Not to the nth degree, just something to show we are thinking about it on some level

[xml]
Assumptions
– archival = workload offline > 1 month
– archive capacity is always available

Do Archive (VM ID, Requestor ID)
Forward request for approval
If approved
Shutdown VM
Move VM to archive storage tier
If successful
Log free’d cluster resources (CPU, RAM, Storage, IP)
Log Archive success
Else
Log Archive fail
End If
Else
Flag request denied
Do nothing
End If
Done
[/xml]
Getting better. Now, depending on whether you already had a “Move Storage” use case, there’s an opportunity for re-use in this operation. But there is also the chance that this use case will impact “Move Storage” if you wrapped up power operations in your “Move Storage” operation… for the archive use case, you don’t want the VM to come back up. Let’s change our code to reflect this.

[xml]
Assumptions
– archival = workload offline > 1 month
– archive capacity is always available

Do Archive (VM ID, Requestor ID)
Forward request for approval
If approved
Shutdown VM
Move Storage –> check Move Storage pseudo code for power operations, need a "leave shutdown" flag or something
If successful
Log free’d cluster resources (CPU, RAM, Storage, IP)
Log Archive success
Else
Log Archive fail
End If
Else
Flag request denied
Do nothing
End If
Done
[/xml]
OK, you’re beginning to see that the rabbit hole could go much deeper. For example, what are the downstream implications of “Log Archive success”? Do you need to build some kind of “Archive Management” service to handle stuff like recording all the details of the VM at the time of archival in order to assist with the retrieval process, or schedule periodic “wake-ups” to apply patches, AV updates, machine account password syncs etc? There are many possible implications of any given action, writing down pseudo code style and then sharing it with the group goes a long way towards identifying such details. It’s good to capture this kind of stuff in a section like we did with the assumptions, so our initial analysis may end up something like

[xml]
Assumptions
– archival = workload offline > 1 month
– archive capacity is always available

Do Archive (VM ID, Requestor ID)
Forward request for approval
If approved
Shutdown VM
Move Storage –> check Move Storage pseudo code for power operations, need a "leave shutdown" flag or something
If successful
Log free’d cluster resources (CPU, RAM, Storage, IP)
Log Archive success –> see Questions
Else
Log Archive fail
End If
Else
Flag request denied
Do nothing
End If
Done

Questions
– Do we require some kind of Archive management service that handles inventory of archived machines, wake up schedules, wake up activities, etc?
[/xml]
As I said, there’s no right or wrong level of detail to go to in this process so I’ll leave it at that for now.

Only Human?
Of course the other reason for developing your API first is that it may not be a human that is interacting with your service. Or put another way, do you actually know what will provide the UI for your customers? This can be as much a political issue as anything else, the best way to sidestep it is to avoid the issue entirely by delivering an API that another system can consume rather than providing Yet Another UI.

And ultimately, there may not even be any human involvement. An application may be requesting new machines be spun up or down, or existing machines be made larger or smaller in accordance with the application load. Granted most applications in the Enterprise aren’t anywhere near this today, but this kind of nirvana begins with you building an API for your developers to consume.

Advertisements

Tags:

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: