What works and what doesn't work in software development? For the first 10 years or so of my career, we followed a strict waterfall development model. Then in 2008 we started switching to an agile development model. In 2011 we added DevOps principles and practices. Our product team has also become increasingly global. This blog is about our successes and failures, and what we've learned along the way.

The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions. The comments may not represent my opinions, and certainly do not represent IBM in any way.

Friday, June 29, 2012

DevOps Days Open Space: Private Cloud Myths, Facts, and Tips

I'm at the DevOps Days conference this week, and some of my favorite sessions have been the OpenSpace collaborations.  I was the scribe for the session on "Private Cloud: Myths, Facts, and Tips", so here is what we discussed:

Myth: The cloud will not fail.
Fact: It will fail, so write software that can handle failures.  In fact, entire data centers will fail.  Have at least two geographically remote clouds that can fail over to each other.
Tip: The Yahoo guys have found that the more technology you have to prevent a data center from failing, the more points of failure you introduce into the system.  They've found that it's much more effective to make it easy to fail over to a new data center.  In fact, they fail over during peak traffic times on purpose: for practice, to build confidence, and as a way to take servers out of production for a little while to upgrade them.
Tip: Infrastructure automation makes it easy to create a new backup system.

Myth: The cloud will move my enterprise forward into the future.
Fact:  There are issues you have to solve first.
Tip: Teach your developers how to write stateless applications.
Tip: The chaos monkey can help your developers learn how to write stateless applications, and how to handle failures of different services.
Tip: Getting people to write stateless code is a cultural and a management issue.

Myth: Any workload can run in the cloud.
Fact: Some workloads will not behave in a cloud-like manner.  You have to design applications for that.
Tip: A best practice is to tell developers that their application could be moved to the public cloud at any time, without notice. 
Tip: A full penetration test on a regular basis is extremely valuable.  Especially if the output of that is a set of tests that failed, that your developers can fix!

Myth: My current ops team can run a private cloud easily.
Fact: You need to train people on how to run a cloud.  It's not traditional ops.  You have to learn to let go.  You have to give people control over their own machines.  You have to let them provision their own VMs.
Tip: Sysadmins should learn how to write infrastructure code.  Either that or they may be reduced to changing out broken servers on racks.
Tip: One way to prevent people from overloading your cloud capacity is to put fixed leases on the machines.  If you have infrastructure as code, you can provision a new system in a few minutes.  So, for developer unit test VMs, it's fine to delete those after 24 hours.  The developer may manually delete VMs even earlier.  Have a way to flag the small percentage of machines that should live for a long time.


  1. "If you have infrastructure as code, you can provision a new system in a few minutes." I've gotten some feedback on this point - in reality, you MAY BE able to provision a new system in a few minutes. It all depends on how your system is set up, how long it takes to deploy a VM, how long it takes to upload the software and install it, etc. Some simple deployments we have done on our servers take 5-10 minutes (a single Linux VM with a simple Tomcat server); others take closer to an hour (upgrading the OS; installing a new asset library).

  2. This comment has been removed by the author.