What works and what doesn't work in software development? For the first 10 years or so of my career, we followed a strict waterfall development model. Then in 2008 we started switching to an agile development model. In 2011 we added DevOps principles and practices. Our product team has also become increasingly global. This blog is about our successes and failures, and what we've learned along the way.

The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions. The comments may not represent my opinions, and certainly do not represent IBM in any way.

Wednesday, November 14, 2012

Test automation got you down?

It's here - my full article on Making Your Automated Tests Faster, on the Enterprise DevOps blog!  Thank you again to the dozens of people who contributed their ideas at DevOps Days Rome.

This reminds me of my earlier post on "death by build", and how a build that takes too long, due in part to slow test automation, can really hamper a project: Is Shift-Left Agile? And Death By Build

Tuesday, November 6, 2012

Interview about DevOps and SmartCloud Continuous Delivery

Here's an interview I recently gave to a fellow colleague of mine, Tiarnán Ó Corráin.  This seems like a good time to reiterate that these views are my own, and not the official views of IBM:

What is DevOps?

You can think of it as an extension of the principles and practices of Agile.  Where the Agile methodology breaks down the barriers between development and business goals, DevOps is breaking down the barriers between development and operations.  It's not all about tools; it's about people and processes as well.  Both Agile and DevOps have a goal of delivering reliable software to the business more quickly, and ultimately, making more money!

How does DevOps do that?

Well, traditionally there has been a problem between going from a development system to a production system: installing new machines, installing the software, scripting and so forth. Getting from a working development system to a working production system involved all of these manual steps, and introduced many points of failure.

In addition, because setting up test machines was so time-consuming and error-prone, developers would often assemble their test machines in a quick and dirty way, on a single server, with the cheapest, simplest components.  Production systems, on the other hand, would have multiple servers, configured for clustering and fail-over, with firewalls between some components.  This meant that the developers weren't testing the software in an environment similar to the one where it would run in production.  And because of that, some bugs were never found until the software was deployed into production.

One of the primary tenets of DevOps is that you should automate every step in creating production systems.  Everything from preparing the machine, to installing the latest software, to starting the services, and testing them, should be fully automated and repeatable.  And when creating production systems is automated, you can also use production-like systems for development and test work.

How does virtualization help that?

When we're working in a cloud environment, deploying a virtual machine is something that can be scripted and automated.  We use infrastructure code to automate deploying the machine, installing the software, and (re)starting the services, and then we check that code into our source code repository and version it just like the application code.  So effectively the process of deploying a new system becomes part of the development process.

Presumably that makes testing easier?

Very much so.  Our virtualization technology means we can deploy production like environments as part of the development process.  It's the way the development process has been trending recently.  We already have continuous integration: RTC (Rational Team Concert) triggers automatic builds when changes are submitted, and we run unit tests against those builds.

Now, take that to the next level with continuous delivery: after changes trigger builds, those builds trigger deployments, and when the deployments are complete, the builds trigger automated tests.  What it means is that as part of the development process, we have production like servers running the latest code.  This allows us to run automated tests including performance verification against a production like environment as part of the development process.

Taking Agile to the next level?

Yes.  It accelerates development, because it takes away some of the uncertainty about deployment: if I can capture every part of the deployment process in my development and testing process, I have more confidence about what I'm going to deploy.

Deploying test systems automatically also saves developers and testers a lot of time!  On my own development team, it's normal for us to deploy dozens of new servers every day, and delete the old ones just as often.

Can you tell me a bit about your own role?

I'm on the advanced technology team that works on DevOps.  We are driving an internal transformation within IBM, to encourage our own development teams to adopt DevOps principles and practices.  In addition, we are creating tools to help IBM's enterprise customers adopt DevOps themselves.  The first tool we developed to sell is SmartCloud Continuous Delivery v2.0, and it's shipping to customers this week.  SmartCloud Continuous Delivery is currently targeted at customers who want to improve their dev/test cycle.    We believe this is the easiest place for our enterprise customers to start taking advantage of these new technologies.  We have other tools to help with production deployments, like SmartCloud Control Desk.

How is it going down in the market?

These ideas are gaining real traction, both within and outside of IBM.  Internally, we already have several adopters of continuous delivery for dev/test.  For instance, Rational Collaborative Lifecycle Management is using our code, and other teams like SmartCloud Provisioning 2.1 have custom continuous delivery solutions that are very similar to ours.  And of course, we're using it ourselves -- SmartCloud Continuous Delivery is self-hosting.

What would you say to any teams that are interested in this approach?

If anyone would like to evaluate the SmartCloud Continuous Delivery product, please check out our website.  We have free trials available.

Even if you're not a good candidate for SmartCloud Continuous Delivery, your team may be able to use several of the DevOps principles and practices.  Check out our Enterprise DevOps blog for ideas, and feel free to contact me about that as well.  IBM even offers DevOps consulting workshops.

Monday, October 8, 2012

DevOps Days Open Space: Making Your Automated Tests Faster

One part of my job is helping other teams adopt DevOps in general, and continuous delivery in particular.  But I have a problem: many of them have a suite of automated tests that run slowly; so slowly that they only run a build, and the tests that run in the build, about once per day.  (Automated test run times of 8-24 hours are not uncommon.)  There are several reasons why this is the case, including:

  • The artifacts that are produced from the build, and then copied over to the test servers, are very large (greater than 1 GB in size).  Also, sometimes the artifacts are copied across continents.
  • Sometimes there are multiple versions of the build artifacts that must be copied to different test servers after the build.  A typical product I deal with will support at least a dozen platforms; a few support around 100 different platforms, when you multiply the number of supported operating system versions times the number of different components (client, server, gateway, etc.) times 2 (for 32- and 64-bit hardware).
  • Often, the database(s) for the product must be loaded with a large amount of test data, which can take a long time to copy and load.
  • Many products have separate test teams writing test automation.  Testers who are not developers tend to write tests that run through the UI, and those tests are usually slower than developers' code-level unit tests.
Running builds and tests often, so developers know quickly when they make a change that breaks something else, is a key goal of both continuous integration and continuous delivery.  Ideally, a developer should get feedback on whether their code is "ok", using a quick personal build and test run, within 5 minutes.  Anything over 10 minutes is definitely too slow; the developer will probably move on to something else, make more changes, and forget exactly what was changed for that particular test.

Once the quick tests pass, the developer can run a full set of tests and then integrate the tested changes.  Or, in cases where a full set of tests is extremely slow, the developer can integrate his or her code changes once the quick tests pass, and then let the daily build run the full set of tests.

In this DevOps Days open space session, we brainstormed ways to make automated tests run more quickly.  We focused more on quick builds for personal tests, but most of these ideas would make the full set of tests faster too.  Many thanks to the dozens of smart people who contributed their ideas.  I don't even have their names, but they know who they are.  I'm sure we'll use several of these ideas right away.

Watch for an article with more details on each of these, coming soon...

Fail quickly

Run a quick smoke test first

Run a small set of tests that fail often next

Run slow tests last, or not at all

See also: Remove slow tests

Run in parallel

Run test buckets in parallel

Use snapshots of databases or VMs to make it easier to run tests in parallel

Break up tests into smaller groups

Divide your application into components, and test the changed components

Automatically determine which tests to run when code changes

Save time on I/O

Mock responses

Use LXC (Linux Containers)

Move servers and data so they are close to one another

Make your test infrastructure faster

See also: Use snapshots of databases or VMs to make it easier to run tests in parallel

Cache what you can

Remove some tests

Remove tests that never fail

Remove slow tests

Replace some UI tests with code-level tests

Replace some tests with monitoring

Friday, July 6, 2012

DevOps Days Open Space: DevOps for Legacy Code and Real Servers

I proposed the topic for this OpenSpace: DevOps for Legacy Code and Real Servers.  Here are some of the insights I gleaned from this session.

Legacy servers
  • Cloud platforms are evolving to manage real, legacy servers in addition to virtual machines.
  • Chef, for example, can manage both physical and virtual servers.  It can also manage clean OS installations as well as update existing servers.  There's a tool called Blueprint that will attempt to reverse engineer Chef automation for an existing server.
  • It's difficult to re-create systems that weren't automated in the first place.  However, it greatly reduces your risk if you invest the time and effort to do that.  What if the server was destroyed in a fire or something?
  • Sometimes people have even lost the source code for applications that are running in production.  That is a very risky state to be in.
  • Another option is to clone the system into a VM first, snapshot it, and then do your exploratory work on the VM.
  • You can also copy some of your production web traffic to your staging servers.
  • Or, you can start deploying new applications to VMs, and gradually shift your enterprise code to VMs.
Mainframe systems
  • Mainframes are the backbone of many legacy systems, and they are not going away.  
  • People who are used to working with mainframes have a different culture and language than people who are used to developing new web applications.  There's a communication gap to bridge before they can benefit from DevOps principles and practices.
  • One option is to just get an enterprise's web applications to adopt DevOps and punt on the mainframe applications.  But why can't we do the same thing for mainframe applications?
  • Mainframes have limited logging and monitoring systems.  Why?
  • Mainframes have limited tooling.  Why?
  • It's very difficult to see what's going in within an application.
  • It's very difficult to debug applications.
  • Deployments have to be completed with zero downtime.
  • LPARs, CICS regions, etc. could actually be considered a type of virtualization.  Is there a way we could make them behave more like VMs?
  • Could mainframe developers take some of the best practices from .Net and Java?
  • A more open place, like a university, might be more willing to experiment with DevOps first.
Universal Principles
These principles from DevOps can apply to legacy servers and mainframes just as easily:
  • Source Control Everything (including infrastructure code)
  • Version Control Everything (including infrastructure code)
  • Automate Everything (including infrastructure code)
  • Test Driven Development: Test First, Test Everything
  • Test for Operational Quality (performance, transaction load, security, etc.)
  • Agility
  • Focus on the Business Outcome, not the features or requirements
  • Improve teamwork between Dev and Ops
  • Collect metrics so you can find problems earlier

Friday, June 29, 2012

DevOps Days Open Space: How Can Ops Teams Give Feedback to Dev Teams?

This was another interesting Open Space that I participated in: How Can Ops Teams Give Feedback to Dev Teams?

Chaos Monkey can teach developers where things might break.  You need to couple that with some sort of monitoring tools so you can find bugs of the performance/throughput/overload type as well.

People in ops would like developers to program more defensively.  Developers are not generally taught how to do this.  It's also not usually part of their culture. 

One great way to tech developers is by writing tests that fail.  Developers are great at fixing tests that fail.

Another best practice is to embed developers in operations and vice versa.  Some companies have done this with teams of people for months or years at at time.  Others rotate people between the teams for one day every couple of weeks.  Set it up like an apprenticeship, where people can start out with a mentor and gradually become responsible for their own things.

Operations people can review code!

Developers can have pagers!

Product managers need to care about operational constraints and include those in the requirements that they put on the development teams.

You need to get everyone in the company to think about business value and happy customers.  Constantly.

You need to get everyone in the company to watch dashboards.  Give each person a few graphs to watch on a dashboard.

DevOps Days Open Space: Private Cloud Myths, Facts, and Tips

I'm at the DevOps Days conference this week, and some of my favorite sessions have been the OpenSpace collaborations.  I was the scribe for the session on "Private Cloud: Myths, Facts, and Tips", so here is what we discussed:

Myth: The cloud will not fail.
Fact: It will fail, so write software that can handle failures.  In fact, entire data centers will fail.  Have at least two geographically remote clouds that can fail over to each other.
Tip: The Yahoo guys have found that the more technology you have to prevent a data center from failing, the more points of failure you introduce into the system.  They've found that it's much more effective to make it easy to fail over to a new data center.  In fact, they fail over during peak traffic times on purpose: for practice, to build confidence, and as a way to take servers out of production for a little while to upgrade them.
Tip: Infrastructure automation makes it easy to create a new backup system.

Myth: The cloud will move my enterprise forward into the future.
Fact:  There are issues you have to solve first.
Tip: Teach your developers how to write stateless applications.
Tip: The chaos monkey can help your developers learn how to write stateless applications, and how to handle failures of different services.
Tip: Getting people to write stateless code is a cultural and a management issue.

Myth: Any workload can run in the cloud.
Fact: Some workloads will not behave in a cloud-like manner.  You have to design applications for that.
Tip: A best practice is to tell developers that their application could be moved to the public cloud at any time, without notice. 
Tip: A full penetration test on a regular basis is extremely valuable.  Especially if the output of that is a set of tests that failed, that your developers can fix!

Myth: My current ops team can run a private cloud easily.
Fact: You need to train people on how to run a cloud.  It's not traditional ops.  You have to learn to let go.  You have to give people control over their own machines.  You have to let them provision their own VMs.
Tip: Sysadmins should learn how to write infrastructure code.  Either that or they may be reduced to changing out broken servers on racks.
Tip: One way to prevent people from overloading your cloud capacity is to put fixed leases on the machines.  If you have infrastructure as code, you can provision a new system in a few minutes.  So, for developer unit test VMs, it's fine to delete those after 24 hours.  The developer may manually delete VMs even earlier.  Have a way to flag the small percentage of machines that should live for a long time.

Thursday, June 28, 2012

The Women at DevOps Days

I recently posted a tweet about DevOps Days Mountain View 2012, and I feel the need to explain it a bit: https://twitter.com/DukeAMO/status/218541579219116032/photo/1

First of all, why did all of the women there that evening decide to get together, in a cage of all places?  Well, DevOps Days was held in a very odd sort of place, the Silicon Valley Cloud Center.  It's a former data center, apparently, and the evening dinner and party was basically in the garage.  For some reason, the garage is partitioned into several large cages, surrounded by chain link fences.  We thought it would be fun to get together, and the easiest place to do that was in one of the cages, because there were tables in there.  The fact that it was slightly subversive to do that was part of the fun.

Did we do that to make some sort of political statement?  No, not really.  The fact of the matter is, when you're a woman at an operations conference, it's very obvious that you're in a minority.  I haven't done any scientific studies, but I'm going to guess that the ratio of men to women attendees at both DevOps Days and the larger Velocity conference this week was around 95% to 5%.  It's a shame that more women aren't attracted to this field, because it's quite rewarding, and there's no reason for women to stay away.  But the women who do buck the trend are rare birds, and we don't mind being a little bit different.  We also get used to it.  On a day to day basis, I don't even notice that the vast majority of the people I work with are men.  It's normal.

We had our little toast together, and had a great time talking for the rest of the evening.  By the time I left, there were just as many men at our table as there were women, and we were happy to have them.

Here's to the rare birds!

Thursday, May 24, 2012

Are you treating your servers like snowflakes or rubber stamps?

This is more of an agile blog, but I thought I'd cross-post this from our DevOps team blog, just in case anyone's interested:

Are you treating your servers like snowflakes or rubber stamps?

I also submitted this proposal for DevOps Days Mountain View 2012.

How IBM is using DevOps to build DevOps tools

Comments on both are welcome!

Friday, January 27, 2012

How can you make remote development work?

I have a few years of experience with working from home, and several years of experience with working with cross-site teams. There are a few things that are necessary for a remote worker. A remote worker needs to be:
  • Motivated
  • Independent
  • Self-managing
  • Responsible
  • Good at time management
  • Excellent at communications (oral and written)
  • Honest and trustworthy
  • Good at ignoring distractions
  • Willing to ask for help if needed

While most of these are disciplines that you can get better at with practice, some are harder to learn.

Working from home can be great, but it's a special skill, because it's easy to get distracted with things that need to be done around the house. Also, there's no one there to see if you're goofing off, so you have to motivate yourself to focus on the task at hand. It's easier to focus and be productive if you have specific tasks that are due on a regular (daily) basis. Scrum helps with this by providing a quick daily checkpoint meeting.

Ideally, remote workers should be in the same time zone as each other, or at least close to the same time zone. I've found that working with people 5 hours away (the UK, for example) is workable. You just learn to block off your mornings for communication and meetings, and do independent work in the afternoons and evenings. Working with people 7 or more hours away is painful. You end up losing days of work because you can't find someone who can help you get past a blocking issue. Perhaps you send an email describing a problem and asking for help... then the person on the other end misunderstands something... and now you're into a second lost day. We work with people in India, but they're actually on a 12-8 PM working schedule, so we at least have a couple of hours of overlapping time in the Eastern US. While you *can* call people early in the morning or late in the evening to keep things moving along, nobody likes to do that on a daily basis.

One other tip, especially for software engineers, is that it really helps to bring people into the office for a week or so, once or twice a year. First of all, face to face communication is high-bandwidth, so you can get a lot of work done in a short amount of time (planning, design work, strategy sessions, training, etc.). People tend to trust and respect each other a little more once they've met in person. They're also more comfortable asking questions, or asking each other for favors, which means people are more productive when they get home. Finally, it builds a stronger team and improves morale.

A year or two ago, multinational development teams (even within one scrum team) were the norm here. But we're trying to consolidate projects locally as much as we can now. My current project is probably 75% local and 25% remote.