Welcome

What works and what doesn't work in software development? For the first 10 years or so of my career, we followed a strict waterfall development model. Then in 2008 we started switching to an agile development model. In 2011 we added DevOps principles and practices. Our product team has also become increasingly global. This blog is about our successes and failures, and what we've learned along the way.

The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions. The comments may not represent my opinions, and certainly do not represent IBM in any way.

Friday, August 8, 2014

SoftLayer Gotchas, Tips, and Tricks

NOTE: THIS IS OUTDATED!
Here are many things I learned about SoftLayer the hard way, through trial and error. Of course, SoftLayer's offerings are evolving, so some of these "facts" will change over time, but reading through this list should still help.

Some of this is RHEL-centric, because I do the majority of my SoftLayer work on RHEL.

Useful links

"Old Portal" https://manage.softlayer.com/ and "New Portal" a.k.a "Customer Portal" https://control.softlayer.com/ They're still in the process of migrating functionality from the "Old Portal" to the "New Portal", so sometimes you have to jump back to the old one.
Getting Started: http://knowledgelayer.softlayer.com/gettingstarted
SoftLayer Development Network: http://sldn.softlayer.com/

Support

SoftLayer tech support via the online ticketing system is very good. I open tickets whenever I get stuck, and they get back to me quickly. They also know what they're talking about; no high school students running the help desk here.

Networking

You can't run the SoftLayer SSL VPN and another VPN client at the same time. It will seem like it's working, but you won't be able to connect to the SoftLayer systems.
Use the SoftLayer private network to administer your systems. It's more secure and saves money.
If you're using the private network, you must have the private network on eth0, and also have a static route: 10.0.0.0/8 via your private subnet gateway server. If you don't have that static route, you won't be able to connect to other SoftLayer servers on the private network and vice-versa. SoftLayer compute instances and bare-metal servers have that route by default unless you re-install the OS yourself. Usually this only happens if you're running a hypervisor within SoftLayer and creating your own VMs there.
If you're deploying CCIs or VMs or servers with a public Internet connection, you must never enable root SSH access with an easily guessed id/password, even for a minute! The machine will be hacked very quickly, by Internet worms that try to log in to all available IP addresses with common IDs and passwords. Monitor bandwidth usage regularly to ensure that none of the systems are consuming huge amounts of bandwidth; this is a symptom of an Internet worm that has managed to install itself.
If you need to ensure that some of your servers are on the same VLAN, use SAN storage, not a local disk, for the primary disk. If you use a local disk, your server will probably be assigned to a different VLAN, and you won't be able to move it later.
If you want multiple bare-metal servers to be on the same VLAN, order all of them at the same time and be consistent with your storage configuration (local vs. SAN disks).
VLAN spanning: By default, servers on different private VLANs can't communicate with each other. There's a setting where you can enable VLAN spanning, and then all of your private VLANs will be able to communicate with each other (regardless of data center).
When creating new servers, you don't get a choice of subnet (only the VLAN). If you want to assign servers to specific subnets, request Portable IP Subnets (public or private) and use those instead.
Use the subnet details page (https://control.softlayer.com/network/subnets/*) to "sign out" portable IP addresses by entering the hostnames in the Notes field.
For public IP addresses, you can set up Reverse DNS from the same subnet details page.
You can add a dedicated hardware firewall in front of any public subnets, and configure firewall rules.
If you want very flexible control over your firewall and VPN access, you can add a Vyatta gateway (or an HA pair of Vyatta gateways) instead. Each Vyatta gateway can manage 1 "pod", where a pod is all of your own VLANs behind the same router pair (public + private) in the same datacenter.
ESX servers will only be connected to the SoftLayer private network. Their public NICs are disabled. This was never a problem for us.
The private SoftLayer network is not intended for large file transfers. It is a shared resource, and file transfer speeds can be slow at times. The public network will give you faster file transfers.

Compute Instances (a.k.a. CCIs)

These are essentially VMs running on a Xen hypervisor.
If you try to install an unsupported OS (such as RHEL 6.4) on a CCI, you may make the CCI unusable. You will also not be able to backup and restore the CCI. It's a bad idea.
On Xen, dmidecode doesn't work.
You can't install VMWare drivers on Xen. You can, however, order a VCenter CCI from SoftLayer and use that to manage your SoftLayer ESX servers.
The maximum size for the primary disk is 100 GB. You can add additional disks if needed.
LVM is not supported. If you try to use LVM to make multiple disks look like a single partition, you may make the CCI unusable. You will also not be able to back and restore the CCI. This is also a bad idea.
Provisioning time may vary between datacenters. When ordering a CCI, you can select an option to deploy it to the first available datacenter if provisioning speed is more important than where it ends up.

Bare Metal Servers

If you install the available AFP firewall, it will probably conflict with iptables unless you know what you're doing.
Bare-metal servers are not available with RHEL. CentOS is available.
If you don't see the server configuration you want in the web order form, you may be able to order exactly what you want by starting up a sales chat.
If you try to install an unsupported OS (such as RHEL 6.4) on a bare metal server, you will not be able to backup and restore the server. This is a bad idea.

Red Hat Tips

Bare-metal servers are not available with RHEL. CentOS is available. Some of our software requires RHEL, and in some cases, we have been able to work around this by installing the prereq packages from the CentOS repo before installing that software.
To add a static route to the private network on RHEL/CentOS: add the route to /etc/sysconfig/network-scripts/route-eth0 so the setting will be persisted, and also run "route add -net 10.0.0.0 netmask 255.0.0.0 dev eth0" to enable it immediately. Google it for details.
OS options are somewhat limited; RHEL 6 will always be the latest version of RHEL SoftLayer supports (currently 6.5).

Hope this helps! I welcome your comments below.

Friday, May 16, 2014

Bootstrap scripts for SoftLayer: Chef, Eureka, Openstack Devstack, Tomcat

I've published sample scripts that can be used to bootstrap some popular software packages. The samples are AS-IS open source (not supported by IBM).

The samples out there right now are: Chef workstation, Chef Solo configuration (solo.rb, node and role files), Eureka (NetflixOSS), Devstack (OpenStack developers' distribution), and Apache Tomcat 7. The Eureka and Tomcat bootstrap scripts also use Chef Solo, while the DevStack script is just a straight .sh script.

If you tweak the scripts a little to use your own information and provide pointers to them in SoftLayer's "provision script URL" field, your systems will be automatically configured for you when create them or re-image them.

https://github.com/amfred/softlayerbootstrap << See the README.md file for details.

Wednesday, August 21, 2013

How to make life easier for your remote employees

I've already written one post about setting up teams with remote workers. However, I didn't really focus on cultural changes that employees who are in the regular office can accept to make life easier and more efficient for their remote colleagues. Cultural changes are always difficult, but these are worth the effort if you want your entire team to be productive:

Be available for communication. People are going to have questions, and if they're remote, they can't walk down the hall to ask them. Be available via chat, instant message, text message, and phone. Check your email. Respond to messages from remote employees at least as quickly as messages from locals, and remember that remote employees can't just stop by if it's urgent. It's great to set aside a couple of hours each day when you won't be interrupted, but make sure there are plenty of hours when your remote workers can reach you. Set up your "do not disturb" time so it's either a time when your remote employees are not working, or when they are also on their "do not disturb" time.
Be available in off hours, especially when working across time zones. Most remote employees will be very respectful of your working hours, and will only call you in off hours if they're stuck, and then they'll keep it quick. If they do not have the freedom to call or text message you in off hours, they will lose hours of productive time on a regular basis. They will learn to stop asking questions, and either spend the time trying to figure it out themselves using the Internet, or implement something that may or may not be what you want, and ask for feedback on it the next day. Worried that you'll end up working too much in off hours? Cross that bridge if you come to it, and allow for flexible schedules.
Allow for flexible schedules. If someone ends up working late one night, they should be able to leave early or come in late another day that week. Also, consider whether people in different time zones should work a shifted schedule. For example, I've worked with teams in India that worked from 12-8 PM every day, so they would be at work for at least a few hours when the U.S. employees were at work. Now that I'm living in California, I'll work an hour very early in the morning to sync up with people in Europe and on the east coast of the U.S., then I'll take a couple of hours off to get the kids off to school, then I'll finish up my work day. My "do not disturb" time is in the late afternoon, when my colleagues are not working. I also try to schedule all of my personal appointments late in the day. Sometimes I work from 10-11 PM to sync up with people in China. Flexible schedules are great as long as people are available for communication when the rest of the team is working.
Be careful about setting meeting times. Keep meetings short, small, and focused. Set meetings during everyone's working hours, or take turns having meetings during your off hours and during others' off hours. Consider whether it's better to schedule a meeting or pick up the phone.
Be careful about who you invite to meetings. Again, keep your meetings small. But if you're having meetings where your remote colleagues have something to add, then invite them and make sure you provide facilities for them to join in (such as a good conference calling phone system, plus screen sharing). Retrospectives and planning meetings, meetings where you set team policies, and town hall meetings are good examples.
Avoid using the mute button during conference call discussions. If it's a one-way presentation, then it's fine for everyone else to be on mute to block background noise. But if it's a discussion, don't put people on mute while you have a side discussion. It's disrespectful, and people know it's happening because the sound of the background noise changes.
Be very clear in your communication. Write carefully, especially if it's an email message rather than real-time communication. Also, be explicit about work that is required, and work that is optional. Are you assigning a task, or are you tossing around ideas? Do you need a lot of help with something, or do you want to be pointed in the right direction? Is anything blocking you? Are you getting pulled away from your main tasks to work on something else?
Be smart about communication modes. Use email when you have to carefully consider what you write, or to report on your status and hand off work at the end of the day. Email and mailing lists are terrible ways to have involved discussions. Use the phone (either an impromptu call or a meeting) when you have much to discuss. Use instant messages or texts for quick questions. If your instant messages or email messages are getting long, it's time to switch to the phone. Use mailing lists for broadcast messages that need to go to a group of people, but if it turns into a discussion, move the discussion to a forum or wiki, send the link to the mailing list, and politely move the discussion off of the mailing list and into the forum/wiki. If discussing a work item (defect, feature, etc.), discuss it via comments on the work item, for future reference. Get all of your tips, setup information, and troubleshooting information into a forum or wiki, or write documentation that stays with the work item.
Have blameless retrospectives. Every week or two, get together as a team to discuss what is working and what is not working, in an honest, blame-free environment.

Anyone have more ideas to add? As always, I welcome your comments!

Wednesday, November 14, 2012

Test automation got you down?

It's here - my full article on Making Your Automated Tests Faster, on the Enterprise DevOps blog! Thank you again to the dozens of people who contributed their ideas at DevOps Days Rome.

This reminds me of my earlier post on "death by build", and how a build that takes too long, due in part to slow test automation, can really hamper a project: Is Shift-Left Agile? And Death By Build

Tuesday, November 6, 2012

Interview about DevOps and SmartCloud Continuous Delivery

Here's an interview I recently gave to a fellow colleague of mine, Tiarnán Ó Corráin. This seems like a good time to reiterate that these views are my own, and not the official views of IBM:

What is DevOps?

You can think of it as an extension of the principles and practices of Agile. Where the Agile methodology breaks down the barriers between development and business goals, DevOps is breaking down the barriers between development and operations. It's not all about tools; it's about people and processes as well. Both Agile and DevOps have a goal of delivering reliable software to the business more quickly, and ultimately, making more money!

How does DevOps do that?

Well, traditionally there has been a problem between going from a development system to a production system: installing new machines, installing the software, scripting and so forth. Getting from a working development system to a working production system involved all of these manual steps, and introduced many points of failure.

In addition, because setting up test machines was so time-consuming and error-prone, developers would often assemble their test machines in a quick and dirty way, on a single server, with the cheapest, simplest components. Production systems, on the other hand, would have multiple servers, configured for clustering and fail-over, with firewalls between some components. This meant that the developers weren't testing the software in an environment similar to the one where it would run in production. And because of that, some bugs were never found until the software was deployed into production.

One of the primary tenets of DevOps is that you should automate every step in creating production systems. Everything from preparing the machine, to installing the latest software, to starting the services, and testing them, should be fully automated and repeatable. And when creating production systems is automated, you can also use production-like systems for development and test work.

How does virtualization help that?

When we're working in a cloud environment, deploying a virtual machine is something that can be scripted and automated. We use infrastructure code to automate deploying the machine, installing the software, and (re)starting the services, and then we check that code into our source code repository and version it just like the application code. So effectively the process of deploying a new system becomes part of the development process.

Presumably that makes testing easier?

Very much so. Our virtualization technology means we can deploy production like environments as part of the development process. It's the way the development process has been trending recently. We already have continuous integration: RTC (Rational Team Concert) triggers automatic builds when changes are submitted, and we run unit tests against those builds.

Now, take that to the next level with continuous delivery: after changes trigger builds, those builds trigger deployments, and when the deployments are complete, the builds trigger automated tests. What it means is that as part of the development process, we have production like servers running the latest code. This allows us to run automated tests including performance verification against a production like environment as part of the development process.

Taking Agile to the next level?

Yes. It accelerates development, because it takes away some of the uncertainty about deployment: if I can capture every part of the deployment process in my development and testing process, I have more confidence about what I'm going to deploy.

Deploying test systems automatically also saves developers and testers a lot of time! On my own development team, it's normal for us to deploy dozens of new servers every day, and delete the old ones just as often.

Can you tell me a bit about your own role?

I'm on the advanced technology team that works on DevOps. We are driving an internal transformation within IBM, to encourage our own development teams to adopt DevOps principles and practices. In addition, we are creating tools to help IBM's enterprise customers adopt DevOps themselves. The first tool we developed to sell is SmartCloud Continuous Delivery v2.0, and it's shipping to customers this week. SmartCloud Continuous Delivery is currently targeted at customers who want to improve their dev/test cycle. We believe this is the easiest place for our enterprise customers to start taking advantage of these new technologies. We have other tools to help with production deployments, like SmartCloud Control Desk.

How is it going down in the market?

These ideas are gaining real traction, both within and outside of IBM. Internally, we already have several adopters of continuous delivery for dev/test. For instance, Rational Collaborative Lifecycle Management is using our code, and other teams like SmartCloud Provisioning 2.1 have custom continuous delivery solutions that are very similar to ours. And of course, we're using it ourselves -- SmartCloud Continuous Delivery is self-hosting.

What would you say to any teams that are interested in this approach?

If anyone would like to evaluate the SmartCloud Continuous Delivery product, please check out our website. We have free trials available.

Even if you're not a good candidate for SmartCloud Continuous Delivery, your team may be able to use several of the DevOps principles and practices. Check out our Enterprise DevOps blog for ideas, and feel free to contact me about that as well. IBM even offers DevOps consulting workshops.

Monday, October 8, 2012

DevOps Days Open Space: Making Your Automated Tests Faster

One part of my job is helping other teams adopt DevOps in general, and continuous delivery in particular. But I have a problem: many of them have a suite of automated tests that run slowly; so slowly that they only run a build, and the tests that run in the build, about once per day. (Automated test run times of 8-24 hours are not uncommon.) There are several reasons why this is the case, including:

The artifacts that are produced from the build, and then copied over to the test servers, are very large (greater than 1 GB in size). Also, sometimes the artifacts are copied across continents.
Sometimes there are multiple versions of the build artifacts that must be copied to different test servers after the build. A typical product I deal with will support at least a dozen platforms; a few support around 100 different platforms, when you multiply the number of supported operating system versions times the number of different components (client, server, gateway, etc.) times 2 (for 32- and 64-bit hardware).
Often, the database(s) for the product must be loaded with a large amount of test data, which can take a long time to copy and load.
Many products have separate test teams writing test automation. Testers who are not developers tend to write tests that run through the UI, and those tests are usually slower than developers' code-level unit tests.

Running builds and tests often, so developers know quickly when they make a change that breaks something else, is a key goal of both continuous integration and continuous delivery. Ideally, a developer should get feedback on whether their code is "ok", using a quick personal build and test run, within 5 minutes. Anything over 10 minutes is definitely too slow; the developer will probably move on to something else, make more changes, and forget exactly what was changed for that particular test.

Once the quick tests pass, the developer can run a full set of tests and then integrate the tested changes. Or, in cases where a full set of tests is extremely slow, the developer can integrate his or her code changes once the quick tests pass, and then let the daily build run the full set of tests.

In this DevOps Days open space session, we brainstormed ways to make automated tests run more quickly. We focused more on quick builds for personal tests, but most of these ideas would make the full set of tests faster too. Many thanks to the dozens of smart people who contributed their ideas. I don't even have their names, but they know who they are. I'm sure we'll use several of these ideas right away.

Watch for an article with more details on each of these, coming soon...

Fail quickly

Run a quick smoke test first

Run a small set of tests that fail often next

Run slow tests last, or not at all

Run in parallel

Run test buckets in parallel

Use snapshots of databases or VMs to make it easier to run tests in parallel

Break up tests into smaller groups

Divide your application into components, and test the changed components

Automatically determine which tests to run when code changes

Save time on I/O

Mock responses

Use LXC (Linux Containers)

Move servers and data so they are close to one another

Make your test infrastructure faster

Cache what you can

Remove some tests

Remove tests that never fail

Remove slow tests

Replace some UI tests with code-level tests

Replace some tests with monitoring

Friday, July 6, 2012

DevOps Days Open Space: DevOps for Legacy Code and Real Servers

I proposed the topic for this OpenSpace: DevOps for Legacy Code and Real Servers. Here are some of the insights I gleaned from this session.

Legacy servers

Cloud platforms are evolving to manage real, legacy servers in addition to virtual machines.
Chef, for example, can manage both physical and virtual servers. It can also manage clean OS installations as well as update existing servers. There's a tool called Blueprint that will attempt to reverse engineer Chef automation for an existing server.
It's difficult to re-create systems that weren't automated in the first place. However, it greatly reduces your risk if you invest the time and effort to do that. What if the server was destroyed in a fire or something?
Sometimes people have even lost the source code for applications that are running in production. That is a very risky state to be in.
Another option is to clone the system into a VM first, snapshot it, and then do your exploratory work on the VM.
You can also copy some of your production web traffic to your staging servers.
Or, you can start deploying new applications to VMs, and gradually shift your enterprise code to VMs.

Mainframe systems

Mainframes are the backbone of many legacy systems, and they are not going away.
People who are used to working with mainframes have a different culture and language than people who are used to developing new web applications. There's a communication gap to bridge before they can benefit from DevOps principles and practices.
One option is to just get an enterprise's web applications to adopt DevOps and punt on the mainframe applications. But why can't we do the same thing for mainframe applications?
Mainframes have limited logging and monitoring systems. Why?
Mainframes have limited tooling. Why?
It's very difficult to see what's going in within an application.
It's very difficult to debug applications.
Deployments have to be completed with zero downtime.
LPARs, CICS regions, etc. could actually be considered a type of virtualization. Is there a way we could make them behave more like VMs?
Could mainframe developers take some of the best practices from .Net and Java?
A more open place, like a university, might be more willing to experiment with DevOps first.

Universal Principles
These principles from DevOps can apply to legacy servers and mainframes just as easily:

Source Control Everything (including infrastructure code)
Version Control Everything (including infrastructure code)
Automate Everything (including infrastructure code)
Test Driven Development: Test First, Test Everything
Test for Operational Quality (performance, transaction load, security, etc.)
Agility
Focus on the Business Outcome, not the features or requirements
Improve teamwork between Dev and Ops
Collect metrics so you can find problems earlier