Production Support

Developers like to keep on developing new things. Having to deal with users who raise pesky bugs is not what they signed up for. Creating a culture that looks at the whole life cycle of code, including maintenance, is a key step to achieving functional Dev Ops.

I recall a lecture at university in a software engineering module where the lecturer asked the class what percentage of the software life cycle did we think was spent on the various stages: analysis, design, development, testing, maintenance.

Few in the class got the figure right for the maintenance cycle: over fifty percent, and that was a conservative estimate. We were focusing on the waterfall methodology in the course, Agile had not gained traction with universities in the late nineties. The percentage surely holds true no matter what the development methodology. Code lives for a long time.

Whether you develop software which is sold to users at a distance or build in-house systems where the users are closer, whether you use waterfall, scrum or Dev Ops, there will still be code out there that requires bug fixes, enhancements and extensions.

Yes, your team is working on the next big version, and you are excited about it, but the customer cannot upgrade their system this month, or even this year. They ask, “Can you apply the shiny new feature on the old version of the code? Thanks.”

That new version of the product will often be based on layers of old, bug-ridden code that we simply don’t have the budget to fix. We want to rewrite it, to re-factor away the code smells and overcomplicated patterns, but it works… Kind of… Most of the time.

Wrong

Has a line of code ever stood out at you as being wrong. You tend to see it every time you revisit the class. Maybe the variables are named poorly or the logic is too complicated for what it does. You worry about it but don’t want to mess with it; it’s like pulling a loose thread and unravelling the entire piece of cloth. It works, for now, although you can’t be sure what it does. You’ve ignored it, but it haunts your sleep.

This is the kind of line that is going to cause an issue in production sooner or later. Perhaps the code will be changed and the developer will not fully understand its purpose. Perhaps it looks wrong because it is. You should fix it as soon as you can, but feel straight-jacketed because there is not enough time to regression test the functionality.

When this line does break something, what then? Most teams can react immediately and sort out the issue. It will hurt. It will delay other work. But it will finally be resolved. The cost will be more than the development and testing effort alone; there will be a system outage or a customer will be affected. Better to resolve these issues sooner.

This sort of problem is technical debt. We can spot it manually, we can use static analysis to find it, as can peer review. However, it takes time (resources) to fix. Fix the problem soon after its made, for example shortly after the developer commits it to source control, then it is easy (cheap) to fix.

Delay that fix by one month, perhaps it is caught in acceptance testing, and the cost goes up. The developer can’t remember what it was supposed to do — even after one month the context switch back in time is difficult. It’s harder to package a patch to the testing environment. There is more red tape to complete, bug reports logged, updated, reviewed and closed.

Delay it longer, say it’s found shortly after release to production, then the cost goes up dramatically. Not only has the fixer got to go back in time and figure out what the code was supposed to do, fix it and deploy somewhere it can be tested, but to get the fix into production could mean having to schedule downtime, or send updates to clients.

What if it’s years later before it surfaces as a problem? Perhaps the developer is no longer working for the company, the team disbanded and no one wants to admit to ever having anything to do with it. The cost of fixing goes up as time since the line was written passes.

One of the big draws for Dev Ops is that delays and costs related to this sort of production fix are no longer expensive. Developers check code into trunk, knowing that their change will go through a suite of testing, from user acceptance for functionality through static analysis to ensure it is compliant, and through security testing, capacity and other non-functional checks.

Dev Ops pipelines are largely automated, so that the cost of making a change and to go through the pipeline including release should be negligible. Testing is automated as is code analysis, so that code smells and bug-causing code do not even make it into production. Developers get fast feedback from their code commits, allowing them to self-correct before moving onto something new, reducing the cost of that context switch into the past to fix something written last month.

The best way to stop bugs in production is to remove them before they get there. The layers of automated testing help to reduce the bugs that make it there.

Some bugs will get into production. It is inevitable. When computer systems run in the wild they can behave in unpredictable ways. Dev Ops is designed to quickly fix the problems, and add preventive methods to stop the same thing happening again.

Once we find something new that has failed, we add some testing for it. Build up the tests over time so that this one cannot happen again. Our test suite becomes richer and richer and confidence to release increases.

My company is a long way from Dev Ops. We have a continuous build system in place and a standard release procedure that is well documented but much of it is still manually run. We will improve this over time, eventually being able to fix code quickly, but meanwhile, we have to work on a culture shift in how we write and design code and deliver it to operations.

All too often, developers say that they are done as soon as the basic case is catered for. They haven’t considered edge cases, or that users might not follow the exact set of steps that they used when testing in a sandbox.

If the issues are not found they get passed into production and are now somebody else’s problem. Many never surface, such as edge cases that are low in probability. But they may occur. What then? The production support team will be the first to know about the problem.

Our production support team reacts quickly to user and system issues, but can become overwhelmed at times. Some issues are simply buried too deeply in the code to be resolved quickly. When issues arise that they cannot solve themselves or the number of concurrent issues simply grows too high, they seek help from the development teams.

This has an impact on the amount of project work that we can complete. The visibility of this work is also less apparent. The team simply looks like it completed less story points in the sprint, where closer analysis shows they had to deal with multiple production issues.

A sense of resentment builds up that development has been stopped due to the influx of issues. But this is the wrong attitude to have. If production is down, we can’t make money. If we don’t make money, I won’t have a job. It is in everyone’s interest to keep production bug free.

Even within a scrum team, the impact of production issues is sometimes hard to see. Consider a developer that commits at the stand-up to finish a four-hour task that day, however, by the time he or she gets back to his or her desk, there is a production support ticket waiting to be worked on. This takes up the rest of the day.

The earliest that the rest of the team might hear about this is at the next day’s stand-up. Maybe another developer could have picked up the task? Maybe the QA was waiting for something. In truth this is a little contrived, but it can happen, especially with remote teams.

In The Dev Ops Handbook, the authors describe the Andon Cord from Toyota manufacturing plants. When something goes wrong on the production line, say a missing or incorrect part arrives, the worker can pull the cord to alert their line manager.

If the manager cannot solve the problem within a set amount of time, the entire production line stops. This allows the entire plant to fix the issue, learn from it, and take steps to ensure that it does not happen again.

There is little point in different section of the line continuing to work. Anything upstream, supplying the problem area, will create a backlog of blocked work items that need to be stored while waiting for the problem to be fixed. Anything further down the line will exhaust their supply and begin to starve.

The only option is to stop everything. Then all appropriate resources can be targeted at a solution. Each part of the line understands that it is not the section’s production that’s important, rather it is the number of complete cars that pass quality control that matters.

Can this example teach us anything for software development? We could consider analysis, design, development, the various stages of testing, and monitoring in the wild can be considered different parts of the production line.

Any one can be overloaded or starved, like the Toyota line. If production is full of bugs, issues and problems, it will be difficult to release new features to it. Every part of the line is vulnerable. Every activity can have an impact the entire delivery.

We could make use of our own Andon Cord. If a developer is blocked or requisitioned to solve a production line problem, they need to alert their team. The team can either rally around to help solve the production issue, or at least react to the missing developer by moving tasks around.

The Andon Cord should also be invoked to pause development of new features if something breaks the continuous integration build; there is no point having automated tests if they fail all the time. No one will know what change caused the build to fail.

Any commits onto an already broken build will make it more difficult to fix whatever caused the break. Change the culture of development so that a problem with any single part of the software production line becomes everyone’s problem.

Try to see the entire software ecosystem, including code in production, as a long production line, where problems at any point have a direct impact on every other point. Use Dev Ops principles such as monitoring and automation to make it easier to spot and fix issues at any point on the line.

Actions

  1. Try to quantify how much it costs to release a small bug fix into production in terms of the release process. Does the difficulty make developers reluctant to fix issues?
  2. Who in your organisation takes responsibility for resolving production issues? Are they the same people that wrote the code? What feedback mechanism do you use to improve quality of delivered code?
  3. If operations staff in your company are overwhelmed with production support issues, and the developers are continuing to develop new features rather than resolving the issues, consider how an Andon Cord for your organisation might work. What changes in culture are required to allow the entire production system to switch into fix-mode to resolve issues?