The Checklist Manifesto is a book from 2010 by author Atual Gawande, an American surgeon and public health researcher. I read it a while ago and it reinforces my belief that we all make mistakes. Unsurprisingly the book is all about checklists. It may not sound like much it can be a great help in reducing mistakes. The book takes us through a number of real world examples. He talks to builders, pilots and doctors in order to learn how to make surgery safer. I do recommend it in full but I’ll take you through it quickly.
Building a skyscraper
In the past when you wanted make a big build, a cathedral say, you would hire a master builder. It would be one persons responsibility to design, engineer and oversee the construction. Over time buildings have gotten larger and more complicated. Architecture and engineering split of from construction. Then further splits until there are specialisations everywhere. All of these people have to work together. The different system they are responsible have to fit together. There is too much for a single master builder to oversee and so they have to look for another way to coordinate.
The author is shown the construction schedule for one building. It lists line-by-line day-by-day what needs to be done. The concrete floor to done on this day, the steel to be delivered the next. Colour coding shows critical steps. Updates are made on a computer but printed out weekly, more often when it’s busy, and posted up on the wall so everyone can see it.
Not that everything always follows this plan. All buildings are expected to settle somewhat during construction. With the building in question this has happened faster than expected and unevenly. Water is pooling on the floor of one of the upper floors. So an update must be made. At a minimum clean-up is required and the schedule adjusted. More importantly, is it a serious construction defect and, if so, what is to be done about it? A new communication checklist is created specifying who has to talk to who, about what, and what date it must be done by. While the initial plan is never going to be perfect it can be review and update, indeed it is expected that it will be.
Flying a plane
Individual aircraft accidents hold a lot of attention. As a consequence a great deal of effort is put into flying safely. Despite any fears of flying it really is one of the safest ways to travel. A typical airline pilot will trained, fly for years, perhaps their entire lives, without experiencing a major incident. Yet they still have to be able to react immediately and correctly if something goes wrong. That doesn’t just happen.
In the cockpit of each plane there is a handbook, different depending on the type of plane and perhaps the airline. The first few pages are a series of checklists detailing normal operations: before starting engines, before pulling away from gate, before taxiing to the runway and so on. A pilot may have done these actions thousands of times. However familiarity can lead to complacency. These checklists are a reminder of what must be done. The rest of the book is a great many checklists detailing non-normal operations: smoke in the cockpit, different warning lights, dead engine, disabled copilot, etc. A pilot may have experienced some of them, may have trained for some, and other will be new. With potentially hundreds of lives on the line they must be able to follow the correct procedure. These checklists are that procedure. Importantly if there is an accident, there will be an investigation. An investigation will lead to a report which, if necessary, can lead to the checklists being updated.
Using central lines
A central line is a catheter placed into a large vein. It is frequently needed when treating critically ill patients to take blood or administer drugs without the need for repeated needle punctures. However the central line itself carries a risk of infection. To avoid infection when putting in central lines a doctor should:
- Wash their hands with soap.
- Clean the patients kin with chlorhexidine antiseptic.
- Put sterile drapes over the entire patient.
- Wear a mask, hat sterile, gown and gloves.
- Put a sterile dressing over the insertion site once the line is in.
Did that always get done? At Johns Hopkins Hospital the 10-day line-infection rate was 11%.
They created a checklist and authorised nurses to stop doctors if they were found to be skipping steps. Nurses were also to check with the doctor each day to see if a line was still necessary to minimise the duration they were used. With a year of monitoring the infection rate went down to effectively zero. There were only two infections during the entire period. Calculations suggests these measures and the enforcement of the checklist saved: 43 infections, 8 deaths and 2 million dollars.
Software development
How does this apply to us? You could read the stories above and believe that checklists are like a magic wand that fixes all problems. Magic wands seem fanciful and can be easy to just dismiss.
Building skyscrapers seem to have the most obvious parallel to building software. That story seems to suggest a waterfall methodology. I’m sure that can work if you know enough about what to expect from systems. I think the physical building industry has an advantage there. Steel is steel, concrete is concrete and plumbing is plumbing. These things are known quantities. Actually there are all sorts of different types of each of these. I don’t think that changes the point. Each one is a known quantity even if it is specific such as A36 steel. In our industry you can go out and pick up, say, a database but they’re all slightly different.
That said I think this system is different from the waterfall model. Yes there is up front design but the reaction to problems seems different. Constructing a building is a process that is in motion. Supplies have been ordered, builders employed and the skyscraper is already half built. You can’t just go back to the design phase and start again. Instead they have to cope with what they have and what is achievable. If a steel beam turns out to have been manufactured slightly wrong and not all of the bolts fit what happens? In the book they end up welding the beams together instead. It’s a different fixing technique than they intended but sufficiently strong. Maybe this is what can happen in the waterfall model but it’s not how it was taught.
The most interesting part of that section was the communication checklist. If something unexpected happens who has to know, what discussions do they have to have and what sign off is needed. This sort of thing certainly happens informally all the time. People find problems and then talk about them. However without a formal process it is more likely mistakes can slip through. Someone isn’t consulted or they are but their concerns don’t make their way on to the ticket. Perhaps formalising ourself could bring some benefit.
Flying a plane is perhaps a poster child for checklists. We might not think of them like this but it has been standard within the industry for decades. I didn’t initially think there was much to take away from here. The most important thing these checklists did was ensure they correct response in emergency situations. That’s not normally how I work. Sometimes there’s a deadline tomorrow but it doesn’t feel like the same thing.
It may be very relevant for different parts of the our industry. If you discover a security vulnerability in your software what should you do? There are probably a lot of things:
- Investigate the vulnerability how easy it is to trigger and how much access it gives.
- See if there is any evidence of it being used in the wild.
- Inform customers of the vulnerability and ways to mitigate it.
- Control the release of information to prevent anyone from leveraging the vulnerability.
What order do these go in and how to mange ones that are in conflict? I can imagine this happening in a piecemeal manner but it would probably go better if there was a plan, a checklist to go through. There are other emergency situations like data breaches, critical loss of data or the discovery of a malware in the system. These specific checklists for these are probably less relevant for the developers and more for the managers, support staff and specialists. However the daily routines that go alongside these like regular backups, strong authentication and go coding practices are relevant for developers. The specifics of these routines can create problems. It’s easy to, say, specify strong passwords that change often but that could just lead people to write them on post-it notes and sticking them to their monitors. These lists need to create useful behaviour, not another problem.
To the lesson of central lines is the easiest. Students training to be doctors normally start with good qualifications and then spend a long time training. Not just their degree but multiple years of on the job training. After all this training these people, or at least some of them, couldn’t put in a central line properly. That’s not to say that they weren’t skilled. It’s just they are people and we can all make mistakes. Maybe they forgot one of the steps, didn’t think they were always important, or skimped sometimes. Maybe they just didn’t have easy access to all the tools. With a procedure to follow and enforcement of that procedure infections essentially disappeared. I think however good or skilled you are procedures can make the result better.
Making checklists
Software engineering does have checklists. The one that springs to mind is for the Definition of Done. However there are a few lessons to learn about checklists:
- Your first version will be wrong.
- Too many items on a list is bad.
- Too few items on a list is bad.
- Leaving all the checks to the end can be too late, use stages or multiple lists.
- A checklist for one site may not be right at another site.
- It’s not just having a checklist, it’s following the checklist.
So don’t expect to sit down, write a list and be done with it. A checklist that is going to work will require experimentation, practice and revision.
On balance
I liked the book and think some lessons can be teased out from it. However it’s not as simple as telling people to use checklists and leaving them to get on with it:
- Look at places where problems and mistakes occur and think about what steps might be possible to avoid them.
- Work with people so that checklists are useful rather than burdensome.
- Think about who has responsibility for different parts of the checklist.
- Take advantage of “automated checklists” when you can: static analysis tools and unit testing.
Leave a Reply