About British Airways, Root Cause Analysis, and Flying Aeroflot

On Saturday, May 27th The Guardian, together with other UK media outlets, notified its readers that:

British Airways cancelled all flights from Heathrow and Gatwick on Saturday due to a major IT failure causing severe disruption to its global operations…”

Financial Times, in its turn, called British Airways CEO, Alex Cruz, “the man who fell to earth”.

The man who once had his name put on a plane by loyal coworkers, has his, and his company’s reputation in shatters. Three days of British Airways (BA) IT outage left more than 75,000 travelers worldwide stranded. The initial estimate of damages was 150,000 British Pounds at the time of this writing.

Mr. Cruz somewhat incomprehensibly blamed the problem on a “power surge”.  The electrical power company servicing BA stated that no other client of theirs in the same geographic region experienced a power surge.

Technical observers parsing the news established the following fairly plausible scenario:  after a loss of power, an operator’s error in the process of power restoration lead to an unexpected power surge; poorly executed datacenter failover processes corrupted data in the spare datacenter and rendered it useless for the recovery purposes.

Tech Crunch reports that this is the sixth crash for the airline’s web site this year. Also, during the time leading up to the outage, some travelers noted deterioration of customer experience, as well as the long time that the system took to process travel registrations.

So who is at fault here?  What can be learned?  How can organizations better prepare for an outage like this?  Can anything be done, or this is just a streak of bad luck?

Right away, the cutting of IT resources at British Airways and the outsourcing offshore were expectedly pointed out as the culprits.  The company predictably answered that one had nothing to do with the other.  Instead, BA cited an operator’s error on the part of the service that managed the electrical supply, who made this error during the power restoration procedure.

Anti-fragility guru Nassim Nicholas Taleb quoted his favorite explanation that compares BA’s fragile IT systems with robust antifragile systems of Netflix.

Netflix is using its Chaos Monkey tool to selectively take out parts of its IT system to test robustness and readiness of failover and continuity processes.

We, at GQP, LLC (http://www.gqpartners.com), from our side, can provide the following analysis based on the available information:

  • In fairness to British Airways — It is significantly more difficult to bring robustness, resilience and agility into an existing enterprise system like British Airways’, that is comprised out of lots of legacy applications, in addition to some new ones, than it is to operate a new, modern, well designed system like that of Netflix. With this, surely British Airways could have used Chaos Monkey and good practices like that of companies such as Netflix in their business.  Clearly they have not.
  • Cutting cost and increasing load on IT staff will lead to failures. Outsourcing could be of value, when it is combined with more effective work processes and system design. In such a case, it can bring significant efficiencies.  Simply shifting jobs overseas is not transformation, it is labor market arbitrage. The two should not be confused. With fragile, unstable IT systems, it was a mistake on the part of British Airways management to engage into cost cutting without creating effective, smart automation leading to better productivity and job satisfaction. How motivated were IT workers whose colleagues jobs were just outsourced and they got to do more work for the same pay? Not very much, and it shows.
  • The fact that the company blames the situation on a human error of a single operator points to a glaring lack of automation. Automated, well designed process holds much better under pressure. While not completely error prone, they are better able to follow process than humans.  Of course such processes have to be tested first — hence Netflix’s use of Chaos Monkey, among other tools.
  • Lack of automation is also a reason that there so many versions of the account of the outage, combined with rumors, uneducated and educated guesses, and outright hearsay about what actually happened. Had the British Airways’ IT been properly logging executed steps, there would be journal logs akin to a black box to be used for recovery, root cause analysis, and ultimately making sure that there is no such outages in the future.
  • Finally, it does not seem likely that British Airways were properly monitoring their users customer experience. Had they had modern customer experience and application performance monitoring tools and practices, they would have been warned about the incoming trouble when they still had time to react. Instead they chose to use their own travelers as canaries in the mines. This turned out to be bad for everybody.

In today’s connected economy the justice for neglecting ones customers comes swiftly:

The Sunday Times reports that :

“British Airways is expected to be downgraded to a three-star rating because of declining standards — leaving it with a worse rating than Russia’s Aeroflot.”

Demotion to a three-star airline would put BA — which once boasted the slogan “the world’s favorite airline” — on a par with Uzbekistan Airways, Ethiopian Airlines, Myanmar Airways and Ryanair.”

Did British Airways hit the bottom? Time will tell. The rest of us should learn from this experience.