Facebook, WhatsApp, AWS: How to prevent the software outages that threaten the services we rely on
The internet stood still when Facebook, Instagram, and WhatsApp or Amazon Web Services went down for hours. That may have seemed like a minor inconvenience to American users of those services. Elsewhere, it crippled essential communication for billions of people across the world.
Commercial software is the engine that powers many parts of our lives. Recent outages show that this engine is teetering on the brink of dangerous outcomes when it shuts down or malfunctions.
With a user base of around two billion, WhatsApp’s disappearance meant a loss of messaging between family members and businesses around the world and broken in-app payments for countries that have effectively come to rely exclusively on the messaging service.
The ease of use and overall low costs of these consumer software services have led governments, schools, NGOs, non-profits, and many more organizations to weave them into their daily operations. This adoption further drives up the engagement and user base metrics of these services, so they are actively encouraging these types of activities. In July this year, WhatsApp promoted an event to support 100 NGOs in India (where it has some 400 million users) to enable their missions to “provide vulnerable, disempowered, and minority communities with easy and secure access to vital information and support.”
Even as more core services weave these products into daily operations, software outages aren’t going to magically stop. What companies are building now far exceeds what any single person, team, or organization can mentally model regarding how these systems are built, much less how they function under unanticipated pressure or otherwise unexpected conditions. On top of that, user and organizational demands are only accelerating the pace and complexity of software development.
Facebook (now Meta) was surprisingly transparent in their post-incident report of what happened. However, most organizations (including Facebook) rarely publish detailed, open analyses of their outages. The desire to save face and move on quickly is driven in large part by competitive, legal, and financial concerns, paired with a desire to avoid bad publicity. That shouldn’t stop companies from doing it. The technology industry has an immense body of commoditized, siloed knowledge that we can share to learn from each other and push software safety forward.
The airline industry’s approach
In the late 1990s, the U.S. airline industry had a treacherous safety record. Accidents and fatalities were at an all-time high, and carriers were faced with the challenges that come with scale. If their accident rates remained the same while global air traffic continued to grow at projected rates, there would be on average one major jet crash per week by 2015.
They avoided this outcome because a group of airline executives, unionized pilots, and federal regulators banded together to voluntarily share data about aviation incidents and accidents across carriers—despite their own legal, competitive, and financial concerns. Eventually, mechanics, air traffic controllers, and the Federal Aviation Administration (FAA) came on board as well. The FAA noted that the “key to this approach is a longstanding commitment to sharing data through an open and collaborative safety culture to detect risks and address problems before accidents occur.”
It’s time for the technology industry to follow suit. The good news? We don’t have to start from scratch. In 2019, Nora Jones founded the Learning From Incidents (LFI) community. Comprising several hundred software practitioners (with experience ranging from behemoths like Facebook, Amazon, Netflix to smaller-scale organizations and academic institutions), technology leaders, and researchers, LFI stands to reshape how the software industry thinks about incidents, software reliability, and the critical role people play in keeping these systems running.
Along with investing in incident analysis and publishing what they’ve learned, organizations can share their incident reports in the VOID, a database of public incident reports that arose from the need to have a place where we share and learn from these incidents.
These outages have revealed the scale and power software-based services now have in our lives, and without a concerted effort to share information as an industry, they’re only going to get worse, not better.
This article was originally published in Fortune on December 16, 2021.
Courtney Nash is a researcher focused on system safety and failures in complex sociotechnical systems. An erstwhile cognitive neuroscientist, she has always been fascinated by how people learn, and the ways memory influences how they solve problems. Over the past two decades, she’s held a variety of editorial, program management, research, and management roles at Holloway, Fastly, O’Reilly Media, Microsoft, and Amazon. She lives in the mountains where she skis, rides bikes, and herds dogs and kids.
Get Ahead of the Incident Curve
Discover key findings which confirm that accepted metrics for incidents
aren’t reliable and aren’t resulting in correct information.