In the early hours of April 10, a series of errors led to a massive, multi-state outage in the emergency call management centers (ECMCs) that handle 911 calls in seven geographically dispersed states. The incident originated at an obscure but critical call routing hub in Englewood, Colo., and ended up knocking out the emergency communication infrastructure for more than 11 million citizens.
The outages affected the entire state of Washington. The software in question is deployed and managed by a Colorado emergency communication service provider named Intrado. Intrado’s own report noted that 6,410 calls were made to emergency call centers during the period of time in question in that state, North Carolina and Minnesota. Of those calls, 5,618 failed and only 792 succeeded. The FCC found that approximately 1,000 additional calls failed in California, Florida, Pennsylvania, and South Carolina. On the whole, 87 percent of calls made during that period failed.
The hub software deployed at the Englewood facility is designed to keep track of the trunk assignment – a method of resource sharing widely used in telecom systems – for 911 calls assigned to numerous public safety answering points (PSAPs) across the country. Some part of the broader system’s architecture leading into the Englewood facility relied on a centralized, legacy trunking system that failed catastrophically.
Once this failure occurred, the software stopped generating trunk assignments, calls were no longer reaching the Englewood facility and, therefore, no longer being routed into the appropriate local call centers.
While the Federal Communication Commission officially attributed the outages to a “preventable coding error” in system software, their 30-page analysis of the incident paints a bleaker picture. Specifically, the coding error at the Englewood Intrado ECMC merely stopped non-IP-enabled trunk assignments, preventing calls being routed there from reaching the appropriate PSAPs.
The reality, it seems, is that the failure was caused as much by aging emergency call management infrastructure as it was by any coding error. The situation was further perpetuated by at least two instances of human and communications errors.
The outage was prolonged in part by an insufficient alarm management system. Several smaller alarms signaled, but these were not enough to attract the attention of Intrado employees. Had there been a more broad alarm system in place, it’s likely that the outage would not have last nearly as long. In addition to this, there was a clear lack of communication between Intrado and some of its third-party contractors.
During the time of the outage, Intrado was having issues with call centers in Oregon. The Oregon outage ultimately proved unrelated to the Washington outage, but for several hours early on April 10, Intrado and CenturyLink worked under the mistaken impression that the Washington and Oregon outages were related. As Intrado noted in its publicly-filed reply comments, this diverted its attention from the true cause of the multi-state outage.
Intrado operates a redundant hub in Miami, which is designed to take on rerouted 911 traffic in situations like this one. Had the company been aware of what was going on, it could have automatically or manually rerouted all calls to that center, avoiding the outages altogether. Once they did realize what was happening, some six hours into the incident, they began routing traffic into Miami, and solved the problem immediately.
In all, some 11,313,787 people were potentially affected across 81 PSAPs in 60 counties in (from most impacted to least) Washington, Minnesota, Pennsylvania, Florida, South Carolina, North Carolina and California. In other words, the outage affected 3.5 percent of the U.S. population.
There were no deaths as a result. The small number of calls completed are believed to have been automatically routed through the Miami ECMC.
In addition to Intrado, who is at the center of the incident, a number of other emergency communication service providers were impacted and filed network outage reporting system (NORS) reports as well, including CenturyLink, Comcast,TeleCommunication Systems, Inc., Verizon Business, AT&T Wireless (Cingular) and Frontier.
The enormous breadth and geographic dispersion of the outage, the FCC says, was in part attributable to an architecture that consolidated critical 911 functions in two locations serving multiple states, without adequate safeguards in place.
“While this consolidation lowered the cost of 911 operations for the [local exchange carrier], the outage clearly showed that consolidation can result in too much dependence on a few critical elements if providers do not ensure the effective operation of adequate diversity and redundancy in the design and execution of the network,” the FCC wrote.
“The April 2014 multistate outage was far more than a simple software error on an otherwise uneventful spring evening in Englewood, Colorado. It was a vivid example of the vulnerabilities that IP-supported architectures may present, without sufficient network safeguards and clear lines of accountability,” the FCC continued. “The issues raised in the outage go to the heart of providing reliable 911 service. Regardless of what party implements a particular component of 911 service, there must be network reliability and clear accountability from call placement to call completion.
“As the Nation transitions to new methods of communications, we need to take care to ensure that our inherent trust in the 911 system does not get lost in that transition.”