In response to a 24 hour outage that occurred last week on
their internet voice and video chat platform, Skype’s Chief Information Officer has revealed that an error in some versions of the company’s software client is to blame.
CIO Lars Rabbe, writing on Skype’s ‘The Big Blog,’ confirmed reports of outtages last week and posted a detailed explanation of what went wrong. He said the company was taking steps to prevent further outages.
On December 22, a cluster of servers responsible for offline
instant messaging at Skype became overloaded. As a result, some Skype clients running Windows and the 5.0.0.152 version of the Skype client received
delayed response messages from the overloaded servers which were not properly
processed, causing them to crash. These crashes affected an estimated 20 percent
of total Skype users. Those users then restarted their clients, causing a new flood of traffic to the supernodes that quickly overwhelmed the company’s infrastructure.
Skype works on a P2P network where supernodes act as a
directory, supporting Skype clients, establishing connections between clients,
and creating local node clusters. So, despite that only 20 percent failed, this
failure caused a 25 percent reduction in overall supernode resources, which
placed too heavy a burden on the remaining supernodes.
While Skype plans for failures of
this sort, their system was incapable of withstanding the increased load
brought on by users restarting windows as they attempted to reconnect. Rabbe believes the increased
load triggered a failsafe feature on te Skype Supernodes, causing them to shut down. That, in turn, heaped more traffic on the few remaining Supernodes, causing a domino effect that led
to the 24 hour outage.
To fix the problem, Skype introduced hundreds of instances
of Skype software into the P2P network to act as supernodes and provide the
capacity to accelerate the recovery. This process was repeated until the system
was completely restored on December 24.
Skype is working to prevent future outages like the one that occurred last week by bolstering
their automatic update system with more frequent hotfixes, researching ways to
detect problems more promptly and recover systems more quickly, reviewing bug
testing processes, and continually examining their capacity and increasing its resiliency
when necessary.