December 29, 2010, 1:53PM

Skype: Client Error Causes Global Outage

Skype outageIn response to a 24 hour outage that occurred last week on their internet voice and video chat platform, Skype's Chief Information Officer has revealed that an error in some versions of the company's software client is to blame. 

CIO Lars Rabbe, writing on Skype’s ‘The Big Blog,’ confirmed reports of outtages last week and posted a detailed explanation of what went wrong.  He said the company was taking steps to prevent further outages. 

On December 22, a cluster of servers responsible for offline instant messaging at Skype became overloaded. As a result, some Skype clients running Windows and the 5.0.0.152 version of the Skype client received delayed response messages from the overloaded servers which were not properly processed, causing them to crash. These crashes affected an estimated 20 percent of total Skype users. Those users then restarted their clients, causing a new flood of traffic to the supernodes that quickly overwhelmed the company's infrastructure.

Skype works on a P2P network where supernodes act as a directory, supporting Skype clients, establishing connections between clients, and creating local node clusters. So, despite that only 20 percent failed, this failure caused a 25 percent reduction in overall supernode resources, which placed too heavy a burden on the remaining supernodes.

While Skype plans for failures of this sort, their system was incapable of withstanding the increased load brought on by users restarting windows as they attempted to reconnect. Rabbe believes the increased load triggered a failsafe feature on te Skype Supernodes, causing them to shut down. That, in turn, heaped more traffic on the few remaining Supernodes, causing a domino effect that led to the 24 hour outage.

To fix the problem, Skype introduced hundreds of instances of Skype software into the P2P network to act as supernodes and provide the capacity to accelerate the recovery. This process was repeated until the system was completely restored on December 24.

Skype is working to prevent future outages like the one that occurred last week by bolstering their automatic update system with more frequent hotfixes, researching ways to detect problems more promptly and recover systems more quickly, reviewing bug testing processes, and continually examining their capacity and increasing its resiliency when necessary. 

Commenting on this Article is closed.

 

Copyright © 2012 threatpost.com | Terms of Service | Privacy