Post-mortem for disruptions on 15th November 2016

Summary

At 08:12, power was lost to both the primary and secondary utility feeds to our data centre in Reading. Power was maintained by the A and B UPS feeds without service interruption and the backup generators began the process of starting automatically. They started successfully, but unfortunately when they took the full building load they experienced an “excitation” fault which triggered an emergency shutdown. The data centre staff made repeated attempts to restart the generators, however these were unsuccessful. This led to the UPS being exhausted at 08:32, resulting in complete power loss at the data centre.

The emergency generator engineer was called to the site and was able to modify the configuration parameters of the generators, enabling successful start-up. This restored power to the data centre at 08:50. However, there were further power events, the last one of which was resolved at 10:04

Impact


During the power outage

  • During the periods of power outage all servers and services hosted at the Reading data centre were offline.
  • Routing to a subset of our IP addresses, allocated to servers in our Dunsfold data centre, experienced intermittent routing issues. This was caused by a continuity misconfiguration in our cross-site BGP routing, which allows IPs to be portable between data centres.
  • The memset.com website was unavailable intermittently due to the issue above.

Following restoration of power

  • Power fluctuations during the course of the incident caused some damage to our power infrastructure, part of which had to be replaced.
  • Power fluctuations also caused some localised file system and hardware failures on virtual server hosts and customer dedicated servers.
  • RAID arrays on several of our Miniserver host servers required a resync. This had a negative impact on the available IO for the virtual servers hosted on them, which in turn may have impacted customer Miniserver performance. This did not affect our next generation of Miniserver hosts.
  • The unclean nature of the power shutdowns and the subsequent power fluctuations caused some file system corruption on a small number of virtual and dedicated servers.

Recovery and Remedial Actions

A major incident was declared immediately and all available Memset staff were tasked to deal with the incident. This included; replacing failed infrastructure and server hardware at the data centre from our supply of hotspares, fixing customer file systems and making additional staff available for customer communications.

We already maintain an internal audit programme which ensures undertaking regular reviews of our suppliers’ continuity programmes. This specifically incorporates reviews of all generator, UPS and environmental controls supporting the data centre. Our supplier has provided immediate, additional records which have been validated by third parties with the assurance given that they will continue to provide continued evidence of testing and maintenance regimes.

The configuration change applied to the generators during the restoration of power fixed the immediate “excitation” fault. This will be validated over the next week and tested using specialist equipment. We are confident that once completed, this problem will not reoccur.

Our network engineers took immediate action to mitigate against any disruption from further power interruptions. New routing policy has now been deployed which fully addresses the issues identified in this incident.

To mitigate against performance degradation due to RAID array syncing, we had already enabled a new feature on our next generation host servers which greatly decreases the amount of data transfer required for a resync. We have decided to roll this feature out to our previous generation hosts as well in light of this incident. This will not cause additional downtime.

We have replaced all faulty power infrastructure and failed hardware. We are confident that this has remediated any immediate issues and weaknesses that through review we feel were caused by the power outage and subsequent fluctuations.

Conclusion

The root cause of this incident was the failure of the utility power supply. The subsequent failure of the generators to supply load resulted in the power outage.

This has brought to light several issues which we have mitigated or are in the process of mitigating against.

Based on the investigations, internal incident review, supplier evidence and subject to further validation, we do not believe this problem will reoccur.

We wholeheartedly apologise to all our customers that were affected by this disruption and would like to take the opportunity to thank them for their patience.

Posted November 2016 by Annalisa O'Rourke , in General