At 08:12, power was lost to both the primary and secondary utility feeds to our data centre in Reading. Power was maintained by the A and B UPS feeds without service interruption and the backup generators began the process of starting automatically. They started successfully, but unfortunately when they took the full building load they experienced an “excitation” fault which triggered an emergency shutdown. The data centre staff made repeated attempts to restart the generators, however these were unsuccessful. This led to the UPS being exhausted at 08:32, resulting in complete power loss at the data centre.
The emergency generator engineer was called to the site and was able to modify the configuration parameters of the generators, enabling successful start-up. This restored power to the data centre at 08:50. However, there were further power events, the last one of which was resolved at 10:04
A major incident was declared immediately and all available Memset staff were tasked to deal with the incident. This included; replacing failed infrastructure and server hardware at the data centre from our supply of hotspares, fixing customer file systems and making additional staff available for customer communications.
We already maintain an internal audit programme which ensures undertaking regular reviews of our suppliers’ continuity programmes. This specifically incorporates reviews of all generator, UPS and environmental controls supporting the data centre. Our supplier has provided immediate, additional records which have been validated by third parties with the assurance given that they will continue to provide continued evidence of testing and maintenance regimes.
The configuration change applied to the generators during the restoration of power fixed the immediate “excitation” fault. This will be validated over the next week and tested using specialist equipment. We are confident that once completed, this problem will not reoccur.
Our network engineers took immediate action to mitigate against any disruption from further power interruptions. New routing policy has now been deployed which fully addresses the issues identified in this incident.
To mitigate against performance degradation due to RAID array syncing, we had already enabled a new feature on our next generation host servers which greatly decreases the amount of data transfer required for a resync. We have decided to roll this feature out to our previous generation hosts as well in light of this incident. This will not cause additional downtime.
We have replaced all faulty power infrastructure and failed hardware. We are confident that this has remediated any immediate issues and weaknesses that through review we feel were caused by the power outage and subsequent fluctuations.
The root cause of this incident was the failure of the utility power supply. The subsequent failure of the generators to supply load resulted in the power outage.
This has brought to light several issues which we have mitigated or are in the process of mitigating against.
Based on the investigations, internal incident review, supplier evidence and subject to further validation, we do not believe this problem will reoccur.
We wholeheartedly apologise to all our customers that were affected by this disruption and would like to take the opportunity to thank them for their patience.