At 6 am on 6th May 2018 we experienced a layer 2 broadcast storm. This is when broadcast packets race round and round the switch network multiplying until they overwhelm all other traffic.
This severely degraded routing to all servers in the Reading Data Centre causing packet loss or complete connectivity failure.
Recovery and remedial actions
In order to fix the broadcast storm we powered down half of our switch network removing any possible network loops. Our switch network is redundant so powering down half the network reduced the ability for the broadcast storm to propagate. The broadcast storm mostly subsided at this point and connectivity was mostly restored at around 10:30. As parts of the network were not responding we continued to reboot and reconfigure our network to fix them.
We carried on rebooting switches and reconnecting them to the network in order to restore the redundancy and fix the last bits of the broadcast storm. We decided that we didn't want to leave the network in an unstable state. We continued with our emergency maintenance in order to ensure our networking was back to its original fully redundant state. This work caused some intermittent and small gaps in connectivity (about 5 minutes) for some customers as they occurred.
By 17:40 we had rebooted all the switches and reconnected the entire network. At which point service was returned to normal, although there were still a few customers with issues which we continued to work on.
We believe that the broadcast storm was caused by a software crash in one of our switches. However, once a broadcast storm has started it is very difficult to identify which bit of equipment caused it. We are continuing to examine switch logs as a matter of urgency to try to track down the exact cause.
We believe the root cause of this problem was malfunctioning switch software; this should never have happened had the switch software (spanning tree) been operating correctly. We are investigating mitigations for this happening again in the future as a matter of urgency.
A full incident review will begin on Tuesday when normal working hours begin as soon as we have more information on this incident we will publish a full RCA. Once we have our investigation is completed we will contact our customers with details of the investigation findings and the mitigations that have been applied.
UPDATE: INCIDENT 6TH MAY 2018
Our network team have been investigating the cause of the outage on Sunday 6th May. We have identified that a broadcast storm originated from a group of switches that form part of our external infrastructure network. We believe one of these switches entered into a failure mode which affected its ability to perform network loop prevention, resulting in the broadcast storm. Unfortunately, due to the nature of the event, we are unable to ascertain the precise cause of the failure as the logging from these switches was disrupted.
We have also been able to identify how the broadcast storm was able to affect our core network in the Reading datacentre. The broadcast storm propagating into our core network caused a lot more disruption than if it had been limited to our external infrastructure network.
The reduced logging, monitoring and difficulty accessing remote management on some parts of the network increased the time it took us to resolve the incident.
While the Cross Site Private Network was affected, it did not propagate the broadcast storm to Dunsfold and we can confirm that the Dunsfold network was not affected in any way.
At our Dunsfold data centre we have already split our network into separate layer-2 broadcast domains and the same network separation was planned for this quarter in Reading. During the incident on Sunday we took the opportunity to effect this change, which stopped the storm within the core network. This change will massively reduce the impact should a similar incident occur by preventing any broadcast storms propagating to or from the core network.
During the incident, our external infrastructure network switches were rebooted, which restored normal operation to this section of the network.
We are continuing to investigate and research future mitigations. In order to safely test potential fixes, we have built a complete replica of the part of the network where the broadcast storm originated. We are using this lab setup to evaluate potential switch firmware upgrades. It seems possible, after examining the firmware change log, that this could prevent the problem reoccurring. However, switch firmware upgrades can introduce new problems so we want to be confident they work properly prior to introducing them to our infrastructure.
We are also using the lab to review and tune other potential remediation settings that we can apply to the switches in order to prevent similar problems from happening in the future.
Should you have any questions following the incident please contact firstname.lastname@example.org and we will respond to your email during normal working hours or if it is of a technical nature please log into your Memset account to submit a support ticket
blog comments powered by Disqus