Post-mortem for disruptions on 4th May 2017
Further to the notification received the week commencing 24th April, at 22:00 Wednesday 03/05/17 Memset commenced scheduled maintenance to update the hypervisor of our previous generation Windows Miniserver hosts. This update was required to address several critical vulnerabilities recently announced for the Xen hypervisor.
Following the update, the new version of Xen presented the virtual devices to the Windows OS with a new PCI ID number. This change caused two unforeseen results;
- All versions of Windows declared the virtual network adapter presented to them as a new adaptor. As a new adaptor it was initiated without configuration. Our maintenance team was able to apply primary IP address configuration. However, due to the nature of the missing configuration, it was not possible to apply the resolution during the maintenance window. Further investigation was required to identify the affected servers. Once investigated, scripts were created to apply secondary IP or VLAN IP address configuration.
- Severs running Windows 2003 were found to not boot following the update. Windows 2003 was unable to mount the virtual hard disk with the new PCI ID number.
Windows virtual servers, without secondary IP addresses or VLAN IP addresses, were restored to normal service during the maintenance window.
Windows virtual servers, with secondary IP addresses or VLAN IP addresses, may have been without the configuration required for these IP addresses to function. We were able to automatically re-apply configuration Thursday 04/05/17.
A small proportion of our customer virtual server estate run Windows 2003. These services were unavailable following the maintenance.
Recovery and remedial actions
From the period of time the maintenance took place, and into the course of the day on Thursday 04/05/17, we were able to reapply configuration to additional IP addresses. This was done with a series of script runs so the precise time that affected servers would have recovered will vary. Our support team were also responding to tickets received and reconfiguring IP addresses manually through the day.
Windows 2003 servers were restored by moving them, from the upgraded hosts, to hosts running an older version of Xen. These were all restored by 19:00 04/05/17.
To mitigate against issues such as this occurring in future maintenance windows, we are reviewing and improving our testing and change management procedures. Our incident reviews have highlighted some areas for improvement which we will be implementing in the coming weeks.
A contributing factor to the length of the incident was that our current monitoring system monitors only primary IP addresses. We have now completed the initial testing phase of a new monitoring system that will improve our response, to incidents such as this, as well as allow us to provide greater functionality to our customers. Memset will communicate when the deployment of this additional monitoring will take place across the customer environments.
Our incident review process has also highlighted areas for improvement in the frequency of customer communications during the course of the incident. Based on these findings we will be reviewing and improving our communications strategy in the coming weeks to ensure that we post regular updates to our status page throughout the course of incidents.
The root cause of this incident was the unexpected behaviour within Windows when presented with new PCI subsystem IDs for devices.
Our incident management process has highlighted some insufficiencies in our pre-release testing process for this particular type of maintenance, as well as confirming already noted limitations in our current monitoring toolset. These will be addressed before future maintenance is scheduled.
Based on the investigations and internal incident review, we believe the likelihood of reoccurrence of incidents of this nature in future maintenance windows is low.