27th August At 11.15am UK time Host #13 started exhibiting a fault relating to the storage controller. After failing over to the redundant unit, and a component replaced, the issues was deemed corrected and an RFO promtly issued. Service was restored at 11.45am.
At 10:20pm we again saw an issue relating to the storage controller, as the unit had been replaced it was evident the errors were coming from the host itself, specifically the motherboard PCI connection to the storage controllers, service was restored at 11:05pm and a replacement scheduled.
28th August At 5:00am the host was replaced with an interim unit to ensure the stability of associated VMs. As the interim unit is intended to provide stability rather than be tuned to performance this caused some intermittent load issues which were alleviated by the migration a selection of VMs with the owners permission.
29th August No issues are seen on the host, stable throughout the day.
30th August RFO issued following 24 hours monitoring of stability.
31st August At 9:00pm the host will be replaced with an upgraded unit, the work will be as described above. Downtime should be no longer than 20 minutes.
Cause: Failed host motherboard PCI lane causing intermittent disconnects from the distributed storage Fix: Replacement of failed host with an upgraded unit
Sincere apologies to all affected users, please make use of the upgraded host as a gesture of goodwill, we will take steps to ensure issues like this do not repeat.