Incident Report: Bizvoip Cloud PBX

Title:                           Cloud PBX Platform unreachable
Date:                          23 June 2020
Status:                       Resolved
Severity Level:          1
Start date:                 1st Failure 17 June 2020 @ 08:41-12:15
2nd failure 20:20 – 18 June 2020 @01h:00

 Details

On the 17th at 08:41 we experienced a core network outage in our Johannesburg POP.  The outage affected substantially all Cloud PBX core and related services.

Core Cloud PBX service was restored at approximately 10:05.  All interconnects were back in service by 11:15. 

The outage was due to a core switch hardware failure of switch 1 in the switch stack in one of our cabinets.  This resulted in the isolation of this cabinet and therefore isolated our VMWARE and NFS server infrastructure from the rest of the network which affected many services

Emergency maintenance was performed in the evening to replace the failed switch.  Only one step in the maintenance was evaluated as to have a service impact – the point at which the live core router was switched back to the primary router – and that was planned to start after 22:00.  We considered prior steps (the physical replacement of the failed switch) safe and started from 20:00 with that work.

However, in the event: at the time that the replacement switch was turned on it unexpectedly became the switch stack “master” (despite having had its flash erased) resulting in the configuration of the other switches being overwritten by the empty config on the new switch.

This caused a complete disruption to the core switching in the Joburg POP.  This outage was severe since it disconnected our VMWare environment from the network storage system.  As a result, most virtual machines crashed due to disk I/O errors.  This necessitated that each virtual machine is checked, filesystems checked and restarted.

Impact:
All Cloud PBX core services.

Root Cause:
The root cause of the outage was due to equipment failure.

Corrective Actions:

  1. Replaced Faulty equipment
  2. Restore full redundancy for stacking cables
  3. Set the intended master switch to high priority in the stacks to ensure they retain a master role in the scenario we experienced.

For any queries please contact support@bizvoip.co.za or 0876543210

Regards,
Vincent Swart
Chief Technical Officer

Privacy Preference Center