Affected
Degraded performance from 11:35 PM to 11:55 PM, Partial outage from 11:55 PM to 6:26 AM, Major outage from 6:26 AM to 7:32 PM, Operational from 7:32 PM to 10:46 PM
Degraded performance from 11:35 PM to 11:55 PM, Partial outage from 11:55 PM to 6:26 AM, Major outage from 6:26 AM to 7:32 PM, Operational from 7:32 PM to 10:46 PM
- ResolvedResolved
The server has now been running for over 4 hours and there have not been any errors. We are now confident that the issue is resolved.
- UpdateUpdate
We are happy to state that the raid rebuild did complete and everything does look to be stable at this time. We do feel that the PCI Adapter was the cause of the hard drives showing as failed or disappearing from bios.
We are going to keep this incident open for a few more hours and will closely monitor the server for any further issues.
- MonitoringMonitoring
The PCI Adapter has been replaced and the second hard drive has been added back in. We will continue to monitor the server closely over the next few hours and will update this incident if there are any further issues.
- InvestigatingInvestigating
After talking with the datacenter, we feel we might have a lead on what is causing the drives too show failed or fall out of connection. It appears this server setup uses a PCI adapter when there is more than 1 hard drive installed. This morning we had the datacenter bypass the PCI adapter and plugin the min drive directly into the motherboard. The server has been running stable now for nearly 3 hours.
So we feel that the PCI adapter might be failing. Within the hour we are going to take the server offline once again to plug the second drive back in and use a new PCI adapter. If this does snot resolve the issue, then we will have to migrate to a new server, but we have high hopes that it is the PCI adapter that is at fault.
- MonitoringMonitoring
As you may have noticed, the server is back online. Please note that is most likely temporary. The datacenter did not feel that the problem was with the motherboard, so we have booted the server with just the one drive that has all the data.
We are gonna let this run for a couple hours and see if it remains stable. If it does, our next plan of action is most likely to setup a second server and then migrate the data to this second server. If the hard drive does fall out of connection or throws any errors, then we will be forced to rebuild the server and restore backups.
We will update this post, once we have our next plan of action.
- IdentifiedIdentified
We are currently talking with the datacenter to try a motherboard replacement. The fact that the drive does show up fine for a period of time and then disappears, makes us think the motherboard may be malfunctioning.
This will be our last attempt to rectify this issue and if this does not work, we will have to proceed with disaster recovery.
- InvestigatingInvestigating
Unfortunately, the resync of the raid array failed at 82%. So we are continuing to explore our options and see if maybe the new hard drive has issues itself.
- MonitoringMonitoring
We hopefully have some good news, but we are not out of the woods yet. We had the datacenter swap the drives around from their original mounting location. This many times, will refresh the bios to see the drives again. This could have been a case of the main drive getting bumped and having a loose connection or the bios just having issues reading the drives.
Now that the main drive is showing up, the server booted correctly and the raid has been resyncing for the last 20 minutes or so. Once the raid resync completes, everything should be good to go. We will update once the resync is complete or if there are any further issues along the way.
- UpdateUpdate
After further investigation, it looks like the second drive might have failed during the resync of the raid array. This is extremely rare, but it has been seen to happen. Before we jump to the extreme and reprovision the server, reconfigure it, and start restoring backups...we are gonna still work to try and recover the data without a complete rebuild.
If we do have to resort to a full rebuild, we will be email all customers with further details. While we work to try and recover the data as it stands, the updates to this incident will be limited. As we wan to put full focus on attempting to recover the data.
If it does come to the point that we need to rebuild and restore backups, this could realistically take a few days to fully restore the server. So we want everyone to be aware that will not be a quick process and will take time to restore everything correctly without rushing and causing more errors.
- InvestigatingInvestigating
It appears that was not the cause as we had thought. Again we are seeing services falling off after a period of time. We are continuing to troubleshoot the issue. This is looking to be a bit more complicated than we anticipated. At this point we do not have an accurate timeframe to offer of how soon this will be resolved.
- MonitoringMonitoring
We found that after the reboot, one of the servers security programs was triggering some extreme ddos protection that was freezing the server and causing processes to fail. We believe we have disabled the features that have caused this and have also reached out to the vendor to find out why this happened.
If this does happen again, we will disable the service completely, until we can work this out with the vendor. We are going to continue to monitor over the next few hours.
- UpdateUpdate
We are still investigating the what happened during the raid rebuild. At this time we don't have any further information, but hope to have more details very soon.
- InvestigatingInvestigating
Something has happened during the raid rebuild and has caused several parts of the server to become inaccessible. We are investigating and will update as we have more information.
- MonitoringMonitoring
We found the newly installed drive has the latest firmware, so there is no need for further reboots. We are awaiting for the raid rebuild to complete and then we will consider this issue closed.
- UpdateUpdate
The hard drive has been replaced and the server is back online. Now we will proceed to rebuild the raid and update the firmware on the drive. Expect 1-2 reboots and then the replacement will be complete.
- UpdateUpdate
The datacenter has already taken the server offline to replace the drive. We will update as things progress.
- IdentifiedIdentified
We have been informed by our monitoring systems, that one of the hard drives on the server has failed. We are going to proceed immediately with a request to have the hard drive replaced. We anticipate the drive to be replace within the next couple hours. During this time, the server will go offline for about 30 minutes. After the replacement is done, there will be a few reboots required to rebuild the raid and update the firmware on the drive.