Investigating issue with Thunder server

Resolved

Major outage

Started over 1 year agoLasted 2 days

Affected

Web Hosting Servers

Thunder

Updates

Resolved
March 08, 2024 at 2:59 PM
Resolved
March 08, 2024 at 2:59 PM
This incident has been resolved.
Monitoring
March 07, 2024 at 7:22 PM
Monitoring
March 07, 2024 at 7:22 PM
The server is now back online. It looks to be a hardware issue that is similar we faced with another server in this datacenter. Here is the response from the datacenter:
you are on a new 7950X3D CPU and motherboard.
I'm assuming it is the same issue we've seen with a few other 7950X servers. They would randomly reboot over and over out of no where. We are guessing we got sent a bad batch from the manufacture as we have hundreds of others running fine. We've seen online some others have experienced the same thing with some 7950x's. We started buying 7950X3D, same thing, a little more expensive, but are faster and use less power, seem to be more stable.
We strongly believe this should resolve the downtime issue, but we will continue to monitor this closely.
Identified
March 07, 2024 at 6:41 PM
Identified
March 07, 2024 at 6:41 PM
The datacenter is moving the hard drives to completely new server setup to rule out any hardware issues. We update as more information comes in.
Update
March 07, 2024 at 6:24 PM
Update
March 07, 2024 at 6:24 PM
It looks like the server is stuck in a constant reboot. The datacenter is actively working on it and checking the hardware and running some quick hardware tests.
Investigating
March 07, 2024 at 6:01 PM
Investigating
March 07, 2024 at 6:01 PM
Well it appears it has crashed again. We are investigating.
Resolved
March 07, 2024 at 4:04 PM
Resolved
March 07, 2024 at 4:04 PM
We have now concluded our investigation on the outage of the Thunder server. Looking through the logs and the action taken on the server to bring it back online, we have come to the following conclusions and timeline of events.
1. The server crashed for no apparent reason. This has been an ongoing issue across our whole server fleet. We have setup our servers with kdump that will produce a kernel dump that can be used for analysis. We have submitted the kernel dumps to Cloudlinux from several servers and they stated it will take a few weeks to analyze these.
  
  Kernel dumps save the contents of the system memory for analysis and average around 4 GB in size. They require specially trained technicians to read the output of the kernel dump and it is not something we could do.
2. Once we found the server to be down, we discovered the server was not booting back into the kernel and instead was landing on a Grub screen. At that point we began our troubleshooting process and also brought the support staff from Cloudlinux to investigate. We also contacted the datacenter and requested a USB be installed with a rescue system installed.
  
  Discovering the cause of a no boot situation is very lengthy process and requires setting up a rescue system to boot into and mounting the original operating system of the server. This all takes extended amounts of time, along with coordinating with the the various staff and CL support team.
3. After server hours of investigation, we found that upon reboot, one the NVMe hard drives that makes up the Raid-1 array on the server, had fallen out of the raid array. To simplify this, a Raid-1 array basically mirrors all the data across 2 drives. If one fails or falls out of the array, the other disk still contains the data on the drive.
  
  In this case the one hard drive fell out of one of the arrays, that contains the boot records. With new server technology, there are different boot records stored on each drive and they are not mirrored like the older systems. So when the one drive fell out of the array, it contained part of the boot records for EFI that had to be reinstalled.
4. At this point we went to reinstall the boot grub config and attempted to reboot, but still it went back to booting into the grub screen. At this point we had the staff from Cloudlinux attempt this as well, and still the reboot failed.
5. This is when we made the decision to bring in outside help. We have one of the best Linux technicians in the country on standby, that we have developed a working relationship over the fast few years. While he does come with a premium price per hour, we felt it had reached a point that we had exhausted all our attempts to bring the server back online.
  
  At this point we gave him a quick rundown of the state of the server and began his work. After a few attempts to reinstall the boot records himself, he still faced the same issues of a non-booting server. He continued to work the server for another 2 hours and we finally able to get the boot records to stick and the server booted correctly. To verify even further, we performed a second manual reboot and the server booted in the operating system as it should.
At this point we will consider the server running stable, but we are still awaiting the analysis of the various kernel dumps that have been provided. There were some changes made since the random reboots/crashes started occurring and this is the first one on 3 weeks, compared to where they were happing several times a week, across several server. So we feel this might have been an isolated case with the hard drive dropping out of the raid array, leading to the kernel panic and crash.
We will continue to work with the Cloudlinux team to find the cause of the crashes.
Monitoring
March 06, 2024 at 10:31 PM
Monitoring
March 06, 2024 at 10:31 PM
We are happy to state that the server is back online. We will be monitoring this closely and we are also going to have Cloudlinux examine a kernel dump that was generated on the crash, along with investigating why the one hard drive partially fell out of the raid array.
We understand many customers wanted a timeline on when this server was going to be back online, but in situations like this, it is impossible to give a timeline. In the first few hours, we would have thought the server would be back up in an hour or so, but that was not the case. If we give a timeline and miss it, then there is gonna be negative feed back on that.
Even further, if we ended up having to reinstall, it could have taken another day. In any outage we would love to give a timeline, but it not always feasible to do that, such in this case.
We thank everything for their patience! If anything further pops up, we will update accordingly.
Update
March 06, 2024 at 9:09 PM
Update
March 06, 2024 at 9:09 PM
Just a short update. We have pulled in one of the top linux technicians in the country and he has now taken over operations. he is actively working on the server as we speak and hope to have good news soon.
Update
March 06, 2024 at 7:53 PM
Update
March 06, 2024 at 7:53 PM
The troubleshooting process is still ongoing. We may have tracked this down to a raid corruption on the server, but we are still not 100% sure on that.
We appreciate everyone's patience. We know this is a long outage and we understand everyone's frustrations. We are facing the same frustrations, but some server issues are not always so cut and dry and can take hours and hours of troubleshooting. If at all possible we would rather take a little extra time and try to repair the server, than perform a full rebuild that is not a quick process and "can" come with it's own issues.
We will continue to update as things progress.
Update
March 06, 2024 at 5:05 PM
Update
March 06, 2024 at 5:05 PM
At this time there is not anything new to report. We are continuing to track down the issue and bring the boot records into place. These types of issues are not quick fixes and we do expect this to be an extended outage. We are doing everything we can to repair the state of the server and avoid a full server reinstall.
We post as soon as we have further details.
Update
March 06, 2024 at 2:41 PM
Update
March 06, 2024 at 2:41 PM
Just a short update, but we have had the datacenter install a rescue USB and we will continue our troubleshooting to see why it is not properly booting into the kernel. We have also brought aboard the Cloudlinux support team and we are working together to identify and resolve the kernel issue.
We have had clients reach out and ask about backups. In a worst case scenario, where we would have to reinstall the server and restore backups, we have backups within 6 hours of the outage.
Update
March 06, 2024 at 12:22 PM
Update
March 06, 2024 at 12:22 PM
We are continuing to work on the server and track down why it is not booting into the kernel. This may take a few hours to attempt to get the boot records reinstalled and the server booted back up.
Identified
March 06, 2024 at 11:13 AM
Identified
March 06, 2024 at 11:13 AM
We found upon notification of this server being down, that the server crashed and is not botting into the kernel properly. We are working on this and will update as more information comes in.
Investigating
March 06, 2024 at 10:39 AM
Investigating
March 06, 2024 at 10:39 AM
We are currently investigating an issue with the server Thunder. Our engineers have been alerted and further details will be provided if necessary.

All systems operational

About This Site

Investigating issue with Thunder server

MonsterMegs - Investigating issue with Thunder server – Incident details

All systems operational

About This Site

Investigating issue with Thunder server