MonsterMegs - Investigating issue with Thunder server – Incident details

All systems operational

About This Site

Welcome to MonsterMegs' status page. If you want to keep on top of any disruptions in service to our website, control panel, or hosting platform, this is the page to check. We report minor/major outages and scheduled maintenance to this status page. However, issues affecting a small number of customers may be reported directly to affected customers in MyMonsterMegs (https://my.monstermegs.com). If you're experiencing an issue you do not see reported on this page, please log into MyMonsterMegs to view any alerts our team has added to your account.

Investigating issue with Thunder server

Resolved
Major outage
Started 7 months agoLasted 2 days

Affected

Web Hosting Servers

Operational from 10:39 AM to 10:39 AM, Partial outage from 10:39 AM to 11:13 AM, Major outage from 11:13 AM to 12:23 PM, Operational from 12:23 PM to 6:01 PM, Major outage from 6:01 PM to 2:59 PM

Thunder

Operational from 10:39 AM to 10:39 AM, Partial outage from 10:39 AM to 11:13 AM, Major outage from 11:13 AM to 12:23 PM, Operational from 12:23 PM to 6:01 PM, Major outage from 6:01 PM to 2:59 PM

Updates
  • Resolved
    Resolved
    This incident has been resolved.
  • Monitoring
    Monitoring

    The server is now back online. It looks to be a hardware issue that is similar we faced with another server in this datacenter. Here is the response from the datacenter:

    you are on a new 7950X3D CPU and motherboard.
    I'm assuming it is the same issue we've seen with a few other 7950X servers. They would randomly reboot over and over out of no where. We are guessing we got sent a bad batch from the manufacture as we have hundreds of others running fine. We've seen online some others have experienced the same thing with some 7950x's. We started buying 7950X3D, same thing, a little more expensive, but are faster and use less power, seem to be more stable.

    We strongly believe this should resolve the downtime issue, but we will continue to monitor this closely.

  • Identified
    Identified

    The datacenter is moving the hard drives to completely new server setup to rule out any hardware issues. We update as more information comes in.

  • Investigating
    Update

    It looks like the server is stuck in a constant reboot. The datacenter is actively working on it and checking the hardware and running some quick hardware tests.

  • Investigating
    Investigating

    Well it appears it has crashed again. We are investigating.

  • Resolved
    Resolved

    We have now concluded our investigation on the outage of the Thunder server. Looking through the logs and the action taken on the server to bring it back online, we have come to the following conclusions and timeline of events.

    1. The server crashed for no apparent reason. This has been an ongoing issue across our whole server fleet. We have setup our servers with kdump that will produce a kernel dump that can be used for analysis. We have submitted the kernel dumps to Cloudlinux from several servers and they stated it will take a few weeks to analyze these.

      Kernel dumps save the contents of the system memory for analysis and average around 4 GB in size. They require specially trained technicians to read the output of the kernel dump and it is not something we could do.

    2. Once we found the server to be down, we discovered the server was not booting back into the kernel and instead was landing on a Grub screen. At that point we began our troubleshooting process and also brought the support staff from Cloudlinux to investigate. We also contacted the datacenter and requested a USB be installed with a rescue system installed.

      Discovering the cause of a no boot situation is very lengthy process and requires setting up a rescue system to boot into and mounting the original operating system of the server. This all takes extended amounts of time, along with coordinating with the the various staff and CL support team.

    3. After server hours of investigation, we found that upon reboot, one the NVMe hard drives that makes up the Raid-1 array on the server, had fallen out of the raid array. To simplify this, a Raid-1 array basically mirrors all the data across 2 drives. If one fails or falls out of the array, the other disk still contains the data on the drive.

      In this case the one hard drive fell out of one of the arrays, that contains the boot records. With new server technology, there are different boot records stored on each drive and they are not mirrored like the older systems. So when the one drive fell out of the array, it contained part of the boot records for EFI that had to be reinstalled.

    4. At this point we went to reinstall the boot grub config and attempted to reboot, but still it went back to booting into the grub screen. At this point we had the staff from Cloudlinux attempt this as well, and still the reboot failed.

    5. This is when we made the decision to bring in outside help. We have one of the best Linux technicians in the country on standby, that we have developed a working relationship over the fast few years. While he does come with a premium price per hour, we felt it had reached a point that we had exhausted all our attempts to bring the server back online.

      At this point we gave him a quick rundown of the state of the server and began his work. After a few attempts to reinstall the boot records himself, he still faced the same issues of a non-booting server. He continued to work the server for another 2 hours and we finally able to get the boot records to stick and the server booted correctly. To verify even further, we performed a second manual reboot and the server booted in the operating system as it should.

    At this point we will consider the server running stable, but we are still awaiting the analysis of the various kernel dumps that have been provided. There were some changes made since the random reboots/crashes started occurring and this is the first one on 3 weeks, compared to where they were happing several times a week, across several server. So we feel this might have been an isolated case with the hard drive dropping out of the raid array, leading to the kernel panic and crash.

    We will continue to work with the Cloudlinux team to find the cause of the crashes.

  • Monitoring
    Monitoring

    We are happy to state that the server is back online. We will be monitoring this closely and we are also going to have Cloudlinux examine a kernel dump that was generated on the crash, along with investigating why the one hard drive partially fell out of the raid array.

    We understand many customers wanted a timeline on when this server was going to be back online, but in situations like this, it is impossible to give a timeline. In the first few hours, we would have thought the server would be back up in an hour or so, but that was not the case. If we give a timeline and miss it, then there is gonna be negative feed back on that.

    Even further, if we ended up having to reinstall, it could have taken another day. In any outage we would love to give a timeline, but it not always feasible to do that, such in this case.

    We thank everything for their patience! If anything further pops up, we will update accordingly.

  • Identified
    Update

    Just a short update. We have pulled in one of the top linux technicians in the country and he has now taken over operations. he is actively working on the server as we speak and hope to have good news soon.

  • Identified
    Update

    The troubleshooting process is still ongoing. We may have tracked this down to a raid corruption on the server, but we are still not 100% sure on that.

    We appreciate everyone's patience. We know this is a long outage and we understand everyone's frustrations. We are facing the same frustrations, but some server issues are not always so cut and dry and can take hours and hours of troubleshooting. If at all possible we would rather take a little extra time and try to repair the server, than perform a full rebuild that is not a quick process and "can" come with it's own issues.

    We will continue to update as things progress.

  • Identified
    Update

    At this time there is not anything new to report. We are continuing to track down the issue and bring the boot records into place. These types of issues are not quick fixes and we do expect this to be an extended outage. We are doing everything we can to repair the state of the server and avoid a full server reinstall.

    We post as soon as we have further details.

  • Identified
    Update

    Just a short update, but we have had the datacenter install a rescue USB and we will continue our troubleshooting to see why it is not properly booting into the kernel. We have also brought aboard the Cloudlinux support team and we are working together to identify and resolve the kernel issue.

    We have had clients reach out and ask about backups. In a worst case scenario, where we would have to reinstall the server and restore backups, we have backups within 6 hours of the outage.

  • Identified
    Update

    We are continuing to work on the server and track down why it is not booting into the kernel. This may take a few hours to attempt to get the boot records reinstalled and the server booted back up.

  • Identified
    Identified

    We found upon notification of this server being down, that the server crashed and is not botting into the kernel properly. We are working on this and will update as more information comes in.

  • Investigating
    Investigating

    We are currently investigating an issue with the server Thunder. Our engineers have been alerted and further details will be provided if necessary.