MonsterMegs - Notice history

All systems operational

About This Site

Welcome to MonsterMegs' status page. If you want to keep on top of any disruptions in service to our website, control panel, or hosting platform, this is the page to check. We report minor/major outages and scheduled maintenance to this status page. However, issues affecting a small number of customers may be reported directly to affected customers in MyMonsterMegs (https://my.monstermegs.com). If you're experiencing an issue you do not see reported on this page, please log into MyMonsterMegs to view any alerts our team has added to your account.

100% - uptime

Website - Operational

100% - uptime
Mar 2024 · 100%Apr · 100.0%May · 100.0%
Mar 2024
Apr 2024
May 2024

Customer Portal - Operational

100% - uptime
Mar 2024 · 100%Apr · 100.0%May · 100.0%
Mar 2024
Apr 2024
May 2024
99% - uptime

Thunder - Operational

98% - uptime
Mar 2024 · 100%Apr · 93.77%May · 99.09%
Mar 2024
Apr 2024
May 2024

Hurricane - Operational

100% - uptime
Mar 2024 · 100%Apr · 100.0%May · 100.0%
Mar 2024
Apr 2024
May 2024

Storm - Operational

100% - uptime
Mar 2024 · 100%Apr · 100.0%May · 100.0%
Mar 2024
Apr 2024
May 2024

Lightning - Operational

100% - uptime
Mar 2024 · 100%Apr · 100.0%May · 99.83%
Mar 2024
Apr 2024
May 2024
100% - uptime

DNS-1 - Operational

100% - uptime
Mar 2024 · 100%Apr · 100.0%May · 100.0%
Mar 2024
Apr 2024
May 2024

DNS-2 - Operational

100% - uptime
Mar 2024 · 100%Apr · 100.0%May · 100.0%
Mar 2024
Apr 2024
May 2024

DNS-3 - Operational

100% - uptime
Mar 2024 · 100%Apr · 100.0%May · 100.0%
Mar 2024
Apr 2024
May 2024

DNS-4 - Operational

100% - uptime
Mar 2024 · 100%Apr · 100.0%May · 100.0%
Mar 2024
Apr 2024
May 2024
100% - uptime

US Backup Storage Daily - Operational

100% - uptime
Mar 2024 · 100%Apr · 100.0%May · 100.0%
Mar 2024
Apr 2024
May 2024

US Backup Storage Weekly - Operational

100% - uptime
Mar 2024 · 100%Apr · 100.0%May · 100.0%
Mar 2024
Apr 2024
May 2024

EU Backup Storage Daily - Operational

100% - uptime
Mar 2024 · 100%Apr · 100.0%May · 100.0%
Mar 2024
Apr 2024
May 2024

EU Backup Storage Weekly - Operational

100% - uptime
Mar 2024 · 100%Apr · 100.0%May · 100.0%
Mar 2024
Apr 2024
May 2024
100% - uptime

Zabbix Monitoring Server US - Operational

100% - uptime
Mar 2024 · 100%Apr · 100.0%May · 100.0%
Mar 2024
Apr 2024
May 2024

Zabbix Monitoring Server EU - Operational

100% - uptime
Mar 2024 · 100%Apr · 100.0%May · 100.0%
Mar 2024
Apr 2024
May 2024

Third Party: Cloudflare → Cloudflare Sites and Services → CDN/Cache - Operational

Notice history

May 2024

Reboot and Failed Raid Drive - Storm
  • Resolved
    Resolved

    The firmware upgrades have been completed and everything is back online. We found there was an issue with earlier versions of the Samsung 990 Pro's that were allowing the disk to fail much earlier than its lifespan. We applied the latest firmware that addresses that issue as well as both of the drives in the server as of now, are the latest production version of the hard drives.

    So with that said, we do not anticipate any further hard drive failures.

  • Update
    Update

    The server has been back online for about 30 minutes now. We are waiting for the raid to finish rebuilding and then we will proceed with the firmware updates.

  • Update
    Update

    The server has just went down for the hard drive replacement. Please anticipate several shorter downtimes over the next couple hours as we apply these firmware updates in rescue mode.

  • Update
    Update

    We have checked the server in recue mode and the drive did not show. We have gotten the server back online with the single drive. So the datacenter will be replacing this drive very shortly. Once the server is back online, we will performing firmware updated so all internal server components to hope this rectifies the hard drives failing or bricking themselves before their life span.

  • Identified
    Identified

    We are going to perform an emergency reboot to check a reported failed hard drive in our raid setup. While doing this, we will most likely be doing firmware updates on the motherboard and hard drives to try and resolve these hard drives that keep getting reported as failed.

    We will update as we determine the course of action.

Apr 2024

Emergency Downtime - Storm
  • Completed
    April 18, 2024 at 10:46 AM
    Completed
    April 18, 2024 at 10:46 AM

    Not sure why our last update did not post, but the hard drive replacement has been completed. We also modified boot records across all drives to fix issues with recent boot records becoming unavailable. This should lead to a more stable environment as compared to the last couple months.

  • Update
    April 18, 2024 at 12:48 AM
    In progress
    April 18, 2024 at 12:48 AM

    The server is back online temporarily. We first had to reinstall boot records to the single drive. Now that we know it boots, we are gonna contact the datacenter to replace the failed hard drive. After they replace the drive, the server should boot normally and we will readd the drive to the raid array. Depending on how fast the datacenter can install the new drive will determine when the maintenance window will close. Until they can install the hard drive, the server will be online.

    The reinstall of the boot records tooo longer than expected due to the server moving the boot folder and modifying the boot entries to the folders new location, which caused havoc on regenerating the boot records due files listing the wrong folder locations. This is not something that usually happens, so it took awhile to track down.

    We would anticipate the datacenter will replace the drive within the next hour or 2.

  • Update
    April 18, 2024 at 12:17 AM
    In progress
    April 18, 2024 at 12:17 AM

    The maintenance is still in progress, but it is taking longer than we expected. We still hope to have the server back up by the close of the maintenance window. If there will be any further delays, we will update accordingly.

  • Update
    April 17, 2024 at 11:00 PM
    In progress
    April 17, 2024 at 11:00 PM
    Maintenance is now in progress
  • In progress
    April 17, 2024 at 11:00 PM
    In progress
    April 17, 2024 at 11:00 PM
    Maintenance is now in progress
  • Planned
    April 17, 2024 at 11:00 PM
    Planned
    April 17, 2024 at 11:00 PM

    We are scheduling an emergency downtime window of 2 hours, starting at 6pm CST and lasting until 8pm CST. during this emergency downtime, we will be taking the server offline to investigate a failed/offline hard drive in our raid array and also will be doing some maintenance to resolve some of the crashes and reboot issues we have suffered in the past.

    While we are scheduling a 2 hour window, we believe this will be resolved within 30-40 minutes or so. We understand this is short notice, but we do not want to take the risk of the other drive failing and being forced to reinstall and restore backups.

Investigating issue with Thunder server
  • Resolved
    Resolved

    We are closing this incident for now. We are still working with the datacenter for a fix. This seems to be related to Almalinux and the way it handles the grub and boot partitions when the disk drives are in a raid format. We are exploring options to bypass this can copy the EFI boot partitions across all drives.

  • Update
    Update

    We found that the server for some reason just dropped all hard drives from the server. Upon doing a full server reset, the drives then showed back up. We are having the datacenter looking for a resolution so this does not happen again. Hopefully it can be a firmware update of some sort and not actually the motherboard itself. Either way, we will update as we gather more information on a resolution to this.

  • Monitoring
    Monitoring

    The server is back online, but we still need to investigate the reason for the crash and boot into bios. We will update this when we have more info.

  • Update
    Update

    We have found that the server crashed and rebooted to the bios screen. The datacenter is currently looking into this to see if it is a hardware issue.

  • Investigating
    Investigating

    We are currently investigating an issue with the server Thunder. Our engineers have been alerted and further details will be provided if necessary.

Unable to access Cpanels on all servers
  • Resolved
    Resolved

    We just got ahold of cpanel and the licenses have been restored.

  • Identified
    Identified

    We just came to notice that all servers are not able to access cpanel and WHM. After further investigation we found that all our cpanel licenses are listed as suspended. This appears to be a billing error on Cpanel's side and we are already in contact with them.

    They recently migrated to a new billing system and it appears to be an error as outlined in this forums post on the cpanel's website.

    https://support.cpanel.net/hc/en-us/community/posts/22571293184023

Mar 2024

Investigating issue with Thunder server
  • Resolved
    Resolved
    This incident has been resolved.
  • Monitoring
    Monitoring

    The server is now back online. It looks to be a hardware issue that is similar we faced with another server in this datacenter. Here is the response from the datacenter:

    you are on a new 7950X3D CPU and motherboard.
    I'm assuming it is the same issue we've seen with a few other 7950X servers. They would randomly reboot over and over out of no where. We are guessing we got sent a bad batch from the manufacture as we have hundreds of others running fine. We've seen online some others have experienced the same thing with some 7950x's. We started buying 7950X3D, same thing, a little more expensive, but are faster and use less power, seem to be more stable.

    We strongly believe this should resolve the downtime issue, but we will continue to monitor this closely.

  • Identified
    Identified

    The datacenter is moving the hard drives to completely new server setup to rule out any hardware issues. We update as more information comes in.

  • Update
    Update

    It looks like the server is stuck in a constant reboot. The datacenter is actively working on it and checking the hardware and running some quick hardware tests.

  • Investigating
    Investigating

    Well it appears it has crashed again. We are investigating.

  • Resolved
    Resolved

    We have now concluded our investigation on the outage of the Thunder server. Looking through the logs and the action taken on the server to bring it back online, we have come to the following conclusions and timeline of events.

    1. The server crashed for no apparent reason. This has been an ongoing issue across our whole server fleet. We have setup our servers with kdump that will produce a kernel dump that can be used for analysis. We have submitted the kernel dumps to Cloudlinux from several servers and they stated it will take a few weeks to analyze these.

      Kernel dumps save the contents of the system memory for analysis and average around 4 GB in size. They require specially trained technicians to read the output of the kernel dump and it is not something we could do.

    2. Once we found the server to be down, we discovered the server was not booting back into the kernel and instead was landing on a Grub screen. At that point we began our troubleshooting process and also brought the support staff from Cloudlinux to investigate. We also contacted the datacenter and requested a USB be installed with a rescue system installed.

      Discovering the cause of a no boot situation is very lengthy process and requires setting up a rescue system to boot into and mounting the original operating system of the server. This all takes extended amounts of time, along with coordinating with the the various staff and CL support team.

    3. After server hours of investigation, we found that upon reboot, one the NVMe hard drives that makes up the Raid-1 array on the server, had fallen out of the raid array. To simplify this, a Raid-1 array basically mirrors all the data across 2 drives. If one fails or falls out of the array, the other disk still contains the data on the drive.

      In this case the one hard drive fell out of one of the arrays, that contains the boot records. With new server technology, there are different boot records stored on each drive and they are not mirrored like the older systems. So when the one drive fell out of the array, it contained part of the boot records for EFI that had to be reinstalled.

    4. At this point we went to reinstall the boot grub config and attempted to reboot, but still it went back to booting into the grub screen. At this point we had the staff from Cloudlinux attempt this as well, and still the reboot failed.

    5. This is when we made the decision to bring in outside help. We have one of the best Linux technicians in the country on standby, that we have developed a working relationship over the fast few years. While he does come with a premium price per hour, we felt it had reached a point that we had exhausted all our attempts to bring the server back online.

      At this point we gave him a quick rundown of the state of the server and began his work. After a few attempts to reinstall the boot records himself, he still faced the same issues of a non-booting server. He continued to work the server for another 2 hours and we finally able to get the boot records to stick and the server booted correctly. To verify even further, we performed a second manual reboot and the server booted in the operating system as it should.

    At this point we will consider the server running stable, but we are still awaiting the analysis of the various kernel dumps that have been provided. There were some changes made since the random reboots/crashes started occurring and this is the first one on 3 weeks, compared to where they were happing several times a week, across several server. So we feel this might have been an isolated case with the hard drive dropping out of the raid array, leading to the kernel panic and crash.

    We will continue to work with the Cloudlinux team to find the cause of the crashes.

  • Monitoring
    Monitoring

    We are happy to state that the server is back online. We will be monitoring this closely and we are also going to have Cloudlinux examine a kernel dump that was generated on the crash, along with investigating why the one hard drive partially fell out of the raid array.

    We understand many customers wanted a timeline on when this server was going to be back online, but in situations like this, it is impossible to give a timeline. In the first few hours, we would have thought the server would be back up in an hour or so, but that was not the case. If we give a timeline and miss it, then there is gonna be negative feed back on that.

    Even further, if we ended up having to reinstall, it could have taken another day. In any outage we would love to give a timeline, but it not always feasible to do that, such in this case.

    We thank everything for their patience! If anything further pops up, we will update accordingly.

  • Update
    Update

    Just a short update. We have pulled in one of the top linux technicians in the country and he has now taken over operations. he is actively working on the server as we speak and hope to have good news soon.

  • Update
    Update

    The troubleshooting process is still ongoing. We may have tracked this down to a raid corruption on the server, but we are still not 100% sure on that.

    We appreciate everyone's patience. We know this is a long outage and we understand everyone's frustrations. We are facing the same frustrations, but some server issues are not always so cut and dry and can take hours and hours of troubleshooting. If at all possible we would rather take a little extra time and try to repair the server, than perform a full rebuild that is not a quick process and "can" come with it's own issues.

    We will continue to update as things progress.

  • Update
    Update

    At this time there is not anything new to report. We are continuing to track down the issue and bring the boot records into place. These types of issues are not quick fixes and we do expect this to be an extended outage. We are doing everything we can to repair the state of the server and avoid a full server reinstall.

    We post as soon as we have further details.

  • Update
    Update

    Just a short update, but we have had the datacenter install a rescue USB and we will continue our troubleshooting to see why it is not properly booting into the kernel. We have also brought aboard the Cloudlinux support team and we are working together to identify and resolve the kernel issue.

    We have had clients reach out and ask about backups. In a worst case scenario, where we would have to reinstall the server and restore backups, we have backups within 6 hours of the outage.

  • Update
    Update

    We are continuing to work on the server and track down why it is not booting into the kernel. This may take a few hours to attempt to get the boot records reinstalled and the server booted back up.

  • Identified
    Identified

    We found upon notification of this server being down, that the server crashed and is not botting into the kernel properly. We are working on this and will update as more information comes in.

  • Investigating
    Investigating

    We are currently investigating an issue with the server Thunder. Our engineers have been alerted and further details will be provided if necessary.

Mar 2024 to May 2024

Next