[SOLVED] Investigating Gradual Lock-up and Failure of Proxmox

KingDigweed

Member
Nov 19, 2021
28
1
8
Hi all,

Hoping for some guidance on investigating a strange problem I've been running into with my new Proxmox machine - strange enough that I can't really make the title of this post any less vague than it currently is! As a heads up, the next couple of paragraphs are some background info - I would hope I'm only including useful information, but understand if people read over it.

Background

A month ago I bought a used Dell Optiplex 7050 to replace an old PowerEdge I'd been running Proxmox on for a couple of years. It's a pretty simple setup, no ZFS, HA etc. - a new SSD formattted as Ext4 which has Proxmox installed on it, along with being used to run my VMs/CTs. I have disabled swap on this system to help with SSD wear, but am very careful to make sure my VMs never run out. An i350-T2 NIC - I've bridged the two ports on this and these are used solely for an OPNsense VM as the WAN and LAN ports. The built-in ethernet port on the motherboard is for Proxmox and all other VMs. I have two hard drives also - formatted as Ext4 and mounted as simple directories. A couple of CTs use mount points to store things on them.

The issue that I'm currently experiencing is the same as what I was experiencing 2 weeks ago (more detail further on). After much troubleshooting I eventually decided to simply return the machine for another. Being used hardware (although a reputable seller), I thought this was worth a shot. When I received the replacement, I reinstalled Proxmox from fresh, and then very gradually began restoring each of my VMs from backup. At first, just the OPNsense VM to get my router up and running. After a couple of days of stability, I decided it was "OK" and restored some more CTs, waiting at least a day or two before adding more back. For the last couple of days, all VMs have been back and working - about 72 hours of all working fine. About 30 minutes ago, however, the same problem as before seemingly struck again.

The Problem

To describe the problem, Proxmox eventually dies completely and goes unreachable. I can't get to the GUI, ping the host, SSH in, etc. But it doesn't happen instantly. Today, I was working on my Plex container, just tweaking very basic settings in the web GUI. I then noticed things starting to load slowly (as I navigated around) before failing to load and the web GUI seeming to break. When I tried refreshing the page, reopening in a new tab, or checking the web GUIs of other services running in the same container, they all failed. It was then about 15-30 seconds before the Proxmox GUI was totally unreachable and internet access was lost (because the VM running my OPNsense router also died, I guess). I had no choice but to power down the machine using the Intel vPro utility and reboot. A quick note about this - Proxmox runs at 192.168.10.5, and the vPro controller runs on the same ethernet port but at 192.168.10.4. Both hosts went down as the system crashed, though after a couple of minutes, the vPro controller seemed to come back to life and I was able to issue a shutdown command before the connection went down again.

Once the machine powered off and rebooted, everything seemed to come back OK. At the moment I have my suspicions about 2 containers in particular - one running a Project Zomboid LGSM server and the other running Plex, qBittorent, the "Arr" apps etc. As a result, I'm keeping these powered off for the meantime. This might be totally off, it's just that I restored those most recently and a couple days later these issues come back...

What I've Found So Far

I've attached a some log files that I think will be of use and note that the crash happened at around 21:35 27/11/2022. From glancing at them, I see multiple references to "general protection fault" and other info at the time of the crash, but I feel out of my depth so hopefully others can get the right info out of the logs and myself!

BIOS is up to date - I did this when I received the machine. Temps are fine - even under the most load that this machine sees, temperatures never exceed 55°C. Usual temps are 40°C. Proxmox is up to date also - 7.3-3. RAM is not overcommitted, nor approaching full. I have 24GB and only 8GB or so are allocated. I am not aware of any of my VMs or CTs getting close to using their full RAM allocation.

On the first machine I had, I installed the Intel Microcode package but the problem persisted. I have therefore not yet done this with this machine.

Running pct fsck 381 on my Plex container, for example, first returns: MMP interval is 10 seconds and total wait time is 42 seconds. Please wait..., and after waiting:
Code:
/dev/mapper/pve-vm--381--disk--0: recovering journal
/dev/mapper/pve-vm--381--disk--0: Clearing orphaned inode 9472 (uid=0, gid=0, mode=0100600, size=0)
/dev/mapper/pve-vm--381--disk--0: clean, 50490/524288 files, 1005597/2097152 blocks
When I first restored this container from backup, I ran a fsck and it returned "clean" with no MMP interval message. I ran the same command and received the above immediately after rebooting from the crash. After my system began crashing on the old OptiPlex which I returned, I also started seeing this MMP interval message.

Please ask as much as you'd like - I'll do anything to get to the root of this!

Many thanks,

Chris
 

Attachments

  • kern.log
    308.1 KB · Views: 2
  • syslog.txt
    527.8 KB · Views: 2
  • messages.txt
    152.3 KB · Views: 0
  • daemon.log
    214.6 KB · Views: 0
Last edited:
Could you test the 5.19 kernel as well [0]? We've seen issues with a few CPUs. It may help improve the situation.

This also seems interesting:
Code:
Nov 27 21:38:30 OptiPlex7050 kernel: [450131.413668] CIFS: VFS: \\192.168.0.3\Proxmox BAD_NETWORK_NAME: \\192.168.0.3\Proxmox

I'd also suggest slowly adding more of the CTs and testing before adding the next one to see if perhaps it's related to one.


[0] https://forum.proxmox.com/threads/opt-in-linux-5-19-kernel-for-proxmox-ve-7-x-available.115090/
 
Could you test the 5.19 kernel as well [0]? We've seen issues with a few CPUs. It may help improve the situation.

This also seems interesting:
Code:
Nov 27 21:38:30 OptiPlex7050 kernel: [450131.413668] CIFS: VFS: \\192.168.0.3\Proxmox BAD_NETWORK_NAME: \\192.168.0.3\Proxmox

I'd also suggest slowly adding more of the CTs and testing before adding the next one to see if perhaps it's related to one.


[0] https://forum.proxmox.com/threads/opt-in-linux-5-19-kernel-for-proxmox-ve-7-x-available.115090/
Thank you very much for the reply Mira. I was just reading through that thread and will give it a go.

What does that BAD_NETWORK_NAME error suggest to you? I have a CIFS share hosted at that address which I was restoring backups from, however, I have disabled it in the GUI for the last couple of days. Surprised it started making lots of references to it...

Totally agree with the approach to slowly add back CTs until things get cranky. I have been doing this so far and it's why I'm suspicious of the two CTs I mentioned - things were stable until adding these back. I'm just cautious of barking up the wrong tree with that...

I'll be sure to report back with the new kernel findings.
 
Hi all,

It's now been practically a week on the new kernel so decided to update this thread.

I've now successfully loaded back all previous VMs & containers, and even made new ones - all without a single crash! I'm almost certain that moving to the 5.19 kernel is what sorted it, so thank you for the suggestion Mira.

Kind regards,

Chris
 
  • Like
Reactions: mira

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!