Proxmox web GUI, ssh, and qm commands stop working after machine is booted for a few minutes

codym

New Member
Jul 11, 2022
2
0
1
Hello all, I am coming here after trying to research this on my own for the last day or two.
I recently enabled gpu passthrough and moved all of my vms over to it from a supermicro with similar specs that ran 7.2. The configuration goes as follows:

HP z440 worksatation
xeon e5 2699a V4
128GB ecc 2133
3 drive pools
-boot pool/vmdisk pool - 2x 512 GB crucial mirror
-vm disk pool 2 - 2x 256gb sandisk ssd mirror
-hard drive pool - 2x 14tb WD datacenter drives mirror
Quadro M4000
Quadro P4000
Mellanox connectX 2 10Gbe

The issue seemed to happen when any vm with a gpu is modified or passed through and the vm is started (or sometimes if they arent). I have also noticed correlations in ram usage going up if the vm boot disk is on the hard drive array (possible io limitation issues), and ram usage is always 106gb when the issue is happening, but when all vms are running on fresh boot theres only 32 gb ram used.

The VMs I have are
LM for plex, 85gb ssd, 7000gb hdd on the hdd pool, 8gb ram, 6 cores
windows gaming vm, 150 gb ssd on vm pool 2 16gb ram, 8 cores, quadro m4000
NS1 for dns and ntp, 32gb ssd, 4gb ram, 2 cores
general windows vm, 6 cores, 8gb ram, 240gb ssd on hdd pool
haproxy vm, 2 cores 4gb ram, 64gb ssd on boot pool
internal services ubuntu vm, 2 cores, 4gb ram 64gb ssd on boot pool
NS2 name server, 2 cores 4gb ram 32gb ssd on boot pool
(I think thats it - cant check because web gui is down)


What happens is, after booting, the proxmox gui loads fine and I can modify/start/stop vms, but after a little bit of running, the start command/restart command will start to fail on systemd timeout, and eventually the whole web interface will stop responding causing my haproxy to show a 503 (backend server cannot handle request) ssh will take ages to load and eventually will take the password but never log in. But the weird thing is, all vms have 100% performance and are all accessible on the network. I am at a loss for what it could be but saw another thread where something related to D-Bus overload. It seems like there is something that the system is stuck on and cant get past that "switch loop/broadcast storm" kind of behavior with systemd, and I am at a loss for what it could be. I am on the very latest version of proxmox 8, as downloaded from the website

Any help would be ever so appreciated!!
 
Last edited:
I believe I have fixed my issue. I am not sure exactly what fixed the issue as I changed multiple things at once, but I will share what I did in case it helps someone.

ram compatibility issues. I believe this was a timing issue or possible faulty dimm, I had 2 sets of 4x 16gb dimms, the two sets were very similar spec'd and actually came from the same cisco server but some numbers were different, and when one set was installed proxmox threw memory errors in the console.

Second thing I did was give proxmox a dedicated amd gpu to do its thing with. My intel xeon has no gpu, and the 2 quadros were being passed through/nvidia drivers disabled, leaving proxmox nothing. I used amd because the quadros being passed through means nvidia drivers being disabled meant using an nvidia card would break things.

Setting the proper gpu as default in bios. I set the amd gpu as default in the bios to prevent any hardware handoff issues when vfio rips the quadros away from the os.

Resetting bios entirely. I ran a full bios reset to clear any potential issues with the previous server configs.

Once I did all of these, literally all of my issues went away. The main reason I wasnt able to detect the memory issues is because all display outputs were inaccessible to proxmox and the errors only showed on the physical display.

I believe the main issue was the memory issues. This is because when the console and systemd eventually broke again when i finally got a display working, I started getting mce records pool full, which, on top of the mce hardware error gave me the final clue that ram was an issue.

Until I can get more of the same ram I will have to run at 64GB but since all vms only use like 36GB combined that leaves plenty for zfs caching.

I love proxmox so much and have enjoyed using it, and appreciate all that the developers have put into making it work so well. I am glad my server is finally working!
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!