Hello all, I am coming here after trying to research this on my own for the last day or two.
I recently enabled gpu passthrough and moved all of my vms over to it from a supermicro with similar specs that ran 7.2. The configuration goes as follows:
HP z440 worksatation
xeon e5 2699a V4
128GB ecc 2133
3 drive pools
-boot pool/vmdisk pool - 2x 512 GB crucial mirror
-vm disk pool 2 - 2x 256gb sandisk ssd mirror
-hard drive pool - 2x 14tb WD datacenter drives mirror
Quadro M4000
Quadro P4000
Mellanox connectX 2 10Gbe
The issue seemed to happen when any vm with a gpu is modified or passed through and the vm is started (or sometimes if they arent). I have also noticed correlations in ram usage going up if the vm boot disk is on the hard drive array (possible io limitation issues), and ram usage is always 106gb when the issue is happening, but when all vms are running on fresh boot theres only 32 gb ram used.
The VMs I have are
LM for plex, 85gb ssd, 7000gb hdd on the hdd pool, 8gb ram, 6 cores
windows gaming vm, 150 gb ssd on vm pool 2 16gb ram, 8 cores, quadro m4000
NS1 for dns and ntp, 32gb ssd, 4gb ram, 2 cores
general windows vm, 6 cores, 8gb ram, 240gb ssd on hdd pool
haproxy vm, 2 cores 4gb ram, 64gb ssd on boot pool
internal services ubuntu vm, 2 cores, 4gb ram 64gb ssd on boot pool
NS2 name server, 2 cores 4gb ram 32gb ssd on boot pool
(I think thats it - cant check because web gui is down)
What happens is, after booting, the proxmox gui loads fine and I can modify/start/stop vms, but after a little bit of running, the start command/restart command will start to fail on systemd timeout, and eventually the whole web interface will stop responding causing my haproxy to show a 503 (backend server cannot handle request) ssh will take ages to load and eventually will take the password but never log in. But the weird thing is, all vms have 100% performance and are all accessible on the network. I am at a loss for what it could be but saw another thread where something related to D-Bus overload. It seems like there is something that the system is stuck on and cant get past that "switch loop/broadcast storm" kind of behavior with systemd, and I am at a loss for what it could be. I am on the very latest version of proxmox 8, as downloaded from the website
Any help would be ever so appreciated!!
I recently enabled gpu passthrough and moved all of my vms over to it from a supermicro with similar specs that ran 7.2. The configuration goes as follows:
HP z440 worksatation
xeon e5 2699a V4
128GB ecc 2133
3 drive pools
-boot pool/vmdisk pool - 2x 512 GB crucial mirror
-vm disk pool 2 - 2x 256gb sandisk ssd mirror
-hard drive pool - 2x 14tb WD datacenter drives mirror
Quadro M4000
Quadro P4000
Mellanox connectX 2 10Gbe
The issue seemed to happen when any vm with a gpu is modified or passed through and the vm is started (or sometimes if they arent). I have also noticed correlations in ram usage going up if the vm boot disk is on the hard drive array (possible io limitation issues), and ram usage is always 106gb when the issue is happening, but when all vms are running on fresh boot theres only 32 gb ram used.
The VMs I have are
LM for plex, 85gb ssd, 7000gb hdd on the hdd pool, 8gb ram, 6 cores
windows gaming vm, 150 gb ssd on vm pool 2 16gb ram, 8 cores, quadro m4000
NS1 for dns and ntp, 32gb ssd, 4gb ram, 2 cores
general windows vm, 6 cores, 8gb ram, 240gb ssd on hdd pool
haproxy vm, 2 cores 4gb ram, 64gb ssd on boot pool
internal services ubuntu vm, 2 cores, 4gb ram 64gb ssd on boot pool
NS2 name server, 2 cores 4gb ram 32gb ssd on boot pool
(I think thats it - cant check because web gui is down)
What happens is, after booting, the proxmox gui loads fine and I can modify/start/stop vms, but after a little bit of running, the start command/restart command will start to fail on systemd timeout, and eventually the whole web interface will stop responding causing my haproxy to show a 503 (backend server cannot handle request) ssh will take ages to load and eventually will take the password but never log in. But the weird thing is, all vms have 100% performance and are all accessible on the network. I am at a loss for what it could be but saw another thread where something related to D-Bus overload. It seems like there is something that the system is stuck on and cant get past that "switch loop/broadcast storm" kind of behavior with systemd, and I am at a loss for what it could be. I am on the very latest version of proxmox 8, as downloaded from the website
Any help would be ever so appreciated!!
Last edited: