New Proxmox host crashes without logs when running 2 or more VM with Q35.

compiz

Member
Dec 29, 2022
30
1
8
Hi, I used to have and older proxmox host (Ryzen 5 3600) in hetzner and I closed my old host in Hetzner, rented a new one with AMD Ryzen 7 PRO 8700GE, I have a PBS server so migration of my VMs was easy.
Now I am facing a problem, I am having a VM with windows 2022 server that is Q35 (v9.0) and a few other linux vm (i440fx latest) based on cloud image templates of debian and ubuntu, when I am running 1 Q35 vm and the i440fx everything is fine! Now when I try to run a 2nd vm, windows 11 or MacOS the host will crash/reboot but no logs are available to see what is causing this!
I am not overusing RAM, as a fact the host has another 8GB free even if I power them all on, the storage on the host is the 2 512gb NVMe's as Raid 0, I don't have any device passthrough, although I would like to ask if I can passthrough to either the MacOS or one of the Windows 11 vms the iGPU as a 2nd question but my main problem is the crashes.
At first I thought it was ram because this happens usually when it passes the 34GB of ram usage, I had hetzner stress testing the host for CPU and Ram for 30 minutes, 0 errors.
I was reading the forum here, most similar problems are with passthrough, I don't have passthrough anything...and it only happens with 2 or more Q35 VM, I can start 20 i440fx vm and all is fine CPU/RAM/storagewise. I have tried the 3 last kernels and the issue remains in all of them or at least didn't see any fix towards that in any of the last 3.

Also when the server comes back online after the crash, it can run the 2 Q35 VMs without issue until I reboot or power on again the 2nd, then again crash and so on.
Installed using the tool/guide found here: https://github.com/chmaikos/proxmox-hetzner

Any ideas or suggestions?
Host info:
CPU: AMD Ryzen 7 PRO 8700GE
RAM: 64GB DDR5
Storage: 2x512GB NVMe Raid 0 ZFS
PVE: 8.2.4
Kernel: Linux pve 6.8.12-1-pve #1 SMP PREEMPT_DYNAMIC PMX 6.8.12-1 (2024-08-05T16:17Z) x86_64
In all VMs I am using host as CPU setting


Kind regards,
George
 
Last edited:
i would not think that this is a kernel issue, can you post the output of 'dmesg' the two vm configs (qm config ID) and your versions (pveversions -v) here?
 
if you haven't limited ZFS ARC memory size, it will use 50% of your host RAM.
Proxmox iso nowadays set it to 10%.
As you've done manual installation, seems the 50% default limit is used.
 
if you haven't limited ZFS ARC memory size, it will use 50% of your host RAM.
Proxmox iso nowadays set it to 10%.
As you've done manual installation, seems the 50% default limit is used.
I have limited the ZFS ARC from the guide above to 10% if I recall correct, also I can see from the host the ram usage being in total <70% (and via btop), also it can happen only with running the 2 vm which use 16gb(windows server) and 4gb(MacOS) right after the reboot so ram usage is low in general.
i would not think that this is a kernel issue, can you post the output of 'dmesg' the two vm configs (qm config ID) and your versions (pveversions -v) here?
sure:
Code:
qm config 1030
agent: 1,fstrim_cloned_disks=1
audio0: device=ich9-intel-hda,driver=none
balloon: 8192
bios: ovmf
boot: order=virtio0
cores: 4
cpu: host
efidisk0: local-zfs:vm-1030-disk-0,efitype=4m,pre-enrolled-keys=1,size=1M
machine: pc-q35-9.0
memory: 16384
meta: creation-qemu=7.0.0,ctime=1662478071
name: Windows2022
net0: virtio=2B:F2:F5:C2:53:08,bridge=vmbr1
numa: 0
onboot: 1
ostype: win11
scsihw: virtio-scsi-pci
smbios1: uuid=b116f805-91c2-46e7-8311-56a7344ca810
sockets: 1
startup: order=5,up=60
tags: Windows,Server
tpmstate0: local-zfs:vm-1030-disk-1,size=1M,version=v2.0
vga: vmware,memory=512
virtio0: local-zfs:vm-1030-disk-2,discard=on,size=200G
vmgenid: 380cd18e-fd10-4278-b43e-5504a70a2bc1

and

Code:
qm config 1300
args: -device isa-applesmc,osk="ourhardworkbythesewordsguardedpleasedontsteal(c)AppleComputerInc" -smbios type=2 -device usb-kbd,bus=ehci.0,port=2 -global nec-usb-xhci.msi=off -global ICH9-LPC.acpi-pci-hotplug-with-bridge-support=off -cpu Haswell-noTSX,vendor=GenuineIntel,+invtsc,+hypervisor,kvm=on,vmware-cpuid-freq=on
audio0: device=ich9-intel-hda,driver=none
bios: ovmf
boot: order=virtio0;net0
cores: 8
cpu: host
efidisk0: local-zfs:vm-1300-disk-0,efitype=4m,size=1M
machine: q35
memory: 4096
meta: creation-qemu=7.2.0,ctime=1679247525
name: MacOSSonoma
net0: vmxnet3=5A:3C:02:58:6F:01,bridge=vmbr1
numa: 0
ostype: other
scsihw: virtio-scsi-pci
smbios1: uuid=ce37cfce-1f82-4b57-83ae-de19d7c25c29
sockets: 1
tags: macos;sonora
vga: vmware
virtio0: local-zfs:vm-1300-disk-1,cache=unsafe,iothread=1,size=64G
vmgenid: 96215bca-3187-4e8d-9996-0df7ffe6a254

dmesg is too long, can i somehow export in a txt file? via the terminal it goes far too long that it erases the starting lines.
Also on both vm the audio device I have added it today as a test for something else, it wasn't there when created originally

Code:
pveversions -v
-bash: pveversions: command not found

pveversion -v
proxmox-ve: 8.2.0 (running kernel: 6.8.12-1-pve)
pve-manager: 8.2.4 (running version: 8.2.4/faa83925c9641325)
proxmox-kernel-helper: 8.1.0
proxmox-kernel-6.8: 6.8.12-1
proxmox-kernel-6.8.12-1-pve-signed: 6.8.12-1
ceph-fuse: 17.2.7-pve3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx9
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.1
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.1.4
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.7
libpve-cluster-perl: 8.0.7
libpve-common-perl: 8.2.2
libpve-guest-common-perl: 5.1.4
libpve-http-server-perl: 5.1.0
libpve-network-perl: 0.9.8
libpve-rs-perl: 0.8.9
libpve-storage-perl: 8.2.3
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.2.7-1
proxmox-backup-file-restore: 3.2.7-1
proxmox-firewall: 0.5.0
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.6
proxmox-widget-toolkit: 4.2.3
pve-cluster: 8.0.7
pve-container: 5.1.12
pve-docs: 8.2.3
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.1
pve-firewall: 5.0.7
pve-firmware: 3.13-1
pve-ha-manager: 4.0.5
pve-i18n: 3.2.2
pve-qemu-kvm: 9.0.2-2
pve-xtermjs: 5.3.0-3
qemu-server: 8.2.4
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.4-pve1
 
Last edited:
We have same issues with same server on Hetzner.

When we ask for HW replacement servers stop rebooting, but HW seems like its not core of the issue since when running stress tests inside VM server is fine, only with real word stress it reboots.

Sometimes it reboots only once in a month+, sometimes it is 3-5x a day, very very strange...

So far we have tried kernels 6.8.8-4-pve and 6.5.13-5-pve both have same issue, we tried even editing GRUB commands such as:
GRUB_
CMDLINE_LINUX_DEFAULT="consoleblank=0 nomodeset noapic pci=assign-busses apicmaintimer idle=poll reboot=cold,hard" or GRUB_CMDLINE_LINUX_DEFAULT="quiet splash processor.max_cstate=1 idle=nomwait" none of them has made any differance


Similarity to your issue is that you are using "host" in CPU settings, but we use BTRFS instead of ZFS and all of our VMs are ubuntu.
 
Last edited:
well, I only have it for <40 days so can't say for 1 month uptime but definitely it does that multiple times a day when I try to work on 2 or more Q35 vm.... the odd thing is it runs fine the i440fx, as many as I want or at least it didn't crash on me with this scenario.
I have forgotten to mention that I am also running another pfSense vm which powers on first in order to give network to the rest of the vm. That is also a i440fx vm
 
I am not sure that its Q35 or i440fx issue, because we had servers with 70+ days of uptime randomly started rebooting. Kernel on them is pined and we even disabled `apt-get update`.

In my opinion something updated or changed in the last month, but I am not sure what would auto update ...
 
check if ARC size is correct with arc_summary
this is with 3 and a half days up time


it doesn't seem to be related to ARC because as I said, it can crash right after a reboot with pfSense, windows 2022 server and MacOS vm only running which in total take 2+16+4=22GB ram out of 64, I doubt the system has time to build cache in the 2 minutes it has from reboot to next crash
 

Attachments

  • arc.txt
    32.3 KB · Views: 1
I am not sure that its Q35 or i440fx issue, because we had servers with 70+ days of uptime randomly started rebooting. Kernel on them is pined and we even disabled `apt-get update`.

In my opinion something updated or changed in the last month, but I am not sure what would auto update ...
For me this was a clean install with direct install of 8.2 and run:
apt update && dist-upgrade
so it was and still full updated from day 0.
Most of the VM were created on the 3600 cpu system but I have also tried it with fresh created VM in this host but same result, that's why I am keeping the old created VM from the 3600 cpu.
I have asked KVM access from Hetzner now, I went to the bios and saw that SRIOV was disabled, I have enabled it since I will try later on to passthrough the iGPU to a VM but first I want stability to the system before I try to do pci-e passthrough.
 
"pin the kernel to version in 6.5.x" did that, does not fix the issue for me, I think I have read a topic with this suggestion yesterday, seems to be for unrelated issue to what I describe.
But after enabling the SRIOV in the bios of the machine, so far I was able to run a couple of Q35 vm without the server crashing but I am not ready to call it a day and say it was this setting, I need to run more tests. I will post updates here
 
Try pc-q35-5.1
Didn't work with that either.
Also enabling the SRIOV didn't solve the problem as well, it makes less often but still crashes.
If I may also add, I have at home a "mini pc" ASRock B660 with an intel i3 13100, I have installed on it an RX6600 that can fit in this little case and a Coral.ai TPU, the TPU I pass it through to Home Assistant OS in order to run Frigate and the RX6600 I pass it through to a windows 11 vm were I run some old games.
In the same PC, I am trying to make a windows 2022 server, I made it Q35 as well because my plan was to pass through the iGPU of the CPU, but way before I make it to the part where I pass through the iGPU, when I start this vm while the other 2 are running, the host crashes as well!
So I am pretty sure something doesn't work well with Q35 when it has more than 2 VM running it.
 
Does the server have ddr5 ram?
Can you try setting it to a fixed speed in bios, somewhat lower than the max speed the used ram should be capable of?
I've had similar issues with 4800mhz ram. Now running seemingly steady at 4400mhz.
 
Yeap, the server is with ddr5 ram.
I have tried the new kernel that was released today, still the server crashes when I am trying to launch a 2nd Q35 vm.
Well also the past 2 weeks with no changes, it crashes at random times without me starting any new VM, just the standard: pfsense
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!