[SOLVED] After updating to Proxmox 9, virtual machines crash.

uzumo

Active Member
Apr 5, 2025
217
52
28
Unexplained virtual machine crashes after updating to Proxmox 9.

In this case, the virtual machine will enter an internal-error state.

The issue only occurs in Proxmox 9, not in Proxmox 8.

This does not happen all the time, but it does happen in environments where GPU passthrough is enabled.

If anyone knows the cause or how to troubleshoot this error, I would be grateful if you could let me know.

version
Code:
pveversion
pve-manager/9.0.6/49c767b70aeb6648 (running kernel: 6.14.8-2-pve)

error
Code:
Aug 25 14:13:29 pve1 QEMU[449875]: error: kvm run failed Bad address
Aug 25 14:13:29 pve1 QEMU[449875]: RAX=ffff968d5f2a7e08 RBX=000000000000019c RCX=ffff968d5f2a7e08 RDX=ffffed78588951f8
Aug 25 14:13:29 pve1 QEMU[449875]: RSI=0000000000000000 RDI=fffff80060b32bc2 RBP=ffff968d5f2fa048 RSP=ffff8104bd756698
Aug 25 14:13:29 pve1 QEMU[449875]: R8 =000000000000019c R9 =0000000000000005 R10=ffff968d4f3cb040 R11=ffff8405b7b3d19c
Aug 25 14:13:29 pve1 QEMU[449875]: R12=0000000000000005 R13=0000000000000083 R14=ffff968d5f2a7e08 R15=ffff8405b7b3d000
Aug 25 14:13:29 pve1 QEMU[449875]: RIP=fffff80060b0ff52 RFL=00050283 [--S---C] CPL=0 II=0 A20=1 SMM=0 HLT=0
Aug 25 14:13:29 pve1 QEMU[449875]: ES =002b 0000000000000000 ffffffff 00c0f300 DPL=3 DS   [-WA]
Aug 25 14:13:29 pve1 QEMU[449875]: CS =0010 0000000000000000 00000000 00209b00 DPL=0 CS64 [-RA]
Aug 25 14:13:29 pve1 QEMU[449875]: SS =0018 0000000000000000 00000000 00409300 DPL=0 DS   [-WA]
Aug 25 14:13:29 pve1 QEMU[449875]: DS =002b 0000000000000000 ffffffff 00c0f300 DPL=3 DS   [-WA]
Aug 25 14:13:29 pve1 QEMU[449875]: FS =0053 0000000000000000 00013c00 0040f300 DPL=3 DS   [-WA]
Aug 25 14:13:29 pve1 QEMU[449875]: GS =002b ffffe0813dfc0000 ffffffff 00c0f300 DPL=3 DS   [-WA]
Aug 25 14:13:29 pve1 QEMU[449875]: LDT=0000 0000000000000000 ffffffff 00c00000
Aug 25 14:13:29 pve1 QEMU[449875]: TR =0040 ffffe0813dfd0000 00000067 00008b00 DPL=0 TSS64-busy
Aug 25 14:13:29 pve1 QEMU[449875]: GDT=     ffffe0813dfd1fb0 00000057
Aug 25 14:13:29 pve1 QEMU[449875]: IDT=     ffffe0813dfcf000 00000fff
Aug 25 14:13:29 pve1 QEMU[449875]: CR0=80050033 CR2=0000002f651fee88 CR3=00000000001ae000 CR4=00350ef8
Aug 25 14:13:29 pve1 QEMU[449875]: DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
Aug 25 14:13:29 pve1 QEMU[449875]: DR6=00000000ffff07f0 DR7=0000000000000400
Aug 25 14:13:29 pve1 QEMU[449875]: EFER=0000000000000d01
Aug 25 14:13:29 pve1 QEMU[449875]: Code=00 00 4e 8d 1c 02 48 2b d1 73 09 4c 3b d9 0f 87 6e 01 00 00 <0f> 10 04 11 48 83 c1 10 f6 c1 0f 74 12 48 83 e1 f0 0f 10 0c 11 0f 11 00 0f 28 c1 48 83 c1

<vmid>.conf
Code:
agent: 1
args: -cpu host,hv_passthrough,-hypervisor,level=35,+vmx,guest>
balloon: 0
bios: ovmf
boot: order=ide0;ide1;virtio0
cores: 20
cpu: host,flags=+pdpe1gb
efidisk0: local-zfs:vm-923-disk-0,efitype=4m,pre-enrolled-keys>
hookscript: local:snippets/rx9070_reset.sh
hostpci0: 0000:04:00,pcie=1,rombar=0,x-vga=1
hostpci1: 0000:83:00,pcie=1
hostpci2: 0000:01:00,pcie=1
ide0: none,media=cdrom
ide1: none,media=cdrom
machine: pc-q35-9.2+pve1
memory: 49152
meta: creation-qemu=8.1.5,ctime=1718161181
name: etc1
net0: virtio=BC:24:11:9E:2C:37,bridge=vmbr0,firewall=1,mtu=1,q>
net1: virtio=BC:24:11:CF:87:C7,bridge=vmbr1,firewall=1,mtu=1,q>
numa: 0
onboot: 1
ostype: win11
rng0: source=/dev/urandom
scsihw: virtio-scsi-single
smbios1: ---
sockets: 1
tablet: 1
tags: default
tpmstate0: local-zfs:vm-923-disk-1,size=4M,version=v2.0
vga: none
virtio0: local-zfs:vm-923-disk-2,iothread=1,size=80G
vmgenid: ---

lspci
Code:
root@pve1:~# lspci -nns 04:00
04:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 48 [RX 9070/9070 XT] [1002:7550] (rev c0)
04:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 48 HDMI/DP Audio Controller [1002:ab40]

root@pve1:~# lspci -nns 01:00
01:00.0 Non-Volatile memory controller [0108]: Sandisk Corp WD_BLACK SN7100 NVMe SSD (DRAM-less) [15b7:5045] (rev 01)

root@pve1:~# lspci -nns 83:00
83:00.0 USB controller [0c03]: Renesas Electronics Corp. uPD720201 USB 3.0 Host Controller [1912:0014] (rev 03)
 
Last edited:
Same error as the problem fixed in 6.14.8-2?

 
When we configured PCI passthrough on rx9070xt and compared performance, the performance of Proxmox VE 9 was 11% lower than Proxmox 8.

It seems that some change is still having a negative impact.

I am going back to Proxmox 8, but can always test when more information becomes available.
 
Regarding performance, after disabling viommu, it became equivalent to the following settings once viommu was configured.

It hasn't been long since the change, but so far it hasn't crashed.

 
This issue only occurs in Proxmox VE 9, but I'm unsure under what conditions it happens and am struggling to resolve it.
If anyone knows how to investigate this, I would appreciate your response.

Until the cause is identified, I will revert to Proxmox VE 8...

Code:
Aug 29 00:33:47 pve1 QEMU[34078]: error: kvm run failed Bad address
Aug 29 00:33:47 pve1 QEMU[34078]: RAX=ffff90830707be08 RBX=000000000000019c RCX=ffff90830707be08 RDX=00006c7e812b81f8
Aug 29 00:33:47 pve1 QEMU[34078]: RSI=0000000000000000 RDI=fffff80761872bc2 RBP=ffff90830705d448 RSP=ffffb58a4daae698
Aug 29 00:33:47 pve1 QEMU[34078]: R8 =000000000000019c R9 =0000000000000005 R10=ffff9083067ba940 R11=fffffd018833419c
Aug 29 00:33:47 pve1 QEMU[34078]: R12=0000000000000005 R13=0000000000000083 R14=ffff90830707be08 R15=fffffd0188334000
Aug 29 00:33:47 pve1 QEMU[34078]: RIP=fffff8076184ff52 RFL=00050212 [----A--] CPL=0 II=0 A20=1 SMM=0 HLT=0
Aug 29 00:33:47 pve1 QEMU[34078]: ES =002b 0000000000000000 ffffffff 00c0f300 DPL=3 DS   [-WA]
Aug 29 00:33:47 pve1 QEMU[34078]: CS =0010 0000000000000000 00000000 00209b00 DPL=0 CS64 [-RA]
Aug 29 00:33:47 pve1 QEMU[34078]: SS =0018 0000000000000000 00000000 00409300 DPL=0 DS   [-WA]
Aug 29 00:33:47 pve1 QEMU[34078]: DS =002b 0000000000000000 ffffffff 00c0f300 DPL=3 DS   [-WA]
Aug 29 00:33:47 pve1 QEMU[34078]: FS =0053 0000000000000000 0000bc00 0040f300 DPL=3 DS   [-WA]
Aug 29 00:33:47 pve1 QEMU[34078]: GS =002b ffffa401d10e1000 ffffffff 00c0f300 DPL=3 DS   [-WA]
Aug 29 00:33:47 pve1 QEMU[34078]: LDT=0000 0000000000000000 ffffffff 00c00000
Aug 29 00:33:47 pve1 QEMU[34078]: TR =0040 ffffa401d10f1000 00000067 00008b00 DPL=0 TSS64-busy
Aug 29 00:33:47 pve1 QEMU[34078]: GDT=     ffffa401d10f2fb0 00000057
Aug 29 00:33:47 pve1 QEMU[34078]: IDT=     ffffa401d10f0000 00000fff
Aug 29 00:33:47 pve1 QEMU[34078]: CR0=80050033 CR2=0000000000000000 CR3=00000000001ae000 CR4=00350ef8
Aug 29 00:33:47 pve1 QEMU[34078]: DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
Aug 29 00:33:47 pve1 QEMU[34078]: DR6=00000000ffff07f0 DR7=0000000000000400
Aug 29 00:33:47 pve1 QEMU[34078]: EFER=0000000000000d01
Aug 29 00:33:47 pve1 QEMU[34078]: Code=00 00 4e 8d 1c 02 48 2b d1 73 09 4c 3b d9 0f 87 6e 01 00 00 <0f> 10 04 11 48 83 c1 10 f6 c1 0f 74 12 48 83 e1 f0 0f 10 0c 11 0f 11 00 0f 28 c1 48 83 c1
 
A similar event thread about an internal error.

It seems to be pve-qemu-kvm/qemu-server as it does not occur on Proxmox VE 8 with kernel 6.14.8-2 / 6.14.11-1.

I hope it will be resolved.

 
Thanks for the reply.

Yes, bios is using the latest 3.04 official release.
There is a beta bios, but we cannot use the beta.

microcode will have been updated upon migration to pve9.

These cpu flags are necessary for me.
Some of them are needed to suppress error output.
Some is needed to hide the fact that it is virtualization.
In pve8 there is no error with the same setup.

cpu Intel Core Ultra 7 265K
m/b Asrock Z890 Pro RS WiFi White
gpu Radeon RX 9070 XT 16GB GDDR6
 
It will take some time to check back to pve9, but I will temporarily change the cpu flag and see if it reproduces.
You will also need to disable viommu=intel.
 
I have reverted back to pve9 and am using it, but it is no longer reproducible.

The difference is that I am using kernel 6.14.11-1-pve, but I am not sure if this is the cause.

I will try to see how it goes a little more.