I have a Debian VM with an NVIDIA A40 passed to it that randomly (a day or two weeks between freezes). When it happens the VM becomes completely unresponsive and I have to kill it. There is nothing in the VM logs, but I have a little from the host logs when the VM freezes:
I'm using an ASUS Prime X670-P motherboard and have tried upgrading the BIOS (at least a couple of times during troubleshooting over the last couple of months).
The card is passed through as a raw device with all functions, and ROM-Bar enabled.
Other details:
VM kernel: 6.18.5+deb13-amd64
VM NVIDIA driver: 590.48.01
Host kernel: 6.17.2-2-pve (with additional boot arguments "pcie_aspm=off pcie_port_pm=off")
Kernels and driver have been upgraded a couple of times through the troubleshooting process.
Although not a deadly critical issue, it's really annoying me and I would be really grateful for ideas.
Here are the host logs from the latest crash:
Feb 17 17:53:24 hypervisor2 QEMU[7737]: error: kvm run failed Bad address
Feb 17 17:53:24 hypervisor2 QEMU[7737]: RAX=0000000000000001 RBX=00000000000001e8 RCX=00007fbd08399620 RDX=00007fbd51cd1000
Feb 17 17:53:24 hypervisor2 QEMU[7737]: RSI=00000000000001e8 RDI=00007fbd0839a648 RBP=0000000003cc4e70 RSP=00007fbd3cddbf48
Feb 17 17:53:24 hypervisor2 QEMU[7737]: R8 =0000000000000001 R9 =0000000000000001 R10=00007fbd3cddc2b0 R11=00007fbd3cddc2c8
Feb 17 17:53:24 hypervisor2 QEMU[7737]: R12=0000000000000fff R13=00007fbd0839a648 R14=00007fbd08399620 R15=00007fbd517b9f30
Feb 17 17:53:24 hypervisor2 QEMU[7737]: RIP=00007fbd1ae68d21 RFL=00000202 [-------] CPL=3 II=0 A20=1 SMM=0 HLT=0
Feb 17 17:53:24 hypervisor2 QEMU[7737]: ES =0000 0000000000000000 00000000 00000000
Feb 17 17:53:24 hypervisor2 QEMU[7737]: CS =0033 0000000000000000 ffffffff 00a0fb00 DPL=3 CS64 [-RA]
Feb 17 17:53:24 hypervisor2 QEMU[7737]: SS =002b 0000000000000000 ffffffff 00c0f300 DPL=3 DS [-WA]
Feb 17 17:53:24 hypervisor2 QEMU[7737]: DS =0000 0000000000000000 00000000 00000000
Feb 17 17:53:24 hypervisor2 QEMU[7737]: FS =0000 00007fbd3cdde6c0 00000000 00000000
Feb 17 17:53:24 hypervisor2 QEMU[7737]: GS =0000 0000000000000000 00000000 00000000
Feb 17 17:53:24 hypervisor2 QEMU[7737]: LDT=0000 0000000000000000 00000000 00000000
Feb 17 17:53:24 hypervisor2 QEMU[7737]: TR =0040 fffffe5d6c5cc000 00004087 00008b00 DPL=0 TSS64-busy
Feb 17 17:53:24 hypervisor2 QEMU[7737]: GDT= fffffe5d6c5ca000 0000007f
Feb 17 17:53:24 hypervisor2 QEMU[7737]: IDT= fffffe0000000000 00000fff
Feb 17 17:53:24 hypervisor2 QEMU[7737]: CR0=80050033 CR2=00007fe05ded1000 CR3=0000000106bda000 CR4=00750ef0
Feb 17 17:53:24 hypervisor2 QEMU[7737]: DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
Feb 17 17:53:24 hypervisor2 QEMU[7737]: DR6=00000000ffff0ff0 DR7=0000000000000400
Feb 17 17:53:24 hypervisor2 QEMU[7737]: EFER=0000000000201d01
Feb 17 17:53:24 hypervisor2 QEMU[7737]: Code=10 85 d2 74 6b 66 0f 1f 44 00 00 48 8b 54 c7 70 48 83 c0 01 <89> b2 8c 00 00 00 48 8b 97 98 01 00 00 39 42 10 77 e5 8b 41 10 85 c0 74 43 b8 30 0>
I'm using an ASUS Prime X670-P motherboard and have tried upgrading the BIOS (at least a couple of times during troubleshooting over the last couple of months).
The card is passed through as a raw device with all functions, and ROM-Bar enabled.
Other details:
VM kernel: 6.18.5+deb13-amd64
VM NVIDIA driver: 590.48.01
Host kernel: 6.17.2-2-pve (with additional boot arguments "pcie_aspm=off pcie_port_pm=off")
Kernels and driver have been upgraded a couple of times through the troubleshooting process.
Although not a deadly critical issue, it's really annoying me and I would be really grateful for ideas.
Here are the host logs from the latest crash:
Feb 17 17:53:24 hypervisor2 QEMU[7737]: error: kvm run failed Bad address
Feb 17 17:53:24 hypervisor2 QEMU[7737]: RAX=0000000000000001 RBX=00000000000001e8 RCX=00007fbd08399620 RDX=00007fbd51cd1000
Feb 17 17:53:24 hypervisor2 QEMU[7737]: RSI=00000000000001e8 RDI=00007fbd0839a648 RBP=0000000003cc4e70 RSP=00007fbd3cddbf48
Feb 17 17:53:24 hypervisor2 QEMU[7737]: R8 =0000000000000001 R9 =0000000000000001 R10=00007fbd3cddc2b0 R11=00007fbd3cddc2c8
Feb 17 17:53:24 hypervisor2 QEMU[7737]: R12=0000000000000fff R13=00007fbd0839a648 R14=00007fbd08399620 R15=00007fbd517b9f30
Feb 17 17:53:24 hypervisor2 QEMU[7737]: RIP=00007fbd1ae68d21 RFL=00000202 [-------] CPL=3 II=0 A20=1 SMM=0 HLT=0
Feb 17 17:53:24 hypervisor2 QEMU[7737]: ES =0000 0000000000000000 00000000 00000000
Feb 17 17:53:24 hypervisor2 QEMU[7737]: CS =0033 0000000000000000 ffffffff 00a0fb00 DPL=3 CS64 [-RA]
Feb 17 17:53:24 hypervisor2 QEMU[7737]: SS =002b 0000000000000000 ffffffff 00c0f300 DPL=3 DS [-WA]
Feb 17 17:53:24 hypervisor2 QEMU[7737]: DS =0000 0000000000000000 00000000 00000000
Feb 17 17:53:24 hypervisor2 QEMU[7737]: FS =0000 00007fbd3cdde6c0 00000000 00000000
Feb 17 17:53:24 hypervisor2 QEMU[7737]: GS =0000 0000000000000000 00000000 00000000
Feb 17 17:53:24 hypervisor2 QEMU[7737]: LDT=0000 0000000000000000 00000000 00000000
Feb 17 17:53:24 hypervisor2 QEMU[7737]: TR =0040 fffffe5d6c5cc000 00004087 00008b00 DPL=0 TSS64-busy
Feb 17 17:53:24 hypervisor2 QEMU[7737]: GDT= fffffe5d6c5ca000 0000007f
Feb 17 17:53:24 hypervisor2 QEMU[7737]: IDT= fffffe0000000000 00000fff
Feb 17 17:53:24 hypervisor2 QEMU[7737]: CR0=80050033 CR2=00007fe05ded1000 CR3=0000000106bda000 CR4=00750ef0
Feb 17 17:53:24 hypervisor2 QEMU[7737]: DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
Feb 17 17:53:24 hypervisor2 QEMU[7737]: DR6=00000000ffff0ff0 DR7=0000000000000400
Feb 17 17:53:24 hypervisor2 QEMU[7737]: EFER=0000000000201d01
Feb 17 17:53:24 hypervisor2 QEMU[7737]: Code=10 85 d2 74 6b 66 0f 1f 44 00 00 48 8b 54 c7 70 48 83 c0 01 <89> b2 8c 00 00 00 48 8b 97 98 01 00 00 39 42 10 77 e5 8b 41 10 85 c0 74 43 b8 30 0>