PVE crashes/restarts unexpectedly, sporadically loses network link, and has issues with NVMe drives - what is going on?

Nerdlicht · Nov 12, 2024

Hello everyone!

I recently started using Proxmox 8.2.7 (Linux 6.8.12-3-pve) to host a few VMs at home. Unfortunately, my host has crashed three times in the past two weeks without any apparent reason. I've attached the full log from shortly before the latest crash as a text file, though this section is likely the most relevant:

Nov 12 18:41:18 pve systemd[1]: Started 110.scope.
Nov 12 18:41:19 pve kernel: tap110i0: entered promiscuous mode
Nov 12 18:41:19 pve kernel: vmbr0: port 12(tap110i0) entered blocking state
Nov 12 18:41:19 pve kernel: vmbr0: port 12(tap110i0) entered disabled state
Nov 12 18:41:19 pve kernel: tap110i0: entered allmulticast mode
Nov 12 18:41:19 pve kernel: vmbr0: port 12(tap110i0) entered blocking state
Nov 12 18:41:19 pve kernel: vmbr0: port 12(tap110i0) entered forwarding state
Nov 12 18:41:19 pve pvedaemon[1380391]: <root@pam> end task UPIDve:001528E0:01BB1DEA:6733933E:qmstart:110:root@pam: OK
Nov 12 18:41:21 pve pvedaemon[1386802]: starting vnc proxy UPIDve:00152932:01BB1F00:67339341:vncproxy:110:root@pam:
Nov 12 18:41:21 pve pvedaemon[1377281]: <root@pam> starting task UPIDve:00152932:01BB1F00:67339341:vncproxy:110:root@pam:
Nov 12 18:41:32 pve kernel: kvm_pr_unimpl_wrmsr: 2 callbacks suppressed
Nov 12 18:41:32 pve kernel: kvm_amd: kvm [1386740]: vcpu0, guest rIP: 0xfffff872ef13b455 Unhandled WRMSR(0xc0010115) = 0x0
Nov 12 18:41:33 pve kernel: kvm_amd: kvm [1386740]: vcpu1, guest rIP: 0xfffff872ef13b455 Unhandled WRMSR(0xc0010115) = 0x0
Nov 12 18:41:34 pve kernel: kvm_amd: kvm [1386740]: vcpu2, guest rIP: 0xfffff872ef13b455 Unhandled WRMSR(0xc0010115) = 0x0
Nov 12 18:41:34 pve kernel: kvm_amd: kvm [1386740]: vcpu3, guest rIP: 0xfffff872ef13b455 Unhandled WRMSR(0xc0010115) = 0x0
Nov 12 18:41:34 pve kernel: kvm_amd: kvm [1386740]: vcpu4, guest rIP: 0xfffff872ef13b455 Unhandled WRMSR(0xc0010115) = 0x0
Nov 12 18:41:34 pve kernel: kvm_amd: kvm [1386740]: vcpu5, guest rIP: 0xfffff872ef13b455 Unhandled WRMSR(0xc0010115) = 0x0
Nov 12 18:41:34 pve kernel: kvm_amd: kvm [1386740]: vcpu6, guest rIP: 0xfffff872ef13b455 Unhandled WRMSR(0xc0010115) = 0x0
Nov 12 18:41:34 pve kernel: kvm_amd: kvm [1386740]: vcpu7, guest rIP: 0xfffff872ef13b455 Unhandled WRMSR(0xc0010115) = 0x0
Nov 12 18:41:34 pve kernel: kvm_amd: kvm [1386740]: vcpu8, guest rIP: 0xfffff872ef13b455 Unhandled WRMSR(0xc0010115) = 0x0
Nov 12 18:41:34 pve kernel: kvm_amd: kvm [1386740]: vcpu9, guest rIP: 0xfffff872ef13b455 Unhandled WRMSR(0xc0010115) = 0x0
-- Reboot --
Nov 12 18:44:51 pve kernel: Linux version 6.8.12-3-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC PMX 6.8.12-3 (2024-10-23T11:41Z) ()
Nov 12 18:44:51 pve kernel: Command line: initrd=\EFI\proxmox\6.8.12-3-pve\initrd.img-6.8.12-3-pve root=ZFS=rpool/ROOT/pve-1 boot=zfs quiet iommu=pt nvme_core.default_ps_max_latency_us=0 pcie_aspm=off
Nov 12 18:44:51 pve kernel: KERNEL supported cpus:
Nov 12 18:44:51 pve kernel: Intel GenuineIntel
Nov 12 18:44:51 pve kernel: AMD AuthenticAMD
Nov 12 18:44:51 pve kernel: Hygon HygonGenuine
Nov 12 18:44:51 pve kernel: Centaur CentaurHauls
Nov 12 18:44:51 pve kernel: zhaoxin Shanghai
Nov 12 18:44:51 pve kernel: BIOS-provided physical RAM map:

The hardware in use is new and has only been in operation for two weeks:

Super Micro H13SAE-MF, AMD Ryzen 9 7900X, 2x 32GB G.Skill Ripjaws S5 DDR5-5600 DIMM CL28-34-34-89, 2x Samsung 990 Pro as RAID1 for the system and a Broadcom 9500-16i HBA that is passed trough to a TrueNAS VM.

Among approximately 15 Linux VMs, I have a Windows 11 VM, which I suspect might be causing the issue. The server ran smoothly for several days until I connected to the Windows 11 VM via RDP today and opened a browser.

Here is the Windows 11 VM config:

agent: 1
bios: ovmf
boot: order=ide0
cores: 24
cpu: host
efidisk0: local-zfs:vm-110-disk-0,efitype=4m,pre-enrolled-keys=1,size=1M
ide0: local-zfs:vm-110-disk-1,cache=writeback,discard=on,size=75G
machine: pc-q35-9.0
memory: 12228
meta: creation-qemu=9.0.2,ctime=1730993987
name: Windows-11-Workstation
net0: virtio=BC:24:11:78:49:E1,bridge=vmbr0
numa: 0
ostype: win11
scsihw: virtio-scsi-single
smbios1: uuid=13047b88-4a0f-4329-a711-4d5b2d77d798
sockets: 1
tpmstate0: local-zfs:vm-110-disk-2,size=4M,version=v2.0
vmgenid: 56515997-c429-4240-a2ad-e0cb00888441
agent: 1
bios: ovmf
boot: order=ide0
cores: 24
cpu: host
efidisk0: local-zfs:vm-110-disk-0,efitype=4m,pre-enrolled-keys=1,size=1M
ide0: local-zfs:vm-110-disk-1,cache=writeback,discard=on,size=75G
machine: pc-q35-9.0
memory: 12228
meta: creation-qemu=9.0.2,ctime=1730993987
name: Windows-11-Workstation
net0: virtio=BC:24:11:78:49:E1,bridge=vmbr0
numa: 0
ostype: win11
scsihw: virtio-scsi-single
smbios1: uuid=13047b88-4a0f-4329-a711-4d5b2d77d798
sockets: 1
tpmstate0: local-zfs:vm-110-disk-2,size=4M,version=v2.0
vmgenid: 56515997-c429-4240-a2ad-e0cb00888441

All firmwares are up to date (Mainboard, NVMe drives, Broadcom HBA). I’m unsure how to get closer to identifying the cause. I hope someone here can offer a tip. I’d prefer not to revert to ESXi.

Best regards,
Karsten

P.S.:

I also noticed that I repeatedly lose the network connection to the Proxmox host and all VMs for a few seconds. Each time this happens, my Mikrotik switch logs a "link down" on the port. I have already swapped the port and cable.

After the Proxmox RAID1 unexpectedly degraded one day with the following log messages, I added the suggested kernel parameters (just as a side note):

Nov 09 01:29:18 pve kernel: nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
Nov 09 01:29:18 pve kernel: nvme nvme0: Does your device have a faulty power saving mode enabled?
Nov 09 01:29:18 pve kernel: nvme nvme0: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug
Nov 09 01:29:18 pve kernel: nvme 0000:05:00.0: Unable to change power state from D3cold to D0, device inaccessible
Nov 09 01:29:18 pve kernel: nvme nvme0: Disabling device after reset failure: -19
Nov 09 01:29:18 pve kernel: zio pool=rpool vdev=/dev/disk/by-id/nvme-eui.0025384441405fc9-part3 error=5 type=2 offset=1987419951104 size=53248 flags=1572992
Nov 09 01:29:18 pve kernel: zio pool=rpool vdev=/dev/disk/by-id/nvme-eui.0025384441405fc9-part3 error=5 type=2 offset=1189416304640 size=4096 flags=1572992

gfngfn256 · Nov 12, 2024

Nerdlicht said:
Windows 11 VM config

Try changing that cpu: host to something else (maybe cpu: x86-64-v4 ?). I believe you maybe suffering something similar to this post.

PVE crashes/restarts unexpectedly, sporadically loses network link, and has issues with NVMe drives - what is going on?

Nerdlicht

New Member

Attachments

gfngfn256

Distinguished Member

We value your privacy