Intermittent Crashes on Host (Proxmox reboots)

kishkaru · Jun 13, 2024

Hello, I've installed the Proxmox 8.2.2 on one of my hosts this past week. Running kernel is 6.8.4-3-pve.
Machine has a Core i9-10900 (10C/20T) on a MSI Z590 board (with updated BIOS), and with 128GB of ram.

Roughly once per day, the host crashes and reboots itself. All running VMs are ungracefully rebooted as a result.
Prior to Proxmox, this host was running Windows Server 2022 bare metal and was rock solid stable.

I have enabled persistent system journals, but they don't provide any useful info around boot time. As an example from the most recent crash:

Code:

Jun 12 23:26:06 proxmox-01 postfix/smtp[322568]: connect to in1-smtp.messagingengine.com[103.168.172.218]:25: Connection timed out
Jun 12 23:26:06 proxmox-01 postfix/smtp[322568]: C685F20F0A: to=<redacted>, relay=none, delay=258150, delays=258000/0.02/150/0, dsn=4.4.1, status=defer>
Jun 12 23:28:35 proxmox-01 postfix/qmgr[2000]: 5AEA920E34: from=<root@proxmox-01.mynet.lan>, size=828, nrcpt=1 (queue active)
Jun 12 23:28:35 proxmox-01 postfix/error[323422]: 5AEA920E34: to=<redacted>, relay=none, delay=124420, delays=124420/0.01/0/0.01, dsn=4.4.1, status=def>
-- Boot 076d5fd44f054065a7fbdf43bdd8e0b1 --
Jun 12 23:34:42 proxmox-01 kernel: Linux version 6.8.4-3-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PR>
Jun 12 23:34:42 proxmox-01 kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-6.8.4-3-pve root=/dev/mapper/pve-root ro quiet intel_iommu=on iommu=pt
Jun 12 23:34:42 proxmox-01 kernel: KERNEL supported cpus:
Jun 12 23:34:42 proxmox-01 kernel:   Intel GenuineIntel
Jun 12 23:34:42 proxmox-01 kernel:   AMD AuthenticAMD
Jun 12 23:34:42 proxmox-01 kernel:   Hygon HygonGenuine
Jun 12 23:34:42 proxmox-01 kernel:   Centaur CentaurHauls
Jun 12 23:34:42 proxmox-01 kernel:   zhaoxin   Shanghai
Jun 12 23:34:42 proxmox-01 kernel: BIOS-provided physical RAM map:

However, much later on, when VMs are powering back on one by one automatically after the reboot, I see this in the system journal:

Code:

Jun 12 23:36:12 proxmox-01 pvedaemon[2029]: VM 107 qmp command failed - VM 107 qmp command 'guest-ping' failed - got timeout
Jun 12 23:36:56 proxmox-01 kernel: irq 16: nobody cared (try booting with the "irqpoll" option)
Jun 12 23:36:56 proxmox-01 kernel: CPU: 13 PID: 0 Comm: swapper/13 Tainted: P           O       6.8.4-3-pve #1
Jun 12 23:36:56 proxmox-01 kernel: Hardware name: Micro-Star International Co., Ltd. MS-7D06/MPG Z590 GAMING CARBON WIFI (MS-7D06), BIOS 1.B0 06/12/2023
Jun 12 23:36:56 proxmox-01 kernel: Call Trace:
Jun 12 23:36:56 proxmox-01 kernel:  <IRQ>
Jun 12 23:36:56 proxmox-01 kernel:  dump_stack_lvl+0x48/0x70
Jun 12 23:36:56 proxmox-01 kernel:  dump_stack+0x10/0x20
Jun 12 23:36:56 proxmox-01 kernel:  __report_bad_irq+0x30/0xd0
Jun 12 23:36:56 proxmox-01 kernel:  note_interrupt+0x2e1/0x320
Jun 12 23:36:56 proxmox-01 kernel:  handle_irq_event+0x79/0x80
Jun 12 23:36:56 proxmox-01 kernel:  handle_fasteoi_irq+0x7d/0x200
Jun 12 23:36:56 proxmox-01 kernel:  __common_interrupt+0x3e/0xb0
Jun 12 23:36:56 proxmox-01 kernel:  common_interrupt+0x9f/0xb0
Jun 12 23:36:56 proxmox-01 kernel:  </IRQ>
Jun 12 23:36:56 proxmox-01 kernel:  <TASK>
Jun 12 23:36:56 proxmox-01 kernel:  asm_common_interrupt+0x27/0x40
Jun 12 23:36:56 proxmox-01 kernel: RIP: 0010:cpuidle_enter_state+0xce/0x470
Jun 12 23:36:56 proxmox-01 kernel: Code: 17 03 ff e8 f4 ee ff ff 8b 53 04 49 89 c6 0f 1f 44 00 00 31 ff e8 02 07 02 ff 80 7d d7 00 0f 85 e7 01 00 00 fb 0f 1f>
Jun 12 23:36:56 proxmox-01 kernel: RSP: 0018:ffffa7db801abe50 EFLAGS: 00000246
Jun 12 23:36:56 proxmox-01 kernel: RAX: 0000000000000000 RBX: ffff980f7c2c15c8 RCX: 0000000000000000
Jun 12 23:36:56 proxmox-01 kernel: RDX: 000000000000000d RSI: 0000000000000000 RDI: 0000000000000000
Jun 12 23:36:56 proxmox-01 kernel: RBP: ffffa7db801abe88 R08: 0000000000000000 R09: 0000000000000000
Jun 12 23:36:56 proxmox-01 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000002
Jun 12 23:36:56 proxmox-01 kernel: R13: ffffffffa846fa00 R14: 0000001ff9ce38f4 R15: 0000000000000002
Jun 12 23:36:56 proxmox-01 kernel:  cpuidle_enter+0x2e/0x50
Jun 12 23:36:56 proxmox-01 kernel:  call_cpuidle+0x23/0x60
Jun 12 23:36:56 proxmox-01 kernel:  do_idle+0x207/0x260
Jun 12 23:36:56 proxmox-01 kernel:  cpu_startup_entry+0x2a/0x30
Jun 12 23:36:56 proxmox-01 kernel:  start_secondary+0x119/0x140
Jun 12 23:36:56 proxmox-01 kernel:  secondary_startup_64_no_verify+0x184/0x18b
Jun 12 23:36:56 proxmox-01 kernel:  </TASK>
Jun 12 23:36:56 proxmox-01 kernel: handlers:
Jun 12 23:36:56 proxmox-01 kernel: [<0000000030ee8955>] i801_isr [i2c_i801]
Jun 12 23:36:56 proxmox-01 kernel: [<0000000051ff6f10>] azx_interrupt [snd_hda_codec]
Jun 12 23:36:56 proxmox-01 kernel: Disabling IRQ #16

VM #107 is a Windows Server VM with a SATA controller and HBA PCIe passed through. I've verified that the IOMMU group for each device is unique, and has no other devices in each group.
Otherwise, the VM has a standard pc-q35 machine with all VirtIO hardware and drivers installed inside OS. Memory ballooning is disabled.

How can I investigate the unexpected reboots?

gfngfn256 · Jun 13, 2024

Are you doing any PCIe passthrough? If you are, I remember reading about trouble with that CPU.

What I would try (anyway) is pinning to the 6.5 kernel & see if that fixes things.

kishkaru · Jun 13, 2024

Yes I am, as mentioned above. A SATA controller and a HBA card to VM #107.

gfngfn256 · Jun 13, 2024

kishkaru said:
as mentioned above

Don't know how I missed it (too much coffee!). Try any of the following (or both):
1. Remove all passthrough (for testing).
2. Pin to kernel 6.5 .

kishkaru · Jun 13, 2024

Looks like 6.5.13-5 is the latest 6.5.x kernel. Installed:

Code:

$ apt install proxmox-kernel-6.5.13-5-pve-signed
$ apt install proxmox-headers-6.5.13-5-pve
$ proxmox-boot-tool kernel pin 6.5.13-5-pve

Pinned:

Code:

$ proxmox-boot-tool kernel list
Manually selected kernels:
None.

Automatically selected kernels:
6.5.13-5-pve
6.8.4-3-pve

Pinned kernel:
6.5.13-5-pve

And rebooted host. Fingers crossed!
What about the newer 6.8.x kernel doesn't provide stability with PCIe passthrough on these older CPUs?

kishkaru · Jun 13, 2024

Unfortunately host crashed once again today, with 6.5.13-5 kernel. I'm seeing the same IRQ #16 error as before in the system log.

I'm also seeing this part marked in red, in the boot sequence. It is the fire time appearing in the log. I don't know if it's related to the prior crash:

Code:

Jun 13 12:39:20 proxmox-01 kernel: sd 4:0:0:0: [sdh] Synchronizing SCSI cache
Jun 13 12:39:20 proxmox-01 kernel: ata4.00: Entering standby power mode
Jun 13 12:39:20 proxmox-01 kernel: ata4.00: disable device
Jun 13 12:39:21 proxmox-01 kernel: sd 0:0:0:0: [sda] Synchronizing SCSI cache
Jun 13 12:39:21 proxmox-01 kernel: sd 0:0:1:0: [sdb] Synchronizing SCSI cache
Jun 13 12:39:21 proxmox-01 kernel: sd 0:0:3:0: [sdd] tag#2581 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=0s
Jun 13 12:39:21 proxmox-01 kernel: sd 0:0:3:0: [sdd] tag#2581 CDB: Read(10) 28 00 00 00 00 00 00 01 00 00
Jun 13 12:39:21 proxmox-01 kernel: I/O error, dev sdd, sector 0 op 0x0:(READ) flags 0x0 phys_seg 32 prio class 2
Jun 13 12:39:21 proxmox-01 kernel: sd 0:0:3:0: [sdd] tag#2582 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=0s
Jun 13 12:39:21 proxmox-01 kernel: sd 0:0:3:0: [sdd] tag#2582 CDB: Read(10) 28 00 00 00 00 22 00 01 00 00
Jun 13 12:39:21 proxmox-01 kernel: I/O error, dev sdd, sector 34 op 0x0:(READ) flags 0x0 phys_seg 32 prio class 2
Jun 13 12:39:21 proxmox-01 kernel: sd 0:0:3:0: [sdd] tag#2583 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=0s
Jun 13 12:39:21 proxmox-01 kernel: sd 0:0:3:0: [sdd] tag#2583 CDB: Read(10) 28 00 00 00 80 00 00 01 00 00
Jun 13 12:39:21 proxmox-01 kernel: I/O error, dev sdd, sector 32768 op 0x0:(READ) flags 0x0 phys_seg 32 prio class 2
Jun 13 12:39:21 proxmox-01 kernel: sd 0:0:2:0: [sdc] Synchronizing SCSI cache
Jun 13 12:39:21 proxmox-01 postfix/smtp[1976]: connect to in1-smtp.messagingengine.com[103.168.172.217]:25: Connection timed out
Jun 13 12:39:21 proxmox-01 kernel: sd 0:0:3:0: [sdd] Synchronizing SCSI cache

The next step might be to turn off the Windows VM that has PCIe passthrough devices enabled, to try isolate the cause.

gfngfn256 · Jun 14, 2024

I have a hunch (based on your above output) that your PCIe controller maybe entering a power-sleep-state & not recovering. You may have to check drivers/BIOS for sleep states. Another thing you may want to look into is CPU microcode/firmware for intel. Maybe see here & here. I don't have much experience with it, so do your homework!

If you isolate the problem to the PCI passthrough, you may want to start again with the method you have accomplished the specific passthrough.

Search

Search

Intermittent Crashes on Host (Proxmox reboots)

kishkaru

New Member

gfngfn256

Renowned Member

kishkaru

New Member

gfngfn256

Renowned Member

kishkaru

New Member

kishkaru

New Member

gfngfn256

Renowned Member