Hello, I've installed the Proxmox 8.2.2 on one of my hosts this past week. Running kernel is 6.8.4-3-pve.
Machine has a Core i9-10900 (10C/20T) on a MSI Z590 board (with updated BIOS), and with 128GB of ram.
Roughly once per day, the host crashes and reboots itself. All running VMs are ungracefully rebooted as a result.
Prior to Proxmox, this host was running Windows Server 2022 bare metal and was rock solid stable.
I have enabled persistent system journals, but they don't provide any useful info around boot time. As an example from the most recent crash:
However, much later on, when VMs are powering back on one by one automatically after the reboot, I see this in the system journal:
VM #107 is a Windows Server VM with a SATA controller and HBA PCIe passed through. I've verified that the IOMMU group for each device is unique, and has no other devices in each group.
Otherwise, the VM has a standard pc-q35 machine with all VirtIO hardware and drivers installed inside OS. Memory ballooning is disabled.
How can I investigate the unexpected reboots?
Machine has a Core i9-10900 (10C/20T) on a MSI Z590 board (with updated BIOS), and with 128GB of ram.
Roughly once per day, the host crashes and reboots itself. All running VMs are ungracefully rebooted as a result.
Prior to Proxmox, this host was running Windows Server 2022 bare metal and was rock solid stable.
I have enabled persistent system journals, but they don't provide any useful info around boot time. As an example from the most recent crash:
Code:
Jun 12 23:26:06 proxmox-01 postfix/smtp[322568]: connect to in1-smtp.messagingengine.com[103.168.172.218]:25: Connection timed out
Jun 12 23:26:06 proxmox-01 postfix/smtp[322568]: C685F20F0A: to=<redacted>, relay=none, delay=258150, delays=258000/0.02/150/0, dsn=4.4.1, status=defer>
Jun 12 23:28:35 proxmox-01 postfix/qmgr[2000]: 5AEA920E34: from=<root@proxmox-01.mynet.lan>, size=828, nrcpt=1 (queue active)
Jun 12 23:28:35 proxmox-01 postfix/error[323422]: 5AEA920E34: to=<redacted>, relay=none, delay=124420, delays=124420/0.01/0/0.01, dsn=4.4.1, status=def>
-- Boot 076d5fd44f054065a7fbdf43bdd8e0b1 --
Jun 12 23:34:42 proxmox-01 kernel: Linux version 6.8.4-3-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PR>
Jun 12 23:34:42 proxmox-01 kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-6.8.4-3-pve root=/dev/mapper/pve-root ro quiet intel_iommu=on iommu=pt
Jun 12 23:34:42 proxmox-01 kernel: KERNEL supported cpus:
Jun 12 23:34:42 proxmox-01 kernel: Intel GenuineIntel
Jun 12 23:34:42 proxmox-01 kernel: AMD AuthenticAMD
Jun 12 23:34:42 proxmox-01 kernel: Hygon HygonGenuine
Jun 12 23:34:42 proxmox-01 kernel: Centaur CentaurHauls
Jun 12 23:34:42 proxmox-01 kernel: zhaoxin Shanghai
Jun 12 23:34:42 proxmox-01 kernel: BIOS-provided physical RAM map:
However, much later on, when VMs are powering back on one by one automatically after the reboot, I see this in the system journal:
Code:
Jun 12 23:36:12 proxmox-01 pvedaemon[2029]: VM 107 qmp command failed - VM 107 qmp command 'guest-ping' failed - got timeout
Jun 12 23:36:56 proxmox-01 kernel: irq 16: nobody cared (try booting with the "irqpoll" option)
Jun 12 23:36:56 proxmox-01 kernel: CPU: 13 PID: 0 Comm: swapper/13 Tainted: P O 6.8.4-3-pve #1
Jun 12 23:36:56 proxmox-01 kernel: Hardware name: Micro-Star International Co., Ltd. MS-7D06/MPG Z590 GAMING CARBON WIFI (MS-7D06), BIOS 1.B0 06/12/2023
Jun 12 23:36:56 proxmox-01 kernel: Call Trace:
Jun 12 23:36:56 proxmox-01 kernel: <IRQ>
Jun 12 23:36:56 proxmox-01 kernel: dump_stack_lvl+0x48/0x70
Jun 12 23:36:56 proxmox-01 kernel: dump_stack+0x10/0x20
Jun 12 23:36:56 proxmox-01 kernel: __report_bad_irq+0x30/0xd0
Jun 12 23:36:56 proxmox-01 kernel: note_interrupt+0x2e1/0x320
Jun 12 23:36:56 proxmox-01 kernel: handle_irq_event+0x79/0x80
Jun 12 23:36:56 proxmox-01 kernel: handle_fasteoi_irq+0x7d/0x200
Jun 12 23:36:56 proxmox-01 kernel: __common_interrupt+0x3e/0xb0
Jun 12 23:36:56 proxmox-01 kernel: common_interrupt+0x9f/0xb0
Jun 12 23:36:56 proxmox-01 kernel: </IRQ>
Jun 12 23:36:56 proxmox-01 kernel: <TASK>
Jun 12 23:36:56 proxmox-01 kernel: asm_common_interrupt+0x27/0x40
Jun 12 23:36:56 proxmox-01 kernel: RIP: 0010:cpuidle_enter_state+0xce/0x470
Jun 12 23:36:56 proxmox-01 kernel: Code: 17 03 ff e8 f4 ee ff ff 8b 53 04 49 89 c6 0f 1f 44 00 00 31 ff e8 02 07 02 ff 80 7d d7 00 0f 85 e7 01 00 00 fb 0f 1f>
Jun 12 23:36:56 proxmox-01 kernel: RSP: 0018:ffffa7db801abe50 EFLAGS: 00000246
Jun 12 23:36:56 proxmox-01 kernel: RAX: 0000000000000000 RBX: ffff980f7c2c15c8 RCX: 0000000000000000
Jun 12 23:36:56 proxmox-01 kernel: RDX: 000000000000000d RSI: 0000000000000000 RDI: 0000000000000000
Jun 12 23:36:56 proxmox-01 kernel: RBP: ffffa7db801abe88 R08: 0000000000000000 R09: 0000000000000000
Jun 12 23:36:56 proxmox-01 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000002
Jun 12 23:36:56 proxmox-01 kernel: R13: ffffffffa846fa00 R14: 0000001ff9ce38f4 R15: 0000000000000002
Jun 12 23:36:56 proxmox-01 kernel: cpuidle_enter+0x2e/0x50
Jun 12 23:36:56 proxmox-01 kernel: call_cpuidle+0x23/0x60
Jun 12 23:36:56 proxmox-01 kernel: do_idle+0x207/0x260
Jun 12 23:36:56 proxmox-01 kernel: cpu_startup_entry+0x2a/0x30
Jun 12 23:36:56 proxmox-01 kernel: start_secondary+0x119/0x140
Jun 12 23:36:56 proxmox-01 kernel: secondary_startup_64_no_verify+0x184/0x18b
Jun 12 23:36:56 proxmox-01 kernel: </TASK>
Jun 12 23:36:56 proxmox-01 kernel: handlers:
Jun 12 23:36:56 proxmox-01 kernel: [<0000000030ee8955>] i801_isr [i2c_i801]
Jun 12 23:36:56 proxmox-01 kernel: [<0000000051ff6f10>] azx_interrupt [snd_hda_codec]
Jun 12 23:36:56 proxmox-01 kernel: Disabling IRQ #16
VM #107 is a Windows Server VM with a SATA controller and HBA PCIe passed through. I've verified that the IOMMU group for each device is unique, and has no other devices in each group.
Otherwise, the VM has a standard pc-q35 machine with all VirtIO hardware and drivers installed inside OS. Memory ballooning is disabled.
How can I investigate the unexpected reboots?
Last edited: