Hello,
On two different occasions we have encountered a single VM becoming unresponsive on our Proxmox hosts (with the same hardware configuration).
When this happens, when we try to re-start the VM we get an error:
Other VMs stay up, and so-far we have only been able to get the VM that is down working again by rebooting the host (after live-migrating the other VMs).
In the logs, we see this message:
We are running:
- Supermicro AS-1125HS-TNR/H13DSH
- AMD EPYC 9374F
- 1TB DDR5 ECC RAM
- Enterprise NVME disks
Does anyone have an idea of what could be the cause and how we can resolve it?
Thank you!
On two different occasions we have encountered a single VM becoming unresponsive on our Proxmox hosts (with the same hardware configuration).
When this happens, when we try to re-start the VM we get an error:
Code:
TASK ERROR: timeout waiting on systemd
Other VMs stay up, and so-far we have only been able to get the VM that is down working again by rebooting the host (after live-migrating the other VMs).
In the logs, we see this message:
Code:
2024-08-13T10:41:55.734697+02:00 vmh005 kernel: [1806928.911824] BUG: unable to handle page fault for address: ffffffff84172444
2024-08-13T10:41:55.734714+02:00 vmh005 kernel: [1806928.912321] #PF: supervisor read access in kernel mode
2024-08-13T10:41:55.734716+02:00 vmh005 kernel: [1806928.912675] #PF: error_code(0x0000) - not-present page
2024-08-13T10:41:55.734716+02:00 vmh005 kernel: [1806928.913000] PGD e336e3a067 P4D e336e3b067 PUD e336e3c063 PMD 0
2024-08-13T10:41:55.734721+02:00 vmh005 kernel: [1806928.913315] Oops: 0000 [#1] PREEMPT SMP NOPTI
2024-08-13T10:41:55.734722+02:00 vmh005 kernel: [1806928.913629] CPU: 62 PID: 5414 Comm: CPU 1/KVM Tainted: P O 6.8.8-2-pve #1
2024-08-13T10:41:55.734724+02:00 vmh005 kernel: [1806928.913953] Hardware name: Supermicro AS -1125HS-TNR/H13DSH, BIOS 1.4 04/19/2023
2024-08-13T10:41:55.734724+02:00 vmh005 kernel: [1806928.914273] RIP: 0010:kvm_arch_vcpu_ioctl_run+0xc20/0x1760 [kvm]
2024-08-13T10:41:55.734740+02:00 vmh005 kernel: [1806928.914650] Code: 00 00 0c 00 00 74 02 0f 0b 48 89 df e8 49 3d 18 00 41 89 c4 83 f8 01 0f 84 8f 06 00 00 f6 83 68 0b 00 00 02 0f 85 aa 07 00 00 <65> 48 8b 05 c8 27 a9 3e a8 aa 0f 85 3a 04 00 00 8b 43 20 89 83 00
2024-08-13T10:41:55.734741+02:00 vmh005 kernel: [1806928.915306] RSP: 0018:ff4884a2f56539a0 EFLAGS: 00010046
2024-08-13T10:41:55.734741+02:00 vmh005 kernel: [1806928.915637] RAX: 0000000000000000 RBX: ff31d086b9a90000 RCX: 0000000000000000
2024-08-13T10:41:55.734743+02:00 vmh005 kernel: [1806928.915974] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
2024-08-13T10:41:55.734744+02:00 vmh005 kernel: [1806928.916302] RBP: ff4884a2f5653a40 R08: 0000000000000000 R09: 0000000000000000
2024-08-13T10:41:55.734744+02:00 vmh005 kernel: [1806928.916630] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
2024-08-13T10:41:55.734745+02:00 vmh005 kernel: [1806928.916947] R13: 0000000000000000 R14: ff31d08690fcaf40 R15: ff31d086b9a90038
2024-08-13T10:41:55.734746+02:00 vmh005 kernel: [1806928.917258] FS: 000074c8320006c0(0000) GS:ff31d1857c100000(0000) knlGS:0000000000000000
2024-08-13T10:41:55.734746+02:00 vmh005 kernel: [1806928.917559] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2024-08-13T10:41:55.734748+02:00 vmh005 kernel: [1806928.917886] CR2: ffffffff84172444 CR3: 00000002bbd8a006 CR4: 0000000000f71ef0
2024-08-13T10:41:55.734748+02:00 vmh005 kernel: [1806928.918213] PKRU: 55555554
2024-08-13T10:41:55.734748+02:00 vmh005 kernel: [1806928.918532] Call Trace:
2024-08-13T10:41:55.734749+02:00 vmh005 kernel: [1806928.918850] <TASK>
2024-08-13T10:41:55.734749+02:00 vmh005 kernel: [1806928.919148] ? show_regs+0x6d/0x80
2024-08-13T10:41:55.734750+02:00 vmh005 kernel: [1806928.919431] ? __die+0x24/0x80
2024-08-13T10:41:55.734750+02:00 vmh005 kernel: [1806928.919664] ? page_fault_oops+0x176/0x500
2024-08-13T10:41:55.734750+02:00 vmh005 kernel: [1806928.919880] ? srso_alias_return_thunk+0x5/0xfbef5
2024-08-13T10:41:55.734750+02:00 vmh005 kernel: [1806928.920105] ? srso_alias_return_thunk+0x5/0xfbef5
2024-08-13T10:41:55.734750+02:00 vmh005 kernel: [1806928.920320] ? kvm_arch_vcpu_ioctl_run+0xc20/0x1760 [kvm]
2024-08-13T10:41:55.734751+02:00 vmh005 kernel: [1806928.920568] ? kernelmode_fixup_or_oops+0xb2/0x140
2024-08-13T10:41:55.734751+02:00 vmh005 kernel: [1806928.920782] ? __bad_area_nosemaphore+0x1a5/0x270
2024-08-13T10:41:55.734751+02:00 vmh005 kernel: [1806928.920995] ? bad_area_nosemaphore+0x16/0x30
2024-08-13T10:41:55.734752+02:00 vmh005 kernel: [1806928.921208] ? do_kern_addr_fault+0x7b/0xa0
2024-08-13T10:41:55.734752+02:00 vmh005 kernel: [1806928.921418] ? exc_page_fault+0x10d/0x1b0
2024-08-13T10:41:55.734752+02:00 vmh005 kernel: [1806928.921629] ? asm_exc_page_fault+0x27/0x30
2024-08-13T10:41:55.734753+02:00 vmh005 kernel: [1806928.921838] ? kvm_arch_vcpu_ioctl_run+0xc20/0x1760 [kvm]
2024-08-13T10:41:55.734753+02:00 vmh005 kernel: [1806928.922086] ? srso_alias_return_thunk+0x5/0xfbef5
2024-08-13T10:41:55.734753+02:00 vmh005 kernel: [1806928.922296] ? kvm_io_bus_get_first_dev+0x57/0xe0 [kvm]
2024-08-13T10:41:55.734754+02:00 vmh005 kernel: [1806928.922534] ? srso_alias_return_thunk+0x5/0xfbef5
2024-08-13T10:41:55.734754+02:00 vmh005 kernel: [1806928.922744] ? svm_get_segment+0x1e/0x130 [kvm_amd]
2024-08-13T10:41:55.734754+02:00 vmh005 kernel: [1806928.922959] ? svm_get_cs_db_l_bits+0x33/0x70 [kvm_amd]
2024-08-13T10:41:55.734754+02:00 vmh005 kernel: [1806928.923173] kvm_vcpu_ioctl+0x297/0x800 [kvm]
2024-08-13T10:41:55.734754+02:00 vmh005 kernel: [1806928.923409] ? kvm_get_linear_rip+0xa5/0x120 [kvm]
2024-08-13T10:41:55.734754+02:00 vmh005 kernel: [1806928.923644] ? srso_alias_return_thunk+0x5/0xfbef5
2024-08-13T10:41:55.734755+02:00 vmh005 kernel: [1806928.923844] ? kvm_fast_pio+0x71/0x270 [kvm]
2024-08-13T10:41:55.734755+02:00 vmh005 kernel: [1806928.924070] ? srso_alias_return_thunk+0x5/0xfbef5
2024-08-13T10:41:55.734755+02:00 vmh005 kernel: [1806928.924271] __x64_sys_ioctl+0xa0/0xf0
2024-08-13T10:41:55.734766+02:00 vmh005 kernel: [1806928.924462] x64_sys_call+0xa68/0x24b0
2024-08-13T10:41:55.734767+02:00 vmh005 kernel: [1806928.924650] do_syscall_64+0x81/0x170
2024-08-13T10:41:55.734767+02:00 vmh005 kernel: [1806928.924835] ? vcpu_put+0x22/0x60 [kvm]
2024-08-13T10:41:55.734767+02:00 vmh005 kernel: [1806928.925047] ? srso_alias_return_thunk+0x5/0xfbef5
2024-08-13T10:41:55.734767+02:00 vmh005 kernel: [1806928.925232] ? kvm_arch_vcpu_ioctl_run+0x471/0x1760 [kvm]
2024-08-13T10:41:55.734767+02:00 vmh005 kernel: [1806928.925442] ? srso_alias_return_thunk+0x5/0xfbef5
2024-08-13T10:41:55.734767+02:00 vmh005 kernel: [1806928.925624] ? __x64_sys_ioctl+0xbb/0xf0
2024-08-13T10:41:55.734770+02:00 vmh005 kernel: [1806928.925804] ? srso_alias_return_thunk+0x5/0xfbef5
2024-08-13T10:41:55.734771+02:00 vmh005 kernel: [1806928.925981] ? kvm_vcpu_ioctl+0x30e/0x800 [kvm]
2024-08-13T10:41:55.734771+02:00 vmh005 kernel: [1806928.926188] ? srso_alias_return_thunk+0x5/0xfbef5
2024-08-13T10:41:55.734771+02:00 vmh005 kernel: [1806928.926358] ? kvm_vcpu_ioctl+0x30e/0x800 [kvm]
2024-08-13T10:41:55.734771+02:00 vmh005 kernel: [1806928.926546] ? do_syscall_64+0x8d/0x170
2024-08-13T10:41:55.734772+02:00 vmh005 kernel: [1806928.926704] ? srso_alias_return_thunk+0x5/0xfbef5
2024-08-13T10:41:55.734772+02:00 vmh005 kernel: [1806928.926860] ? kvm_on_user_return+0x78/0xd0 [kvm]
2024-08-13T10:41:55.734772+02:00 vmh005 kernel: [1806928.927047] ? srso_alias_return_thunk+0x5/0xfbef5
2024-08-13T10:41:55.734772+02:00 vmh005 kernel: [1806928.927204] ? fire_user_return_notifiers+0x37/0x80
2024-08-13T10:41:55.734772+02:00 vmh005 kernel: [1806928.927363] ? srso_alias_return_thunk+0x5/0xfbef5
2024-08-13T10:41:55.734773+02:00 vmh005 kernel: [1806928.927519] ? syscall_exit_to_user_mode+0x89/0x260
2024-08-13T10:41:55.734773+02:00 vmh005 kernel: [1806928.927673] ? srso_alias_return_thunk+0x5/0xfbef5
2024-08-13T10:41:55.734773+02:00 vmh005 kernel: [1806928.927825] ? do_syscall_64+0x8d/0x170
2024-08-13T10:41:55.734774+02:00 vmh005 kernel: [1806928.927976] ? srso_alias_return_thunk+0x5/0xfbef5
2024-08-13T10:41:55.734774+02:00 vmh005 kernel: [1806928.928139] ? syscall_exit_to_user_mode+0x89/0x260
2024-08-13T10:41:55.734774+02:00 vmh005 kernel: [1806928.928289] ? srso_alias_return_thunk+0x5/0xfbef5
2024-08-13T10:41:55.734776+02:00 vmh005 kernel: [1806928.928441] ? kvm_on_user_return+0x78/0xd0 [kvm]
2024-08-13T10:41:55.734776+02:00 vmh005 kernel: [1806928.928620] ? srso_alias_return_thunk+0x5/0xfbef5
2024-08-13T10:41:55.734776+02:00 vmh005 kernel: [1806928.928772] ? fire_user_return_notifiers+0x37/0x80
2024-08-13T10:41:55.734776+02:00 vmh005 kernel: [1806928.928925] ? srso_alias_return_thunk+0x5/0xfbef5
2024-08-13T10:41:55.734777+02:00 vmh005 kernel: [1806928.929081] ? syscall_exit_to_user_mode+0x89/0x260
2024-08-13T10:41:55.734777+02:00 vmh005 kernel: [1806928.929237] ? srso_alias_return_thunk+0x5/0xfbef5
2024-08-13T10:41:55.734777+02:00 vmh005 kernel: [1806928.929393] ? __x64_sys_ioctl+0xbb/0xf0
2024-08-13T10:41:55.734777+02:00 vmh005 kernel: [1806928.929547] ? srso_alias_return_thunk+0x5/0xfbef5
2024-08-13T10:41:55.734777+02:00 vmh005 kernel: [1806928.929703] ? syscall_exit_to_user_mode+0x89/0x260
2024-08-13T10:41:55.734777+02:00 vmh005 kernel: [1806928.929858] ? srso_alias_return_thunk+0x5/0xfbef5
2024-08-13T10:41:55.734778+02:00 vmh005 kernel: [1806928.930016] ? do_syscall_64+0x8d/0x170
2024-08-13T10:41:55.734778+02:00 vmh005 kernel: [1806928.930170] ? srso_alias_return_thunk+0x5/0xfbef5
2024-08-13T10:41:55.734779+02:00 vmh005 kernel: [1806928.930327] ? syscall_exit_to_user_mode+0x89/0x260
2024-08-13T10:41:55.734780+02:00 vmh005 kernel: [1806928.930484] ? srso_alias_return_thunk+0x5/0xfbef5
2024-08-13T10:41:55.734781+02:00 vmh005 kernel: [1806928.930639] ? do_syscall_64+0x8d/0x170
2024-08-13T10:41:55.734781+02:00 vmh005 kernel: [1806928.930794] ? kvm_on_user_return+0x78/0xd0 [kvm]
2024-08-13T10:41:55.734781+02:00 vmh005 kernel: [1806928.930976] ? srso_alias_return_thunk+0x5/0xfbef5
2024-08-13T10:41:55.734781+02:00 vmh005 kernel: [1806928.931133] ? srso_alias_return_thunk+0x5/0xfbef5
2024-08-13T10:41:55.734781+02:00 vmh005 kernel: [1806928.931287] ? __x64_sys_ioctl+0xbb/0xf0
2024-08-13T10:41:55.734783+02:00 vmh005 kernel: [1806928.931439] ? srso_alias_return_thunk+0x5/0xfbef5
2024-08-13T10:41:55.734783+02:00 vmh005 kernel: [1806928.931592] ? syscall_exit_to_user_mode+0x89/0x260
2024-08-13T10:41:55.734784+02:00 vmh005 kernel: [1806928.931745] ? srso_alias_return_thunk+0x5/0xfbef5
2024-08-13T10:41:55.734784+02:00 vmh005 kernel: [1806928.931899] ? do_syscall_64+0x8d/0x170
2024-08-13T10:41:55.734784+02:00 vmh005 kernel: [1806928.932053] ? srso_alias_return_thunk+0x5/0xfbef5
2024-08-13T10:41:55.734784+02:00 vmh005 kernel: [1806928.932216] ? do_syscall_64+0x8d/0x170
2024-08-13T10:41:55.734784+02:00 vmh005 kernel: [1806928.932369] entry_SYSCALL_64_after_hwframe+0x78/0x80
2024-08-13T10:41:55.734784+02:00 vmh005 kernel: [1806928.932523] RIP: 0033:0x74c837371c5b
2024-08-13T10:41:55.734785+02:00 vmh005 kernel: [1806928.932713] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1c 48 8b 44 24 18 64 48 2b 04 25 28 00 00
2024-08-13T10:41:55.734785+02:00 vmh005 kernel: [1806928.933045] RSP: 002b:000074c831ffaf30 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
2024-08-13T10:41:55.734785+02:00 vmh005 kernel: [1806928.933218] RAX: ffffffffffffffda RBX: 00005e0cc5ce5d10 RCX: 000074c837371c5b
2024-08-13T10:41:55.734786+02:00 vmh005 kernel: [1806928.933390] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 0000000000000023
2024-08-13T10:41:55.734786+02:00 vmh005 kernel: [1806928.933569] RBP: 000000000000ae80 R08: 00005e0cc44f0c90 R09: 0000000000000000
2024-08-13T10:41:55.734786+02:00 vmh005 kernel: [1806928.933743] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
2024-08-13T10:41:55.734786+02:00 vmh005 kernel: [1806928.933917] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000000
2024-08-13T10:41:55.734787+02:00 vmh005 kernel: [1806928.934093] </TASK>
2024-08-13T10:41:55.734787+02:00 vmh005 kernel: [1806928.934264] Modules linked in: nfsv3 nfs_acl nfs lockd grace netfs ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter sctp ip6_udp_tunnel udp_tunnel nf_tables 8021q garp mrp bonding tls softdog sunrpc binfmt_misc nfnetlink_log nfnetlink intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd kvm_amd kvm irqbypass crct10dif_pclmul ipmi_ssif polyval_clmulni polyval_generic ghash_clmulni_intel sha256_ssse3 sha1_ssse3 aesni_intel crypto_simd cryptd dax_hmem cxl_acpi rapl cxl_core pcspkr irdma i40e ib_uverbs acpi_ipmi ast i2c_algo_bit ipmi_si ib_core ipmi_devintf k10temp ccp ipmi_msghandler joydev input_leds mac_hid vhost_net vhost vhost_iotlb tap efi_pstore dmi_sysfs ip_tables x_tables autofs4 zfs(PO) spl(O) btrfs blake2b_generic raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq raid1 raid0 rndis_host cdc_ether usbnet mii hid_generic usbmouse usbhid hid dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c raid10 xhci_pci crc32_pclmul
2024-08-13T10:41:55.734789+02:00 vmh005 kernel: [1806928.934353] nvme xhci_pci_renesas ice nvme_core ahci nvme_auth xhci_hcd gnss libahci i2c_piix4
2024-08-13T10:41:55.734789+02:00 vmh005 kernel: [1806928.936407] CR2: ffffffff84172444
2024-08-13T10:41:55.734789+02:00 vmh005 kernel: [1806928.936632] ---[ end trace 0000000000000000 ]---
2024-08-13T10:41:55.734789+02:00 vmh005 kernel: [1806929.085377] RIP: 0010:kvm_arch_vcpu_ioctl_run+0xc20/0x1760 [kvm]
2024-08-13T10:41:55.734806+02:00 vmh005 kernel: [1806929.085780] Code: 00 00 0c 00 00 74 02 0f 0b 48 89 df e8 49 3d 18 00 41 89 c4 83 f8 01 0f 84 8f 06 00 00 f6 83 68 0b 00 00 02 0f 85 aa 07 00 00 <65> 48 8b 05 c8 27 a9 3e a8 aa 0f 85 3a 04 00 00 8b 43 20 89 83 00
2024-08-13T10:41:55.734806+02:00 vmh005 kernel: [1806929.086280] RSP: 0018:ff4884a2f56539a0 EFLAGS: 00010046
2024-08-13T10:41:55.734806+02:00 vmh005 kernel: [1806929.086535] RAX: 0000000000000000 RBX: ff31d086b9a90000 RCX: 0000000000000000
2024-08-13T10:41:55.734807+02:00 vmh005 kernel: [1806929.086791] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
2024-08-13T10:41:55.734807+02:00 vmh005 kernel: [1806929.087045] RBP: ff4884a2f5653a40 R08: 0000000000000000 R09: 0000000000000000
2024-08-13T10:41:55.734807+02:00 vmh005 kernel: [1806929.087296] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
2024-08-13T10:41:55.734809+02:00 vmh005 kernel: [1806929.087547] R13: 0000000000000000 R14: ff31d08690fcaf40 R15: ff31d086b9a90038
2024-08-13T10:41:55.734809+02:00 vmh005 kernel: [1806929.087798] FS: 000074c8320006c0(0000) GS:ff31d1857c100000(0000) knlGS:0000000000000000
2024-08-13T10:41:55.734809+02:00 vmh005 kernel: [1806929.088053] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2024-08-13T10:41:55.734809+02:00 vmh005 kernel: [1806929.088309] CR2: ffffffff84172444 CR3: 00000002bbd8a006 CR4: 0000000000f71ef0
2024-08-13T10:41:55.734809+02:00 vmh005 kernel: [1806929.088566] PKRU: 55555554
2024-08-13T10:41:55.734810+02:00 vmh005 kernel: [1806929.088821] note: CPU 1/KVM[5414] exited with irqs disabled
2024-08-13T10:41:55.734810+02:00 vmh005 kernel: [1806929.089103] note: CPU 1/KVM[5414] exited with preempt_count 1
We are running:
- Supermicro AS-1125HS-TNR/H13DSH
- AMD EPYC 9374F
- 1TB DDR5 ECC RAM
- Enterprise NVME disks
Does anyone have an idea of what could be the cause and how we can resolve it?
Thank you!