This is very strange. We have a 6-node cluster. The newest node is the only one with a Xeon Platinum 8358. The others are mostly Xeon Gold 6154. The newest node had over 100 VMs. Out of the blue, a few VMs on the newest node started to malfunction. They weren't frozen, we could still log-in, and ping, but services weren't working, and they couldn't restart. Restarts took forever and never ended. We then tried to restart the VMs, but they started shutting down services and never finished. We tried reseting the VMs, but these VMs wouldn't boot correctly neither. So we ended stopping these few VMs, migrated them in stopped state to another node, started them and they booted correctly and worked just fine. This node (the newest) has still over 100 VMs working fine, but something happened with those VMs. After this, we migrated one of these VMs back to the newest node to see what happens, and it looked like it worked fine, until next day, when it started malfunctioning again and had to stop it and migrate it to another node.
pveversion -v
lscpu
extract of journalctl -b:
I searched forums, bugs... found similar problems, but not this one. Hope to get help here. Thank you.
pveversion -v
Code:
root@tcn-05-lon-vh27:~# pveversion -v
proxmox-ve: 7.2-1 (running kernel: 5.15.35-2-pve)
pve-manager: 7.2-4 (running version: 7.2-4/ca9d43cc)
pve-kernel-5.15: 7.2-4
pve-kernel-helper: 7.2-4
pve-kernel-5.15.35-2-pve: 5.15.35-5
pve-kernel-5.15.30-2-pve: 5.15.30-3
ceph-fuse: 15.2.16-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve1
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.2-2
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.2-2
libpve-guest-common-perl: 4.1-2
libpve-http-server-perl: 4.1-2
libpve-storage-perl: 7.2-4
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.12-1
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.2.3-1
proxmox-backup-file-restore: 2.2.3-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.5.1
pve-cluster: 7.2-1
pve-container: 4.2-1
pve-docs: 7.2-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.4-2
pve-ha-manager: 3.3-4
pve-i18n: 2.7-2
pve-qemu-kvm: 6.2.0-10
pve-xtermjs: 4.16.0-1
qemu-server: 7.2-3
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.4-pve1
root@tcn-05-lon-vh27:~#
lscpu
Code:
root@tcn-05-lon-vh27:~# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 46 bits physical, 57 bits virtual
CPU(s): 128
On-line CPU(s) list: 0-127
Thread(s) per core: 2
Core(s) per socket: 32
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 106
Model name: Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz
Stepping: 6
CPU MHz: 2600.000
CPU max MHz: 3400.0000
CPU min MHz: 800.0000
BogoMIPS: 5200.00
Virtualization: VT-x
L1d cache: 3 MiB
L1i cache: 2 MiB
L2 cache: 80 MiB
L3 cache: 96 MiB
NUMA node0 CPU(s): 0-31,64-95
NUMA node1 CPU(s): 32-63,96-127
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pc
lmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 invpcid_single intel_ppin ssbd mba
ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsave
opt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bi
talg tme avx512_vpopcntdq la57 rdpid fsrm md_clear pconfig flush_l1d arch_capabilities
root@tcn-05-lon-vh27:~#
extract of journalctl -b:
Code:
Jul 29 02:05:01 tcn-05-lon-vh27 pvestatd[194998]: status update time (13.378 seconds)
Jul 29 02:05:11 tcn-05-lon-vh27 pvestatd[194998]: VM 183 qmp command failed - VM 183 qmp command 'query-proxmox-support' failed - unable to connect to VM 183 qmp socket - timeout after 31 retries
Jul 29 02:05:14 tcn-05-lon-vh27 pvestatd[194998]: VM 146 qmp command failed - VM 146 qmp command 'query-proxmox-support' failed - unable to connect to VM 146 qmp socket - timeout after 31 retries
Jul 29 02:05:15 tcn-05-lon-vh27 pvestatd[194998]: status update time (13.412 seconds)
Jul 29 02:05:24 tcn-05-lon-vh27 pvestatd[194998]: VM 146 qmp command failed - VM 146 qmp command 'query-proxmox-support' failed - unable to connect to VM 146 qmp socket - timeout after 31 retries
Jul 29 02:05:27 tcn-05-lon-vh27 pvestatd[194998]: VM 183 qmp command failed - VM 183 qmp command 'query-proxmox-support' failed - unable to connect to VM 183 qmp socket - timeout after 31 retries
Jul 29 02:05:28 tcn-05-lon-vh27 pvestatd[194998]: status update time (12.710 seconds)
Jul 29 02:05:37 tcn-05-lon-vh27 pvestatd[194998]: VM 183 qmp command failed - VM 183 qmp command 'query-proxmox-support' failed - unable to connect to VM 183 qmp socket - timeout after 31 retries
Jul 29 02:05:40 tcn-05-lon-vh27 pvestatd[194998]: VM 146 qmp command failed - VM 146 qmp command 'query-proxmox-support' failed - unable to connect to VM 146 qmp socket - timeout after 31 retries
Jul 29 02:05:40 tcn-05-lon-vh27 pvestatd[194998]: storage 'SN4-NOREPLICA' is not online
Jul 29 02:05:40 tcn-05-lon-vh27 pvestatd[194998]: status update time (12.929 seconds)
Jul 29 02:05:50 tcn-05-lon-vh27 pvestatd[194998]: VM 146 qmp command failed - VM 146 qmp command 'query-proxmox-support' failed - unable to connect to VM 146 qmp socket - timeout after 31 retries
Jul 29 02:05:53 tcn-05-lon-vh27 pvestatd[194998]: VM 183 qmp command failed - VM 183 qmp command 'query-proxmox-support' failed - unable to connect to VM 183 qmp socket - timeout after 31 retries
Jul 29 02:05:54 tcn-05-lon-vh27 pvestatd[194998]: status update time (13.178 seconds)
Jul 29 02:05:58 tcn-05-lon-vh27 kernel: INFO: task iou-wrk-210670:2613106 blocked for more than 241 seconds.
Jul 29 02:05:58 tcn-05-lon-vh27 kernel: Tainted: P O 5.15.35-2-pve #1
Jul 29 02:05:58 tcn-05-lon-vh27 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 29 02:05:58 tcn-05-lon-vh27 kernel: task:iou-wrk-210670 state:D stack: 0 pid:2613106 ppid: 1 flags:0x00004000
Jul 29 02:05:58 tcn-05-lon-vh27 kernel: Call Trace:
Jul 29 02:05:58 tcn-05-lon-vh27 kernel: <TASK>
Jul 29 02:05:58 tcn-05-lon-vh27 kernel: __schedule+0x33d/0x1750
Jul 29 02:05:58 tcn-05-lon-vh27 kernel: ? nfs_generic_pg_pgios+0xa5/0xc0 [nfs]
Jul 29 02:05:58 tcn-05-lon-vh27 kernel: schedule+0x4e/0xb0
Jul 29 02:05:58 tcn-05-lon-vh27 kernel: io_schedule+0x46/0x70
Jul 29 02:05:58 tcn-05-lon-vh27 kernel: wait_on_page_bit_common+0x114/0x3e0
Jul 29 02:05:58 tcn-05-lon-vh27 kernel: ? filemap_invalidate_unlock_two+0x40/0x40
Jul 29 02:05:58 tcn-05-lon-vh27 kernel: wait_on_page_bit+0x3f/0x50
Jul 29 02:05:58 tcn-05-lon-vh27 kernel: wait_on_page_writeback+0x26/0x80
Jul 29 02:05:58 tcn-05-lon-vh27 kernel: __filemap_fdatawait_range+0x97/0x110
Jul 29 02:05:58 tcn-05-lon-vh27 kernel: file_write_and_wait_range+0xcc/0xf0
Jul 29 02:05:58 tcn-05-lon-vh27 kernel: nfs_file_fsync+0x9f/0x190 [nfs]
Jul 29 02:05:58 tcn-05-lon-vh27 kernel: vfs_fsync_range+0x46/0x80
Jul 29 02:05:58 tcn-05-lon-vh27 kernel: io_issue_sqe+0x1098/0x1fb0
Jul 29 02:05:58 tcn-05-lon-vh27 kernel: ? lock_timer_base+0x3b/0xd0
Jul 29 02:05:58 tcn-05-lon-vh27 kernel: io_wq_submit_work+0x68/0xb0
Jul 29 02:05:58 tcn-05-lon-vh27 kernel: io_worker_handle_work+0x1a7/0x5f0
Jul 29 02:05:58 tcn-05-lon-vh27 kernel: io_wqe_worker+0x2c0/0x360
Jul 29 02:05:58 tcn-05-lon-vh27 kernel: ? finish_task_switch.isra.0+0xa6/0x2a0
Jul 29 02:05:58 tcn-05-lon-vh27 kernel: ? io_worker_handle_work+0x5f0/0x5f0
Jul 29 02:05:58 tcn-05-lon-vh27 kernel: ? io_worker_handle_work+0x5f0/0x5f0
Jul 29 02:05:58 tcn-05-lon-vh27 kernel: ret_from_fork+0x1f/0x30
Jul 29 02:05:58 tcn-05-lon-vh27 kernel: RIP: 0033:0x0
Jul 29 02:05:58 tcn-05-lon-vh27 kernel: RSP: 002b:0000000000000000 EFLAGS: 00000212 ORIG_RAX: 00000000000001aa
Jul 29 02:05:58 tcn-05-lon-vh27 kernel: RAX: 0000000000000000 RBX: 00007fe31abeb860 RCX: 00007fe54489b9b9
Jul 29 02:05:58 tcn-05-lon-vh27 kernel: RDX: 0000000000000000 RSI: 0000000000000001 RDI: 000000000000000f
Jul 29 02:05:58 tcn-05-lon-vh27 kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000008
Jul 29 02:05:58 tcn-05-lon-vh27 kernel: R10: 0000000000000000 R11: 0000000000000212 R12: 000056547dccdc78
Jul 29 02:05:58 tcn-05-lon-vh27 kernel: R13: 000056547dccdd30 R14: 000056547dccdc70 R15: 00007fe31abeb860
Jul 29 02:05:58 tcn-05-lon-vh27 kernel: </TASK>
Jul 29 02:05:58 tcn-05-lon-vh27 kernel: INFO: task kvm:1610434 blocked for more than 120 seconds.
Jul 29 02:05:58 tcn-05-lon-vh27 kernel: Tainted: P O 5.15.35-2-pve #1
Jul 29 02:05:58 tcn-05-lon-vh27 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 29 02:05:58 tcn-05-lon-vh27 kernel: task:kvm state:D stack: 0 pid:1610434 ppid: 1 flags:0x00000000
Jul 29 02:05:58 tcn-05-lon-vh27 kernel: Call Trace:
Jul 29 02:05:58 tcn-05-lon-vh27 kernel: <TASK>
Jul 29 02:05:58 tcn-05-lon-vh27 kernel: __schedule+0x33d/0x1750
Jul 29 02:05:58 tcn-05-lon-vh27 kernel: schedule+0x4e/0xb0
Jul 29 02:05:58 tcn-05-lon-vh27 kernel: rwsem_down_read_slowpath+0x318/0x370
Jul 29 02:05:58 tcn-05-lon-vh27 kernel: down_read+0x43/0x90
Jul 29 02:05:58 tcn-05-lon-vh27 kernel: nfs_start_io_read+0x1f/0x80 [nfs]
Jul 29 02:05:58 tcn-05-lon-vh27 kernel: nfs_file_read+0x39/0xb0 [nfs]
Jul 29 02:05:58 tcn-05-lon-vh27 kernel: io_read+0xe9/0x4c0
Jul 29 02:05:58 tcn-05-lon-vh27 kernel: ? wake_up_process+0x15/0x20
Jul 29 02:05:58 tcn-05-lon-vh27 kernel: ? io_wqe_activate_free_worker+0xc0/0xd0
Jul 29 02:05:58 tcn-05-lon-vh27 kernel: io_issue_sqe+0xf57/0x1fb0
Jul 29 02:05:58 tcn-05-lon-vh27 kernel: ? io_wq_enqueue+0x1c/0x20
Jul 29 02:05:58 tcn-05-lon-vh27 kernel: __io_queue_sqe+0x35/0x310
Jul 29 02:05:58 tcn-05-lon-vh27 kernel: ? fget+0x2a/0x30
Jul 29 02:05:58 tcn-05-lon-vh27 kernel: io_submit_sqes+0xfb5/0x1b50
Jul 29 02:05:58 tcn-05-lon-vh27 kernel: ? wake_up_q+0x90/0x90
Jul 29 02:05:58 tcn-05-lon-vh27 kernel: ? __fget_files+0x86/0xc0
Jul 29 02:05:58 tcn-05-lon-vh27 kernel: __do_sys_io_uring_enter+0x520/0x9a0
Jul 29 02:05:58 tcn-05-lon-vh27 kernel: ? __do_sys_io_uring_enter+0x520/0x9a0
Jul 29 02:05:58 tcn-05-lon-vh27 kernel: ? vfs_read+0x100/0x1a0
Jul 29 02:05:58 tcn-05-lon-vh27 kernel: __x64_sys_io_uring_enter+0x29/0x30
Jul 29 02:05:58 tcn-05-lon-vh27 kernel: do_syscall_64+0x59/0xc0
Jul 29 02:05:58 tcn-05-lon-vh27 kernel: ? do_syscall_64+0x69/0xc0
Jul 29 02:05:58 tcn-05-lon-vh27 kernel: ? do_syscall_64+0x69/0xc0
Jul 29 02:05:58 tcn-05-lon-vh27 kernel: ? sysvec_apic_timer_interrupt+0x4e/0x90
Jul 29 02:05:58 tcn-05-lon-vh27 kernel: ? asm_sysvec_apic_timer_interrupt+0xa/0x20
Jul 29 02:05:58 tcn-05-lon-vh27 kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae
Jul 29 02:05:58 tcn-05-lon-vh27 kernel: RIP: 0033:0x7f25472619b9
Jul 29 02:05:58 tcn-05-lon-vh27 kernel: RSP: 002b:00007ffdc0fc17a8 EFLAGS: 00000216 ORIG_RAX: 00000000000001aa
Jul 29 02:05:58 tcn-05-lon-vh27 kernel: RAX: ffffffffffffffda RBX: 00007f2277a7a630 RCX: 00007f25472619b9
Jul 29 02:05:58 tcn-05-lon-vh27 kernel: RDX: 0000000000000000 RSI: 0000000000000004 RDI: 000000000000000f
Jul 29 02:05:58 tcn-05-lon-vh27 kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000008
Jul 29 02:05:58 tcn-05-lon-vh27 kernel: R10: 0000000000000000 R11: 0000000000000216 R12: 0000557587985a28
Jul 29 02:05:58 tcn-05-lon-vh27 kernel: R13: 0000557587985ae0 R14: 0000557587985a20 R15: 0000000000000001
Jul 29 02:05:58 tcn-05-lon-vh27 kernel: </TASK>
I searched forums, bugs... found similar problems, but not this one. Hope to get help here. Thank you.