4-Node Cluster problem - pvestatd $max_cpuid

Tdreissi

Well-Known Member
May 21, 2019
40
3
48
47
Hello,

I run a 4-node cluster.
Proxmox version 7.1-6

I have the same message on all 4 nodes in syslog and "systemctl status pvestatd"
The cluster and the VM's are running, but tonight a node just booted for no apparent reason.
My single node systems do not have this error.

pvestatd[6203]: Use of uninitialized value $max_cpuid in addition (+) at /usr/share/perl5/PVE/Service/pvestatd.pm line 277.

Need you help - thanks...
 
Hi,

did you boot into an older cgroup version or the like?

Else, what's the output of:
find /sys/fs/cgroup/lxc/ -name cpuset.cpus -print -exec cat {} \;

And a affected container config would be also interesting: pct config VMID
 
Hi,

I have only VM's, no container.


/sys/fs/cgroup/lxc - not avilable

Code:
/sys/fs/cgroup# find ./ -name cpuset.cpus -print -exec cat {} \;
./sys-fs-fuse-connections.mount/cpuset.cpus
./sys-kernel-config.mount/cpuset.cpus
./sys-kernel-debug.mount/cpuset.cpus
./qemu.slice/cpuset.cpus
./dev-mqueue.mount/cpuset.cpus
./user.slice/cpuset.cpus
./sys-kernel-tracing.mount/cpuset.cpus
./init.scope/cpuset.cpus
./proc-fs-nfsd.mount/cpuset.cpus
./system.slice/pvescheduler.service/cpuset.cpus
./system.slice/systemd-udevd.service/cpuset.cpus
./system.slice/cron.service/cpuset.cpus
./system.slice/nfs-mountd.service/cpuset.cpus
./system.slice/rpcbind.socket/cpuset.cpus
./system.slice/pve-firewall.service/cpuset.cpus
./system.slice/pve-lxc-syscalld.service/cpuset.cpus
./system.slice/spiceproxy.service/cpuset.cpus
./system.slice/pve-ha-crm.service/cpuset.cpus
./system.slice/system-modprobe.slice/cpuset.cpus
./system.slice/pvedaemon.service/cpuset.cpus
./system.slice/systemd-journald.service/cpuset.cpus
./system.slice/frr.service/cpuset.cpus
./system.slice/ssh.service/cpuset.cpus
./system.slice/qmeventd.service/cpuset.cpus
./system.slice/rrdcached.service/cpuset.cpus
./system.slice/watchdog-mux.service/cpuset.cpus
./system.slice/rpc-statd.service/cpuset.cpus
./system.slice/pvefw-logger.service/cpuset.cpus
./system.slice/nfs-blkmap.service/cpuset.cpus
./system.slice/rsyslog.service/cpuset.cpus
./system.slice/pveproxy.service/cpuset.cpus
./system.slice/zabbix-agent2.service/cpuset.cpus
./system.slice/ksmtuned.service/cpuset.cpus
./system.slice/lxc-monitord.service/cpuset.cpus
./system.slice/rpcbind.service/cpuset.cpus
./system.slice/lxcfs.service/cpuset.cpus
./system.slice/nfs-idmapd.service/cpuset.cpus
./system.slice/corosync.service/cpuset.cpus
./system.slice/run-rpc_pipefs.mount/cpuset.cpus
./system.slice/system-postfix.slice/cpuset.cpus
./system.slice/smartmontools.service/cpuset.cpus
./system.slice/xagt.service/cpuset.cpus
./system.slice/system-lvm2\x2dpvscan.slice/cpuset.cpus
./system.slice/iscsid.service/cpuset.cpus
./system.slice/zfs-zed.service/cpuset.cpus
./system.slice/pve-cluster.service/cpuset.cpus
./system.slice/acpid.service/cpuset.cpus
./system.slice/ntp.service/cpuset.cpus
./system.slice/cpuset.cpus
./system.slice/dbus.service/cpuset.cpus
./system.slice/pve-ha-lrm.service/cpuset.cpus
./system.slice/system-getty.slice/cpuset.cpus
./system.slice/pvestatd.service/cpuset.cpus
./system.slice/systemd-logind.service/cpuset.cpus
./dev-hugepages.mount/cpuset.cpus

/sys/fs/cgroup# cat cpuset.cpus.effective
0-39
/sys/fs/cgroup# cat cpuset.mems.effective
0-1
 
HA is enabled.
I have two HA groups.
Group A: Node 1 + 2
Group B: Node 3 + 4

All VM's are in HA mode and replication.
 
Hi,

syslog events from 22:34 - 22:46

Code:
Nov 24 22:34:02 lpnode1 pmxcfs[2982123]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/lpnode1/nas: -1
Nov 24 22:34:02 lpnode1 pmxcfs[2982123]: [status] notice: RRD update error /var/lib/rrdcached/db/pve2-storage/lpnode1/nas: /var/lib/rrdcached/db/pve2-storage/lpnode1/nas: illegal attempt to update using time 1637789642 when last update time is 1637789642 (minimum one second step)
Nov 24 22:34:02 lpnode1 pmxcfs[2982123]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/lpnode1/local: -1
Nov 24 22:34:02 lpnode1 pmxcfs[2982123]: [status] notice: RRD update error /var/lib/rrdcached/db/pve2-storage/lpnode1/local: /var/lib/rrdcached/db/pve2-storage/lpnode1/local: illegal attempt to update using time 1637789642 when last update time is 1637789642 (minimum one second step)
Nov 24 22:34:02 lpnode1 pmxcfs[2982123]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/lpnode1/local-zfs: -1
Nov 24 22:34:02 lpnode1 pmxcfs[2982123]: [status] notice: RRD update error /var/lib/rrdcached/db/pve2-storage/lpnode1/local-zfs: /var/lib/rrdcached/db/pve2-storage/lpnode1/local-zfs: illegal attempt to update using time 1637789642 when last update time is 1637789642 (minimum one second step)
Nov 24 22:46:28 lpnode1 systemd-modules-load[10666]: Inserted module 'vfio'
 
While that won't be a direct crash of the cluster filesystem or drop quorum, which would then let HA trigger the watchdog, it could be an indication of the actual issue:

Is time synchronization configured and working on those servers? IOW, is ntp, chrony or the like installed and running?

Also, what happened around the crash time on other nodes (corosync/kronosnet/pmxcfs and HA messages would be most interesting). Further, how are those four nodes connected? Is there a separate, or at least not so chatty, network for the cluster communication?
 
Hi,

proxmox gui syslog

Code:
Nov 24 22:30:06 lpnode1 pvestatd[12573]: Use of uninitialized value $max_cpuid in addition (+) at /usr/share/perl5/PVE/Service/pvestatd.pm line 277.
Nov 24 22:30:16 lpnode1 pvestatd[12573]: Use of uninitialized value $max_cpuid in addition (+) at /usr/share/perl5/PVE/Service/pvestatd.pm line 277.
Nov 24 22:30:26 lpnode1 pvestatd[12573]: Use of uninitialized value $max_cpuid in addition (+) at /usr/share/perl5/PVE/Service/pvestatd.pm line 277.
Nov 24 22:30:36 lpnode1 pvestatd[12573]: Use of uninitialized value $max_cpuid in addition (+) at /usr/share/perl5/PVE/Service/pvestatd.pm line 277.
Nov 24 22:30:46 lpnode1 pvestatd[12573]: Use of uninitialized value $max_cpuid in addition (+) at /usr/share/perl5/PVE/Service/pvestatd.pm line 277.
Nov 24 22:30:56 lpnode1 pvestatd[12573]: Use of uninitialized value $max_cpuid in addition (+) at /usr/share/perl5/PVE/Service/pvestatd.pm line 277.
Nov 24 22:31:06 lpnode1 pvestatd[12573]: Use of uninitialized value $max_cpuid in addition (+) at /usr/share/perl5/PVE/Service/pvestatd.pm line 277.
Nov 24 22:31:16 lpnode1 pvestatd[12573]: Use of uninitialized value $max_cpuid in addition (+) at /usr/share/perl5/PVE/Service/pvestatd.pm line 277.
Nov 24 22:31:26 lpnode1 pvestatd[12573]: Use of uninitialized value $max_cpuid in addition (+) at /usr/share/perl5/PVE/Service/pvestatd.pm line 277.
Nov 24 22:31:36 lpnode1 pvestatd[12573]: Use of uninitialized value $max_cpuid in addition (+) at /usr/share/perl5/PVE/Service/pvestatd.pm line 277.
Nov 24 22:31:46 lpnode1 pvestatd[12573]: Use of uninitialized value $max_cpuid in addition (+) at /usr/share/perl5/PVE/Service/pvestatd.pm line 277.
Nov 24 22:31:56 lpnode1 pvestatd[12573]: Use of uninitialized value $max_cpuid in addition (+) at /usr/share/perl5/PVE/Service/pvestatd.pm line 277.
Nov 24 22:32:06 lpnode1 pvestatd[12573]: Use of uninitialized value $max_cpuid in addition (+) at /usr/share/perl5/PVE/Service/pvestatd.pm line 277.
Nov 24 22:32:16 lpnode1 pvestatd[12573]: Use of uninitialized value $max_cpuid in addition (+) at /usr/share/perl5/PVE/Service/pvestatd.pm line 277.
Nov 24 22:32:26 lpnode1 pvestatd[12573]: Use of uninitialized value $max_cpuid in addition (+) at /usr/share/perl5/PVE/Service/pvestatd.pm line 277.
Nov 24 22:32:36 lpnode1 pvestatd[12573]: Use of uninitialized value $max_cpuid in addition (+) at /usr/share/perl5/PVE/Service/pvestatd.pm line 277.
Nov 24 22:32:46 lpnode1 pvestatd[12573]: Use of uninitialized value $max_cpuid in addition (+) at /usr/share/perl5/PVE/Service/pvestatd.pm line 277.
Nov 24 22:32:56 lpnode1 pvestatd[12573]: Use of uninitialized value $max_cpuid in addition (+) at /usr/share/perl5/PVE/Service/pvestatd.pm line 277.
Nov 24 22:33:06 lpnode1 pvestatd[12573]: Use of uninitialized value $max_cpuid in addition (+) at /usr/share/perl5/PVE/Service/pvestatd.pm line 277.
Nov 24 22:34:00 lpnode1 watchdog-mux[10576]: client watchdog expired - disable watchdog updates
Nov 24 22:34:01 lpnode1 kernel: sd 0:0:0:0: attempting task abort!scmd(0x00000000b601dfd7), outstanding for 60812 ms & timeout 60000 ms
Nov 24 22:34:01 lpnode1 kernel: sd 0:0:0:0: [sda] tag#2484 CDB: ATA command pass through(16) 85 08 0e 00 d0 00 01 00 00 00 4f 00 c2 00 b0 00
Nov 24 22:34:01 lpnode1 kernel: scsi target0:0:0: handle(0x000b), sas_address(0x4433221102000000), phy(2)
Nov 24 22:34:01 lpnode1 kernel: scsi target0:0:0: enclosure logical id(0x500605b00e9c0400), slot(0)
Nov 24 22:34:01 lpnode1 kernel: scsi target0:0:0: enclosure level(0x0000), connector name(     )
Nov 24 22:34:01 lpnode1 smartd[10526]: Device: /dev/sda [SAT], failed to read SMART Attribute Data
Nov 24 22:34:01 lpnode1 smartd[10526]: Sending warning via /usr/share/smartmontools/smartd-runner to root ...
Nov 24 22:34:01 lpnode1 kernel: sd 0:0:0:0: task abort: SUCCESS scmd(0x00000000b601dfd7)
Nov 24 22:34:01 lpnode1 kernel: sd 0:0:0:0: attempting task abort!scmd(0x000000004c69ba9e), outstanding for 60848 ms & timeout 60000 ms
Nov 24 22:34:01 lpnode1 kernel: sd 0:0:0:0: [sda] tag#8408 CDB: Synchronize Cache(10) 35 00 00 00 00 00 00 00 00 00
Nov 24 22:34:01 lpnode1 kernel: scsi target0:0:0: handle(0x000b), sas_address(0x4433221102000000), phy(2)
Nov 24 22:34:01 lpnode1 kernel: scsi target0:0:0: enclosure logical id(0x500605b00e9c0400), slot(0)
Nov 24 22:34:01 lpnode1 kernel: scsi target0:0:0: enclosure level(0x0000), connector name(     )
Nov 24 22:34:01 lpnode1 kernel: sd 0:0:0:0: No reference found at driver, assuming scmd(0x000000004c69ba9e) might have completed
Nov 24 22:34:01 lpnode1 kernel: sd 0:0:0:0: task abort: SUCCESS scmd(0x000000004c69ba9e)
Nov 24 22:34:02 lpnode1 kernel: sd 0:0:0:0: Power-on or device reset occurred
Nov 24 22:34:02 lpnode1 zed[931767]: eid=324528 class=delay pool='rpool' vdev=sda3 size=20480 offset=546046148608 priority=1 err=0 flags=0x180880 delay=56063ms bookmark=771:0:-2:597210
Nov 24 22:34:02 lpnode1 zed[931762]: eid=324529 class=delay pool='rpool' vdev=sda3 size=20480 offset=520386023424 priority=1 err=0 flags=0x180880 delay=52467ms bookmark=643:0:-2:170130
Nov 24 22:34:02 lpnode1 zed[931784]: eid=324531 class=delay pool='rpool' vdev=sda3 size=20480 offset=520387584000 priority=1 err=0 flags=0x180880 delay=52467ms bookmark=643:0:-2:170132
Nov 24 22:34:02 lpnode1 zed[931806]: eid=324532 class=delay pool='rpool' vdev=sda3 size=20480 offset=520387633152 priority=1 err=0 flags=0x180880 delay=52467ms bookmark=643:0:-2:170133
Nov 24 22:34:02 lpnode1 smartd[10526]: Warning via /usr/share/smartmontools/smartd-runner to root: successful
Nov 24 22:34:02 lpnode1 zed[931825]: eid=324530 class=delay pool='rpool' vdev=sda3 size=20480 offset=520387248128 priority=1 err=0 flags=0x180880 delay=52467ms bookmark=643:0:-2:170131
Nov 24 22:34:02 lpnode1 zed[931844]: eid=324533 class=delay pool='rpool' vdev=sda3 size=20480 offset=520387686400 priority=1 err=0 flags=0x180880 delay=52466ms bookmark=643:0:-2:170134
Nov 24 22:34:02 lpnode1 zed[931832]: eid=324534 class=delay pool='rpool' vdev=sda3 size=20480 offset=545813991424 priority=1 err=0 flags=0x180880 delay=61673ms bookmark=104792:0:-2:354197092
Nov 24 22:34:02 lpnode1 smartd[10526]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 78 to 79
Nov 24 22:34:02 lpnode1 postfix/pickup[761052]: 060823BC79: uid=0 from=<root>
Nov 24 22:34:02 lpnode1 smartd[10526]: Device: /dev/sde [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 78 to 79
Nov 24 22:34:02 lpnode1 smartd[10526]: Device: /dev/sdf [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 78 to 79
Nov 24 22:34:02 lpnode1 smartd[10526]: Device: /dev/sdg [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 78 to 79
Nov 24 22:34:02 lpnode1 smartd[10526]: Device: /dev/sdh [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 78 to 79
Nov 24 22:34:02 lpnode1 postfix/cleanup[931713]: 060823BC79: message-id=<20211124213402.060823BC79@lpnode1.group.net>
Nov 24 22:34:02 lpnode1 postfix/qmgr[11917]: 060823BC79: from=<root@lpnode1.group.net>, size=1004, nrcpt=1 (queue active)
Nov 24 22:34:02 lpnode1 watchdog-mux[10576]: exit watchdog-mux with active connections
Nov 24 22:34:02 lpnode1 pve-ha-crm[12630]: loop take too long (55 seconds)
Nov 24 22:34:02 lpnode1 systemd[1]: watchdog-mux.service: Succeeded.
Nov 24 22:34:02 lpnode1 kernel: watchdog: watchdog0: watchdog did not stop!
-- Reboot --
 
Proxmox Node 2 Syslog

Code:
Nov 24 22:33:54 lpnode2 pvestatd[6203]: Use of uninitialized value $max_cpuid in addition (+) at /usr/share/perl5/PVE/Service/pvestatd.pm line 277.
Nov 24 22:34:02 lpnode2 pmxcfs[320485]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-node/lpnode1: -1
Nov 24 22:34:02 lpnode2 pmxcfs[320485]: [status] notice: RRD update error /var/lib/rrdcached/db/pve2-node/lpnode1: /var/lib/rrdcached/db/pve2-node/lpnode1: illegal attempt to update using time 1637789642 when last update time is 1637789642 (minimum one second step)
Nov 24 22:34:02 lpnode2 pmxcfs[320485]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/101: -1
Nov 24 22:34:02 lpnode2 pmxcfs[320485]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/500: -1
Nov 24 22:34:02 lpnode2 pmxcfs[320485]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/501: -1
Nov 24 22:34:02 lpnode2 pmxcfs[320485]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/510: -1
Nov 24 22:34:02 lpnode2 pmxcfs[320485]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/502: -1
Nov 24 22:34:02 lpnode2 pmxcfs[320485]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/590: -1
Nov 24 22:34:02 lpnode2 pmxcfs[320485]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/lpnode1/nas: -1
Nov 24 22:34:02 lpnode2 pmxcfs[320485]: [status] notice: RRD update error /var/lib/rrdcached/db/pve2-storage/lpnode1/nas: /var/lib/rrdcached/db/pve2-storage/lpnode1/nas: illegal attempt to update using time 1637789642 when last update time is 1637789642 (minimum one second step)
Nov 24 22:34:02 lpnode2 pmxcfs[320485]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/lpnode1/local: -1
Nov 24 22:34:02 lpnode2 pmxcfs[320485]: [status] notice: RRD update error /var/lib/rrdcached/db/pve2-storage/lpnode1/local: /var/lib/rrdcached/db/pve2-storage/lpnode1/local: illegal attempt to update using time 1637789642 when last update time is 1637789642 (minimum one second step)
Nov 24 22:34:02 lpnode2 pmxcfs[320485]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/lpnode1/local-zfs: -1
Nov 24 22:34:02 lpnode2 pmxcfs[320485]: [status] notice: RRD update error /var/lib/rrdcached/db/pve2-storage/lpnode1/local-zfs: /var/lib/rrdcached/db/pve2-storage/lpnode1/local-zfs: illegal attempt to update using time 1637789642 when last update time is 1637789642 (minimum one second step)
Nov 24 22:34:04 lpnode2 pvestatd[6203]: Use of uninitialized value $max_cpuid in addition (+) at /usr/share/perl5/PVE/Service/pvestatd.pm line 277.
Nov 24 22:34:13 lpnode2 corosync[6123]:   [KNET  ] link: host: 3 link: 0 is down
Nov 24 22:34:13 lpnode2 corosync[6123]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Nov 24 22:34:13 lpnode2 corosync[6123]:   [KNET  ] host: host: 3 has no active links
Nov 24 22:34:14 lpnode2 pvestatd[6203]: Use of uninitialized value $max_cpuid in addition (+) at /usr/share/perl5/PVE/Service/pvestatd.pm line 277.
Nov 24 22:34:14 lpnode2 corosync[6123]:   [TOTEM ] Token has not been received in 3225 ms
Nov 24 22:34:15 lpnode2 corosync[6123]:   [TOTEM ] A processor failed, forming new configuration: token timed out (4300ms), waiting 5160ms for consensus.
Nov 24 22:34:20 lpnode2 corosync[6123]:   [QUORUM] Sync members[3]: 1 2 4
Nov 24 22:34:20 lpnode2 corosync[6123]:   [QUORUM] Sync left[1]: 3
Nov 24 22:34:20 lpnode2 corosync[6123]:   [TOTEM ] A new membership (1.636) was formed. Members left: 3
Nov 24 22:34:20 lpnode2 corosync[6123]:   [TOTEM ] Failed to receive the leave message. failed: 3
Nov 24 22:34:20 lpnode2 pmxcfs[320485]: [dcdb] notice: members: 1/382408, 2/1981838, 4/320485
Nov 24 22:34:20 lpnode2 pmxcfs[320485]: [dcdb] notice: starting data syncronisation
Nov 24 22:34:20 lpnode2 pmxcfs[320485]: [status] notice: members: 1/382408, 2/1981838, 4/320485
Nov 24 22:34:20 lpnode2 pmxcfs[320485]: [status] notice: starting data syncronisation
Nov 24 22:34:20 lpnode2 corosync[6123]:   [QUORUM] Members[3]: 1 2 4
Nov 24 22:34:20 lpnode2 corosync[6123]:   [MAIN  ] Completed service synchronization, ready to provide service.
Nov 24 22:34:20 lpnode2 pmxcfs[320485]: [dcdb] notice: received sync request (epoch 1/382408/00000004)
Nov 24 22:34:20 lpnode2 pmxcfs[320485]: [status] notice: received sync request (epoch 1/382408/00000004)
Nov 24 22:34:20 lpnode2 pmxcfs[320485]: [dcdb] notice: received all states
Nov 24 22:34:20 lpnode2 pmxcfs[320485]: [dcdb] notice: leader is 1/382408
Nov 24 22:34:20 lpnode2 pmxcfs[320485]: [dcdb] notice: synced members: 1/382408, 2/1981838, 4/320485
Nov 24 22:34:20 lpnode2 pmxcfs[320485]: [dcdb] notice: all data is up to date
Nov 24 22:34:20 lpnode2 pmxcfs[320485]: [dcdb] notice: dfsm_deliver_queue: queue length 8
Nov 24 22:34:20 lpnode2 pmxcfs[320485]: [status] notice: received all states
Nov 24 22:34:20 lpnode2 pmxcfs[320485]: [status] notice: all data is up to date
Nov 24 22:34:20 lpnode2 pmxcfs[320485]: [status] notice: dfsm_deliver_queue: queue length 40
Nov 24 22:35:34 lpnode2 pvestatd[6203]: Use of uninitialized value $max_cpuid in addition (+) at /usr/share/perl5/PVE/Service/pvestatd.pm line 277.
Nov 24 22:35:44 lpnode2 pvestatd[6203]: Use of uninitialized value $max_cpuid in addition (+) at /usr/share/perl5/PVE/Service/pvestatd.pm line 277.
Nov 24 22:35:54 lpnode2 pvestatd[6203]: Use of uninitialized value $max_cpuid in addition (+) at /usr/share/perl5/PVE/Service/pvestatd.pm line 277.
Nov 24 22:36:04 lpnode2 pvestatd[6203]: Use of uninitialized value $max_cpuid in addition (+) at /usr/share/perl5/PVE/Service/pvestatd.pm line 277.
Nov 24 22:36:14 lpnode2 pvestatd[6203]: Use of uninitialized value $max_cpuid in addition (+) at /usr/share/perl5/PVE/Service/pvestatd.pm line 277.
Nov 24 22:36:20 lpnode2 pve-ha-lrm[1770059]: starting service vm:500
Nov 24 22:36:20 lpnode2 pve-ha-lrm[1770058]: starting service vm:101
Nov 24 22:36:20 lpnode2 pve-ha-lrm[1770060]: starting service vm:510
Nov 24 22:36:20 lpnode2 pve-ha-lrm[1770064]: start VM 510: UPID:lpnode2:001B0250:066FD491:619EB054:qmstart:510:root@pam:
Nov 24 22:36:20 lpnode2 pve-ha-lrm[1770065]: start VM 101: UPID:lpnode2:001B0251:066FD491:619EB054:qmstart:101:root@pam:
Nov 24 22:36:20 lpnode2 pve-ha-lrm[1770066]: start VM 500: UPID:lpnode2:001B0252:066FD492:619EB054:qmstart:500:root@pam:
Nov 24 22:36:20 lpnode2 pve-ha-lrm[1770060]: <root@pam> starting task UPID:lpnode2:001B0250:066FD491:619EB054:qmstart:510:root@pam:
Nov 24 22:36:20 lpnode2 pve-ha-lrm[1770059]: <root@pam> starting task UPID:lpnode2:001B0252:066FD492:619EB054:qmstart:500:root@pam:
Nov 24 22:36:20 lpnode2 pve-ha-lrm[1770058]: <root@pam> starting task UPID:lpnode2:001B0251:066FD491:619EB054:qmstart:101:root@pam:
Nov 24 22:36:20 lpnode2 systemd[1]: Started 101.scope.
Nov 24 22:36:20 lpnode2 systemd[1]: Started 500.scope.
Nov 24 22:36:20 lpnode2 systemd[1]: Started 510.scope.
Nov 24 22:36:22 lpnode2 systemd-udevd[1770530]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Nov 24 22:36:22 lpnode2 pve-ha-lrm[1770059]: <root@pam> end task UPID:lpnode2:001B0252:066FD492:619EB054:qmstart:500:root@pam: OK
Nov 24 22:36:22 lpnode2 pve-ha-lrm[1770059]: service status vm:500 started
Nov 24 22:36:22 lpnode2 pve-ha-lrm[1770058]: <root@pam> end task UPID:lpnode2:001B0251:066FD491:619EB054:qmstart:101:root@pam: OK
Nov 24 22:36:22 lpnode2 pve-ha-lrm[1770058]: service status vm:101 started
Nov 24 22:36:23 lpnode2 pve-ha-lrm[1770060]: <root@pam> end task UPID:lpnode2:001B0250:066FD491:619EB054:qmstart:510:root@pam: OK
Nov 24 22:36:23 lpnode2 pve-ha-lrm[1770060]: service status vm:510 started
Nov 24 22:36:24 lpnode2 pvestatd[6203]: Use of uninitialized value $max_cpuid in addition (+) at /usr/share/perl5/PVE/Service/pvestatd.pm line 277.
Nov 24 22:36:34 lpnode2 pvestatd[6203]: Use of uninitialized value $max_cpuid in addition (+) at /usr/share/perl5/PVE/Service/pvestatd.pm line 277.
Nov 24 22:37:07 lpnode2 pvescheduler[1804965]: VM 101 qmp command failed - VM 101 qmp command 'guest-ping' failed - got timeout
Nov 24 22:37:07 lpnode2 pvescheduler[1804965]: Qemu Guest Agent is not running - VM 101 qmp command 'guest-ping' failed - got timeout
Nov 24 22:37:11 lpnode2 pvescheduler[1804965]: 101-0: got unexpected replication job error - command '/usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=lpnode1' root@yyy.yyy.yyy.yyy pvecm mtunnel -migration_network 10.101.53.160/27 -get_migration_ip' failed: exit code 255
 
Nov 24 22:34:00 lpnode1 watchdog-mux[10576]: client watchdog expired - disable watchdog updates
Nov 24 22:34:01 lpnode1 kernel: sd 0:0:0:0: attempting task abort!scmd(0x00000000b601dfd7), outstanding for 60812 ms & timeout 60000 ms
Nov 24 22:34:01 lpnode1 kernel: sd 0:0:0:0: [sda] tag#2484 CDB: ATA command pass through(16) 85 08 0e 00 d0 00 01 00 00 00 4f 00 c2 00 b0 00
So, the kernel reports about a command timoute on a sda device, which is highly like the actual cause, or more related to that.

This has the CRM hanging and this the watchdog won't get updated and expires -> hangs.

In other word, it seems there was an (IO) issue that made the node (close to completely) unresponsive, and thus HA did its job and fenced.

The amout of "Use of uninitialized value $max_cpuid in addition" is weird, but itself not problematic.

Older logs on node that node may be interesting and if just to ensure the causality chain stops at sda being in a bogus, hanging state and triggering the further hang and thus self-fence.
 
Hi Thomas,

I'm through the old logs.
On 09/27/21 I did the proxmox update 6 to 7.
From then on the errors appear.

Could it have something to do with CVE-2018-3646 ?

Code:
Sep 27 11:17:23 lpnode01 kernel: L1TF CPU bug present and SMT on, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.
Sep 27 11:17:23 lpnode01 pve-ha-crm[6409]: status change wait_for_quorum => slave
Sep 27 11:17:23 lpnode01 zebra[4470]: if_zebra_speed_update: vxlan_vpl01 old speed: 0 new speed: 4294967295
Sep 27 11:17:23 lpnode01 zebra[4470]: if_zebra_speed_update: vpl01 old speed: 0 new speed: 4294967295
Sep 27 11:17:23 lpnode01 zebra[4470]: if_zebra_speed_update: vxlan_vpl02 old speed: 0 new speed: 4294967295
Sep 27 11:17:23 lpnode01 zebra[4470]: if_zebra_speed_update: vpl02 old speed: 0 new speed: 4294967295
Sep 27 11:17:24 lpnode01 pvestatd[6358]: Use of uninitialized value $max_cpuid in addition (+) at /usr/share/perl5/PVE/Service/pvestatd.pm line 276.
Sep 27 11:17:24 lpnode01 kernel: FS-Cache: Loaded
Sep 27 11:17:24 lpnode01 kernel: FS-Cache: Netfs 'nfs' registered for caching
Sep 27 11:17:24 lpnode01 kernel: NFS: Registering the id_resolver key type
Sep 27 11:17:24 lpnode01 kernel: Key type id_resolver registered
Sep 27 11:17:24 lpnode01 kernel: Key type id_legacy registered
Sep 27 11:17:24 lpnode01 systemd[1]: Reached target Host and Network Name Lookups.
Sep 27 11:17:24 lpnode01 systemd[1]: Starting Preprocess NFS configuration...
Sep 27 11:17:24 lpnode01 systemd[1]: nfs-config.service: Succeeded.
Sep 27 11:17:24 lpnode01 systemd[1]: Finished Preprocess NFS configuration.
Sep 27 11:17:24 lpnode01 systemd[1]: Starting Notify NFS peers of a restart...
Sep 27 11:17:24 lpnode01 systemd[1]: Starting NFS status monitor for NFSv2/3 locking....
Sep 27 11:17:24 lpnode01 sm-notify[7040]: Version 1.3.3 starting
Sep 27 11:17:24 lpnode01 systemd[1]: rpc-statd-notify.service: Succeeded.
Sep 27 11:17:24 lpnode01 systemd[1]: Started Notify NFS peers of a restart.
Sep 27 11:17:24 lpnode01 rpc.statd[7043]: Version 1.3.3 starting
Sep 27 11:17:24 lpnode01 rpc.statd[7043]: Flags: TI-RPC
Sep 27 11:17:24 lpnode01 rpc.statd[7043]: Failed to read /var/lib/nfs/state: Success
Sep 27 11:17:24 lpnode01 rpc.statd[7043]: Initializing NSM state
Sep 27 11:17:24 lpnode01 systemd[1]: Started NFS status monitor for NFSv2/3 locking..
Sep 27 11:17:32 lpnode01 pvestatd[6358]: Use of uninitialized value $max_cpuid in addition (+) at /usr/share/perl5/PVE/Service/pvestatd.pm line 276.
Sep 27 11:17:42 lpnode01 pvestatd[6358]: Use of uninitialized value $max_cpuid in addition (+) at /usr/share/perl5/PVE/Service/pvestatd.pm line 276.
Sep 27 11:17:52 lpnode01 pvestatd[6358]: Use of uninitialized value $max_cpuid in addition (+) at /usr/share/perl5/PVE/Service/pvestatd.pm line 276.
 
Could it have something to do with CVE-2018-3646 ?
Not very likely, L1TF is rather old and this is just an info log from the kernel about a CPU issue (ensure latest firmware and microcode ins installed), but that bug won't trigger such behavior.

On 09/27/21 I did the proxmox update 6 to 7.
That's a few months back, unlikely to be related for that incident now.

What's imo more likely is a hw issue (check the disks/disk controller/...), a service or vm getting bonkers and overloading the system or maybe a kernel bug with your specific hw, just mentioning that one because it's not impossible.

But for a closer evaluation more active investigation with getting all hardware details, VM configs, storage, disks used and reading the full logs would be required - that's rather out of scope from the community forum as it would need posting lots of possible private details and one willing to spent time into the investigation. For such a thing our enterprise support exists: https://www.proxmox.com/en/proxmox-ve/pricing
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!