4-Node Cluster problem - pvestatd $max_cpuid

Tdreissi · Nov 25, 2021

Hello,

I run a 4-node cluster.
Proxmox version 7.1-6

I have the same message on all 4 nodes in syslog and "systemctl status pvestatd"
The cluster and the VM's are running, but tonight a node just booted for no apparent reason.
My single node systems do not have this error.

pvestatd[6203]: Use of uninitialized value $max_cpuid in addition (+) at /usr/share/perl5/PVE/Service/pvestatd.pm line 277.

Need you help - thanks...

t.lamprecht · Nov 25, 2021

Hi,

did you boot into an older cgroup version or the like?

Else, what's the output of:
find /sys/fs/cgroup/lxc/ -name cpuset.cpus -print -exec cat {} \;

And a affected container config would be also interesting: pct config VMID

Tdreissi · Nov 25, 2021

Hi,

I have only VM's, no container.

/sys/fs/cgroup/lxc - not avilable

Code:

/sys/fs/cgroup# find ./ -name cpuset.cpus -print -exec cat {} \;
./sys-fs-fuse-connections.mount/cpuset.cpus
./sys-kernel-config.mount/cpuset.cpus
./sys-kernel-debug.mount/cpuset.cpus
./qemu.slice/cpuset.cpus
./dev-mqueue.mount/cpuset.cpus
./user.slice/cpuset.cpus
./sys-kernel-tracing.mount/cpuset.cpus
./init.scope/cpuset.cpus
./proc-fs-nfsd.mount/cpuset.cpus
./system.slice/pvescheduler.service/cpuset.cpus
./system.slice/systemd-udevd.service/cpuset.cpus
./system.slice/cron.service/cpuset.cpus
./system.slice/nfs-mountd.service/cpuset.cpus
./system.slice/rpcbind.socket/cpuset.cpus
./system.slice/pve-firewall.service/cpuset.cpus
./system.slice/pve-lxc-syscalld.service/cpuset.cpus
./system.slice/spiceproxy.service/cpuset.cpus
./system.slice/pve-ha-crm.service/cpuset.cpus
./system.slice/system-modprobe.slice/cpuset.cpus
./system.slice/pvedaemon.service/cpuset.cpus
./system.slice/systemd-journald.service/cpuset.cpus
./system.slice/frr.service/cpuset.cpus
./system.slice/ssh.service/cpuset.cpus
./system.slice/qmeventd.service/cpuset.cpus
./system.slice/rrdcached.service/cpuset.cpus
./system.slice/watchdog-mux.service/cpuset.cpus
./system.slice/rpc-statd.service/cpuset.cpus
./system.slice/pvefw-logger.service/cpuset.cpus
./system.slice/nfs-blkmap.service/cpuset.cpus
./system.slice/rsyslog.service/cpuset.cpus
./system.slice/pveproxy.service/cpuset.cpus
./system.slice/zabbix-agent2.service/cpuset.cpus
./system.slice/ksmtuned.service/cpuset.cpus
./system.slice/lxc-monitord.service/cpuset.cpus
./system.slice/rpcbind.service/cpuset.cpus
./system.slice/lxcfs.service/cpuset.cpus
./system.slice/nfs-idmapd.service/cpuset.cpus
./system.slice/corosync.service/cpuset.cpus
./system.slice/run-rpc_pipefs.mount/cpuset.cpus
./system.slice/system-postfix.slice/cpuset.cpus
./system.slice/smartmontools.service/cpuset.cpus
./system.slice/xagt.service/cpuset.cpus
./system.slice/system-lvm2\x2dpvscan.slice/cpuset.cpus
./system.slice/iscsid.service/cpuset.cpus
./system.slice/zfs-zed.service/cpuset.cpus
./system.slice/pve-cluster.service/cpuset.cpus
./system.slice/acpid.service/cpuset.cpus
./system.slice/ntp.service/cpuset.cpus
./system.slice/cpuset.cpus
./system.slice/dbus.service/cpuset.cpus
./system.slice/pve-ha-lrm.service/cpuset.cpus
./system.slice/system-getty.slice/cpuset.cpus
./system.slice/pvestatd.service/cpuset.cpus
./system.slice/systemd-logind.service/cpuset.cpus
./dev-hugepages.mount/cpuset.cpus

/sys/fs/cgroup# cat cpuset.cpus.effective
0-39
/sys/fs/cgroup# cat cpuset.mems.effective
0-1

t.lamprecht · Nov 25, 2021

Tdreissi said:
I have only VM's, no container.

Tdreissi said:
The cluster and the VM's are running, but tonight a node just booted for no apparent reason.

Then that warning is rather a red herring. Do you have HA enabled?

Tdreissi · Nov 25, 2021

HA is enabled.
I have two HA groups.
Group A: Node 1 + 2
Group B: Node 3 + 4

All VM's are in HA mode and replication.

t.lamprecht · Nov 25, 2021

Ok, then can you check the actual journal/syslog from the time leading up to the unexpected node reset?

Tdreissi · Nov 25, 2021

Hi,

syslog events from 22:34 - 22:46

Code:

Nov 24 22:34:02 lpnode1 pmxcfs[2982123]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/lpnode1/nas: -1
Nov 24 22:34:02 lpnode1 pmxcfs[2982123]: [status] notice: RRD update error /var/lib/rrdcached/db/pve2-storage/lpnode1/nas: /var/lib/rrdcached/db/pve2-storage/lpnode1/nas: illegal attempt to update using time 1637789642 when last update time is 1637789642 (minimum one second step)
Nov 24 22:34:02 lpnode1 pmxcfs[2982123]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/lpnode1/local: -1
Nov 24 22:34:02 lpnode1 pmxcfs[2982123]: [status] notice: RRD update error /var/lib/rrdcached/db/pve2-storage/lpnode1/local: /var/lib/rrdcached/db/pve2-storage/lpnode1/local: illegal attempt to update using time 1637789642 when last update time is 1637789642 (minimum one second step)
Nov 24 22:34:02 lpnode1 pmxcfs[2982123]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/lpnode1/local-zfs: -1
Nov 24 22:34:02 lpnode1 pmxcfs[2982123]: [status] notice: RRD update error /var/lib/rrdcached/db/pve2-storage/lpnode1/local-zfs: /var/lib/rrdcached/db/pve2-storage/lpnode1/local-zfs: illegal attempt to update using time 1637789642 when last update time is 1637789642 (minimum one second step)
Nov 24 22:46:28 lpnode1 systemd-modules-load[10666]: Inserted module 'vfio'

t.lamprecht · Nov 25, 2021

While that won't be a direct crash of the cluster filesystem or drop quorum, which would then let HA trigger the watchdog, it could be an indication of the actual issue:

Is time synchronization configured and working on those servers? IOW, is ntp, chrony or the like installed and running?

Also, what happened around the crash time on other nodes (corosync/kronosnet/pmxcfs and HA messages would be most interesting). Further, how are those four nodes connected? Is there a separate, or at least not so chatty, network for the cluster communication?

Tdreissi · Nov 25, 2021

Hi,

proxmox gui syslog

Code:

Nov 24 22:30:06 lpnode1 pvestatd[12573]: Use of uninitialized value $max_cpuid in addition (+) at /usr/share/perl5/PVE/Service/pvestatd.pm line 277.
Nov 24 22:30:16 lpnode1 pvestatd[12573]: Use of uninitialized value $max_cpuid in addition (+) at /usr/share/perl5/PVE/Service/pvestatd.pm line 277.
Nov 24 22:30:26 lpnode1 pvestatd[12573]: Use of uninitialized value $max_cpuid in addition (+) at /usr/share/perl5/PVE/Service/pvestatd.pm line 277.
Nov 24 22:30:36 lpnode1 pvestatd[12573]: Use of uninitialized value $max_cpuid in addition (+) at /usr/share/perl5/PVE/Service/pvestatd.pm line 277.
Nov 24 22:30:46 lpnode1 pvestatd[12573]: Use of uninitialized value $max_cpuid in addition (+) at /usr/share/perl5/PVE/Service/pvestatd.pm line 277.
Nov 24 22:30:56 lpnode1 pvestatd[12573]: Use of uninitialized value $max_cpuid in addition (+) at /usr/share/perl5/PVE/Service/pvestatd.pm line 277.
Nov 24 22:31:06 lpnode1 pvestatd[12573]: Use of uninitialized value $max_cpuid in addition (+) at /usr/share/perl5/PVE/Service/pvestatd.pm line 277.
Nov 24 22:31:16 lpnode1 pvestatd[12573]: Use of uninitialized value $max_cpuid in addition (+) at /usr/share/perl5/PVE/Service/pvestatd.pm line 277.
Nov 24 22:31:26 lpnode1 pvestatd[12573]: Use of uninitialized value $max_cpuid in addition (+) at /usr/share/perl5/PVE/Service/pvestatd.pm line 277.
Nov 24 22:31:36 lpnode1 pvestatd[12573]: Use of uninitialized value $max_cpuid in addition (+) at /usr/share/perl5/PVE/Service/pvestatd.pm line 277.
Nov 24 22:31:46 lpnode1 pvestatd[12573]: Use of uninitialized value $max_cpuid in addition (+) at /usr/share/perl5/PVE/Service/pvestatd.pm line 277.
Nov 24 22:31:56 lpnode1 pvestatd[12573]: Use of uninitialized value $max_cpuid in addition (+) at /usr/share/perl5/PVE/Service/pvestatd.pm line 277.
Nov 24 22:32:06 lpnode1 pvestatd[12573]: Use of uninitialized value $max_cpuid in addition (+) at /usr/share/perl5/PVE/Service/pvestatd.pm line 277.
Nov 24 22:32:16 lpnode1 pvestatd[12573]: Use of uninitialized value $max_cpuid in addition (+) at /usr/share/perl5/PVE/Service/pvestatd.pm line 277.
Nov 24 22:32:26 lpnode1 pvestatd[12573]: Use of uninitialized value $max_cpuid in addition (+) at /usr/share/perl5/PVE/Service/pvestatd.pm line 277.
Nov 24 22:32:36 lpnode1 pvestatd[12573]: Use of uninitialized value $max_cpuid in addition (+) at /usr/share/perl5/PVE/Service/pvestatd.pm line 277.
Nov 24 22:32:46 lpnode1 pvestatd[12573]: Use of uninitialized value $max_cpuid in addition (+) at /usr/share/perl5/PVE/Service/pvestatd.pm line 277.
Nov 24 22:32:56 lpnode1 pvestatd[12573]: Use of uninitialized value $max_cpuid in addition (+) at /usr/share/perl5/PVE/Service/pvestatd.pm line 277.
Nov 24 22:33:06 lpnode1 pvestatd[12573]: Use of uninitialized value $max_cpuid in addition (+) at /usr/share/perl5/PVE/Service/pvestatd.pm line 277.
Nov 24 22:34:00 lpnode1 watchdog-mux[10576]: client watchdog expired - disable watchdog updates
Nov 24 22:34:01 lpnode1 kernel: sd 0:0:0:0: attempting task abort!scmd(0x00000000b601dfd7), outstanding for 60812 ms & timeout 60000 ms
Nov 24 22:34:01 lpnode1 kernel: sd 0:0:0:0: [sda] tag#2484 CDB: ATA command pass through(16) 85 08 0e 00 d0 00 01 00 00 00 4f 00 c2 00 b0 00
Nov 24 22:34:01 lpnode1 kernel: scsi target0:0:0: handle(0x000b), sas_address(0x4433221102000000), phy(2)
Nov 24 22:34:01 lpnode1 kernel: scsi target0:0:0: enclosure logical id(0x500605b00e9c0400), slot(0)
Nov 24 22:34:01 lpnode1 kernel: scsi target0:0:0: enclosure level(0x0000), connector name(     )
Nov 24 22:34:01 lpnode1 smartd[10526]: Device: /dev/sda [SAT], failed to read SMART Attribute Data
Nov 24 22:34:01 lpnode1 smartd[10526]: Sending warning via /usr/share/smartmontools/smartd-runner to root ...
Nov 24 22:34:01 lpnode1 kernel: sd 0:0:0:0: task abort: SUCCESS scmd(0x00000000b601dfd7)
Nov 24 22:34:01 lpnode1 kernel: sd 0:0:0:0: attempting task abort!scmd(0x000000004c69ba9e), outstanding for 60848 ms & timeout 60000 ms
Nov 24 22:34:01 lpnode1 kernel: sd 0:0:0:0: [sda] tag#8408 CDB: Synchronize Cache(10) 35 00 00 00 00 00 00 00 00 00
Nov 24 22:34:01 lpnode1 kernel: scsi target0:0:0: handle(0x000b), sas_address(0x4433221102000000), phy(2)
Nov 24 22:34:01 lpnode1 kernel: scsi target0:0:0: enclosure logical id(0x500605b00e9c0400), slot(0)
Nov 24 22:34:01 lpnode1 kernel: scsi target0:0:0: enclosure level(0x0000), connector name(     )
Nov 24 22:34:01 lpnode1 kernel: sd 0:0:0:0: No reference found at driver, assuming scmd(0x000000004c69ba9e) might have completed
Nov 24 22:34:01 lpnode1 kernel: sd 0:0:0:0: task abort: SUCCESS scmd(0x000000004c69ba9e)
Nov 24 22:34:02 lpnode1 kernel: sd 0:0:0:0: Power-on or device reset occurred
Nov 24 22:34:02 lpnode1 zed[931767]: eid=324528 class=delay pool='rpool' vdev=sda3 size=20480 offset=546046148608 priority=1 err=0 flags=0x180880 delay=56063ms bookmark=771:0:-2:597210
Nov 24 22:34:02 lpnode1 zed[931762]: eid=324529 class=delay pool='rpool' vdev=sda3 size=20480 offset=520386023424 priority=1 err=0 flags=0x180880 delay=52467ms bookmark=643:0:-2:170130
Nov 24 22:34:02 lpnode1 zed[931784]: eid=324531 class=delay pool='rpool' vdev=sda3 size=20480 offset=520387584000 priority=1 err=0 flags=0x180880 delay=52467ms bookmark=643:0:-2:170132
Nov 24 22:34:02 lpnode1 zed[931806]: eid=324532 class=delay pool='rpool' vdev=sda3 size=20480 offset=520387633152 priority=1 err=0 flags=0x180880 delay=52467ms bookmark=643:0:-2:170133
Nov 24 22:34:02 lpnode1 smartd[10526]: Warning via /usr/share/smartmontools/smartd-runner to root: successful
Nov 24 22:34:02 lpnode1 zed[931825]: eid=324530 class=delay pool='rpool' vdev=sda3 size=20480 offset=520387248128 priority=1 err=0 flags=0x180880 delay=52467ms bookmark=643:0:-2:170131
Nov 24 22:34:02 lpnode1 zed[931844]: eid=324533 class=delay pool='rpool' vdev=sda3 size=20480 offset=520387686400 priority=1 err=0 flags=0x180880 delay=52466ms bookmark=643:0:-2:170134
Nov 24 22:34:02 lpnode1 zed[931832]: eid=324534 class=delay pool='rpool' vdev=sda3 size=20480 offset=545813991424 priority=1 err=0 flags=0x180880 delay=61673ms bookmark=104792:0:-2:354197092
Nov 24 22:34:02 lpnode1 smartd[10526]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 78 to 79
Nov 24 22:34:02 lpnode1 postfix/pickup[761052]: 060823BC79: uid=0 from=<root>
Nov 24 22:34:02 lpnode1 smartd[10526]: Device: /dev/sde [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 78 to 79
Nov 24 22:34:02 lpnode1 smartd[10526]: Device: /dev/sdf [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 78 to 79
Nov 24 22:34:02 lpnode1 smartd[10526]: Device: /dev/sdg [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 78 to 79
Nov 24 22:34:02 lpnode1 smartd[10526]: Device: /dev/sdh [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 78 to 79
Nov 24 22:34:02 lpnode1 postfix/cleanup[931713]: 060823BC79: message-id=<20211124213402.060823BC79@lpnode1.group.net>
Nov 24 22:34:02 lpnode1 postfix/qmgr[11917]: 060823BC79: from=<root@lpnode1.group.net>, size=1004, nrcpt=1 (queue active)
Nov 24 22:34:02 lpnode1 watchdog-mux[10576]: exit watchdog-mux with active connections
Nov 24 22:34:02 lpnode1 pve-ha-crm[12630]: loop take too long (55 seconds)
Nov 24 22:34:02 lpnode1 systemd[1]: watchdog-mux.service: Succeeded.
Nov 24 22:34:02 lpnode1 kernel: watchdog: watchdog0: watchdog did not stop!
-- Reboot --

Tdreissi · Nov 25, 2021

Proxmox Node 2 Syslog

Code:

Nov 24 22:33:54 lpnode2 pvestatd[6203]: Use of uninitialized value $max_cpuid in addition (+) at /usr/share/perl5/PVE/Service/pvestatd.pm line 277.
Nov 24 22:34:02 lpnode2 pmxcfs[320485]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-node/lpnode1: -1
Nov 24 22:34:02 lpnode2 pmxcfs[320485]: [status] notice: RRD update error /var/lib/rrdcached/db/pve2-node/lpnode1: /var/lib/rrdcached/db/pve2-node/lpnode1: illegal attempt to update using time 1637789642 when last update time is 1637789642 (minimum one second step)
Nov 24 22:34:02 lpnode2 pmxcfs[320485]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/101: -1
Nov 24 22:34:02 lpnode2 pmxcfs[320485]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/500: -1
Nov 24 22:34:02 lpnode2 pmxcfs[320485]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/501: -1
Nov 24 22:34:02 lpnode2 pmxcfs[320485]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/510: -1
Nov 24 22:34:02 lpnode2 pmxcfs[320485]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/502: -1
Nov 24 22:34:02 lpnode2 pmxcfs[320485]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/590: -1
Nov 24 22:34:02 lpnode2 pmxcfs[320485]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/lpnode1/nas: -1
Nov 24 22:34:02 lpnode2 pmxcfs[320485]: [status] notice: RRD update error /var/lib/rrdcached/db/pve2-storage/lpnode1/nas: /var/lib/rrdcached/db/pve2-storage/lpnode1/nas: illegal attempt to update using time 1637789642 when last update time is 1637789642 (minimum one second step)
Nov 24 22:34:02 lpnode2 pmxcfs[320485]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/lpnode1/local: -1
Nov 24 22:34:02 lpnode2 pmxcfs[320485]: [status] notice: RRD update error /var/lib/rrdcached/db/pve2-storage/lpnode1/local: /var/lib/rrdcached/db/pve2-storage/lpnode1/local: illegal attempt to update using time 1637789642 when last update time is 1637789642 (minimum one second step)
Nov 24 22:34:02 lpnode2 pmxcfs[320485]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/lpnode1/local-zfs: -1
Nov 24 22:34:02 lpnode2 pmxcfs[320485]: [status] notice: RRD update error /var/lib/rrdcached/db/pve2-storage/lpnode1/local-zfs: /var/lib/rrdcached/db/pve2-storage/lpnode1/local-zfs: illegal attempt to update using time 1637789642 when last update time is 1637789642 (minimum one second step)
Nov 24 22:34:04 lpnode2 pvestatd[6203]: Use of uninitialized value $max_cpuid in addition (+) at /usr/share/perl5/PVE/Service/pvestatd.pm line 277.
Nov 24 22:34:13 lpnode2 corosync[6123]:   [KNET  ] link: host: 3 link: 0 is down
Nov 24 22:34:13 lpnode2 corosync[6123]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Nov 24 22:34:13 lpnode2 corosync[6123]:   [KNET  ] host: host: 3 has no active links
Nov 24 22:34:14 lpnode2 pvestatd[6203]: Use of uninitialized value $max_cpuid in addition (+) at /usr/share/perl5/PVE/Service/pvestatd.pm line 277.
Nov 24 22:34:14 lpnode2 corosync[6123]:   [TOTEM ] Token has not been received in 3225 ms
Nov 24 22:34:15 lpnode2 corosync[6123]:   [TOTEM ] A processor failed, forming new configuration: token timed out (4300ms), waiting 5160ms for consensus.
Nov 24 22:34:20 lpnode2 corosync[6123]:   [QUORUM] Sync members[3]: 1 2 4
Nov 24 22:34:20 lpnode2 corosync[6123]:   [QUORUM] Sync left[1]: 3
Nov 24 22:34:20 lpnode2 corosync[6123]:   [TOTEM ] A new membership (1.636) was formed. Members left: 3
Nov 24 22:34:20 lpnode2 corosync[6123]:   [TOTEM ] Failed to receive the leave message. failed: 3
Nov 24 22:34:20 lpnode2 pmxcfs[320485]: [dcdb] notice: members: 1/382408, 2/1981838, 4/320485
Nov 24 22:34:20 lpnode2 pmxcfs[320485]: [dcdb] notice: starting data syncronisation
Nov 24 22:34:20 lpnode2 pmxcfs[320485]: [status] notice: members: 1/382408, 2/1981838, 4/320485
Nov 24 22:34:20 lpnode2 pmxcfs[320485]: [status] notice: starting data syncronisation
Nov 24 22:34:20 lpnode2 corosync[6123]:   [QUORUM] Members[3]: 1 2 4
Nov 24 22:34:20 lpnode2 corosync[6123]:   [MAIN  ] Completed service synchronization, ready to provide service.
Nov 24 22:34:20 lpnode2 pmxcfs[320485]: [dcdb] notice: received sync request (epoch 1/382408/00000004)
Nov 24 22:34:20 lpnode2 pmxcfs[320485]: [status] notice: received sync request (epoch 1/382408/00000004)
Nov 24 22:34:20 lpnode2 pmxcfs[320485]: [dcdb] notice: received all states
Nov 24 22:34:20 lpnode2 pmxcfs[320485]: [dcdb] notice: leader is 1/382408
Nov 24 22:34:20 lpnode2 pmxcfs[320485]: [dcdb] notice: synced members: 1/382408, 2/1981838, 4/320485
Nov 24 22:34:20 lpnode2 pmxcfs[320485]: [dcdb] notice: all data is up to date
Nov 24 22:34:20 lpnode2 pmxcfs[320485]: [dcdb] notice: dfsm_deliver_queue: queue length 8
Nov 24 22:34:20 lpnode2 pmxcfs[320485]: [status] notice: received all states
Nov 24 22:34:20 lpnode2 pmxcfs[320485]: [status] notice: all data is up to date
Nov 24 22:34:20 lpnode2 pmxcfs[320485]: [status] notice: dfsm_deliver_queue: queue length 40
Nov 24 22:35:34 lpnode2 pvestatd[6203]: Use of uninitialized value $max_cpuid in addition (+) at /usr/share/perl5/PVE/Service/pvestatd.pm line 277.
Nov 24 22:35:44 lpnode2 pvestatd[6203]: Use of uninitialized value $max_cpuid in addition (+) at /usr/share/perl5/PVE/Service/pvestatd.pm line 277.
Nov 24 22:35:54 lpnode2 pvestatd[6203]: Use of uninitialized value $max_cpuid in addition (+) at /usr/share/perl5/PVE/Service/pvestatd.pm line 277.
Nov 24 22:36:04 lpnode2 pvestatd[6203]: Use of uninitialized value $max_cpuid in addition (+) at /usr/share/perl5/PVE/Service/pvestatd.pm line 277.
Nov 24 22:36:14 lpnode2 pvestatd[6203]: Use of uninitialized value $max_cpuid in addition (+) at /usr/share/perl5/PVE/Service/pvestatd.pm line 277.
Nov 24 22:36:20 lpnode2 pve-ha-lrm[1770059]: starting service vm:500
Nov 24 22:36:20 lpnode2 pve-ha-lrm[1770058]: starting service vm:101
Nov 24 22:36:20 lpnode2 pve-ha-lrm[1770060]: starting service vm:510
Nov 24 22:36:20 lpnode2 pve-ha-lrm[1770064]: start VM 510: UPID:lpnode2:001B0250:066FD491:619EB054:qmstart:510:root@pam:
Nov 24 22:36:20 lpnode2 pve-ha-lrm[1770065]: start VM 101: UPID:lpnode2:001B0251:066FD491:619EB054:qmstart:101:root@pam:
Nov 24 22:36:20 lpnode2 pve-ha-lrm[1770066]: start VM 500: UPID:lpnode2:001B0252:066FD492:619EB054:qmstart:500:root@pam:
Nov 24 22:36:20 lpnode2 pve-ha-lrm[1770060]: <root@pam> starting task UPID:lpnode2:001B0250:066FD491:619EB054:qmstart:510:root@pam:
Nov 24 22:36:20 lpnode2 pve-ha-lrm[1770059]: <root@pam> starting task UPID:lpnode2:001B0252:066FD492:619EB054:qmstart:500:root@pam:
Nov 24 22:36:20 lpnode2 pve-ha-lrm[1770058]: <root@pam> starting task UPID:lpnode2:001B0251:066FD491:619EB054:qmstart:101:root@pam:
Nov 24 22:36:20 lpnode2 systemd[1]: Started 101.scope.
Nov 24 22:36:20 lpnode2 systemd[1]: Started 500.scope.
Nov 24 22:36:20 lpnode2 systemd[1]: Started 510.scope.
Nov 24 22:36:22 lpnode2 systemd-udevd[1770530]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Nov 24 22:36:22 lpnode2 pve-ha-lrm[1770059]: <root@pam> end task UPID:lpnode2:001B0252:066FD492:619EB054:qmstart:500:root@pam: OK
Nov 24 22:36:22 lpnode2 pve-ha-lrm[1770059]: service status vm:500 started
Nov 24 22:36:22 lpnode2 pve-ha-lrm[1770058]: <root@pam> end task UPID:lpnode2:001B0251:066FD491:619EB054:qmstart:101:root@pam: OK
Nov 24 22:36:22 lpnode2 pve-ha-lrm[1770058]: service status vm:101 started
Nov 24 22:36:23 lpnode2 pve-ha-lrm[1770060]: <root@pam> end task UPID:lpnode2:001B0250:066FD491:619EB054:qmstart:510:root@pam: OK
Nov 24 22:36:23 lpnode2 pve-ha-lrm[1770060]: service status vm:510 started
Nov 24 22:36:24 lpnode2 pvestatd[6203]: Use of uninitialized value $max_cpuid in addition (+) at /usr/share/perl5/PVE/Service/pvestatd.pm line 277.
Nov 24 22:36:34 lpnode2 pvestatd[6203]: Use of uninitialized value $max_cpuid in addition (+) at /usr/share/perl5/PVE/Service/pvestatd.pm line 277.
Nov 24 22:37:07 lpnode2 pvescheduler[1804965]: VM 101 qmp command failed - VM 101 qmp command 'guest-ping' failed - got timeout
Nov 24 22:37:07 lpnode2 pvescheduler[1804965]: Qemu Guest Agent is not running - VM 101 qmp command 'guest-ping' failed - got timeout
Nov 24 22:37:11 lpnode2 pvescheduler[1804965]: 101-0: got unexpected replication job error - command '/usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=lpnode1' root@yyy.yyy.yyy.yyy pvecm mtunnel -migration_network 10.101.53.160/27 -get_migration_ip' failed: exit code 255

t.lamprecht · Nov 25, 2021

Tdreissi said:
Nov 24 22:34:00 lpnode1 watchdog-mux[10576]: client watchdog expired - disable watchdog updates
Nov 24 22:34:01 lpnode1 kernel: sd 0:0:0:0: attempting task abort!scmd(0x00000000b601dfd7), outstanding for 60812 ms & timeout 60000 ms
Nov 24 22:34:01 lpnode1 kernel: sd 0:0:0:0: [sda] tag#2484 CDB: ATA command pass through(16) 85 08 0e 00 d0 00 01 00 00 00 4f 00 c2 00 b0 00

So, the kernel reports about a command timoute on a sda device, which is highly like the actual cause, or more related to that.

This has the CRM hanging and this the watchdog won't get updated and expires -> hangs.

In other word, it seems there was an (IO) issue that made the node (close to completely) unresponsive, and thus HA did its job and fenced.

The amout of "Use of uninitialized value $max_cpuid in addition" is weird, but itself not problematic.

Older logs on node that node may be interesting and if just to ensure the causality chain stops at sda being in a bogus, hanging state and triggering the further hang and thus self-fence.

Tdreissi · Nov 26, 2021

Hi Thomas,

I'm through the old logs.
On 09/27/21 I did the proxmox update 6 to 7.
From then on the errors appear.

Could it have something to do with CVE-2018-3646 ?

Code:

Sep 27 11:17:23 lpnode01 kernel: L1TF CPU bug present and SMT on, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.
Sep 27 11:17:23 lpnode01 pve-ha-crm[6409]: status change wait_for_quorum => slave
Sep 27 11:17:23 lpnode01 zebra[4470]: if_zebra_speed_update: vxlan_vpl01 old speed: 0 new speed: 4294967295
Sep 27 11:17:23 lpnode01 zebra[4470]: if_zebra_speed_update: vpl01 old speed: 0 new speed: 4294967295
Sep 27 11:17:23 lpnode01 zebra[4470]: if_zebra_speed_update: vxlan_vpl02 old speed: 0 new speed: 4294967295
Sep 27 11:17:23 lpnode01 zebra[4470]: if_zebra_speed_update: vpl02 old speed: 0 new speed: 4294967295
Sep 27 11:17:24 lpnode01 pvestatd[6358]: Use of uninitialized value $max_cpuid in addition (+) at /usr/share/perl5/PVE/Service/pvestatd.pm line 276.
Sep 27 11:17:24 lpnode01 kernel: FS-Cache: Loaded
Sep 27 11:17:24 lpnode01 kernel: FS-Cache: Netfs 'nfs' registered for caching
Sep 27 11:17:24 lpnode01 kernel: NFS: Registering the id_resolver key type
Sep 27 11:17:24 lpnode01 kernel: Key type id_resolver registered
Sep 27 11:17:24 lpnode01 kernel: Key type id_legacy registered
Sep 27 11:17:24 lpnode01 systemd[1]: Reached target Host and Network Name Lookups.
Sep 27 11:17:24 lpnode01 systemd[1]: Starting Preprocess NFS configuration...
Sep 27 11:17:24 lpnode01 systemd[1]: nfs-config.service: Succeeded.
Sep 27 11:17:24 lpnode01 systemd[1]: Finished Preprocess NFS configuration.
Sep 27 11:17:24 lpnode01 systemd[1]: Starting Notify NFS peers of a restart...
Sep 27 11:17:24 lpnode01 systemd[1]: Starting NFS status monitor for NFSv2/3 locking....
Sep 27 11:17:24 lpnode01 sm-notify[7040]: Version 1.3.3 starting
Sep 27 11:17:24 lpnode01 systemd[1]: rpc-statd-notify.service: Succeeded.
Sep 27 11:17:24 lpnode01 systemd[1]: Started Notify NFS peers of a restart.
Sep 27 11:17:24 lpnode01 rpc.statd[7043]: Version 1.3.3 starting
Sep 27 11:17:24 lpnode01 rpc.statd[7043]: Flags: TI-RPC
Sep 27 11:17:24 lpnode01 rpc.statd[7043]: Failed to read /var/lib/nfs/state: Success
Sep 27 11:17:24 lpnode01 rpc.statd[7043]: Initializing NSM state
Sep 27 11:17:24 lpnode01 systemd[1]: Started NFS status monitor for NFSv2/3 locking..
Sep 27 11:17:32 lpnode01 pvestatd[6358]: Use of uninitialized value $max_cpuid in addition (+) at /usr/share/perl5/PVE/Service/pvestatd.pm line 276.
Sep 27 11:17:42 lpnode01 pvestatd[6358]: Use of uninitialized value $max_cpuid in addition (+) at /usr/share/perl5/PVE/Service/pvestatd.pm line 276.
Sep 27 11:17:52 lpnode01 pvestatd[6358]: Use of uninitialized value $max_cpuid in addition (+) at /usr/share/perl5/PVE/Service/pvestatd.pm line 276.

t.lamprecht · Nov 26, 2021

Tdreissi said:
Could it have something to do with CVE-2018-3646 ?

Not very likely, L1TF is rather old and this is just an info log from the kernel about a CPU issue (ensure latest firmware and microcode ins installed), but that bug won't trigger such behavior.

Tdreissi said:
On 09/27/21 I did the proxmox update 6 to 7.

That's a few months back, unlikely to be related for that incident now.

What's imo more likely is a hw issue (check the disks/disk controller/...), a service or vm getting bonkers and overloading the system or maybe a kernel bug with your specific hw, just mentioning that one because it's not impossible.

But for a closer evaluation more active investigation with getting all hardware details, VM configs, storage, disks used and reading the full logs would be required - that's rather out of scope from the community forum as it would need posting lots of possible private details and one willing to spent time into the investigation. For such a thing our enterprise support exists: https://www.proxmox.com/en/proxmox-ve/pricing

Search

Search

4-Node Cluster problem - pvestatd $max_cpuid

Tdreissi

Well-Known Member

t.lamprecht

Proxmox Staff Member

Tdreissi

Well-Known Member

t.lamprecht

Proxmox Staff Member

Tdreissi

Well-Known Member

t.lamprecht

Proxmox Staff Member

Tdreissi

Well-Known Member

t.lamprecht

Proxmox Staff Member

Tdreissi

Well-Known Member

Tdreissi

Well-Known Member

t.lamprecht

Proxmox Staff Member

Tdreissi

Well-Known Member

t.lamprecht

Proxmox Staff Member