Sudden reboot of multiple nodes while adding a new node

alex_ca · Oct 19, 2022

I'm trying to figure out what happened in one of the "existing" nodes......

I can clearly see the node changed the corosync conf to accomodate the new node here:

Code:

Oct 16 19:01:58 pve42-cluster01 pmxcfs[3299]: [status] notice: received log
Oct 16 19:01:58 pve42-cluster01 pmxcfs[3299]: [dcdb] notice: wrote new corosync config '/etc/corosync/corosync.conf' (version = 10)
Oct 16 19:01:59 pve42-cluster01 corosync[3579]:   [CFG   ] Config reload requested by node 1
Oct 16 19:01:59 pve42-cluster01 corosync[3579]:   [TOTEM ] Configuring link 0
Oct 16 19:01:59 pve42-cluster01 corosync[3579]:   [TOTEM ] Configured link number 0: local addr: 172.16.7.42, port=5405
Oct 16 19:01:59 pve42-cluster01 corosync[3579]:   [TOTEM ] Configuring link 1
Oct 16 19:01:59 pve42-cluster01 corosync[3579]:   [TOTEM ] Configured link number 1: local addr: 172.16.8.42, port=5406
Oct 16 19:01:59 pve42-cluster01 corosync[3579]:   [KNET  ] host: host: 8 (passive) best link: 0 (pri: 1)
Oct 16 19:01:59 pve42-cluster01 corosync[3579]:   [KNET  ] host: host: 8 has no active links
Oct 16 19:01:59 pve42-cluster01 corosync[3579]:   [KNET  ] host: host: 8 (passive) best link: 0 (pri: 1)
Oct 16 19:01:59 pve42-cluster01 corosync[3579]:   [KNET  ] host: host: 8 has no active links
Oct 16 19:01:59 pve42-cluster01 corosync[3579]:   [KNET  ] host: host: 8 (passive) best link: 0 (pri: 1)
Oct 16 19:01:59 pve42-cluster01 corosync[3579]:   [KNET  ] host: host: 8 has no active links
Oct 16 19:01:59 pve42-cluster01 corosync[3579]:   [KNET  ] host: host: 8 (passive) best link: 0 (pri: 1)
Oct 16 19:01:59 pve42-cluster01 corosync[3579]:   [KNET  ] host: host: 8 has no active links
Oct 16 19:01:59 pve42-cluster01 corosync[3579]:   [KNET  ] host: host: 8 (passive) best link: 0 (pri: 1)
Oct 16 19:01:59 pve42-cluster01 corosync[3579]:   [KNET  ] host: host: 8 has no active links
Oct 16 19:01:59 pve42-cluster01 corosync[3579]:   [KNET  ] host: host: 8 (passive) best link: 0 (pri: 1)
Oct 16 19:01:59 pve42-cluster01 corosync[3579]:   [KNET  ] host: host: 8 has no active links
Oct 16 19:01:59 pve42-cluster01 pmxcfs[3299]: [status] notice: update cluster info (cluster name  sax-pve-cl01, version = 10)
Oct 16 19:02:05 pve42-cluster01 corosync[3579]:   [KNET  ] rx: host: 8 link: 0 is up
Oct 16 19:02:05 pve42-cluster01 corosync[3579]:   [KNET  ] rx: host: 8 link: 1 is up
Oct 16 19:02:05 pve42-cluster01 corosync[3579]:   [KNET  ] host: host: 8 (passive) best link: 0 (pri: 1)
Oct 16 19:02:05 pve42-cluster01 corosync[3579]:   [KNET  ] host: host: 8 (passive) best link: 0 (pri: 1)

And the 8 node actually got added to the cluster at 19:02:05

Code:

Oct 16 19:02:05 pve42-cluster01 corosync[3579]:   [QUORUM] Sync members[8]: 1 2 3 4 5 6 7 8
Oct 16 19:02:05 pve42-cluster01 corosync[3579]:   [QUORUM] Sync joined[1]: 8
Oct 16 19:02:05 pve42-cluster01 corosync[3579]:   [TOTEM ] A new membership (1.102) was formed. Members joined: 8

Then at this point there are several cpg_send_message that are failing (probably due to the STP loop caused by the joining node

Code:

Oct 16 19:02:10 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 10
Oct 16 19:02:11 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 20
Oct 16 19:02:12 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 30
Oct 16 19:02:13 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 40
Oct 16 19:02:14 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 50
Oct 16 19:02:15 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 60
Oct 16 19:02:16 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 70
Oct 16 19:02:17 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 80
Oct 16 19:02:18 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 90
Oct 16 19:02:19 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 100
Oct 16 19:02:19 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retried 100 times
Oct 16 19:02:19 pve42-cluster01 pmxcfs[3299]: [status] crit: cpg_send_message failed: 6
Oct 16 19:02:20 pve42-cluster01 pmxcfs[3299]: [dcdb] notice: cpg_send_message retry 10
Oct 16 19:02:20 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 10
Oct 16 19:02:21 pve42-cluster01 pmxcfs[3299]: [dcdb] notice: cpg_send_message retry 20
Oct 16 19:02:21 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 20
Oct 16 19:02:22 pve42-cluster01 pmxcfs[3299]: [dcdb] notice: cpg_send_message retry 30
Oct 16 19:02:22 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 30
Oct 16 19:02:23 pve42-cluster01 pmxcfs[3299]: [dcdb] notice: cpg_send_message retry 40
Oct 16 19:02:23 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 40
Oct 16 19:02:24 pve42-cluster01 pmxcfs[3299]: [dcdb] notice: cpg_send_message retry 50
Oct 16 19:02:24 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 50
Oct 16 19:02:25 pve42-cluster01 pmxcfs[3299]: [dcdb] notice: cpg_send_message retry 60
Oct 16 19:02:25 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 60
Oct 16 19:02:26 pve42-cluster01 pmxcfs[3299]: [dcdb] notice: cpg_send_message retry 70
Oct 16 19:02:26 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 70
Oct 16 19:02:27 pve42-cluster01 pmxcfs[3299]: [dcdb] notice: cpg_send_message retry 80
Oct 16 19:02:27 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 80
Oct 16 19:02:28 pve42-cluster01 pmxcfs[3299]: [dcdb] notice: cpg_send_message retry 90
Oct 16 19:02:28 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 90
Oct 16 19:02:29 pve42-cluster01 pmxcfs[3299]: [dcdb] notice: cpg_send_message retry 100
Oct 16 19:02:29 pve42-cluster01 pmxcfs[3299]: [dcdb] notice: cpg_send_message retried 100 times
Oct 16 19:02:29 pve42-cluster01 pmxcfs[3299]: [dcdb] crit: cpg_send_message failed: 6

And here now the problem with HA, again probably caused by the fact that the exising node 1 was not able to communicate with existing node 2 anymore

Code:

Oct 16 19:02:29 pve42-cluster01 pve-ha-lrm[4358]: lost lock 'ha_agent_pve42-cluster01_lock - cfs lock update failed - Device or resource busy
Oct 16 19:02:29 pve42-cluster01 pve-ha-lrm[4358]: status change active => lost_agent_lock
Oct 16 19:02:29 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 100
Oct 16 19:02:29 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retried 100 times
Oct 16 19:02:29 pve42-cluster01 pmxcfs[3299]: [status] crit: cpg_send_message failed: 6
...... SNIP .....

And here the watchdog expires

Code:

Oct 16 19:02:59 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retried 100 times
Oct 16 19:02:59 pve42-cluster01 pmxcfs[3299]: [status] crit: cpg_send_message failed: 6
Oct 16 19:02:59 pve42-cluster01 pve-firewall[3705]: firewall update time (20.061 seconds)
Oct 16 19:03:00 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 10
Oct 16 19:03:00 pve42-cluster01 pmxcfs[3299]: [dcdb] notice: cpg_send_message retry 10
Oct 16 19:03:01 pve42-cluster01 watchdog-mux[2515]: client watchdog expired - disable watchdog updates
Oct 16 19:03:01 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 20
Oct 16 19:03:01 pve42-cluster01 pmxcfs[3299]: [dcdb] notice: cpg_send_message retry 20
Oct 16 19:03:02 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 30
Oct 16 19:03:02 pve42-cluster01 pmxcfs[3299]: [dcdb] notice: cpg_send_message retry 30
Oct 16 19:03:03 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 40
Oct 16 19:03:03 pve42-cluster01 pmxcfs[3299]: [dcdb] notice: cpg_send_message retry 40
Oct 16 19:03:04 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 50
Oct 16 19:03:04 pve42-cluster01 pmxcfs[3299]: [dcdb] notice: cpg_send_message retry 50
Oct 16 19:03:05 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 60
Oct 16 19:03:05 pve42-cluster01 pmxcfs[3299]: [dcdb] notice: cpg_send_message retry 60
Oct 16 19:03:06 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 70
Oct 16 19:03:06 pve42-cluster01 pmxcfs[3299]: [dcdb] notice: cpg_send_message retry 70
Oct 16 19:03:07 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 80
Oct 16 19:03:07 pve42-cluster01 pmxcfs[3299]: [dcdb] notice: cpg_send_message retry 80
Oct 16 19:03:08 pve42-cluster01 pmxcfs[3299]: [dcdb] notice: cpg_send_message retry 90

And now something quite strange. The existing node lost link to the network.
Please bear in mind that the ports on the network are configured as edge ports....

Code:

Oct 16 19:03:09 pve42-cluster01 corosync[3579]:   [KNET  ] link: host: 1 link: 0 is down
Oct 16 19:03:09 pve42-cluster01 corosync[3579]:   [KNET  ] link: host: 1 link: 1 is down
Oct 16 19:03:09 pve42-cluster01 corosync[3579]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Oct 16 19:03:09 pve42-cluster01 corosync[3579]:   [KNET  ] host: host: 1 has no active links
Oct 16 19:03:09 pve42-cluster01 corosync[3579]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Oct 16 19:03:09 pve42-cluster01 corosync[3579]:   [KNET  ] host: host: 1 has no active links
Oct 16 19:03:09 pve42-cluster01 pmxcfs[3299]: [dcdb] notice: cpg_send_message retry 100
Oct 16 19:03:09 pve42-cluster01 pmxcfs[3299]: [dcdb] notice: cpg_send_message retried 100 times
Oct 16 19:03:09 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 100
Oct 16 19:03:09 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retried 100 times
Oct 16 19:03:09 pve42-cluster01 pmxcfs[3299]: [status] crit: cpg_send_message failed: 6

At this point all the logs stop and i cannot find anything else that would indicate a reboot. I'd expect that if any process triggers a reboot i would find some message in syslog.
How is possible that this node started a reboot without leaving any trace in the logs?

And here the reboot again....

Code:

Oct 16 19:05:07 pve42-cluster01 systemd-modules-load[1799]: Inserted module 'iscsi_tcp'
Oct 16 19:05:07 pve42-cluster01 systemd-modules-load[1799]: Inserted module 'ib_iser'
Oct 16 19:05:07 pve42-cluster01 systemd-modules-load[1799]: Inserted module 'vhost_net'
Oct 16 19:05:07 pve42-cluster01 lvm[1792]:   1 logical volume(s) in volume group "ceph-a1141880-818f-450b-b484-826efed83f84" monitored
Oct 16 19:05:07 pve42-cluster01 lvm[1792]:   1 logical volume(s) in volume group "ceph-346f9c7a-dbfc-4d0b-8e8f-c17b6ef087d9" monitored
Oct 16 19:05:07 pve42-cluster01 kernel: [    0.000000] Linux version 5.15.60-2-pve (build@proxmox) (gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP PVE 5.15.60-2 (Tue, 04 Oct 2022 16:52:28 +0200) ()
Oct 16 19:05:07 pve42-cluster01 systemd[1]: Starting Flush Journal to Persistent Storage...
Oct 16 19:05:07 pve42-cluster01 systemd[1]: Finished Coldplug All udev Devices.
Oct 16 19:05:07 pve42-cluster01 kernel: [    0.000000] Command line: BOOT_IMAGE=/vmlinuz-5.15.60-2-pve root=ZFS=rpool/ROOT/pve-1 ro ixgbe.allow_unsupported_sfp=1 root=ZFS=rpool/ROOT/pve-1 boot=zfs quiet
Oct 16 19:05:07 pve42-cluster01 kernel: [    0.000000] KERNEL supported cpus:
Oct 16 19:05:07 pve42-cluster01 kernel: [    0.000000]   Intel GenuineIntel
Oct 16 19:05:07 pve42-cluster01 kernel: [    0.000000]   AMD AuthenticAMD
Oct 16 19:05:07 pve42-cluster01 kernel: [    0.000000]   Hygon HygonGenuine
Oct 16 19:05:07 pve42-cluster01 kernel: [    0.000000]   Centaur CentaurHauls
Oct 16 19:05:07 pve42-cluster01 systemd[1]: Starting Helper to synchronize boot up for ifupdown...
Oct 16 19:05:07 pve42-cluster01 kernel: [    0.000000]   zhaoxin   Shanghai 
Oct 16 19:05:07 pve42-cluster01 kernel: [    0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
Oct 16 19:05:07 pve42-cluster01 kernel: [    0.000000] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
Oct 16 19:05:07 pve42-cluster01 systemd[1]: Starting Wait for udev To Complete Device Initialization...

I appreciate that something happened, and i appreciate the need of avoiding producing unnecessary logs.
But I also believe if any element of Proxmox decides to trigger a reboot there should be at least a record in syslog, with an explanation to help understanding what happened.

To be 100% honest, if i had to read the logs by myself without your help i would suspect a power failure (and I'm 100% sure this is not the case)

fabian · Oct 19, 2022

alexmolon said:

I'm trying to figure out what happened in one of the "existing" nodes......

I can clearly see the node changed the corosync conf to accomodate the new node here:

Code:

Oct 16 19:01:58 pve42-cluster01 pmxcfs[3299]: [status] notice: received log
Oct 16 19:01:58 pve42-cluster01 pmxcfs[3299]: [dcdb] notice: wrote new corosync config '/etc/corosync/corosync.conf' (version = 10)
Oct 16 19:01:59 pve42-cluster01 corosync[3579]:   [CFG   ] Config reload requested by node 1
Oct 16 19:01:59 pve42-cluster01 corosync[3579]:   [TOTEM ] Configuring link 0
Oct 16 19:01:59 pve42-cluster01 corosync[3579]:   [TOTEM ] Configured link number 0: local addr: 172.16.7.42, port=5405
Oct 16 19:01:59 pve42-cluster01 corosync[3579]:   [TOTEM ] Configuring link 1
Oct 16 19:01:59 pve42-cluster01 corosync[3579]:   [TOTEM ] Configured link number 1: local addr: 172.16.8.42, port=5406
Oct 16 19:01:59 pve42-cluster01 corosync[3579]:   [KNET  ] host: host: 8 (passive) best link: 0 (pri: 1)
Oct 16 19:01:59 pve42-cluster01 corosync[3579]:   [KNET  ] host: host: 8 has no active links
Oct 16 19:01:59 pve42-cluster01 corosync[3579]:   [KNET  ] host: host: 8 (passive) best link: 0 (pri: 1)
Oct 16 19:01:59 pve42-cluster01 corosync[3579]:   [KNET  ] host: host: 8 has no active links
Oct 16 19:01:59 pve42-cluster01 corosync[3579]:   [KNET  ] host: host: 8 (passive) best link: 0 (pri: 1)
Oct 16 19:01:59 pve42-cluster01 corosync[3579]:   [KNET  ] host: host: 8 has no active links
Oct 16 19:01:59 pve42-cluster01 corosync[3579]:   [KNET  ] host: host: 8 (passive) best link: 0 (pri: 1)
Oct 16 19:01:59 pve42-cluster01 corosync[3579]:   [KNET  ] host: host: 8 has no active links
Oct 16 19:01:59 pve42-cluster01 corosync[3579]:   [KNET  ] host: host: 8 (passive) best link: 0 (pri: 1)
Oct 16 19:01:59 pve42-cluster01 corosync[3579]:   [KNET  ] host: host: 8 has no active links
Oct 16 19:01:59 pve42-cluster01 corosync[3579]:   [KNET  ] host: host: 8 (passive) best link: 0 (pri: 1)
Oct 16 19:01:59 pve42-cluster01 corosync[3579]:   [KNET  ] host: host: 8 has no active links
Oct 16 19:01:59 pve42-cluster01 pmxcfs[3299]: [status] notice: update cluster info (cluster name  sax-pve-cl01, version = 10)
Oct 16 19:02:05 pve42-cluster01 corosync[3579]:   [KNET  ] rx: host: 8 link: 0 is up
Oct 16 19:02:05 pve42-cluster01 corosync[3579]:   [KNET  ] rx: host: 8 link: 1 is up
Oct 16 19:02:05 pve42-cluster01 corosync[3579]:   [KNET  ] host: host: 8 (passive) best link: 0 (pri: 1)
Oct 16 19:02:05 pve42-cluster01 corosync[3579]:   [KNET  ] host: host: 8 (passive) best link: 0 (pri: 1)

And the 8 node actually got added to the cluster at 19:02:05

well, at that point the first two heartbeat ping/pongs went through, so the network part of corosync says: this host is reachable over these two links here. the issue is that the expected MTU at this point is >9000 (based on the other links to the other nodes), and your broken network config means that to discover that these new links actually only support MTU 1500 will take a long time (since it seems like it basically blackholed packets over that size).

alexmolon said:
Code:

Oct 16 19:02:05 pve42-cluster01 corosync[3579]: [QUORUM] Sync members[8]: 1 2 3 4 5 6 7 8 Oct 16 19:02:05 pve42-cluster01 corosync[3579]: [QUORUM] Sync joined[1]: 8 Oct 16 19:02:05 pve42-cluster01 corosync[3579]: [TOTEM ] A new membership (1.102) was formed. Members joined: 8

Then at this point there are several cpg_send_message that are failing (probably due to the STP loop caused by the joining node

no, CPG is not functional because corosync cannot finish synchronizing the nodes (the first node has a retransmit list where it logs that it tries to send the sync messages but fails - because the last, newly joined node never acks them because those messages end up in nirvana).

alexmolon said:

Code:

Oct 16 19:02:10 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 10
Oct 16 19:02:11 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 20
Oct 16 19:02:12 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 30
Oct 16 19:02:13 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 40
Oct 16 19:02:14 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 50
Oct 16 19:02:15 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 60
Oct 16 19:02:16 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 70
Oct 16 19:02:17 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 80
Oct 16 19:02:18 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 90
Oct 16 19:02:19 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 100
Oct 16 19:02:19 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retried 100 times
Oct 16 19:02:19 pve42-cluster01 pmxcfs[3299]: [status] crit: cpg_send_message failed: 6
Oct 16 19:02:20 pve42-cluster01 pmxcfs[3299]: [dcdb] notice: cpg_send_message retry 10
Oct 16 19:02:20 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 10
Oct 16 19:02:21 pve42-cluster01 pmxcfs[3299]: [dcdb] notice: cpg_send_message retry 20
Oct 16 19:02:21 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 20
Oct 16 19:02:22 pve42-cluster01 pmxcfs[3299]: [dcdb] notice: cpg_send_message retry 30
Oct 16 19:02:22 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 30
Oct 16 19:02:23 pve42-cluster01 pmxcfs[3299]: [dcdb] notice: cpg_send_message retry 40
Oct 16 19:02:23 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 40
Oct 16 19:02:24 pve42-cluster01 pmxcfs[3299]: [dcdb] notice: cpg_send_message retry 50
Oct 16 19:02:24 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 50
Oct 16 19:02:25 pve42-cluster01 pmxcfs[3299]: [dcdb] notice: cpg_send_message retry 60
Oct 16 19:02:25 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 60
Oct 16 19:02:26 pve42-cluster01 pmxcfs[3299]: [dcdb] notice: cpg_send_message retry 70
Oct 16 19:02:26 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 70
Oct 16 19:02:27 pve42-cluster01 pmxcfs[3299]: [dcdb] notice: cpg_send_message retry 80
Oct 16 19:02:27 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 80
Oct 16 19:02:28 pve42-cluster01 pmxcfs[3299]: [dcdb] notice: cpg_send_message retry 90
Oct 16 19:02:28 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 90
Oct 16 19:02:29 pve42-cluster01 pmxcfs[3299]: [dcdb] notice: cpg_send_message retry 100
Oct 16 19:02:29 pve42-cluster01 pmxcfs[3299]: [dcdb] notice: cpg_send_message retried 100 times
Oct 16 19:02:29 pve42-cluster01 pmxcfs[3299]: [dcdb] crit: cpg_send_message failed: 6

And here now the problem with HA, again probably caused by the fact that the exising node 1 was not able to communicate with existing node 2 anymore

no, the issue with HA is that CPG is not functional, thus the HA services cannot write their "we are still alive and well" timestamp.

alexmolon said:

Code:

Oct 16 19:02:29 pve42-cluster01 pve-ha-lrm[4358]: lost lock 'ha_agent_pve42-cluster01_lock - cfs lock update failed - Device or resource busy
Oct 16 19:02:29 pve42-cluster01 pve-ha-lrm[4358]: status change active => lost_agent_lock
Oct 16 19:02:29 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 100
Oct 16 19:02:29 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retried 100 times
Oct 16 19:02:29 pve42-cluster01 pmxcfs[3299]: [status] crit: cpg_send_message failed: 6
...... SNIP .....

And here the watchdog expires

yes, because the watchdog is only pulled up again if HA services write their timestamps.

alexmolon said:

Code:

Oct 16 19:02:59 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retried 100 times
Oct 16 19:02:59 pve42-cluster01 pmxcfs[3299]: [status] crit: cpg_send_message failed: 6
Oct 16 19:02:59 pve42-cluster01 pve-firewall[3705]: firewall update time (20.061 seconds)
Oct 16 19:03:00 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 10
Oct 16 19:03:00 pve42-cluster01 pmxcfs[3299]: [dcdb] notice: cpg_send_message retry 10
Oct 16 19:03:01 pve42-cluster01 watchdog-mux[2515]: client watchdog expired - disable watchdog updates
Oct 16 19:03:01 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 20
Oct 16 19:03:01 pve42-cluster01 pmxcfs[3299]: [dcdb] notice: cpg_send_message retry 20
Oct 16 19:03:02 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 30
Oct 16 19:03:02 pve42-cluster01 pmxcfs[3299]: [dcdb] notice: cpg_send_message retry 30
Oct 16 19:03:03 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 40
Oct 16 19:03:03 pve42-cluster01 pmxcfs[3299]: [dcdb] notice: cpg_send_message retry 40
Oct 16 19:03:04 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 50
Oct 16 19:03:04 pve42-cluster01 pmxcfs[3299]: [dcdb] notice: cpg_send_message retry 50
Oct 16 19:03:05 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 60
Oct 16 19:03:05 pve42-cluster01 pmxcfs[3299]: [dcdb] notice: cpg_send_message retry 60
Oct 16 19:03:06 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 70
Oct 16 19:03:06 pve42-cluster01 pmxcfs[3299]: [dcdb] notice: cpg_send_message retry 70
Oct 16 19:03:07 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 80
Oct 16 19:03:07 pve42-cluster01 pmxcfs[3299]: [dcdb] notice: cpg_send_message retry 80
Oct 16 19:03:08 pve42-cluster01 pmxcfs[3299]: [dcdb] notice: cpg_send_message retry 90

And now something quite strange. The existing node lost link to the network.
Please bear in mind that the ports on the network are configured as edge ports....

no, what happens is that the first node fenced itself and the other nodes notice that (rinse repeat for the second and third nodes)

alexmolon said:

Code:

Oct 16 19:03:09 pve42-cluster01 corosync[3579]:   [KNET  ] link: host: 1 link: 0 is down
Oct 16 19:03:09 pve42-cluster01 corosync[3579]:   [KNET  ] link: host: 1 link: 1 is down
Oct 16 19:03:09 pve42-cluster01 corosync[3579]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Oct 16 19:03:09 pve42-cluster01 corosync[3579]:   [KNET  ] host: host: 1 has no active links
Oct 16 19:03:09 pve42-cluster01 corosync[3579]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Oct 16 19:03:09 pve42-cluster01 corosync[3579]:   [KNET  ] host: host: 1 has no active links
Oct 16 19:03:09 pve42-cluster01 pmxcfs[3299]: [dcdb] notice: cpg_send_message retry 100
Oct 16 19:03:09 pve42-cluster01 pmxcfs[3299]: [dcdb] notice: cpg_send_message retried 100 times
Oct 16 19:03:09 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 100
Oct 16 19:03:09 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retried 100 times
Oct 16 19:03:09 pve42-cluster01 pmxcfs[3299]: [status] crit: cpg_send_message failed: 6

At this point all the logs stop and i cannot find anything else that would indicate a reboot. I'd expect that if any process triggers a reboot i would find some message in syslog.
How is possible that this node started a reboot without leaving any trace in the logs?

it fenced itself and probably didn't write out anything. usually the other nodes would, but their HA services also become unoperational at the same time.

alexmolon said:

And here the reboot again....

Code:

Oct 16 19:05:07 pve42-cluster01 systemd-modules-load[1799]: Inserted module 'iscsi_tcp'
Oct 16 19:05:07 pve42-cluster01 systemd-modules-load[1799]: Inserted module 'ib_iser'
Oct 16 19:05:07 pve42-cluster01 systemd-modules-load[1799]: Inserted module 'vhost_net'
Oct 16 19:05:07 pve42-cluster01 lvm[1792]:   1 logical volume(s) in volume group "ceph-a1141880-818f-450b-b484-826efed83f84" monitored
Oct 16 19:05:07 pve42-cluster01 lvm[1792]:   1 logical volume(s) in volume group "ceph-346f9c7a-dbfc-4d0b-8e8f-c17b6ef087d9" monitored
Oct 16 19:05:07 pve42-cluster01 kernel: [    0.000000] Linux version 5.15.60-2-pve (build@proxmox) (gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP PVE 5.15.60-2 (Tue, 04 Oct 2022 16:52:28 +0200) ()
Oct 16 19:05:07 pve42-cluster01 systemd[1]: Starting Flush Journal to Persistent Storage...
Oct 16 19:05:07 pve42-cluster01 systemd[1]: Finished Coldplug All udev Devices.
Oct 16 19:05:07 pve42-cluster01 kernel: [    0.000000] Command line: BOOT_IMAGE=/vmlinuz-5.15.60-2-pve root=ZFS=rpool/ROOT/pve-1 ro ixgbe.allow_unsupported_sfp=1 root=ZFS=rpool/ROOT/pve-1 boot=zfs quiet
Oct 16 19:05:07 pve42-cluster01 kernel: [    0.000000] KERNEL supported cpus:
Oct 16 19:05:07 pve42-cluster01 kernel: [    0.000000]   Intel GenuineIntel
Oct 16 19:05:07 pve42-cluster01 kernel: [    0.000000]   AMD AuthenticAMD
Oct 16 19:05:07 pve42-cluster01 kernel: [    0.000000]   Hygon HygonGenuine
Oct 16 19:05:07 pve42-cluster01 kernel: [    0.000000]   Centaur CentaurHauls
Oct 16 19:05:07 pve42-cluster01 systemd[1]: Starting Helper to synchronize boot up for ifupdown...
Oct 16 19:05:07 pve42-cluster01 kernel: [    0.000000]   zhaoxin   Shanghai
Oct 16 19:05:07 pve42-cluster01 kernel: [    0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
Oct 16 19:05:07 pve42-cluster01 kernel: [    0.000000] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
Oct 16 19:05:07 pve42-cluster01 systemd[1]: Starting Wait for udev To Complete Device Initialization...

I appreciate that something happened, and i appreciate the need of avoiding producing unnecessary logs.
But I also believe if any element of Proxmox decides to trigger a reboot there should be at least a record in syslog, with an explanation to help understanding what happened.

To be 100% honest, if i had to read the logs by myself without your help i would suspect a power failure (and I'm 100% sure this is not the case)

the way I see it there are basically two solutions (until upstream and we have decided whether and how this is fixable in corosync/knet):
- fix the network situation (either MTU 9000 all around, or 1500 all around)
- or force corosync to use MTU 1500 (via corosync.conf / netmtu) despite of what the network says/seems capable of

disarming HA would obviously still be prudent to avoid further fencing events.

alex_ca · Oct 19, 2022

I can easily change the MTU, or even installing a 10G card identical to the ones already installed in the other servers, so we can be sure there is no MTU mismatch.

But my main question remains..... Is it normal that whatever triggered the reboot of the machines didn't leave any evident trace on their syslog?

I believe an event and condition like this worths to be logged, more than anything else

fabian · Oct 19, 2022

normally the HA services would log something - in this case /etc/pve just completely blocked, which likely also blocked them, so they didn't log anything. since that affected the whole cluster, the other nodes also didn't log anything. the watchdog expiring is entirely automatic and is logged, but it's basically the last thing that happens before the node is killed so that message might not make it to disk anymore.

Search

Search

Sudden reboot of multiple nodes while adding a new node

alex_ca

New Member

fabian

Proxmox Staff Member

alex_ca

New Member

fabian

Proxmox Staff Member

We value your privacy