I'm trying to figure out what happened in one of the "existing" nodes......
I can clearly see the node changed the corosync conf to accomodate the new node here:
And the 8 node actually got added to the cluster at 19:02:05
Then at this point there are several cpg_send_message that are failing (probably due to the STP loop caused by the joining node
And here now the problem with HA, again probably caused by the fact that the exising node 1 was not able to communicate with existing node 2 anymore
And here the watchdog expires
And now something quite strange. The existing node lost link to the network.
Please bear in mind that the ports on the network are configured as edge ports....
At this point all the logs stop and i cannot find anything else that would indicate a reboot. I'd expect that if any process triggers a reboot i would find some message in syslog.
How is possible that this node started a reboot without leaving any trace in the logs?
And here the reboot again....
I appreciate that something happened, and i appreciate the need of avoiding producing unnecessary logs.
But I also believe if any element of Proxmox decides to trigger a reboot there should be at least a record in syslog, with an explanation to help understanding what happened.
To be 100% honest, if i had to read the logs by myself without your help i would suspect a power failure (and I'm 100% sure this is not the case)
I can clearly see the node changed the corosync conf to accomodate the new node here:
Code:
Oct 16 19:01:58 pve42-cluster01 pmxcfs[3299]: [status] notice: received log
Oct 16 19:01:58 pve42-cluster01 pmxcfs[3299]: [dcdb] notice: wrote new corosync config '/etc/corosync/corosync.conf' (version = 10)
Oct 16 19:01:59 pve42-cluster01 corosync[3579]: [CFG ] Config reload requested by node 1
Oct 16 19:01:59 pve42-cluster01 corosync[3579]: [TOTEM ] Configuring link 0
Oct 16 19:01:59 pve42-cluster01 corosync[3579]: [TOTEM ] Configured link number 0: local addr: 172.16.7.42, port=5405
Oct 16 19:01:59 pve42-cluster01 corosync[3579]: [TOTEM ] Configuring link 1
Oct 16 19:01:59 pve42-cluster01 corosync[3579]: [TOTEM ] Configured link number 1: local addr: 172.16.8.42, port=5406
Oct 16 19:01:59 pve42-cluster01 corosync[3579]: [KNET ] host: host: 8 (passive) best link: 0 (pri: 1)
Oct 16 19:01:59 pve42-cluster01 corosync[3579]: [KNET ] host: host: 8 has no active links
Oct 16 19:01:59 pve42-cluster01 corosync[3579]: [KNET ] host: host: 8 (passive) best link: 0 (pri: 1)
Oct 16 19:01:59 pve42-cluster01 corosync[3579]: [KNET ] host: host: 8 has no active links
Oct 16 19:01:59 pve42-cluster01 corosync[3579]: [KNET ] host: host: 8 (passive) best link: 0 (pri: 1)
Oct 16 19:01:59 pve42-cluster01 corosync[3579]: [KNET ] host: host: 8 has no active links
Oct 16 19:01:59 pve42-cluster01 corosync[3579]: [KNET ] host: host: 8 (passive) best link: 0 (pri: 1)
Oct 16 19:01:59 pve42-cluster01 corosync[3579]: [KNET ] host: host: 8 has no active links
Oct 16 19:01:59 pve42-cluster01 corosync[3579]: [KNET ] host: host: 8 (passive) best link: 0 (pri: 1)
Oct 16 19:01:59 pve42-cluster01 corosync[3579]: [KNET ] host: host: 8 has no active links
Oct 16 19:01:59 pve42-cluster01 corosync[3579]: [KNET ] host: host: 8 (passive) best link: 0 (pri: 1)
Oct 16 19:01:59 pve42-cluster01 corosync[3579]: [KNET ] host: host: 8 has no active links
Oct 16 19:01:59 pve42-cluster01 pmxcfs[3299]: [status] notice: update cluster info (cluster name sax-pve-cl01, version = 10)
Oct 16 19:02:05 pve42-cluster01 corosync[3579]: [KNET ] rx: host: 8 link: 0 is up
Oct 16 19:02:05 pve42-cluster01 corosync[3579]: [KNET ] rx: host: 8 link: 1 is up
Oct 16 19:02:05 pve42-cluster01 corosync[3579]: [KNET ] host: host: 8 (passive) best link: 0 (pri: 1)
Oct 16 19:02:05 pve42-cluster01 corosync[3579]: [KNET ] host: host: 8 (passive) best link: 0 (pri: 1)
And the 8 node actually got added to the cluster at 19:02:05
Code:
Oct 16 19:02:05 pve42-cluster01 corosync[3579]: [QUORUM] Sync members[8]: 1 2 3 4 5 6 7 8
Oct 16 19:02:05 pve42-cluster01 corosync[3579]: [QUORUM] Sync joined[1]: 8
Oct 16 19:02:05 pve42-cluster01 corosync[3579]: [TOTEM ] A new membership (1.102) was formed. Members joined: 8
Then at this point there are several cpg_send_message that are failing (probably due to the STP loop caused by the joining node
Code:
Oct 16 19:02:10 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 10
Oct 16 19:02:11 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 20
Oct 16 19:02:12 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 30
Oct 16 19:02:13 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 40
Oct 16 19:02:14 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 50
Oct 16 19:02:15 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 60
Oct 16 19:02:16 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 70
Oct 16 19:02:17 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 80
Oct 16 19:02:18 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 90
Oct 16 19:02:19 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 100
Oct 16 19:02:19 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retried 100 times
Oct 16 19:02:19 pve42-cluster01 pmxcfs[3299]: [status] crit: cpg_send_message failed: 6
Oct 16 19:02:20 pve42-cluster01 pmxcfs[3299]: [dcdb] notice: cpg_send_message retry 10
Oct 16 19:02:20 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 10
Oct 16 19:02:21 pve42-cluster01 pmxcfs[3299]: [dcdb] notice: cpg_send_message retry 20
Oct 16 19:02:21 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 20
Oct 16 19:02:22 pve42-cluster01 pmxcfs[3299]: [dcdb] notice: cpg_send_message retry 30
Oct 16 19:02:22 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 30
Oct 16 19:02:23 pve42-cluster01 pmxcfs[3299]: [dcdb] notice: cpg_send_message retry 40
Oct 16 19:02:23 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 40
Oct 16 19:02:24 pve42-cluster01 pmxcfs[3299]: [dcdb] notice: cpg_send_message retry 50
Oct 16 19:02:24 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 50
Oct 16 19:02:25 pve42-cluster01 pmxcfs[3299]: [dcdb] notice: cpg_send_message retry 60
Oct 16 19:02:25 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 60
Oct 16 19:02:26 pve42-cluster01 pmxcfs[3299]: [dcdb] notice: cpg_send_message retry 70
Oct 16 19:02:26 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 70
Oct 16 19:02:27 pve42-cluster01 pmxcfs[3299]: [dcdb] notice: cpg_send_message retry 80
Oct 16 19:02:27 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 80
Oct 16 19:02:28 pve42-cluster01 pmxcfs[3299]: [dcdb] notice: cpg_send_message retry 90
Oct 16 19:02:28 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 90
Oct 16 19:02:29 pve42-cluster01 pmxcfs[3299]: [dcdb] notice: cpg_send_message retry 100
Oct 16 19:02:29 pve42-cluster01 pmxcfs[3299]: [dcdb] notice: cpg_send_message retried 100 times
Oct 16 19:02:29 pve42-cluster01 pmxcfs[3299]: [dcdb] crit: cpg_send_message failed: 6
And here now the problem with HA, again probably caused by the fact that the exising node 1 was not able to communicate with existing node 2 anymore
Code:
Oct 16 19:02:29 pve42-cluster01 pve-ha-lrm[4358]: lost lock 'ha_agent_pve42-cluster01_lock - cfs lock update failed - Device or resource busy
Oct 16 19:02:29 pve42-cluster01 pve-ha-lrm[4358]: status change active => lost_agent_lock
Oct 16 19:02:29 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 100
Oct 16 19:02:29 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retried 100 times
Oct 16 19:02:29 pve42-cluster01 pmxcfs[3299]: [status] crit: cpg_send_message failed: 6
...... SNIP .....
And here the watchdog expires
Code:
Oct 16 19:02:59 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retried 100 times
Oct 16 19:02:59 pve42-cluster01 pmxcfs[3299]: [status] crit: cpg_send_message failed: 6
Oct 16 19:02:59 pve42-cluster01 pve-firewall[3705]: firewall update time (20.061 seconds)
Oct 16 19:03:00 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 10
Oct 16 19:03:00 pve42-cluster01 pmxcfs[3299]: [dcdb] notice: cpg_send_message retry 10
Oct 16 19:03:01 pve42-cluster01 watchdog-mux[2515]: client watchdog expired - disable watchdog updates
Oct 16 19:03:01 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 20
Oct 16 19:03:01 pve42-cluster01 pmxcfs[3299]: [dcdb] notice: cpg_send_message retry 20
Oct 16 19:03:02 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 30
Oct 16 19:03:02 pve42-cluster01 pmxcfs[3299]: [dcdb] notice: cpg_send_message retry 30
Oct 16 19:03:03 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 40
Oct 16 19:03:03 pve42-cluster01 pmxcfs[3299]: [dcdb] notice: cpg_send_message retry 40
Oct 16 19:03:04 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 50
Oct 16 19:03:04 pve42-cluster01 pmxcfs[3299]: [dcdb] notice: cpg_send_message retry 50
Oct 16 19:03:05 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 60
Oct 16 19:03:05 pve42-cluster01 pmxcfs[3299]: [dcdb] notice: cpg_send_message retry 60
Oct 16 19:03:06 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 70
Oct 16 19:03:06 pve42-cluster01 pmxcfs[3299]: [dcdb] notice: cpg_send_message retry 70
Oct 16 19:03:07 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 80
Oct 16 19:03:07 pve42-cluster01 pmxcfs[3299]: [dcdb] notice: cpg_send_message retry 80
Oct 16 19:03:08 pve42-cluster01 pmxcfs[3299]: [dcdb] notice: cpg_send_message retry 90
And now something quite strange. The existing node lost link to the network.
Please bear in mind that the ports on the network are configured as edge ports....
Code:
Oct 16 19:03:09 pve42-cluster01 corosync[3579]: [KNET ] link: host: 1 link: 0 is down
Oct 16 19:03:09 pve42-cluster01 corosync[3579]: [KNET ] link: host: 1 link: 1 is down
Oct 16 19:03:09 pve42-cluster01 corosync[3579]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Oct 16 19:03:09 pve42-cluster01 corosync[3579]: [KNET ] host: host: 1 has no active links
Oct 16 19:03:09 pve42-cluster01 corosync[3579]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Oct 16 19:03:09 pve42-cluster01 corosync[3579]: [KNET ] host: host: 1 has no active links
Oct 16 19:03:09 pve42-cluster01 pmxcfs[3299]: [dcdb] notice: cpg_send_message retry 100
Oct 16 19:03:09 pve42-cluster01 pmxcfs[3299]: [dcdb] notice: cpg_send_message retried 100 times
Oct 16 19:03:09 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retry 100
Oct 16 19:03:09 pve42-cluster01 pmxcfs[3299]: [status] notice: cpg_send_message retried 100 times
Oct 16 19:03:09 pve42-cluster01 pmxcfs[3299]: [status] crit: cpg_send_message failed: 6
At this point all the logs stop and i cannot find anything else that would indicate a reboot. I'd expect that if any process triggers a reboot i would find some message in syslog.
How is possible that this node started a reboot without leaving any trace in the logs?
And here the reboot again....
Code:
Oct 16 19:05:07 pve42-cluster01 systemd-modules-load[1799]: Inserted module 'iscsi_tcp'
Oct 16 19:05:07 pve42-cluster01 systemd-modules-load[1799]: Inserted module 'ib_iser'
Oct 16 19:05:07 pve42-cluster01 systemd-modules-load[1799]: Inserted module 'vhost_net'
Oct 16 19:05:07 pve42-cluster01 lvm[1792]: 1 logical volume(s) in volume group "ceph-a1141880-818f-450b-b484-826efed83f84" monitored
Oct 16 19:05:07 pve42-cluster01 lvm[1792]: 1 logical volume(s) in volume group "ceph-346f9c7a-dbfc-4d0b-8e8f-c17b6ef087d9" monitored
Oct 16 19:05:07 pve42-cluster01 kernel: [ 0.000000] Linux version 5.15.60-2-pve (build@proxmox) (gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP PVE 5.15.60-2 (Tue, 04 Oct 2022 16:52:28 +0200) ()
Oct 16 19:05:07 pve42-cluster01 systemd[1]: Starting Flush Journal to Persistent Storage...
Oct 16 19:05:07 pve42-cluster01 systemd[1]: Finished Coldplug All udev Devices.
Oct 16 19:05:07 pve42-cluster01 kernel: [ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-5.15.60-2-pve root=ZFS=rpool/ROOT/pve-1 ro ixgbe.allow_unsupported_sfp=1 root=ZFS=rpool/ROOT/pve-1 boot=zfs quiet
Oct 16 19:05:07 pve42-cluster01 kernel: [ 0.000000] KERNEL supported cpus:
Oct 16 19:05:07 pve42-cluster01 kernel: [ 0.000000] Intel GenuineIntel
Oct 16 19:05:07 pve42-cluster01 kernel: [ 0.000000] AMD AuthenticAMD
Oct 16 19:05:07 pve42-cluster01 kernel: [ 0.000000] Hygon HygonGenuine
Oct 16 19:05:07 pve42-cluster01 kernel: [ 0.000000] Centaur CentaurHauls
Oct 16 19:05:07 pve42-cluster01 systemd[1]: Starting Helper to synchronize boot up for ifupdown...
Oct 16 19:05:07 pve42-cluster01 kernel: [ 0.000000] zhaoxin Shanghai
Oct 16 19:05:07 pve42-cluster01 kernel: [ 0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
Oct 16 19:05:07 pve42-cluster01 kernel: [ 0.000000] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
Oct 16 19:05:07 pve42-cluster01 systemd[1]: Starting Wait for udev To Complete Device Initialization...
I appreciate that something happened, and i appreciate the need of avoiding producing unnecessary logs.
But I also believe if any element of Proxmox decides to trigger a reboot there should be at least a record in syslog, with an explanation to help understanding what happened.
To be 100% honest, if i had to read the logs by myself without your help i would suspect a power failure (and I'm 100% sure this is not the case)