cluster nodes unstable

Binary Bandit

Well-Known Member
Dec 13, 2018
60
9
48
53
Hi All,

We've been running the latest Proxmox for about six months. It's been very solid and stable until this last Sunday ... six days ago. We're now having issues.

We're running a three node cluster, Debian 9 and Ceph.

What's happening?
One of the three nodes reboots about every 24 hours. This seems to be about 6AM PST though not always.

What's changed?
This last Saturday (the day before the reboots started) each cluster node was updated with apt-get update and then apt-get upgrade. There was a kernel patch and a reboot was required.

Other observations?
We remotely monitor VMs on the cluster and have seen VMs stop functioning before the reboot. I believe that the reboot it trigger by the IPMI watchdog but am unsure of how to confirm. After the reboot the watchdog sometimes looses it's configuration showing a 15 second countdown rather than a 10 second one. If the node is rebooted once more the watchdog is then properly configured.

I'm a new to troubleshooting this so don't be shy about checking the basics with me.

best,

James
 
@udo , thanks.

I just Googled / read about dist-upgrade and full-upgrade. While I was at it I checked in to apt vs apt-get. I'm glad to know the differences.

Unfortunately there is nothing for full-upgrade / dist-upgrade to do on the nodes. There are no packages requiring removal. Thinking back, I may have used the GUI to do the upgrade. Perhaps it uses the full-upgrade command?

James
 
@udo , thanks.

I just Googled / read about dist-upgrade and full-upgrade. While I was at it I checked in to apt vs apt-get. I'm glad to know the differences.

Unfortunately there is nothing for full-upgrade / dist-upgrade to do on the nodes. There are no packages requiring removal. Thinking back, I may have used the GUI to do the upgrade. Perhaps it uses the full-upgrade command?

James
Hi,
if you used the gui, the "right" packages are updated!

How looks your config?
Can you post the output of following commands?
Code:
vgs
lvs
zfs list
Udo
 
Here you are:

Code:
root@ait1:~# vgs
  VG  #PV #LV #SN Attr   VSize VFree
  pve   1   4   0 wz--n- 1.82t 17.23g
root@ait1:~# lvs
  LV     VG  Attr       LSize  Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  aitvms pve Vwi-aotz--  1.70t data        4.11
  data   pve twi-aotz--  1.70t             4.11   2.41
  root   pve -wi-ao---- 96.00g
  swap   pve -wi-ao----  8.00g
root@ait1:~# zfs list
no datasets available

-----------------

root@ait2:~# vgs
  VG  #PV #LV #SN Attr   VSize VFree
  pve   1   4   0 wz--n- 1.82t 17.23g
root@ait2:~# lvs
  LV     VG  Attr       LSize  Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  aitvms pve Vwi-aotz--  1.70t data        6.27
  data   pve twi-aotz--  1.70t             6.27   3.42
  root   pve -wi-ao---- 96.00g
  swap   pve -wi-ao----  8.00g
root@ait2:~# zfs list
no datasets available

-------------------

root@ait3:~# vgs
  VG  #PV #LV #SN Attr   VSize VFree
  pve   1   4   0 wz--n- 1.82t 17.23g
root@ait3:~# lvs
  LV     VG  Attr       LSize  Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  aitvms pve Vwi-aotz--  1.70t data        2.32
  data   pve twi-aotz--  1.70t             2.32   1.50
  root   pve -wi-ao---- 96.00g
  swap   pve -wi-ao----  8.00g
root@ait3:~# zfs list
no datasets available
 
Hi All,

Our three cluster nodes rebooted at random for about a week. This last weekend I started one cluster node (selected it in Grub) on the last kernel version prior to the random reboot problem. Since this time (three days) the cluster hasn't had a node (old or new kernel) reboot.

I've moved all of the most critical VMs to the node that old kernel version node (it's called AIT3), left one of the other nodes with no VMs (AIT2) and then put everything else on AIT1. This also means that AIT3 is experiencing the most (mostly network IO) load.

node name - version (taken from the summary of each node)
AIT1 - pve-kernel-4.15.18-12-pve: 4.15.18-35
AIT2 - pve-kernel-4.15.18-12-pve: 4.15.18-35
AIT3 - pvd-kernel-4.15.18-11-pve: 4.15.18-34

Does this give anyone any ideas?

I'de really like to get to the bottom of this and restore our confidence in this Proxmox cluster. We have other Proxmox hosts with VMs waiting to migrate (another data center) but have put it on hold for now.

best,

James
 
* Do you have HA enabled in the cluster?
* check the journal entries before the node reboots (take a look especially for messages from corosync and pmxcfs)
 
I guess the reboots are due to the nodes losing quorum and fencing themselves - see https://pve.proxmox.com/pve-docs/pve-admin-guide.html#ha_manager_fencing)

the output of `journalctl` should yield the journal - you can optionally provide some parameters to narrow down how much of the logs is shown:
* `journalctl --since '2019-04-11` should give the complete log since today at 00:00
* `journalctl -b -1` - the journal of the second-to-last boot (that's probably the most helpful since you know how often the box got restarted)
* you can also reverse the order (so that the messages before the fence come on top with the `-r` option : `journalctl -r -b -1`

Please keep in mind, that you need to have persistent journaling enabled - so if this does not provide any information - make sure that '/var/log/journal' exists (as a directory) and restart `systemd-journald` afterwards

Hope this helps!
 
Thanks Stoiko.

What you just wrote confirms what a lot of Googling just taught me. I didn't expect you to write back so soon. Too bad I didn't refresh this post while Google-ing. LOL

I didn't have persistent journaling enabled but do now. Here's what I've done:

- edited /etc/systemd/journald.conf (I used nano)
- un-commented the "Storage=" line and set it to "persistent"
- restarted journaling with "systemctl restart systemd-journald"

Now I just need one of the cluster nodes to reboot. This hasn't happened since I booted the AIT3 note to the old kernel version. I'll change to the new version this Friday PST evening. Hopefully a node will reboot over the weekend, I'll capture it in the logs and then move things back to the stable old kernel. ... makes me wish that we had a test environment ...
 
- un-commented the "Storage=" line and set it to "persistent"
can be done - but do make sure that '/var/log/journal' exists - else I think persistent journaling does not work.

* you can also find most infos in '/var/log/syslog*' (and other files therein - e.g. `zgrep corosync *` should tell you which logs contain messages from corosync - do the same for pmxcfs).
* sadly when fencing the node usually does not flush the last lines of syslog (which can contain the important information) to disk - but it can help rule out a few cases.
 
make sure that '/var/log/journal' exists
I checked for '/var/log/journal' it's there on all three nodes. I'm not sure but I think that it's created when journaling is restarted.

AIT3 is the only node with a reboot that is still visible in the syslogs ... the syslogs containing reboots for AIT1 and AIT2 have already been deleted. Unfortunately AIT3's log doesn't show anything interesting prior to the reboot. That lack of flushing is definitely sad.

I ran the following on all cluster nodes ... with times and dates adjusted to look for log entries just before the reboots. Our external monitoring software makes this easy as it records the reboot down to the second.

cd /var/log

zgrep pmxcfs * | grep " 2 " | grep " 05:"
and
zgrep corosync * | grep " 2 " | grep " 05:"

Above I'm looking for log entries containing pmxcfs or corosync that happened on the 2nd where the hour is 5AM and any minutes. There's probably a cleaner way to do this without all of the piping but it works.

but ... there's nothing abnormal to be seen. Everything looks good and then the node can be seen booting up.

Hopefully this information helps someone else troubleshoot. I'm stuck waiting for the weekend to cause a reboot.
 
I checked for '/var/log/journal' it's there on all three nodes. I'm not sure but I think that it's created when journaling is restarted.
nice - did not know that - Thanks!

but ... there's nothing abnormal to be seen. Everything looks good and then the node can be seen booting up.
could you share parts of the logs? (just to rule out that you do have messages, which are normal in your environment, but still could indicate a bottleneck in your cluster-network? (e.g. Retransmits can happen in regular networks as well, but they can also indicate that the network is at its limit).

Since fencing only happens if:
* ha is active
* at least one ha-service ran on the node since its last boot
* the node is not in the quorate partition of the corosync cluster
the node did lose quorum if it got fenced.

To rule out another cause of the reboot (unlikely but still) - please check the logs for the same timeframe on the other nodes - the pve-ha-crm service logs when a node gets fenced (and where its services are recovered to) - usually grepping for fence in the logs should show that (but to read the logs for the timeframe)

Also check the fenced nodes log in the timeframe (not with grep but by reading it) for any messages from the kernel - maybe its a bug with the NIC('s driver) where corosync runs on.

Last it could also be a bug in the watchdog module you're using - which one do you have configured (softdog ist the default) and are there any other particularities of your setup?

zgrep corosync * | grep " 2 " | grep " 05:"
I probably would have used `zgrep '2 05:.*corosync' *` - but whatever works is ok :)

Hope this helps!
 
could you share parts of the logs?

Definitely ... here are some logs from AIT3, the last node to reboot and the one that I have syslog entries for. The reboot happened just after Apr 5 05:19:30. The nodes address is 172.20.64.14. 172.20.64.253 is one of our monitoring servers.
Code:
Apr  5 05:00:00 ait3 systemd[1]: Starting Proxmox VE replication runner...
Apr  5 05:00:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:00:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:00:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:00:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:00:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:00:01 ait3 systemd[1]: Started Proxmox VE replication runner.
Apr  5 05:00:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:00:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:00:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:00:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:00:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:01:00 ait3 systemd[1]: Starting Proxmox VE replication runner...
Apr  5 05:01:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:01:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:01:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:01:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:01:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:01:01 ait3 systemd[1]: Started Proxmox VE replication runner.
Apr  5 05:01:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:01:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:01:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:01:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:01:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:02:00 ait3 systemd[1]: Starting Proxmox VE replication runner...
Apr  5 05:02:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:02:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:02:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:02:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:02:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:02:01 ait3 systemd[1]: Started Proxmox VE replication runner.
Apr  5 05:02:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:02:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:02:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:02:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:02:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:03:00 ait3 systemd[1]: Starting Proxmox VE replication runner...
Apr  5 05:03:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:03:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:03:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:03:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:03:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:03:01 ait3 systemd[1]: Started Proxmox VE replication runner.
Apr  5 05:03:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:03:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:03:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:03:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:03:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:04:00 ait3 systemd[1]: Starting Proxmox VE replication runner...
Apr  5 05:04:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:04:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:04:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:04:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:04:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:04:01 ait3 systemd[1]: Started Proxmox VE replication runner.
Apr  5 05:04:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:04:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:04:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:04:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:04:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:05:00 ait3 systemd[1]: Starting Proxmox VE replication runner...
Apr  5 05:05:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:05:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:05:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:05:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:05:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:05:01 ait3 systemd[1]: Started Proxmox VE replication runner.
Apr  5 05:05:18 ait3 snmpd[4329]: Connection from UDP: [172.20.64.253]:64390->[172.20.64.14]:161
Apr  5 05:05:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:05:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:05:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:05:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:05:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:06:00 ait3 systemd[1]: Starting Proxmox VE replication runner...
Apr  5 05:06:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:06:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:06:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:06:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:06:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:06:01 ait3 systemd[1]: Started Proxmox VE replication runner.
Apr  5 05:06:18 ait3 pmxcfs[4785]: [dcdb] notice: data verification successful
Apr  5 05:06:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:06:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:06:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:06:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:06:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:07:00 ait3 systemd[1]: Starting Proxmox VE replication runner...
Apr  5 05:07:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:07:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:07:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:07:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:07:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:07:01 ait3 systemd[1]: Started Proxmox VE replication runner.
Apr  5 05:07:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:07:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:07:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:07:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:07:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:08:00 ait3 systemd[1]: Starting Proxmox VE replication runner...
Apr  5 05:08:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:08:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:08:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:08:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:08:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:08:01 ait3 systemd[1]: Started Proxmox VE replication runner.
Apr  5 05:08:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:08:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:08:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:08:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:08:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:09:00 ait3 systemd[1]: Starting Proxmox VE replication runner...
Apr  5 05:09:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:09:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:09:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:09:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:09:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:09:01 ait3 systemd[1]: Started Proxmox VE replication runner.
Apr  5 05:09:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:09:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:09:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:09:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:09:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:10:00 ait3 systemd[1]: Starting Proxmox VE replication runner...
Apr  5 05:10:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:10:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:10:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:10:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:10:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:10:01 ait3 systemd[1]: Started Proxmox VE replication runner.
Apr  5 05:10:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:10:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:10:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:10:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:10:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:11:00 ait3 systemd[1]: Starting Proxmox VE replication runner...
Apr  5 05:11:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:11:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:11:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:11:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:11:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:11:01 ait3 systemd[1]: Started Proxmox VE replication runner.
Apr  5 05:11:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:11:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:11:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:11:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:11:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:12:00 ait3 systemd[1]: Starting Proxmox VE replication runner...
Apr  5 05:12:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:12:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:12:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:12:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:12:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:12:01 ait3 systemd[1]: Started Proxmox VE replication runner.
Apr  5 05:12:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:12:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:12:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:12:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:12:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:13:00 ait3 systemd[1]: Starting Proxmox VE replication runner...
Apr  5 05:13:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:13:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:13:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:13:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:13:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:13:01 ait3 systemd[1]: Started Proxmox VE replication runner.
Apr  5 05:13:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:13:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:13:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:13:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:13:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:14:00 ait3 systemd[1]: Starting Proxmox VE replication runner...
Apr  5 05:14:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:14:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:14:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:14:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:14:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:14:01 ait3 systemd[1]: Started Proxmox VE replication runner.
Apr  5 05:14:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:14:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:14:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:14:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:14:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:14:35 ait3 pmxcfs[4785]: [status] notice: received log
Apr  5 05:15:00 ait3 systemd[1]: Starting Proxmox VE replication runner...
Apr  5 05:15:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:15:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:15:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:15:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:15:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:15:01 ait3 systemd[1]: Started Proxmox VE replication runner.
Apr  5 05:15:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:15:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:15:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:15:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:15:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:16:00 ait3 systemd[1]: Starting Proxmox VE replication runner...
Apr  5 05:16:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:16:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:16:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:16:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:16:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:16:01 ait3 systemd[1]: Started Proxmox VE replication runner.
Apr  5 05:16:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:16:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:16:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:16:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:16:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:17:00 ait3 systemd[1]: Starting Proxmox VE replication runner...
Apr  5 05:17:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:17:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:17:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:17:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:17:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:17:01 ait3 systemd[1]: Started Proxmox VE replication runner.
Apr  5 05:17:01 ait3 CRON[2898136]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Apr  5 05:17:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:17:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:17:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:17:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:17:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:18:00 ait3 systemd[1]: Starting Proxmox VE replication runner...
Apr  5 05:18:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:18:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:18:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:18:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:18:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:18:01 ait3 systemd[1]: Started Proxmox VE replication runner.
Apr  5 05:18:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:18:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:18:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:18:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:18:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:19:00 ait3 systemd[1]: Starting Proxmox VE replication runner...
Apr  5 05:19:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:19:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:19:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:19:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:19:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:19:01 ait3 systemd[1]: Started Proxmox VE replication runner.
Apr  5 05:19:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:19:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:19:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:19:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:19:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)

'zgrep corosync * | grep " 5 " | grep " 05:"' didn't return anything. (Sticking with the command that I'm familiar with for now.)
'zgrep corosync * | grep " 5 "' doesn't show any entries untill 09:15 when I reboot a cluster node to troubleshoot.

'zgrep pmxcfs * | grep " 5 " | grep " 05:"' returns nothing as well.
Here's 4 lines before and after the reboot time using 'zgrep pmxcfs * | grep " 5 "'.

Code:
daemon.log.1:Apr  5 04:06:18 ait3 pmxcfs[4785]: [dcdb] notice: data verification successful
daemon.log.1:Apr  5 04:14:33 ait3 pmxcfs[4785]: [status] notice: received log
daemon.log.1:Apr  5 04:29:33 ait3 pmxcfs[4785]: [status] notice: received log
daemon.log.1:Apr  5 04:44:34 ait3 pmxcfs[4785]: [status] notice: received log
syslog.7.gz:Apr  5 06:29:39 ait3 pmxcfs[4793]: [status] notice: received log
syslog.7.gz:Apr  5 06:44:39 ait3 pmxcfs[4793]: [status] notice: received log
syslog.7.gz:Apr  5 06:59:40 ait3 pmxcfs[4793]: [status] notice: received log
syslog.7.gz:Apr  5 07:06:18 ait3 pmxcfs[4793]: [dcdb] notice: data verification successful

please check the logs for the same timeframe on the other nodes

On AIT2, using 'zgrep fence *' provides a page of entries with pve-ha-crm.
'zgrep pve-ha-crm * | grep " 5 05:"' shows:
Note that there aren't any entries before 5AM on the 5th ... checked by removing the '05:'.

Code:
daemon.log.1:Apr  5 05:22:27 ait2 pve-ha-crm[5642]: successfully acquired lock 'ha_manager_lock'
daemon.log.1:Apr  5 05:22:27 ait2 pve-ha-crm[5642]: watchdog active
daemon.log.1:Apr  5 05:22:27 ait2 pve-ha-crm[5642]: status change slave => master
daemon.log.1:Apr  5 05:22:27 ait2 pve-ha-crm[5642]: node 'ait3': state changed from 'online' => 'unknown'
daemon.log.1:Apr  5 05:23:27 ait2 pve-ha-crm[5642]: service 'vm:100': state changed from 'started' to 'fence'
daemon.log.1:Apr  5 05:23:27 ait2 pve-ha-crm[5642]: service 'vm:106': state changed from 'started' to 'fence'
daemon.log.1:Apr  5 05:23:27 ait2 pve-ha-crm[5642]: service 'vm:109': state changed from 'started' to 'fence'
daemon.log.1:Apr  5 05:23:27 ait2 pve-ha-crm[5642]: service 'vm:110': state changed from 'started' to 'fence'
daemon.log.1:Apr  5 05:23:27 ait2 pve-ha-crm[5642]: service 'vm:111': state changed from 'started' to 'fence'
daemon.log.1:Apr  5 05:23:27 ait2 pve-ha-crm[5642]: service 'vm:112': state changed from 'started' to 'fence'
daemon.log.1:Apr  5 05:23:27 ait2 pve-ha-crm[5642]: service 'vm:113': state changed from 'started' to 'fence'
daemon.log.1:Apr  5 05:23:27 ait2 pve-ha-crm[5642]: service 'vm:115': state changed from 'started' to 'fence'
daemon.log.1:Apr  5 05:23:27 ait2 pve-ha-crm[5642]: node 'ait3': state changed from 'unknown' => 'fence'
daemon.log.1:Apr  5 05:23:37 ait2 pve-ha-crm[5642]: successfully acquired lock 'ha_agent_ait3_lock'
daemon.log.1:Apr  5 05:23:37 ait2 pve-ha-crm[5642]: fencing: acknowledged - got agent lock for node 'ait3'
daemon.log.1:Apr  5 05:23:37 ait2 pve-ha-crm[5642]: node 'ait3': state changed from 'fence' => 'unknown'
daemon.log.1:Apr  5 05:23:37 ait2 pve-ha-crm[5642]: recover service 'vm:100' from fenced node 'ait3' to node 'ait2'
daemon.log.1:Apr  5 05:23:37 ait2 pve-ha-crm[5642]: service 'vm:100': state changed from 'fence' to 'started'  (node = ait2)
daemon.log.1:Apr  5 05:23:37 ait2 pve-ha-crm[5642]: recover service 'vm:106' from fenced node 'ait3' to node 'ait2'
daemon.log.1:Apr  5 05:23:37 ait2 pve-ha-crm[5642]: service 'vm:106': state changed from 'fence' to 'started'  (node = ait2)
daemon.log.1:Apr  5 05:23:37 ait2 pve-ha-crm[5642]: recover service 'vm:109' from fenced node 'ait3' to node 'ait1'
daemon.log.1:Apr  5 05:23:37 ait2 pve-ha-crm[5642]: service 'vm:109': state changed from 'fence' to 'started'  (node = ait1)
daemon.log.1:Apr  5 05:23:37 ait2 pve-ha-crm[5642]: recover service 'vm:110' from fenced node 'ait3' to node 'ait2'
daemon.log.1:Apr  5 05:23:37 ait2 pve-ha-crm[5642]: service 'vm:110': state changed from 'fence' to 'started'  (node = ait2)
daemon.log.1:Apr  5 05:23:37 ait2 pve-ha-crm[5642]: recover service 'vm:111' from fenced node 'ait3' to node 'ait1'
daemon.log.1:Apr  5 05:23:37 ait2 pve-ha-crm[5642]: service 'vm:111': state changed from 'fence' to 'started'  (node = ait1)
daemon.log.1:Apr  5 05:23:37 ait2 pve-ha-crm[5642]: recover service 'vm:112' from fenced node 'ait3' to node 'ait2'
daemon.log.1:Apr  5 05:23:37 ait2 pve-ha-crm[5642]: service 'vm:112': state changed from 'fence' to 'started'  (node = ait2)
daemon.log.1:Apr  5 05:23:37 ait2 pve-ha-crm[5642]: recover service 'vm:113' from fenced node 'ait3' to node 'ait1'
daemon.log.1:Apr  5 05:23:37 ait2 pve-ha-crm[5642]: service 'vm:113': state changed from 'fence' to 'started'  (node = ait1)
daemon.log.1:Apr  5 05:23:37 ait2 pve-ha-crm[5642]: recover service 'vm:115' from fenced node 'ait3' to node 'ait2'
daemon.log.1:Apr  5 05:23:37 ait2 pve-ha-crm[5642]: service 'vm:115': state changed from 'fence' to 'started'  (node = ait2)
daemon.log.1:Apr  5 05:24:57 ait2 pve-ha-crm[5642]: node 'ait3': state changed from 'unknown' => 'online'

Looking in more detail, here's a bit from 'cat daemon.log.1 | grep "Apr 5 05:"'
Code:
Apr  5 05:19:00 ait2 systemd[1]: Starting Proxmox VE replication runner...
Apr  5 05:19:01 ait2 systemd[1]: Started Proxmox VE replication runner.
Apr  5 05:19:21 ait2 snmpd[4426]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:19:21 ait2 snmpd[4426]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:19:21 ait2 snmpd[4426]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:19:21 ait2 snmpd[4426]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:19:21 ait2 snmpd[4426]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:19:51 ait2 snmpd[4426]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:19:51 ait2 snmpd[4426]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:19:51 ait2 snmpd[4426]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:19:51 ait2 snmpd[4426]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:19:51 ait2 snmpd[4426]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:20:00 ait2 systemd[1]: Starting Proxmox VE replication runner...
Apr  5 05:20:01 ait2 systemd[1]: Started Proxmox VE replication runner.
Apr  5 05:20:18 ait2 snmpd[4426]: Connection from UDP: [172.20.64.253]:58590->[172.20.64.13]:161
Apr  5 05:20:21 ait2 snmpd[4426]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:20:21 ait2 snmpd[4426]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:20:21 ait2 snmpd[4426]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:20:21 ait2 snmpd[4426]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:20:21 ait2 snmpd[4426]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:20:31 ait2 corosync[5025]: notice  [TOTEM ] A new membership (172.20.128.12:5452) was formed. Members left: 3
Apr  5 05:20:31 ait2 corosync[5025]: notice  [TOTEM ] Failed to receive the leave message. failed: 3
Apr  5 05:20:31 ait2 corosync[5025]:  [TOTEM ] A new membership (172.20.128.12:5452) was formed. Members left: 3
Apr  5 05:20:31 ait2 corosync[5025]:  [TOTEM ] Failed to receive the leave message. failed: 3
Apr  5 05:20:31 ait2 corosync[5025]: warning [CPG   ] downlist left_list: 1 received
Apr  5 05:20:31 ait2 corosync[5025]:  [CPG   ] downlist left_list: 1 received
Apr  5 05:20:31 ait2 corosync[5025]:  [CPG   ] downlist left_list: 1 received
Apr  5 05:20:31 ait2 corosync[5025]: warning [CPG   ] downlist left_list: 1 received
Apr  5 05:20:31 ait2 pmxcfs[4870]: [dcdb] notice: members: 1/4736, 2/4870
Apr  5 05:20:31 ait2 corosync[5025]: notice  [QUORUM] Members[2]: 1 2
Apr  5 05:20:31 ait2 corosync[5025]: notice  [MAIN  ] Completed service synchronization, ready to provide service.
Apr  5 05:20:31 ait2 pmxcfs[4870]: [dcdb] notice: starting data syncronisation
Apr  5 05:20:31 ait2 pmxcfs[4870]: [status] notice: members: 1/4736, 2/4870
Apr  5 05:20:31 ait2 pmxcfs[4870]: [status] notice: starting data syncronisation
Apr  5 05:20:31 ait2 corosync[5025]:  [QUORUM] Members[2]: 1 2
Apr  5 05:20:31 ait2 corosync[5025]:  [MAIN  ] Completed service synchronization, ready to provide service.

On AIT1, using 'zgrep fence *' doesn't show anything.
There's something in daemon.log.1 that's no text as I had to use 'grep "Apr 5 05:" daemon.log.1 -a' to pull this:
Code:
Apr  5 05:19:01 ait1 systemd[1]: Started Proxmox VE replication runner.
Apr  5 05:19:02 ait1 snmpd[4323]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:19:02 ait1 snmpd[4323]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:19:02 ait1 snmpd[4323]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:19:02 ait1 snmpd[4323]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:19:02 ait1 snmpd[4323]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:19:32 ait1 snmpd[4323]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:19:32 ait1 snmpd[4323]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:19:32 ait1 snmpd[4323]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:19:32 ait1 snmpd[4323]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:19:32 ait1 snmpd[4323]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:20:00 ait1 systemd[1]: Starting Proxmox VE replication runner...
Apr  5 05:20:01 ait1 systemd[1]: Started Proxmox VE replication runner.
Apr  5 05:20:02 ait1 snmpd[4323]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:20:02 ait1 snmpd[4323]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:20:02 ait1 snmpd[4323]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:20:02 ait1 snmpd[4323]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:20:02 ait1 snmpd[4323]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:20:15 ait1 pveproxy[2190176]: Clearing outdated entries from certificate cache
Apr  5 05:20:18 ait1 snmpd[4323]: Connection from UDP: [172.20.64.253]:58593->[172.20.64.12]:161
Apr  5 05:20:29 ait1 corosync[4983]: notice  [TOTEM ] A processor failed, forming new configuration.
Apr  5 05:20:29 ait1 corosync[4983]:  [TOTEM ] A processor failed, forming new configuration.
Apr  5 05:20:31 ait1 corosync[4983]: notice  [TOTEM ] A new membership (172.20.128.12:5452) was formed. Members left: 3
Apr  5 05:20:31 ait1 corosync[4983]: notice  [TOTEM ] Failed to receive the leave message. failed: 3
Apr  5 05:20:31 ait1 corosync[4983]:  [TOTEM ] A new membership (172.20.128.12:5452) was formed. Members left: 3
Apr  5 05:20:31 ait1 corosync[4983]:  [TOTEM ] Failed to receive the leave message. failed: 3
Apr  5 05:20:31 ait1 corosync[4983]: warning [CPG   ] downlist left_list: 1 received
Apr  5 05:20:31 ait1 corosync[4983]:  [CPG   ] downlist left_list: 1 received
Apr  5 05:20:31 ait1 corosync[4983]: warning [CPG   ] downlist left_list: 1 received
Apr  5 05:20:31 ait1 corosync[4983]:  [CPG   ] downlist left_list: 1 received
Apr  5 05:20:31 ait1 pmxcfs[4736]: [dcdb] notice: members: 1/4736, 2/4870
Apr  5 05:20:31 ait1 corosync[4983]: notice  [QUORUM] Members[2]: 1 2
Apr  5 05:20:31 ait1 corosync[4983]: notice  [MAIN  ] Completed service synchronization, ready to provide service.
Apr  5 05:20:31 ait1 pmxcfs[4736]: [dcdb] notice: starting data syncronisation
Apr  5 05:20:31 ait1 corosync[4983]:  [QUORUM] Members[2]: 1 2
Apr  5 05:20:31 ait1 corosync[4983]:  [MAIN  ] Completed service synchronization, ready to provide service.
Apr  5 05:20:31 ait1 pmxcfs[4736]: [dcdb] notice: cpg_send_message retried 1 times
Apr  5 05:20:31 ait1 pmxcfs[4736]: [status] notice: members: 1/4736, 2/4870
Apr  5 05:20:31 ait1 pmxcfs[4736]: [status] notice: starting data syncronisation
Apr  5 05:20:31 ait1 pmxcfs[4736]: [dcdb] notice: received sync request (epoch 1/4736/0000000A)
Apr  5 05:20:31 ait1 pmxcfs[4736]: [status] notice: received sync request (epoch 1/4736/0000000A)
Apr  5 05:20:31 ait1 pmxcfs[4736]: [dcdb] notice: received all states
Apr  5 05:20:31 ait1 pmxcfs[4736]: [dcdb] notice: leader is 1/4736
Apr  5 05:20:31 ait1 pmxcfs[4736]: [dcdb] notice: synced members: 1/4736, 2/4870
Apr  5 05:20:31 ait1 pmxcfs[4736]: [dcdb] notice: start sending inode updates
Apr  5 05:20:31 ait1 pmxcfs[4736]: [dcdb] notice: sent all (0) updates
Apr  5 05:20:31 ait1 pmxcfs[4736]: [dcdb] notice: all data is up to date
Apr  5 05:20:31 ait1 pmxcfs[4736]: [dcdb] notice: dfsm_deliver_queue: queue length 5
Apr  5 05:20:31 ait1 pmxcfs[4736]: [status] notice: received all states
Apr  5 05:20:31 ait1 pmxcfs[4736]: [status] notice: all data is up to date
Apr  5 05:20:31 ait1 pmxcfs[4736]: [status] notice: dfsm_deliver_queue: queue length 7
Apr  5 05:20:32 ait1 snmpd[4323]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:20:32 ait1 snmpd[4323]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:20:32 ait1 snmpd[4323]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:20:32 ait1 snmpd[4323]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:20:32 ait1 snmpd[4323]: error on subcontainer 'ia_addr' insert (-1)

check the fenced nodes log ... for any messages from the kernel
OK ... using 'zgrep ' 5 05:.*kernel' *' on AIT3 shows that I should look in kern.log.1 and messages.1.

'zgrep ' 5 05:' kern.log.1' This only shows the system booting ... there's nothing before. The only entries are 2 hours earlier from pveupdate.
Code:
Apr  5 03:38:35 ait3 pveupdate[2862945]: <root@pam> starting task UPID:ait3:002BAF91:02EB5BA9:5CA7302B:aptupdate::root@pam:
Apr  5 03:38:40 ait3 pveupdate[2862945]: <root@pam> end task UPID:ait3:002BAF91:02EB5BA9:5CA7302B:aptupdate::root@pam: OK

messages.1 look to be the same information.

it could also be a bug in the watchdog module you're using - which one do you have configured
We're using the IPMI watchdog.
Perhaps it's worth a few questions here from you given that any node which reboots (in an unplanned way) sometimes losses its watchdog config. Another manual reboot fixes this.

any other particularities of your setup
Hmm ... I don't think so but I'm filtering based on what I know. The setup has worked well up to this point ... no other strange issues.
 
On the first look it seems that the node just lost connectivity - i.e. no retransmits, no overloaded network, but it just vanished.

* Is the time on the nodes synchronized?
* Else it would help if you could setup some remote syslog - that way you increase the chances to get the last messages before the fence.
* Maybe also try to disable HA (or just leave the node running without any resources configured - then it won't get fenced) - maybe you'll get some more info on the reason for the crash
 
Thanks Stoiko.

The time on the nodes is definitely synchronized.

After reading about the changes / fixes in 5.4 I decided to upgrade. The cluster hasn't rebooted since. Everything seems to point to something in 5.3 that didn't work well with my config.

I'm guessing something in my hardware config as I don't think that our 3 node cluster is that unique. Well, maybe the active-backup bonds on the NICs for every network except for one of two corosync LANs. The three nodes are identical Dell r510s, 64GB RAM, 2 x 6 processor CPUs, 2TB Dell SATA drives, perc H700 RAID card ... throwing this out there in case someone knows of a known issue.

For now, I'm going to watch and wait.
 
  • Like
Reactions: Stoiko Ivanov

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!