cluster nodes unstable

Binary Bandit · Apr 6, 2019

Hi All,

We've been running the latest Proxmox for about six months. It's been very solid and stable until this last Sunday ... six days ago. We're now having issues.

We're running a three node cluster, Debian 9 and Ceph.

What's happening?
One of the three nodes reboots about every 24 hours. This seems to be about 6AM PST though not always.

What's changed?
This last Saturday (the day before the reboots started) each cluster node was updated with apt-get update and then apt-get upgrade. There was a kernel patch and a reboot was required.

Other observations?
We remotely monitor VMs on the cluster and have seen VMs stop functioning before the reboot. I believe that the reboot it trigger by the IPMI watchdog but am unsure of how to confirm. After the reboot the watchdog sometimes looses it's configuration showing a 15 second countdown rather than a 10 second one. If the node is rebooted once more the watchdog is then properly configured.

I'm a new to troubleshooting this so don't be shy about checking the basics with me.

best,

James

udo · Apr 6, 2019

Hi,
proxmox ve is an rolling release. It's important to use "apt dist-upgrade" (or full-upgrade)!
"apt upgrade" (or apt-get) isn't enough!

Perhaps this solve the issue?!

Udo

Binary Bandit · Apr 7, 2019

@udo , thanks.

I just Googled / read about dist-upgrade and full-upgrade. While I was at it I checked in to apt vs apt-get. I'm glad to know the differences.

Unfortunately there is nothing for full-upgrade / dist-upgrade to do on the nodes. There are no packages requiring removal. Thinking back, I may have used the GUI to do the upgrade. Perhaps it uses the full-upgrade command?

James

udo · Apr 8, 2019

Binary Bandit said:
@udo , thanks.

I just Googled / read about dist-upgrade and full-upgrade. While I was at it I checked in to apt vs apt-get. I'm glad to know the differences.

Unfortunately there is nothing for full-upgrade / dist-upgrade to do on the nodes. There are no packages requiring removal. Thinking back, I may have used the GUI to do the upgrade. Perhaps it uses the full-upgrade command?

James

Hi,
if you used the gui, the "right" packages are updated!

How looks your config?
Can you post the output of following commands?

Code:

vgs
lvs
zfs list

Udo

Binary Bandit · Apr 8, 2019

Here you are:

Code:

root@ait1:~# vgs
  VG  #PV #LV #SN Attr   VSize VFree
  pve   1   4   0 wz--n- 1.82t 17.23g
root@ait1:~# lvs
  LV     VG  Attr       LSize  Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  aitvms pve Vwi-aotz--  1.70t data        4.11
  data   pve twi-aotz--  1.70t             4.11   2.41
  root   pve -wi-ao---- 96.00g
  swap   pve -wi-ao----  8.00g
root@ait1:~# zfs list
no datasets available

-----------------

root@ait2:~# vgs
  VG  #PV #LV #SN Attr   VSize VFree
  pve   1   4   0 wz--n- 1.82t 17.23g
root@ait2:~# lvs
  LV     VG  Attr       LSize  Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  aitvms pve Vwi-aotz--  1.70t data        6.27
  data   pve twi-aotz--  1.70t             6.27   3.42
  root   pve -wi-ao---- 96.00g
  swap   pve -wi-ao----  8.00g
root@ait2:~# zfs list
no datasets available

-------------------

root@ait3:~# vgs
  VG  #PV #LV #SN Attr   VSize VFree
  pve   1   4   0 wz--n- 1.82t 17.23g
root@ait3:~# lvs
  LV     VG  Attr       LSize  Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  aitvms pve Vwi-aotz--  1.70t data        2.32
  data   pve twi-aotz--  1.70t             2.32   1.50
  root   pve -wi-ao---- 96.00g
  swap   pve -wi-ao----  8.00g
root@ait3:~# zfs list
no datasets available

Binary Bandit · Apr 9, 2019

Hi All,

Our three cluster nodes rebooted at random for about a week. This last weekend I started one cluster node (selected it in Grub) on the last kernel version prior to the random reboot problem. Since this time (three days) the cluster hasn't had a node (old or new kernel) reboot.

I've moved all of the most critical VMs to the node that old kernel version node (it's called AIT3), left one of the other nodes with no VMs (AIT2) and then put everything else on AIT1. This also means that AIT3 is experiencing the most (mostly network IO) load.

node name - version (taken from the summary of each node)
AIT1 - pve-kernel-4.15.18-12-pve: 4.15.18-35
AIT2 - pve-kernel-4.15.18-12-pve: 4.15.18-35
AIT3 - pvd-kernel-4.15.18-11-pve: 4.15.18-34

Does this give anyone any ideas?

I'de really like to get to the bottom of this and restore our confidence in this Proxmox cluster. We have other Proxmox hosts with VMs waiting to migrate (another data center) but have put it on hold for now.

best,

James

Stoiko Ivanov · Apr 11, 2019

* Do you have HA enabled in the cluster?
* check the journal entries before the node reboots (take a look especially for messages from corosync and pmxcfs)

Binary Bandit · Apr 11, 2019

Stoiko Ivanov said:
* Do you have HA enabled in the cluster?)

Yes

Stoiko Ivanov said:
* check the journal entries before the node reboots (take a look especially for messages from corosync and pmxcfs)

How is this best done? I'm not sure where to look for these.

--- edit ---
investigating journalctl now ...

Stoiko Ivanov · Apr 11, 2019

I guess the reboots are due to the nodes losing quorum and fencing themselves - see https://pve.proxmox.com/pve-docs/pve-admin-guide.html#ha_manager_fencing)

the output of `journalctl` should yield the journal - you can optionally provide some parameters to narrow down how much of the logs is shown:
* `journalctl --since '2019-04-11` should give the complete log since today at 00:00
* `journalctl -b -1` - the journal of the second-to-last boot (that's probably the most helpful since you know how often the box got restarted)
* you can also reverse the order (so that the messages before the fence come on top with the `-r` option : `journalctl -r -b -1`

Please keep in mind, that you need to have persistent journaling enabled - so if this does not provide any information - make sure that '/var/log/journal' exists (as a directory) and restart `systemd-journald` afterwards

Hope this helps!

Binary Bandit · Apr 11, 2019

Thanks Stoiko.

What you just wrote confirms what a lot of Googling just taught me. I didn't expect you to write back so soon. Too bad I didn't refresh this post while Google-ing. LOL

I didn't have persistent journaling enabled but do now. Here's what I've done:

- edited /etc/systemd/journald.conf (I used nano)
- un-commented the "Storage=" line and set it to "persistent"
- restarted journaling with "systemctl restart systemd-journald"

Now I just need one of the cluster nodes to reboot. This hasn't happened since I booted the AIT3 note to the old kernel version. I'll change to the new version this Friday PST evening. Hopefully a node will reboot over the weekend, I'll capture it in the logs and then move things back to the stable old kernel. ... makes me wish that we had a test environment ...

Stoiko Ivanov · Apr 11, 2019

Binary Bandit said:
- un-commented the "Storage=" line and set it to "persistent"

can be done - but do make sure that '/var/log/journal' exists - else I think persistent journaling does not work.

* you can also find most infos in '/var/log/syslog*' (and other files therein - e.g. `zgrep corosync *` should tell you which logs contain messages from corosync - do the same for pmxcfs).
* sadly when fencing the node usually does not flush the last lines of syslog (which can contain the important information) to disk - but it can help rule out a few cases.

Binary Bandit · Apr 11, 2019

Stoiko Ivanov said:
make sure that '/var/log/journal' exists

I checked for '/var/log/journal' it's there on all three nodes. I'm not sure but I think that it's created when journaling is restarted.

AIT3 is the only node with a reboot that is still visible in the syslogs ... the syslogs containing reboots for AIT1 and AIT2 have already been deleted. Unfortunately AIT3's log doesn't show anything interesting prior to the reboot. That lack of flushing is definitely sad.

I ran the following on all cluster nodes ... with times and dates adjusted to look for log entries just before the reboots. Our external monitoring software makes this easy as it records the reboot down to the second.

cd /var/log

zgrep pmxcfs * | grep " 2 " | grep " 05:"
and
zgrep corosync * | grep " 2 " | grep " 05:"

Above I'm looking for log entries containing pmxcfs or corosync that happened on the 2nd where the hour is 5AM and any minutes. There's probably a cleaner way to do this without all of the piping but it works.

but ... there's nothing abnormal to be seen. Everything looks good and then the node can be seen booting up.

Hopefully this information helps someone else troubleshoot. I'm stuck waiting for the weekend to cause a reboot.

Stoiko Ivanov · Apr 12, 2019

Binary Bandit said:
I checked for '/var/log/journal' it's there on all three nodes. I'm not sure but I think that it's created when journaling is restarted.

nice - did not know that - Thanks!

Binary Bandit said:
but ... there's nothing abnormal to be seen. Everything looks good and then the node can be seen booting up.

could you share parts of the logs? (just to rule out that you do have messages, which are normal in your environment, but still could indicate a bottleneck in your cluster-network? (e.g. Retransmits can happen in regular networks as well, but they can also indicate that the network is at its limit).

Since fencing only happens if:
* ha is active
* at least one ha-service ran on the node since its last boot
* the node is not in the quorate partition of the corosync cluster
the node did lose quorum if it got fenced.

To rule out another cause of the reboot (unlikely but still) - please check the logs for the same timeframe on the other nodes - the pve-ha-crm service logs when a node gets fenced (and where its services are recovered to) - usually grepping for fence in the logs should show that (but to read the logs for the timeframe)

Also check the fenced nodes log in the timeframe (not with grep but by reading it) for any messages from the kernel - maybe its a bug with the NIC('s driver) where corosync runs on.

Last it could also be a bug in the watchdog module you're using - which one do you have configured (softdog ist the default) and are there any other particularities of your setup?

Binary Bandit said:
zgrep corosync * | grep " 2 " | grep " 05:"

I probably would have used `zgrep '2 05:.*corosync' *` - but whatever works is ok

Hope this helps!

Binary Bandit · Apr 12, 2019

Stoiko Ivanov said:
could you share parts of the logs?

Definitely ... here are some logs from AIT3, the last node to reboot and the one that I have syslog entries for. The reboot happened just after Apr 5 05:19:30. The nodes address is 172.20.64.14. 172.20.64.253 is one of our monitoring servers.

Code:

Apr  5 05:00:00 ait3 systemd[1]: Starting Proxmox VE replication runner...
Apr  5 05:00:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:00:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:00:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:00:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:00:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:00:01 ait3 systemd[1]: Started Proxmox VE replication runner.
Apr  5 05:00:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:00:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:00:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:00:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:00:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:01:00 ait3 systemd[1]: Starting Proxmox VE replication runner...
Apr  5 05:01:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:01:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:01:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:01:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:01:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:01:01 ait3 systemd[1]: Started Proxmox VE replication runner.
Apr  5 05:01:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:01:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:01:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:01:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:01:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:02:00 ait3 systemd[1]: Starting Proxmox VE replication runner...
Apr  5 05:02:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:02:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:02:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:02:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:02:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:02:01 ait3 systemd[1]: Started Proxmox VE replication runner.
Apr  5 05:02:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:02:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:02:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:02:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:02:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:03:00 ait3 systemd[1]: Starting Proxmox VE replication runner...
Apr  5 05:03:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:03:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:03:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:03:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:03:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:03:01 ait3 systemd[1]: Started Proxmox VE replication runner.
Apr  5 05:03:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:03:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:03:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:03:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:03:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:04:00 ait3 systemd[1]: Starting Proxmox VE replication runner...
Apr  5 05:04:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:04:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:04:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:04:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:04:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:04:01 ait3 systemd[1]: Started Proxmox VE replication runner.
Apr  5 05:04:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:04:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:04:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:04:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:04:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:05:00 ait3 systemd[1]: Starting Proxmox VE replication runner...
Apr  5 05:05:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:05:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:05:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:05:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:05:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:05:01 ait3 systemd[1]: Started Proxmox VE replication runner.
Apr  5 05:05:18 ait3 snmpd[4329]: Connection from UDP: [172.20.64.253]:64390->[172.20.64.14]:161
Apr  5 05:05:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:05:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:05:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:05:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:05:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:06:00 ait3 systemd[1]: Starting Proxmox VE replication runner...
Apr  5 05:06:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:06:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:06:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:06:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:06:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:06:01 ait3 systemd[1]: Started Proxmox VE replication runner.
Apr  5 05:06:18 ait3 pmxcfs[4785]: [dcdb] notice: data verification successful
Apr  5 05:06:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:06:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:06:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:06:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:06:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:07:00 ait3 systemd[1]: Starting Proxmox VE replication runner...
Apr  5 05:07:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:07:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:07:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:07:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:07:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:07:01 ait3 systemd[1]: Started Proxmox VE replication runner.
Apr  5 05:07:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:07:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:07:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:07:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:07:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:08:00 ait3 systemd[1]: Starting Proxmox VE replication runner...
Apr  5 05:08:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:08:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:08:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:08:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:08:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:08:01 ait3 systemd[1]: Started Proxmox VE replication runner.
Apr  5 05:08:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:08:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:08:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:08:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:08:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:09:00 ait3 systemd[1]: Starting Proxmox VE replication runner...
Apr  5 05:09:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:09:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:09:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:09:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:09:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:09:01 ait3 systemd[1]: Started Proxmox VE replication runner.
Apr  5 05:09:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:09:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:09:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:09:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:09:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:10:00 ait3 systemd[1]: Starting Proxmox VE replication runner...
Apr  5 05:10:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:10:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:10:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:10:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:10:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:10:01 ait3 systemd[1]: Started Proxmox VE replication runner.
Apr  5 05:10:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:10:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:10:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:10:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:10:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:11:00 ait3 systemd[1]: Starting Proxmox VE replication runner...
Apr  5 05:11:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:11:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:11:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:11:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:11:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:11:01 ait3 systemd[1]: Started Proxmox VE replication runner.
Apr  5 05:11:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:11:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:11:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:11:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:11:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:12:00 ait3 systemd[1]: Starting Proxmox VE replication runner...
Apr  5 05:12:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:12:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:12:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:12:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:12:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:12:01 ait3 systemd[1]: Started Proxmox VE replication runner.
Apr  5 05:12:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:12:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:12:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:12:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:12:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:13:00 ait3 systemd[1]: Starting Proxmox VE replication runner...
Apr  5 05:13:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:13:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:13:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:13:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:13:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:13:01 ait3 systemd[1]: Started Proxmox VE replication runner.
Apr  5 05:13:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:13:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:13:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:13:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:13:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:14:00 ait3 systemd[1]: Starting Proxmox VE replication runner...
Apr  5 05:14:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:14:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:14:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:14:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:14:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:14:01 ait3 systemd[1]: Started Proxmox VE replication runner.
Apr  5 05:14:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:14:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:14:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:14:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:14:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:14:35 ait3 pmxcfs[4785]: [status] notice: received log
Apr  5 05:15:00 ait3 systemd[1]: Starting Proxmox VE replication runner...
Apr  5 05:15:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:15:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:15:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:15:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:15:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:15:01 ait3 systemd[1]: Started Proxmox VE replication runner.
Apr  5 05:15:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:15:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:15:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:15:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:15:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:16:00 ait3 systemd[1]: Starting Proxmox VE replication runner...
Apr  5 05:16:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:16:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:16:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:16:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:16:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:16:01 ait3 systemd[1]: Started Proxmox VE replication runner.
Apr  5 05:16:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:16:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:16:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:16:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:16:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:17:00 ait3 systemd[1]: Starting Proxmox VE replication runner...
Apr  5 05:17:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:17:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:17:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:17:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:17:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:17:01 ait3 systemd[1]: Started Proxmox VE replication runner.
Apr  5 05:17:01 ait3 CRON[2898136]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Apr  5 05:17:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:17:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:17:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:17:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:17:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:18:00 ait3 systemd[1]: Starting Proxmox VE replication runner...
Apr  5 05:18:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:18:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:18:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:18:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:18:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:18:01 ait3 systemd[1]: Started Proxmox VE replication runner.
Apr  5 05:18:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:18:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:18:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:18:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:18:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:19:00 ait3 systemd[1]: Starting Proxmox VE replication runner...
Apr  5 05:19:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:19:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:19:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:19:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:19:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:19:01 ait3 systemd[1]: Started Proxmox VE replication runner.
Apr  5 05:19:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:19:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:19:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:19:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:19:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)

'zgrep corosync * | grep " 5 " | grep " 05:"' didn't return anything. (Sticking with the command that I'm familiar with for now.)
'zgrep corosync * | grep " 5 "' doesn't show any entries untill 09:15 when I reboot a cluster node to troubleshoot.

'zgrep pmxcfs * | grep " 5 " | grep " 05:"' returns nothing as well.
Here's 4 lines before and after the reboot time using 'zgrep pmxcfs * | grep " 5 "'.

Code:

daemon.log.1:Apr  5 04:06:18 ait3 pmxcfs[4785]: [dcdb] notice: data verification successful
daemon.log.1:Apr  5 04:14:33 ait3 pmxcfs[4785]: [status] notice: received log
daemon.log.1:Apr  5 04:29:33 ait3 pmxcfs[4785]: [status] notice: received log
daemon.log.1:Apr  5 04:44:34 ait3 pmxcfs[4785]: [status] notice: received log
syslog.7.gz:Apr  5 06:29:39 ait3 pmxcfs[4793]: [status] notice: received log
syslog.7.gz:Apr  5 06:44:39 ait3 pmxcfs[4793]: [status] notice: received log
syslog.7.gz:Apr  5 06:59:40 ait3 pmxcfs[4793]: [status] notice: received log
syslog.7.gz:Apr  5 07:06:18 ait3 pmxcfs[4793]: [dcdb] notice: data verification successful

Stoiko Ivanov said:
please check the logs for the same timeframe on the other nodes

On AIT2, using 'zgrep fence *' provides a page of entries with pve-ha-crm.
'zgrep pve-ha-crm * | grep " 5 05:"' shows:
Note that there aren't any entries before 5AM on the 5th ... checked by removing the '05:'.

Code:

daemon.log.1:Apr  5 05:22:27 ait2 pve-ha-crm[5642]: successfully acquired lock 'ha_manager_lock'
daemon.log.1:Apr  5 05:22:27 ait2 pve-ha-crm[5642]: watchdog active
daemon.log.1:Apr  5 05:22:27 ait2 pve-ha-crm[5642]: status change slave => master
daemon.log.1:Apr  5 05:22:27 ait2 pve-ha-crm[5642]: node 'ait3': state changed from 'online' => 'unknown'
daemon.log.1:Apr  5 05:23:27 ait2 pve-ha-crm[5642]: service 'vm:100': state changed from 'started' to 'fence'
daemon.log.1:Apr  5 05:23:27 ait2 pve-ha-crm[5642]: service 'vm:106': state changed from 'started' to 'fence'
daemon.log.1:Apr  5 05:23:27 ait2 pve-ha-crm[5642]: service 'vm:109': state changed from 'started' to 'fence'
daemon.log.1:Apr  5 05:23:27 ait2 pve-ha-crm[5642]: service 'vm:110': state changed from 'started' to 'fence'
daemon.log.1:Apr  5 05:23:27 ait2 pve-ha-crm[5642]: service 'vm:111': state changed from 'started' to 'fence'
daemon.log.1:Apr  5 05:23:27 ait2 pve-ha-crm[5642]: service 'vm:112': state changed from 'started' to 'fence'
daemon.log.1:Apr  5 05:23:27 ait2 pve-ha-crm[5642]: service 'vm:113': state changed from 'started' to 'fence'
daemon.log.1:Apr  5 05:23:27 ait2 pve-ha-crm[5642]: service 'vm:115': state changed from 'started' to 'fence'
daemon.log.1:Apr  5 05:23:27 ait2 pve-ha-crm[5642]: node 'ait3': state changed from 'unknown' => 'fence'
daemon.log.1:Apr  5 05:23:37 ait2 pve-ha-crm[5642]: successfully acquired lock 'ha_agent_ait3_lock'
daemon.log.1:Apr  5 05:23:37 ait2 pve-ha-crm[5642]: fencing: acknowledged - got agent lock for node 'ait3'
daemon.log.1:Apr  5 05:23:37 ait2 pve-ha-crm[5642]: node 'ait3': state changed from 'fence' => 'unknown'
daemon.log.1:Apr  5 05:23:37 ait2 pve-ha-crm[5642]: recover service 'vm:100' from fenced node 'ait3' to node 'ait2'
daemon.log.1:Apr  5 05:23:37 ait2 pve-ha-crm[5642]: service 'vm:100': state changed from 'fence' to 'started'  (node = ait2)
daemon.log.1:Apr  5 05:23:37 ait2 pve-ha-crm[5642]: recover service 'vm:106' from fenced node 'ait3' to node 'ait2'
daemon.log.1:Apr  5 05:23:37 ait2 pve-ha-crm[5642]: service 'vm:106': state changed from 'fence' to 'started'  (node = ait2)
daemon.log.1:Apr  5 05:23:37 ait2 pve-ha-crm[5642]: recover service 'vm:109' from fenced node 'ait3' to node 'ait1'
daemon.log.1:Apr  5 05:23:37 ait2 pve-ha-crm[5642]: service 'vm:109': state changed from 'fence' to 'started'  (node = ait1)
daemon.log.1:Apr  5 05:23:37 ait2 pve-ha-crm[5642]: recover service 'vm:110' from fenced node 'ait3' to node 'ait2'
daemon.log.1:Apr  5 05:23:37 ait2 pve-ha-crm[5642]: service 'vm:110': state changed from 'fence' to 'started'  (node = ait2)
daemon.log.1:Apr  5 05:23:37 ait2 pve-ha-crm[5642]: recover service 'vm:111' from fenced node 'ait3' to node 'ait1'
daemon.log.1:Apr  5 05:23:37 ait2 pve-ha-crm[5642]: service 'vm:111': state changed from 'fence' to 'started'  (node = ait1)
daemon.log.1:Apr  5 05:23:37 ait2 pve-ha-crm[5642]: recover service 'vm:112' from fenced node 'ait3' to node 'ait2'
daemon.log.1:Apr  5 05:23:37 ait2 pve-ha-crm[5642]: service 'vm:112': state changed from 'fence' to 'started'  (node = ait2)
daemon.log.1:Apr  5 05:23:37 ait2 pve-ha-crm[5642]: recover service 'vm:113' from fenced node 'ait3' to node 'ait1'
daemon.log.1:Apr  5 05:23:37 ait2 pve-ha-crm[5642]: service 'vm:113': state changed from 'fence' to 'started'  (node = ait1)
daemon.log.1:Apr  5 05:23:37 ait2 pve-ha-crm[5642]: recover service 'vm:115' from fenced node 'ait3' to node 'ait2'
daemon.log.1:Apr  5 05:23:37 ait2 pve-ha-crm[5642]: service 'vm:115': state changed from 'fence' to 'started'  (node = ait2)
daemon.log.1:Apr  5 05:24:57 ait2 pve-ha-crm[5642]: node 'ait3': state changed from 'unknown' => 'online'

Looking in more detail, here's a bit from 'cat daemon.log.1 | grep "Apr 5 05:"'

Code:

Apr  5 05:19:00 ait2 systemd[1]: Starting Proxmox VE replication runner...
Apr  5 05:19:01 ait2 systemd[1]: Started Proxmox VE replication runner.
Apr  5 05:19:21 ait2 snmpd[4426]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:19:21 ait2 snmpd[4426]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:19:21 ait2 snmpd[4426]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:19:21 ait2 snmpd[4426]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:19:21 ait2 snmpd[4426]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:19:51 ait2 snmpd[4426]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:19:51 ait2 snmpd[4426]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:19:51 ait2 snmpd[4426]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:19:51 ait2 snmpd[4426]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:19:51 ait2 snmpd[4426]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:20:00 ait2 systemd[1]: Starting Proxmox VE replication runner...
Apr  5 05:20:01 ait2 systemd[1]: Started Proxmox VE replication runner.
Apr  5 05:20:18 ait2 snmpd[4426]: Connection from UDP: [172.20.64.253]:58590->[172.20.64.13]:161
Apr  5 05:20:21 ait2 snmpd[4426]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:20:21 ait2 snmpd[4426]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:20:21 ait2 snmpd[4426]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:20:21 ait2 snmpd[4426]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:20:21 ait2 snmpd[4426]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:20:31 ait2 corosync[5025]: notice  [TOTEM ] A new membership (172.20.128.12:5452) was formed. Members left: 3
Apr  5 05:20:31 ait2 corosync[5025]: notice  [TOTEM ] Failed to receive the leave message. failed: 3
Apr  5 05:20:31 ait2 corosync[5025]:  [TOTEM ] A new membership (172.20.128.12:5452) was formed. Members left: 3
Apr  5 05:20:31 ait2 corosync[5025]:  [TOTEM ] Failed to receive the leave message. failed: 3
Apr  5 05:20:31 ait2 corosync[5025]: warning [CPG   ] downlist left_list: 1 received
Apr  5 05:20:31 ait2 corosync[5025]:  [CPG   ] downlist left_list: 1 received
Apr  5 05:20:31 ait2 corosync[5025]:  [CPG   ] downlist left_list: 1 received
Apr  5 05:20:31 ait2 corosync[5025]: warning [CPG   ] downlist left_list: 1 received
Apr  5 05:20:31 ait2 pmxcfs[4870]: [dcdb] notice: members: 1/4736, 2/4870
Apr  5 05:20:31 ait2 corosync[5025]: notice  [QUORUM] Members[2]: 1 2
Apr  5 05:20:31 ait2 corosync[5025]: notice  [MAIN  ] Completed service synchronization, ready to provide service.
Apr  5 05:20:31 ait2 pmxcfs[4870]: [dcdb] notice: starting data syncronisation
Apr  5 05:20:31 ait2 pmxcfs[4870]: [status] notice: members: 1/4736, 2/4870
Apr  5 05:20:31 ait2 pmxcfs[4870]: [status] notice: starting data syncronisation
Apr  5 05:20:31 ait2 corosync[5025]:  [QUORUM] Members[2]: 1 2
Apr  5 05:20:31 ait2 corosync[5025]:  [MAIN  ] Completed service synchronization, ready to provide service.

On AIT1, using 'zgrep fence *' doesn't show anything.
There's something in daemon.log.1 that's no text as I had to use 'grep "Apr 5 05:" daemon.log.1 -a' to pull this:

Code:

Apr  5 05:19:01 ait1 systemd[1]: Started Proxmox VE replication runner.
Apr  5 05:19:02 ait1 snmpd[4323]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:19:02 ait1 snmpd[4323]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:19:02 ait1 snmpd[4323]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:19:02 ait1 snmpd[4323]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:19:02 ait1 snmpd[4323]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:19:32 ait1 snmpd[4323]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:19:32 ait1 snmpd[4323]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:19:32 ait1 snmpd[4323]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:19:32 ait1 snmpd[4323]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:19:32 ait1 snmpd[4323]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:20:00 ait1 systemd[1]: Starting Proxmox VE replication runner...
Apr  5 05:20:01 ait1 systemd[1]: Started Proxmox VE replication runner.
Apr  5 05:20:02 ait1 snmpd[4323]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:20:02 ait1 snmpd[4323]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:20:02 ait1 snmpd[4323]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:20:02 ait1 snmpd[4323]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:20:02 ait1 snmpd[4323]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:20:15 ait1 pveproxy[2190176]: Clearing outdated entries from certificate cache
Apr  5 05:20:18 ait1 snmpd[4323]: Connection from UDP: [172.20.64.253]:58593->[172.20.64.12]:161
Apr  5 05:20:29 ait1 corosync[4983]: notice  [TOTEM ] A processor failed, forming new configuration.
Apr  5 05:20:29 ait1 corosync[4983]:  [TOTEM ] A processor failed, forming new configuration.
Apr  5 05:20:31 ait1 corosync[4983]: notice  [TOTEM ] A new membership (172.20.128.12:5452) was formed. Members left: 3
Apr  5 05:20:31 ait1 corosync[4983]: notice  [TOTEM ] Failed to receive the leave message. failed: 3
Apr  5 05:20:31 ait1 corosync[4983]:  [TOTEM ] A new membership (172.20.128.12:5452) was formed. Members left: 3
Apr  5 05:20:31 ait1 corosync[4983]:  [TOTEM ] Failed to receive the leave message. failed: 3
Apr  5 05:20:31 ait1 corosync[4983]: warning [CPG   ] downlist left_list: 1 received
Apr  5 05:20:31 ait1 corosync[4983]:  [CPG   ] downlist left_list: 1 received
Apr  5 05:20:31 ait1 corosync[4983]: warning [CPG   ] downlist left_list: 1 received
Apr  5 05:20:31 ait1 corosync[4983]:  [CPG   ] downlist left_list: 1 received
Apr  5 05:20:31 ait1 pmxcfs[4736]: [dcdb] notice: members: 1/4736, 2/4870
Apr  5 05:20:31 ait1 corosync[4983]: notice  [QUORUM] Members[2]: 1 2
Apr  5 05:20:31 ait1 corosync[4983]: notice  [MAIN  ] Completed service synchronization, ready to provide service.
Apr  5 05:20:31 ait1 pmxcfs[4736]: [dcdb] notice: starting data syncronisation
Apr  5 05:20:31 ait1 corosync[4983]:  [QUORUM] Members[2]: 1 2
Apr  5 05:20:31 ait1 corosync[4983]:  [MAIN  ] Completed service synchronization, ready to provide service.
Apr  5 05:20:31 ait1 pmxcfs[4736]: [dcdb] notice: cpg_send_message retried 1 times
Apr  5 05:20:31 ait1 pmxcfs[4736]: [status] notice: members: 1/4736, 2/4870
Apr  5 05:20:31 ait1 pmxcfs[4736]: [status] notice: starting data syncronisation
Apr  5 05:20:31 ait1 pmxcfs[4736]: [dcdb] notice: received sync request (epoch 1/4736/0000000A)
Apr  5 05:20:31 ait1 pmxcfs[4736]: [status] notice: received sync request (epoch 1/4736/0000000A)
Apr  5 05:20:31 ait1 pmxcfs[4736]: [dcdb] notice: received all states
Apr  5 05:20:31 ait1 pmxcfs[4736]: [dcdb] notice: leader is 1/4736
Apr  5 05:20:31 ait1 pmxcfs[4736]: [dcdb] notice: synced members: 1/4736, 2/4870
Apr  5 05:20:31 ait1 pmxcfs[4736]: [dcdb] notice: start sending inode updates
Apr  5 05:20:31 ait1 pmxcfs[4736]: [dcdb] notice: sent all (0) updates
Apr  5 05:20:31 ait1 pmxcfs[4736]: [dcdb] notice: all data is up to date
Apr  5 05:20:31 ait1 pmxcfs[4736]: [dcdb] notice: dfsm_deliver_queue: queue length 5
Apr  5 05:20:31 ait1 pmxcfs[4736]: [status] notice: received all states
Apr  5 05:20:31 ait1 pmxcfs[4736]: [status] notice: all data is up to date
Apr  5 05:20:31 ait1 pmxcfs[4736]: [status] notice: dfsm_deliver_queue: queue length 7
Apr  5 05:20:32 ait1 snmpd[4323]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:20:32 ait1 snmpd[4323]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:20:32 ait1 snmpd[4323]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:20:32 ait1 snmpd[4323]: error on subcontainer 'ia_addr' insert (-1)
Apr  5 05:20:32 ait1 snmpd[4323]: error on subcontainer 'ia_addr' insert (-1)

Stoiko Ivanov said:
check the fenced nodes log ... for any messages from the kernel

OK ... using 'zgrep ' 5 05:.*kernel' *' on AIT3 shows that I should look in kern.log.1 and messages.1.

'zgrep ' 5 05:' kern.log.1' This only shows the system booting ... there's nothing before. The only entries are 2 hours earlier from pveupdate.

Code:

Apr  5 03:38:35 ait3 pveupdate[2862945]: <root@pam> starting task UPID:ait3:002BAF91:02EB5BA9:5CA7302B:aptupdate::root@pam:
Apr  5 03:38:40 ait3 pveupdate[2862945]: <root@pam> end task UPID:ait3:002BAF91:02EB5BA9:5CA7302B:aptupdate::root@pam: OK

messages.1 look to be the same information.

Stoiko Ivanov said:
it could also be a bug in the watchdog module you're using - which one do you have configured

We're using the IPMI watchdog.
Perhaps it's worth a few questions here from you given that any node which reboots (in an unplanned way) sometimes losses its watchdog config. Another manual reboot fixes this.

Stoiko Ivanov said:
any other particularities of your setup

Hmm ... I don't think so but I'm filtering based on what I know. The setup has worked well up to this point ... no other strange issues.

Stoiko Ivanov · Apr 16, 2019

On the first look it seems that the node just lost connectivity - i.e. no retransmits, no overloaded network, but it just vanished.

* Is the time on the nodes synchronized?
* Else it would help if you could setup some remote syslog - that way you increase the chances to get the last messages before the fence.
* Maybe also try to disable HA (or just leave the node running without any resources configured - then it won't get fenced) - maybe you'll get some more info on the reason for the crash

Binary Bandit · Apr 16, 2019

Thanks Stoiko.

The time on the nodes is definitely synchronized.

After reading about the changes / fixes in 5.4 I decided to upgrade. The cluster hasn't rebooted since. Everything seems to point to something in 5.3 that didn't work well with my config.

I'm guessing something in my hardware config as I don't think that our 3 node cluster is that unique. Well, maybe the active-backup bonds on the NICs for every network except for one of two corosync LANs. The three nodes are identical Dell r510s, 64GB RAM, 2 x 6 processor CPUs, 2TB Dell SATA drives, perc H700 RAID card ... throwing this out there in case someone knows of a known issue.

For now, I'm going to watch and wait.

Search

Search

cluster nodes unstable

Binary Bandit

Well-Known Member

udo

Distinguished Member

Binary Bandit

Well-Known Member

udo

Distinguished Member

Binary Bandit

Well-Known Member

Binary Bandit

Well-Known Member

Stoiko Ivanov

Proxmox Staff Member

Binary Bandit

Well-Known Member

Stoiko Ivanov

Proxmox Staff Member

Binary Bandit

Well-Known Member

Stoiko Ivanov

Proxmox Staff Member

Binary Bandit

Well-Known Member

Stoiko Ivanov

Proxmox Staff Member

Binary Bandit

Well-Known Member

Stoiko Ivanov

Proxmox Staff Member

Binary Bandit

Well-Known Member