cluster nodes unstable

Discussion in 'Proxmox VE: Installation and configuration' started by Binary Bandit, Apr 6, 2019.

  1. Binary Bandit

    Binary Bandit New Member
    Proxmox Subscriber

    Joined:
    Dec 13, 2018
    Messages:
    27
    Likes Received:
    1
    Hi All,

    We've been running the latest Proxmox for about six months. It's been very solid and stable until this last Sunday ... six days ago. We're now having issues.

    We're running a three node cluster, Debian 9 and Ceph.

    What's happening?
    One of the three nodes reboots about every 24 hours. This seems to be about 6AM PST though not always.

    What's changed?
    This last Saturday (the day before the reboots started) each cluster node was updated with apt-get update and then apt-get upgrade. There was a kernel patch and a reboot was required.

    Other observations?
    We remotely monitor VMs on the cluster and have seen VMs stop functioning before the reboot. I believe that the reboot it trigger by the IPMI watchdog but am unsure of how to confirm. After the reboot the watchdog sometimes looses it's configuration showing a 15 second countdown rather than a 10 second one. If the node is rebooted once more the watchdog is then properly configured.

    I'm a new to troubleshooting this so don't be shy about checking the basics with me.

    best,

    James
     
  2. udo

    udo Well-Known Member
    Proxmox Subscriber

    Joined:
    Apr 22, 2009
    Messages:
    5,831
    Likes Received:
    158
    Hi,
    proxmox ve is an rolling release. It's important to use "apt dist-upgrade" (or full-upgrade)!
    "apt upgrade" (or apt-get) isn't enough!

    Perhaps this solve the issue?!

    Udo
     
  3. Binary Bandit

    Binary Bandit New Member
    Proxmox Subscriber

    Joined:
    Dec 13, 2018
    Messages:
    27
    Likes Received:
    1
    @udo , thanks.

    I just Googled / read about dist-upgrade and full-upgrade. While I was at it I checked in to apt vs apt-get. I'm glad to know the differences.

    Unfortunately there is nothing for full-upgrade / dist-upgrade to do on the nodes. There are no packages requiring removal. Thinking back, I may have used the GUI to do the upgrade. Perhaps it uses the full-upgrade command?

    James
     
  4. udo

    udo Well-Known Member
    Proxmox Subscriber

    Joined:
    Apr 22, 2009
    Messages:
    5,831
    Likes Received:
    158
    Hi,
    if you used the gui, the "right" packages are updated!

    How looks your config?
    Can you post the output of following commands?
    Code:
    vgs
    lvs
    zfs list
    
    Udo
     
  5. Binary Bandit

    Binary Bandit New Member
    Proxmox Subscriber

    Joined:
    Dec 13, 2018
    Messages:
    27
    Likes Received:
    1
    Here you are:

    Code:
    root@ait1:~# vgs
      VG  #PV #LV #SN Attr   VSize VFree
      pve   1   4   0 wz--n- 1.82t 17.23g
    root@ait1:~# lvs
      LV     VG  Attr       LSize  Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
      aitvms pve Vwi-aotz--  1.70t data        4.11
      data   pve twi-aotz--  1.70t             4.11   2.41
      root   pve -wi-ao---- 96.00g
      swap   pve -wi-ao----  8.00g
    root@ait1:~# zfs list
    no datasets available
    
    -----------------
    
    root@ait2:~# vgs
      VG  #PV #LV #SN Attr   VSize VFree
      pve   1   4   0 wz--n- 1.82t 17.23g
    root@ait2:~# lvs
      LV     VG  Attr       LSize  Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
      aitvms pve Vwi-aotz--  1.70t data        6.27
      data   pve twi-aotz--  1.70t             6.27   3.42
      root   pve -wi-ao---- 96.00g
      swap   pve -wi-ao----  8.00g
    root@ait2:~# zfs list
    no datasets available
    
    -------------------
    
    root@ait3:~# vgs
      VG  #PV #LV #SN Attr   VSize VFree
      pve   1   4   0 wz--n- 1.82t 17.23g
    root@ait3:~# lvs
      LV     VG  Attr       LSize  Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
      aitvms pve Vwi-aotz--  1.70t data        2.32
      data   pve twi-aotz--  1.70t             2.32   1.50
      root   pve -wi-ao---- 96.00g
      swap   pve -wi-ao----  8.00g
    root@ait3:~# zfs list
    no datasets available
     
  6. Binary Bandit

    Binary Bandit New Member
    Proxmox Subscriber

    Joined:
    Dec 13, 2018
    Messages:
    27
    Likes Received:
    1
    Hi All,

    Our three cluster nodes rebooted at random for about a week. This last weekend I started one cluster node (selected it in Grub) on the last kernel version prior to the random reboot problem. Since this time (three days) the cluster hasn't had a node (old or new kernel) reboot.

    I've moved all of the most critical VMs to the node that old kernel version node (it's called AIT3), left one of the other nodes with no VMs (AIT2) and then put everything else on AIT1. This also means that AIT3 is experiencing the most (mostly network IO) load.

    node name - version (taken from the summary of each node)
    AIT1 - pve-kernel-4.15.18-12-pve: 4.15.18-35
    AIT2 - pve-kernel-4.15.18-12-pve: 4.15.18-35
    AIT3 - pvd-kernel-4.15.18-11-pve: 4.15.18-34

    Does this give anyone any ideas?

    I'de really like to get to the bottom of this and restore our confidence in this Proxmox cluster. We have other Proxmox hosts with VMs waiting to migrate (another data center) but have put it on hold for now.

    best,

    James
     
  7. Stoiko Ivanov

    Stoiko Ivanov Proxmox Staff Member
    Staff Member

    Joined:
    May 2, 2018
    Messages:
    1,198
    Likes Received:
    102
    * Do you have HA enabled in the cluster?
    * check the journal entries before the node reboots (take a look especially for messages from corosync and pmxcfs)
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  8. Binary Bandit

    Binary Bandit New Member
    Proxmox Subscriber

    Joined:
    Dec 13, 2018
    Messages:
    27
    Likes Received:
    1
    Yes


    How is this best done? I'm not sure where to look for these.

    --- edit ---
    investigating journalctl now ...
     
    #8 Binary Bandit, Apr 11, 2019
    Last edited: Apr 11, 2019
  9. Stoiko Ivanov

    Stoiko Ivanov Proxmox Staff Member
    Staff Member

    Joined:
    May 2, 2018
    Messages:
    1,198
    Likes Received:
    102
    I guess the reboots are due to the nodes losing quorum and fencing themselves - see https://pve.proxmox.com/pve-docs/pve-admin-guide.html#ha_manager_fencing)

    the output of `journalctl` should yield the journal - you can optionally provide some parameters to narrow down how much of the logs is shown:
    * `journalctl --since '2019-04-11` should give the complete log since today at 00:00
    * `journalctl -b -1` - the journal of the second-to-last boot (that's probably the most helpful since you know how often the box got restarted)
    * you can also reverse the order (so that the messages before the fence come on top with the `-r` option : `journalctl -r -b -1`

    Please keep in mind, that you need to have persistent journaling enabled - so if this does not provide any information - make sure that '/var/log/journal' exists (as a directory) and restart `systemd-journald` afterwards

    Hope this helps!
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  10. Binary Bandit

    Binary Bandit New Member
    Proxmox Subscriber

    Joined:
    Dec 13, 2018
    Messages:
    27
    Likes Received:
    1
    Thanks Stoiko.

    What you just wrote confirms what a lot of Googling just taught me. I didn't expect you to write back so soon. Too bad I didn't refresh this post while Google-ing. LOL

    I didn't have persistent journaling enabled but do now. Here's what I've done:

    - edited /etc/systemd/journald.conf (I used nano)
    - un-commented the "Storage=" line and set it to "persistent"
    - restarted journaling with "systemctl restart systemd-journald"

    Now I just need one of the cluster nodes to reboot. This hasn't happened since I booted the AIT3 note to the old kernel version. I'll change to the new version this Friday PST evening. Hopefully a node will reboot over the weekend, I'll capture it in the logs and then move things back to the stable old kernel. ... makes me wish that we had a test environment ...
     
  11. Stoiko Ivanov

    Stoiko Ivanov Proxmox Staff Member
    Staff Member

    Joined:
    May 2, 2018
    Messages:
    1,198
    Likes Received:
    102
    can be done - but do make sure that '/var/log/journal' exists - else I think persistent journaling does not work.

    * you can also find most infos in '/var/log/syslog*' (and other files therein - e.g. `zgrep corosync *` should tell you which logs contain messages from corosync - do the same for pmxcfs).
    * sadly when fencing the node usually does not flush the last lines of syslog (which can contain the important information) to disk - but it can help rule out a few cases.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  12. Binary Bandit

    Binary Bandit New Member
    Proxmox Subscriber

    Joined:
    Dec 13, 2018
    Messages:
    27
    Likes Received:
    1
    I checked for '/var/log/journal' it's there on all three nodes. I'm not sure but I think that it's created when journaling is restarted.

    AIT3 is the only node with a reboot that is still visible in the syslogs ... the syslogs containing reboots for AIT1 and AIT2 have already been deleted. Unfortunately AIT3's log doesn't show anything interesting prior to the reboot. That lack of flushing is definitely sad.

    I ran the following on all cluster nodes ... with times and dates adjusted to look for log entries just before the reboots. Our external monitoring software makes this easy as it records the reboot down to the second.

    cd /var/log

    zgrep pmxcfs * | grep " 2 " | grep " 05:"
    and
    zgrep corosync * | grep " 2 " | grep " 05:"

    Above I'm looking for log entries containing pmxcfs or corosync that happened on the 2nd where the hour is 5AM and any minutes. There's probably a cleaner way to do this without all of the piping but it works.

    but ... there's nothing abnormal to be seen. Everything looks good and then the node can be seen booting up.

    Hopefully this information helps someone else troubleshoot. I'm stuck waiting for the weekend to cause a reboot.
     
  13. Stoiko Ivanov

    Stoiko Ivanov Proxmox Staff Member
    Staff Member

    Joined:
    May 2, 2018
    Messages:
    1,198
    Likes Received:
    102
    nice - did not know that - Thanks!

    could you share parts of the logs? (just to rule out that you do have messages, which are normal in your environment, but still could indicate a bottleneck in your cluster-network? (e.g. Retransmits can happen in regular networks as well, but they can also indicate that the network is at its limit).

    Since fencing only happens if:
    * ha is active
    * at least one ha-service ran on the node since its last boot
    * the node is not in the quorate partition of the corosync cluster
    the node did lose quorum if it got fenced.

    To rule out another cause of the reboot (unlikely but still) - please check the logs for the same timeframe on the other nodes - the pve-ha-crm service logs when a node gets fenced (and where its services are recovered to) - usually grepping for fence in the logs should show that (but to read the logs for the timeframe)

    Also check the fenced nodes log in the timeframe (not with grep but by reading it) for any messages from the kernel - maybe its a bug with the NIC('s driver) where corosync runs on.

    Last it could also be a bug in the watchdog module you're using - which one do you have configured (softdog ist the default) and are there any other particularities of your setup?

    I probably would have used `zgrep '2 05:.*corosync' *` - but whatever works is ok :)

    Hope this helps!
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  14. Binary Bandit

    Binary Bandit New Member
    Proxmox Subscriber

    Joined:
    Dec 13, 2018
    Messages:
    27
    Likes Received:
    1
    Definitely ... here are some logs from AIT3, the last node to reboot and the one that I have syslog entries for. The reboot happened just after Apr 5 05:19:30. The nodes address is 172.20.64.14. 172.20.64.253 is one of our monitoring servers.
    Code:
    Apr  5 05:00:00 ait3 systemd[1]: Starting Proxmox VE replication runner...
    Apr  5 05:00:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:00:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:00:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:00:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:00:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:00:01 ait3 systemd[1]: Started Proxmox VE replication runner.
    Apr  5 05:00:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:00:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:00:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:00:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:00:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:01:00 ait3 systemd[1]: Starting Proxmox VE replication runner...
    Apr  5 05:01:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:01:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:01:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:01:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:01:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:01:01 ait3 systemd[1]: Started Proxmox VE replication runner.
    Apr  5 05:01:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:01:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:01:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:01:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:01:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:02:00 ait3 systemd[1]: Starting Proxmox VE replication runner...
    Apr  5 05:02:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:02:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:02:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:02:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:02:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:02:01 ait3 systemd[1]: Started Proxmox VE replication runner.
    Apr  5 05:02:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:02:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:02:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:02:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:02:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:03:00 ait3 systemd[1]: Starting Proxmox VE replication runner...
    Apr  5 05:03:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:03:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:03:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:03:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:03:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:03:01 ait3 systemd[1]: Started Proxmox VE replication runner.
    Apr  5 05:03:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:03:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:03:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:03:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:03:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:04:00 ait3 systemd[1]: Starting Proxmox VE replication runner...
    Apr  5 05:04:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:04:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:04:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:04:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:04:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:04:01 ait3 systemd[1]: Started Proxmox VE replication runner.
    Apr  5 05:04:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:04:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:04:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:04:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:04:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:05:00 ait3 systemd[1]: Starting Proxmox VE replication runner...
    Apr  5 05:05:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:05:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:05:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:05:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:05:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:05:01 ait3 systemd[1]: Started Proxmox VE replication runner.
    Apr  5 05:05:18 ait3 snmpd[4329]: Connection from UDP: [172.20.64.253]:64390->[172.20.64.14]:161
    Apr  5 05:05:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:05:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:05:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:05:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:05:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:06:00 ait3 systemd[1]: Starting Proxmox VE replication runner...
    Apr  5 05:06:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:06:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:06:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:06:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:06:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:06:01 ait3 systemd[1]: Started Proxmox VE replication runner.
    Apr  5 05:06:18 ait3 pmxcfs[4785]: [dcdb] notice: data verification successful
    Apr  5 05:06:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:06:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:06:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:06:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:06:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:07:00 ait3 systemd[1]: Starting Proxmox VE replication runner...
    Apr  5 05:07:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:07:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:07:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:07:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:07:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:07:01 ait3 systemd[1]: Started Proxmox VE replication runner.
    Apr  5 05:07:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:07:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:07:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:07:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:07:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:08:00 ait3 systemd[1]: Starting Proxmox VE replication runner...
    Apr  5 05:08:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:08:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:08:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:08:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:08:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:08:01 ait3 systemd[1]: Started Proxmox VE replication runner.
    Apr  5 05:08:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:08:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:08:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:08:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:08:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:09:00 ait3 systemd[1]: Starting Proxmox VE replication runner...
    Apr  5 05:09:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:09:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:09:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:09:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:09:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:09:01 ait3 systemd[1]: Started Proxmox VE replication runner.
    Apr  5 05:09:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:09:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:09:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:09:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:09:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:10:00 ait3 systemd[1]: Starting Proxmox VE replication runner...
    Apr  5 05:10:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:10:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:10:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:10:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:10:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:10:01 ait3 systemd[1]: Started Proxmox VE replication runner.
    Apr  5 05:10:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:10:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:10:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:10:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:10:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:11:00 ait3 systemd[1]: Starting Proxmox VE replication runner...
    Apr  5 05:11:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:11:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:11:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:11:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:11:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:11:01 ait3 systemd[1]: Started Proxmox VE replication runner.
    Apr  5 05:11:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:11:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:11:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:11:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:11:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:12:00 ait3 systemd[1]: Starting Proxmox VE replication runner...
    Apr  5 05:12:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:12:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:12:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:12:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:12:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:12:01 ait3 systemd[1]: Started Proxmox VE replication runner.
    Apr  5 05:12:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:12:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:12:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:12:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:12:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:13:00 ait3 systemd[1]: Starting Proxmox VE replication runner...
    Apr  5 05:13:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:13:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:13:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:13:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:13:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:13:01 ait3 systemd[1]: Started Proxmox VE replication runner.
    Apr  5 05:13:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:13:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:13:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:13:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:13:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:14:00 ait3 systemd[1]: Starting Proxmox VE replication runner...
    Apr  5 05:14:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:14:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:14:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:14:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:14:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:14:01 ait3 systemd[1]: Started Proxmox VE replication runner.
    Apr  5 05:14:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:14:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:14:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:14:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:14:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:14:35 ait3 pmxcfs[4785]: [status] notice: received log
    Apr  5 05:15:00 ait3 systemd[1]: Starting Proxmox VE replication runner...
    Apr  5 05:15:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:15:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:15:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:15:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:15:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:15:01 ait3 systemd[1]: Started Proxmox VE replication runner.
    Apr  5 05:15:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:15:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:15:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:15:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:15:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:16:00 ait3 systemd[1]: Starting Proxmox VE replication runner...
    Apr  5 05:16:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:16:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:16:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:16:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:16:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:16:01 ait3 systemd[1]: Started Proxmox VE replication runner.
    Apr  5 05:16:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:16:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:16:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:16:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:16:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:17:00 ait3 systemd[1]: Starting Proxmox VE replication runner...
    Apr  5 05:17:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:17:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:17:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:17:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:17:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:17:01 ait3 systemd[1]: Started Proxmox VE replication runner.
    Apr  5 05:17:01 ait3 CRON[2898136]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
    Apr  5 05:17:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:17:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:17:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:17:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:17:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:18:00 ait3 systemd[1]: Starting Proxmox VE replication runner...
    Apr  5 05:18:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:18:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:18:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:18:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:18:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:18:01 ait3 systemd[1]: Started Proxmox VE replication runner.
    Apr  5 05:18:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:18:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:18:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:18:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:18:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:19:00 ait3 systemd[1]: Starting Proxmox VE replication runner...
    Apr  5 05:19:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:19:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:19:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:19:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:19:00 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:19:01 ait3 systemd[1]: Started Proxmox VE replication runner.
    Apr  5 05:19:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:19:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:19:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:19:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:19:30 ait3 snmpd[4329]: error on subcontainer 'ia_addr' insert (-1)
    'zgrep corosync * | grep " 5 " | grep " 05:"' didn't return anything. (Sticking with the command that I'm familiar with for now.)
    'zgrep corosync * | grep " 5 "' doesn't show any entries untill 09:15 when I reboot a cluster node to troubleshoot.

    'zgrep pmxcfs * | grep " 5 " | grep " 05:"' returns nothing as well.
    Here's 4 lines before and after the reboot time using 'zgrep pmxcfs * | grep " 5 "'.

    Code:
    daemon.log.1:Apr  5 04:06:18 ait3 pmxcfs[4785]: [dcdb] notice: data verification successful
    daemon.log.1:Apr  5 04:14:33 ait3 pmxcfs[4785]: [status] notice: received log
    daemon.log.1:Apr  5 04:29:33 ait3 pmxcfs[4785]: [status] notice: received log
    daemon.log.1:Apr  5 04:44:34 ait3 pmxcfs[4785]: [status] notice: received log
    syslog.7.gz:Apr  5 06:29:39 ait3 pmxcfs[4793]: [status] notice: received log
    syslog.7.gz:Apr  5 06:44:39 ait3 pmxcfs[4793]: [status] notice: received log
    syslog.7.gz:Apr  5 06:59:40 ait3 pmxcfs[4793]: [status] notice: received log
    syslog.7.gz:Apr  5 07:06:18 ait3 pmxcfs[4793]: [dcdb] notice: data verification successful
    On AIT2, using 'zgrep fence *' provides a page of entries with pve-ha-crm.
    'zgrep pve-ha-crm * | grep " 5 05:"' shows:
    Note that there aren't any entries before 5AM on the 5th ... checked by removing the '05:'.

    Code:
    daemon.log.1:Apr  5 05:22:27 ait2 pve-ha-crm[5642]: successfully acquired lock 'ha_manager_lock'
    daemon.log.1:Apr  5 05:22:27 ait2 pve-ha-crm[5642]: watchdog active
    daemon.log.1:Apr  5 05:22:27 ait2 pve-ha-crm[5642]: status change slave => master
    daemon.log.1:Apr  5 05:22:27 ait2 pve-ha-crm[5642]: node 'ait3': state changed from 'online' => 'unknown'
    daemon.log.1:Apr  5 05:23:27 ait2 pve-ha-crm[5642]: service 'vm:100': state changed from 'started' to 'fence'
    daemon.log.1:Apr  5 05:23:27 ait2 pve-ha-crm[5642]: service 'vm:106': state changed from 'started' to 'fence'
    daemon.log.1:Apr  5 05:23:27 ait2 pve-ha-crm[5642]: service 'vm:109': state changed from 'started' to 'fence'
    daemon.log.1:Apr  5 05:23:27 ait2 pve-ha-crm[5642]: service 'vm:110': state changed from 'started' to 'fence'
    daemon.log.1:Apr  5 05:23:27 ait2 pve-ha-crm[5642]: service 'vm:111': state changed from 'started' to 'fence'
    daemon.log.1:Apr  5 05:23:27 ait2 pve-ha-crm[5642]: service 'vm:112': state changed from 'started' to 'fence'
    daemon.log.1:Apr  5 05:23:27 ait2 pve-ha-crm[5642]: service 'vm:113': state changed from 'started' to 'fence'
    daemon.log.1:Apr  5 05:23:27 ait2 pve-ha-crm[5642]: service 'vm:115': state changed from 'started' to 'fence'
    daemon.log.1:Apr  5 05:23:27 ait2 pve-ha-crm[5642]: node 'ait3': state changed from 'unknown' => 'fence'
    daemon.log.1:Apr  5 05:23:37 ait2 pve-ha-crm[5642]: successfully acquired lock 'ha_agent_ait3_lock'
    daemon.log.1:Apr  5 05:23:37 ait2 pve-ha-crm[5642]: fencing: acknowledged - got agent lock for node 'ait3'
    daemon.log.1:Apr  5 05:23:37 ait2 pve-ha-crm[5642]: node 'ait3': state changed from 'fence' => 'unknown'
    daemon.log.1:Apr  5 05:23:37 ait2 pve-ha-crm[5642]: recover service 'vm:100' from fenced node 'ait3' to node 'ait2'
    daemon.log.1:Apr  5 05:23:37 ait2 pve-ha-crm[5642]: service 'vm:100': state changed from 'fence' to 'started'  (node = ait2)
    daemon.log.1:Apr  5 05:23:37 ait2 pve-ha-crm[5642]: recover service 'vm:106' from fenced node 'ait3' to node 'ait2'
    daemon.log.1:Apr  5 05:23:37 ait2 pve-ha-crm[5642]: service 'vm:106': state changed from 'fence' to 'started'  (node = ait2)
    daemon.log.1:Apr  5 05:23:37 ait2 pve-ha-crm[5642]: recover service 'vm:109' from fenced node 'ait3' to node 'ait1'
    daemon.log.1:Apr  5 05:23:37 ait2 pve-ha-crm[5642]: service 'vm:109': state changed from 'fence' to 'started'  (node = ait1)
    daemon.log.1:Apr  5 05:23:37 ait2 pve-ha-crm[5642]: recover service 'vm:110' from fenced node 'ait3' to node 'ait2'
    daemon.log.1:Apr  5 05:23:37 ait2 pve-ha-crm[5642]: service 'vm:110': state changed from 'fence' to 'started'  (node = ait2)
    daemon.log.1:Apr  5 05:23:37 ait2 pve-ha-crm[5642]: recover service 'vm:111' from fenced node 'ait3' to node 'ait1'
    daemon.log.1:Apr  5 05:23:37 ait2 pve-ha-crm[5642]: service 'vm:111': state changed from 'fence' to 'started'  (node = ait1)
    daemon.log.1:Apr  5 05:23:37 ait2 pve-ha-crm[5642]: recover service 'vm:112' from fenced node 'ait3' to node 'ait2'
    daemon.log.1:Apr  5 05:23:37 ait2 pve-ha-crm[5642]: service 'vm:112': state changed from 'fence' to 'started'  (node = ait2)
    daemon.log.1:Apr  5 05:23:37 ait2 pve-ha-crm[5642]: recover service 'vm:113' from fenced node 'ait3' to node 'ait1'
    daemon.log.1:Apr  5 05:23:37 ait2 pve-ha-crm[5642]: service 'vm:113': state changed from 'fence' to 'started'  (node = ait1)
    daemon.log.1:Apr  5 05:23:37 ait2 pve-ha-crm[5642]: recover service 'vm:115' from fenced node 'ait3' to node 'ait2'
    daemon.log.1:Apr  5 05:23:37 ait2 pve-ha-crm[5642]: service 'vm:115': state changed from 'fence' to 'started'  (node = ait2)
    daemon.log.1:Apr  5 05:24:57 ait2 pve-ha-crm[5642]: node 'ait3': state changed from 'unknown' => 'online'
    Looking in more detail, here's a bit from 'cat daemon.log.1 | grep "Apr 5 05:"'
    Code:
    Apr  5 05:19:00 ait2 systemd[1]: Starting Proxmox VE replication runner...
    Apr  5 05:19:01 ait2 systemd[1]: Started Proxmox VE replication runner.
    Apr  5 05:19:21 ait2 snmpd[4426]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:19:21 ait2 snmpd[4426]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:19:21 ait2 snmpd[4426]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:19:21 ait2 snmpd[4426]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:19:21 ait2 snmpd[4426]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:19:51 ait2 snmpd[4426]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:19:51 ait2 snmpd[4426]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:19:51 ait2 snmpd[4426]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:19:51 ait2 snmpd[4426]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:19:51 ait2 snmpd[4426]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:20:00 ait2 systemd[1]: Starting Proxmox VE replication runner...
    Apr  5 05:20:01 ait2 systemd[1]: Started Proxmox VE replication runner.
    Apr  5 05:20:18 ait2 snmpd[4426]: Connection from UDP: [172.20.64.253]:58590->[172.20.64.13]:161
    Apr  5 05:20:21 ait2 snmpd[4426]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:20:21 ait2 snmpd[4426]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:20:21 ait2 snmpd[4426]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:20:21 ait2 snmpd[4426]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:20:21 ait2 snmpd[4426]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:20:31 ait2 corosync[5025]: notice  [TOTEM ] A new membership (172.20.128.12:5452) was formed. Members left: 3
    Apr  5 05:20:31 ait2 corosync[5025]: notice  [TOTEM ] Failed to receive the leave message. failed: 3
    Apr  5 05:20:31 ait2 corosync[5025]:  [TOTEM ] A new membership (172.20.128.12:5452) was formed. Members left: 3
    Apr  5 05:20:31 ait2 corosync[5025]:  [TOTEM ] Failed to receive the leave message. failed: 3
    Apr  5 05:20:31 ait2 corosync[5025]: warning [CPG   ] downlist left_list: 1 received
    Apr  5 05:20:31 ait2 corosync[5025]:  [CPG   ] downlist left_list: 1 received
    Apr  5 05:20:31 ait2 corosync[5025]:  [CPG   ] downlist left_list: 1 received
    Apr  5 05:20:31 ait2 corosync[5025]: warning [CPG   ] downlist left_list: 1 received
    Apr  5 05:20:31 ait2 pmxcfs[4870]: [dcdb] notice: members: 1/4736, 2/4870
    Apr  5 05:20:31 ait2 corosync[5025]: notice  [QUORUM] Members[2]: 1 2
    Apr  5 05:20:31 ait2 corosync[5025]: notice  [MAIN  ] Completed service synchronization, ready to provide service.
    Apr  5 05:20:31 ait2 pmxcfs[4870]: [dcdb] notice: starting data syncronisation
    Apr  5 05:20:31 ait2 pmxcfs[4870]: [status] notice: members: 1/4736, 2/4870
    Apr  5 05:20:31 ait2 pmxcfs[4870]: [status] notice: starting data syncronisation
    Apr  5 05:20:31 ait2 corosync[5025]:  [QUORUM] Members[2]: 1 2
    Apr  5 05:20:31 ait2 corosync[5025]:  [MAIN  ] Completed service synchronization, ready to provide service.
    On AIT1, using 'zgrep fence *' doesn't show anything.
    There's something in daemon.log.1 that's no text as I had to use 'grep "Apr 5 05:" daemon.log.1 -a' to pull this:
    Code:
    Apr  5 05:19:01 ait1 systemd[1]: Started Proxmox VE replication runner.
    Apr  5 05:19:02 ait1 snmpd[4323]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:19:02 ait1 snmpd[4323]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:19:02 ait1 snmpd[4323]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:19:02 ait1 snmpd[4323]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:19:02 ait1 snmpd[4323]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:19:32 ait1 snmpd[4323]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:19:32 ait1 snmpd[4323]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:19:32 ait1 snmpd[4323]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:19:32 ait1 snmpd[4323]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:19:32 ait1 snmpd[4323]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:20:00 ait1 systemd[1]: Starting Proxmox VE replication runner...
    Apr  5 05:20:01 ait1 systemd[1]: Started Proxmox VE replication runner.
    Apr  5 05:20:02 ait1 snmpd[4323]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:20:02 ait1 snmpd[4323]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:20:02 ait1 snmpd[4323]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:20:02 ait1 snmpd[4323]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:20:02 ait1 snmpd[4323]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:20:15 ait1 pveproxy[2190176]: Clearing outdated entries from certificate cache
    Apr  5 05:20:18 ait1 snmpd[4323]: Connection from UDP: [172.20.64.253]:58593->[172.20.64.12]:161
    Apr  5 05:20:29 ait1 corosync[4983]: notice  [TOTEM ] A processor failed, forming new configuration.
    Apr  5 05:20:29 ait1 corosync[4983]:  [TOTEM ] A processor failed, forming new configuration.
    Apr  5 05:20:31 ait1 corosync[4983]: notice  [TOTEM ] A new membership (172.20.128.12:5452) was formed. Members left: 3
    Apr  5 05:20:31 ait1 corosync[4983]: notice  [TOTEM ] Failed to receive the leave message. failed: 3
    Apr  5 05:20:31 ait1 corosync[4983]:  [TOTEM ] A new membership (172.20.128.12:5452) was formed. Members left: 3
    Apr  5 05:20:31 ait1 corosync[4983]:  [TOTEM ] Failed to receive the leave message. failed: 3
    Apr  5 05:20:31 ait1 corosync[4983]: warning [CPG   ] downlist left_list: 1 received
    Apr  5 05:20:31 ait1 corosync[4983]:  [CPG   ] downlist left_list: 1 received
    Apr  5 05:20:31 ait1 corosync[4983]: warning [CPG   ] downlist left_list: 1 received
    Apr  5 05:20:31 ait1 corosync[4983]:  [CPG   ] downlist left_list: 1 received
    Apr  5 05:20:31 ait1 pmxcfs[4736]: [dcdb] notice: members: 1/4736, 2/4870
    Apr  5 05:20:31 ait1 corosync[4983]: notice  [QUORUM] Members[2]: 1 2
    Apr  5 05:20:31 ait1 corosync[4983]: notice  [MAIN  ] Completed service synchronization, ready to provide service.
    Apr  5 05:20:31 ait1 pmxcfs[4736]: [dcdb] notice: starting data syncronisation
    Apr  5 05:20:31 ait1 corosync[4983]:  [QUORUM] Members[2]: 1 2
    Apr  5 05:20:31 ait1 corosync[4983]:  [MAIN  ] Completed service synchronization, ready to provide service.
    Apr  5 05:20:31 ait1 pmxcfs[4736]: [dcdb] notice: cpg_send_message retried 1 times
    Apr  5 05:20:31 ait1 pmxcfs[4736]: [status] notice: members: 1/4736, 2/4870
    Apr  5 05:20:31 ait1 pmxcfs[4736]: [status] notice: starting data syncronisation
    Apr  5 05:20:31 ait1 pmxcfs[4736]: [dcdb] notice: received sync request (epoch 1/4736/0000000A)
    Apr  5 05:20:31 ait1 pmxcfs[4736]: [status] notice: received sync request (epoch 1/4736/0000000A)
    Apr  5 05:20:31 ait1 pmxcfs[4736]: [dcdb] notice: received all states
    Apr  5 05:20:31 ait1 pmxcfs[4736]: [dcdb] notice: leader is 1/4736
    Apr  5 05:20:31 ait1 pmxcfs[4736]: [dcdb] notice: synced members: 1/4736, 2/4870
    Apr  5 05:20:31 ait1 pmxcfs[4736]: [dcdb] notice: start sending inode updates
    Apr  5 05:20:31 ait1 pmxcfs[4736]: [dcdb] notice: sent all (0) updates
    Apr  5 05:20:31 ait1 pmxcfs[4736]: [dcdb] notice: all data is up to date
    Apr  5 05:20:31 ait1 pmxcfs[4736]: [dcdb] notice: dfsm_deliver_queue: queue length 5
    Apr  5 05:20:31 ait1 pmxcfs[4736]: [status] notice: received all states
    Apr  5 05:20:31 ait1 pmxcfs[4736]: [status] notice: all data is up to date
    Apr  5 05:20:31 ait1 pmxcfs[4736]: [status] notice: dfsm_deliver_queue: queue length 7
    Apr  5 05:20:32 ait1 snmpd[4323]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:20:32 ait1 snmpd[4323]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:20:32 ait1 snmpd[4323]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:20:32 ait1 snmpd[4323]: error on subcontainer 'ia_addr' insert (-1)
    Apr  5 05:20:32 ait1 snmpd[4323]: error on subcontainer 'ia_addr' insert (-1)
    OK ... using 'zgrep ' 5 05:.*kernel' *' on AIT3 shows that I should look in kern.log.1 and messages.1.

    'zgrep ' 5 05:' kern.log.1' This only shows the system booting ... there's nothing before. The only entries are 2 hours earlier from pveupdate.
    Code:
    Apr  5 03:38:35 ait3 pveupdate[2862945]: <root@pam> starting task UPID:ait3:002BAF91:02EB5BA9:5CA7302B:aptupdate::root@pam:
    Apr  5 03:38:40 ait3 pveupdate[2862945]: <root@pam> end task UPID:ait3:002BAF91:02EB5BA9:5CA7302B:aptupdate::root@pam: OK
    messages.1 look to be the same information.

    We're using the IPMI watchdog.
    Perhaps it's worth a few questions here from you given that any node which reboots (in an unplanned way) sometimes losses its watchdog config. Another manual reboot fixes this.

    Hmm ... I don't think so but I'm filtering based on what I know. The setup has worked well up to this point ... no other strange issues.
     
  15. Stoiko Ivanov

    Stoiko Ivanov Proxmox Staff Member
    Staff Member

    Joined:
    May 2, 2018
    Messages:
    1,198
    Likes Received:
    102
    On the first look it seems that the node just lost connectivity - i.e. no retransmits, no overloaded network, but it just vanished.

    * Is the time on the nodes synchronized?
    * Else it would help if you could setup some remote syslog - that way you increase the chances to get the last messages before the fence.
    * Maybe also try to disable HA (or just leave the node running without any resources configured - then it won't get fenced) - maybe you'll get some more info on the reason for the crash
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  16. Binary Bandit

    Binary Bandit New Member
    Proxmox Subscriber

    Joined:
    Dec 13, 2018
    Messages:
    27
    Likes Received:
    1
    Thanks Stoiko.

    The time on the nodes is definitely synchronized.

    After reading about the changes / fixes in 5.4 I decided to upgrade. The cluster hasn't rebooted since. Everything seems to point to something in 5.3 that didn't work well with my config.

    I'm guessing something in my hardware config as I don't think that our 3 node cluster is that unique. Well, maybe the active-backup bonds on the NICs for every network except for one of two corosync LANs. The three nodes are identical Dell r510s, 64GB RAM, 2 x 6 processor CPUs, 2TB Dell SATA drives, perc H700 RAID card ... throwing this out there in case someone knows of a known issue.

    For now, I'm going to watch and wait.
     
    Stoiko Ivanov likes this.
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice