Random cluster breakup

Nhoague

Renowned Member
Sep 29, 2012
90
4
73
45
Colorado, USA
Hey there,

So I have a small cluster using Proxmox 5.1-52. Every 20 minutes or so the cluster drops. I see this in the cluster logs. Ideas? Both servers have bonded NIC in RR mode connected to meshed network switches. This problem never happened with PVE3, but now I'm seeing it. Is there some other requirement or switching configuration I need to be aware of?

Thank you!

May 04 17:09:55 SUNFIRE corosync[2373]: warning [CPG ] downlist left_list: 0 received
May 04 17:09:55 SUNFIRE corosync[2373]: [CPG ] downlist left_list: 0 received
May 04 17:09:55 SUNFIRE corosync[2373]: warning [CPG ] downlist left_list: 0 received
May 04 17:09:55 SUNFIRE corosync[2373]: [CPG ] downlist left_list: 0 received
May 04 17:09:55 SUNFIRE corosync[2373]: notice [QUORUM] This node is within the primary component and will provide service.
May 04 17:09:55 SUNFIRE corosync[2373]: notice [QUORUM] Members[2]: 2 1
May 04 17:09:55 SUNFIRE corosync[2373]: notice [MAIN ] Completed service synchronization, ready to provide service.
May 04 17:09:55 SUNFIRE corosync[2373]: [QUORUM] This node is within the primary component and will provide service.
May 04 17:09:55 SUNFIRE corosync[2373]: [QUORUM] Members[2]: 2 1
May 04 17:09:55 SUNFIRE corosync[2373]: [MAIN ] Completed service synchronization, ready to provide service.
May 04 17:29:24 SUNFIRE corosync[2373]: notice [TOTEM ] Retransmit List: be9 bea
May 04 17:29:24 SUNFIRE corosync[2373]: [TOTEM ] Retransmit List: be9 bea
May 04 17:29:24 SUNFIRE corosync[2373]: notice [TOTEM ] Retransmit List: be9 bea
May 04 17:29:24 SUNFIRE corosync[2373]: [TOTEM ] Retransmit List: be9 bea
May 04 17:29:24 SUNFIRE corosync[2373]: notice [TOTEM ] Retransmit List: be9 bea
May 04 17:29:24 SUNFIRE corosync[2373]: [TOTEM ] Retransmit List: be9 bea

When it comes back online this is what I see in the logs:

May 04 17:29:24 SUNFIRE corosync[2373]: [TOTEM ] Retransmit List: be9 bea
May 04 17:29:25 SUNFIRE corosync[2373]: [CPG ] downlist left_list: 1 received
May 04 17:29:26 SUNFIRE corosync[2373]: [CPG ] downlist left_list: 0 received
May 04 17:29:27 SUNFIRE corosync[2373]: [CPG ] downlist left_list: 0 received
May 04 17:30:48 SUNFIRE corosync[2373]: notice [TOTEM ] A new membership (192.168.80.100:1272) was formed. Members joined: 1
May 04 17:30:48 SUNFIRE corosync[2373]: [TOTEM ] A new membership (192.168.80.100:1272) was formed. Members joined: 1
May 04 17:30:48 SUNFIRE corosync[2373]: warning [CPG ] downlist left_list: 0 received
May 04 17:30:48 SUNFIRE corosync[2373]: [CPG ] downlist left_list: 0 received
May 04 17:30:48 SUNFIRE corosync[2373]: [CPG ] downlist left_list: 0 received
May 04 17:30:48 SUNFIRE corosync[2373]: warning [CPG ] downlist left_list: 0 received
May 04 17:30:48 SUNFIRE corosync[2373]: notice [QUORUM] This node is within the primary component and will provide service.
May 04 17:30:48 SUNFIRE corosync[2373]: notice [QUORUM] Members[2]: 2 1
May 04 17:30:48 SUNFIRE corosync[2373]: notice [MAIN ] Completed service synchronization, ready to provide service.
May 04 17:30:48 SUNFIRE corosync[2373]: [QUORUM] This node is within the primary component and will provide service.
 
I cannot see that the cluster is offline - you always have quorum with 2 members.

But yes, seems member 1 leaves, then join again very fast. Is there any hint in the syslog of that node?
 
Last edited:
Heres some more info from syslog:

May 05 07:35:26 MAGNETO pmxcfs[4114]: [dcdb] notice: members: 1/4114
May 05 07:35:26 MAGNETO pmxcfs[4114]: [status] notice: members: 1/4114
May 05 07:35:26 MAGNETO corosync[4133]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
May 05 07:35:26 MAGNETO corosync[4133]: [QUORUM] Members[1]: 1
May 05 07:35:26 MAGNETO corosync[4133]: [MAIN ] Completed service synchronization, ready to provide service.
May 05 07:35:26 MAGNETO pmxcfs[4114]: [status] notice: node lost quorum
May 05 07:36:00 MAGNETO systemd[1]: Starting Proxmox VE replication runner...
May 05 07:36:00 MAGNETO systemd[1]: Started Proxmox VE replication runner.
May 05 07:36:42 MAGNETO corosync[4133]: notice [TOTEM ] A new membership (192.168.80.100:6552) was formed. Members joined: 2
May 05 07:36:42 MAGNETO corosync[4133]: [TOTEM ] A new membership (192.168.80.100:6552) was formed. Members joined: 2
May 05 07:36:42 MAGNETO corosync[4133]: warning [CPG ] downlist left_list: 0 received
May 05 07:36:42 MAGNETO corosync[4133]: [CPG ] downlist left_list: 0 received
May 05 07:36:42 MAGNETO corosync[4133]: [CPG ] downlist left_list: 0 received
May 05 07:36:42 MAGNETO corosync[4133]: warning [CPG ] downlist left_list: 0 received
May 05 07:36:42 MAGNETO pmxcfs[4114]: [dcdb] notice: members: 1/4114, 2/2393
May 05 07:36:42 MAGNETO corosync[4133]: notice [QUORUM] This node is within the primary component and will provide service.
May 05 07:36:42 MAGNETO corosync[4133]: notice [QUORUM] Members[2]: 2 1
May 05 07:36:42 MAGNETO corosync[4133]: notice [MAIN ] Completed service synchronization, ready to provide service.
May 05 07:36:42 MAGNETO pmxcfs[4114]: [dcdb] notice: starting data syncronisation
May 05 07:36:42 MAGNETO corosync[4133]: [QUORUM] This node is within the primary component and will provide service.
May 05 07:36:42 MAGNETO corosync[4133]: [QUORUM] Members[2]: 2 1
May 05 07:36:42 MAGNETO corosync[4133]: [MAIN ] Completed service synchronization, ready to provide service.
May 05 07:36:42 MAGNETO pmxcfs[4114]: [dcdb] notice: cpg_send_message retried 1 times
May 05 07:36:42 MAGNETO pmxcfs[4114]: [status] notice: node has quorum
May 05 07:36:42 MAGNETO pmxcfs[4114]: [status] notice: members: 1/4114, 2/2393
May 05 07:36:42 MAGNETO pmxcfs[4114]: [status] notice: starting data syncronisation
May 05 07:36:42 MAGNETO pmxcfs[4114]: [dcdb] notice: received sync request (epoch 1/4114/00000070)
May 05 07:36:42 MAGNETO pmxcfs[4114]: [status] notice: received sync request (epoch 1/4114/0000005A)
May 05 07:36:42 MAGNETO pmxcfs[4114]: [dcdb] notice: received all states
May 05 07:36:42 MAGNETO pmxcfs[4114]: [dcdb] notice: leader is 2/2393
May 05 07:36:42 MAGNETO pmxcfs[4114]: [dcdb] notice: synced members: 2/2393
May 05 07:36:42 MAGNETO pmxcfs[4114]: [dcdb] notice: waiting for updates from leader
May 05 07:36:42 MAGNETO pmxcfs[4114]: [dcdb] notice: update complete - trying to commit (got 2 inode updates)
May 05 07:36:42 MAGNETO pmxcfs[4114]: [dcdb] notice: all data is up to date
May 05 07:36:42 MAGNETO pmxcfs[4114]: [status] notice: received all states
May 05 07:36:42 MAGNETO pmxcfs[4114]: [status] notice: all data is up to date
May 05 07:37:00 MAGNETO systemd[1]: Starting Proxmox VE replication runner...
May 05 07:37:00 MAGNETO systemd[1]: Started Proxmox VE replication runner.
May 05 07:38:00 MAGNETO systemd[1]: Starting Proxmox VE replication runner...
May 05 07:38:00 MAGNETO systemd[1]: Started Proxmox VE replication runner.
May 05 07:39:00 MAGNETO systemd[1]: Starting Proxmox VE replication runner...
May 05 07:39:00 MAGNETO systemd[1]: Started Proxmox VE replication runner.
May 05 07:40:00 MAGNETO systemd[1]: Starting Proxmox VE replication runner...
May 05 07:40:00 MAGNETO systemd[1]: Started Proxmox VE replication runner.
May 05 07:40:53 MAGNETO pvedaemon[2204]: <root@pam> successful auth for user 'root@pam'
May 05 07:41:00 MAGNETO systemd[1]: Starting Proxmox VE replication runner...
May 05 07:41:00 MAGNETO systemd[1]: Started Proxmox VE replication runner.
May 05 07:41:41 MAGNETO corosync[4133]: error [TOTEM ] FAILED TO RECEIVE
May 05 07:41:41 MAGNETO corosync[4133]: [TOTEM ] FAILED TO RECEIVE
May 05 07:41:42 MAGNETO corosync[4133]: notice [TOTEM ] A new membership (192.168.80.104:6556) was formed. Members left: 2
May 05 07:41:42 MAGNETO corosync[4133]: notice [TOTEM ] Failed to receive the leave message. failed: 2
May 05 07:41:42 MAGNETO corosync[4133]: warning [CPG ] downlist left_list: 1 received
May 05 07:41:42 MAGNETO corosync[4133]: notice [QUORUM] This node is within the non-primary component and will NOT provide any services.
May 05 07:41:42 MAGNETO corosync[4133]: notice [QUORUM] Members[1]: 1
May 05 07:41:42 MAGNETO corosync[4133]: notice [MAIN ] Completed service synchronization, ready to provide service.
May 05 07:41:42 MAGNETO corosync[4133]: [TOTEM ] A new membership (192.168.80.104:6556) was formed. Members left: 2
May 05 07:41:42 MAGNETO corosync[4133]: [TOTEM ] Failed to receive the leave message. failed: 2
May 05 07:41:42 MAGNETO corosync[4133]: [CPG ] downlist left_list: 1 received
May 05 07:41:42 MAGNETO pmxcfs[4114]: [dcdb] notice: members: 1/4114
May 05 07:41:42 MAGNETO pmxcfs[4114]: [status] notice: members: 1/4114
May 05 07:41:42 MAGNETO pmxcfs[4114]: [status] notice: node lost quorum
May 05 07:41:42 MAGNETO corosync[4133]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
May 05 07:41:42 MAGNETO corosync[4133]: [QUORUM] Members[1]: 1
May 05 07:41:42 MAGNETO corosync[4133]: [MAIN ] Completed service synchronization, ready to provide service.
May 05 07:41:56 MAGNETO pveproxy[19118]: Clearing outdated entries from certificate cache
May 05 07:41:59 MAGNETO pveproxy[17392]: Clearing outdated entries from certificate cache
May 05 07:42:00 MAGNETO systemd[1]: Starting Proxmox VE replication runner...
May 05 07:42:00 MAGNETO systemd[1]: Started Proxmox VE replication runner.
May 05 07:42:50 MAGNETO corosync[4133]: notice [TOTEM ] A new membership (192.168.80.104:6560) was formed. Members
May 05 07:42:50 MAGNETO corosync[4133]: warning [CPG ] downlist left_list: 0 received
May 05 07:42:50 MAGNETO corosync[4133]: notice [QUORUM] Members[1]: 1
May 05 07:42:50 MAGNETO corosync[4133]: notice [MAIN ] Completed service synchronization, ready to provide service.
May 05 07:42:50 MAGNETO corosync[4133]: [TOTEM ] A new membership (192.168.80.104:6560) was formed. Members
May 05 07:42:50 MAGNETO corosync[4133]: [CPG ] downlist left_list: 0 received
May 05 07:42:50 MAGNETO corosync[4133]: [QUORUM] Members[1]: 1
May 05 07:42:50 MAGNETO corosync[4133]: [MAIN ] Completed service synchronization, ready to provide service.
May 05 07:42:51 MAGNETO corosync[4133]: notice [TOTEM ] A new membership (192.168.80.104:6564) was formed. Members
May 05 07:42:51 MAGNETO corosync[4133]: warning [CPG ] downlist left_list: 0 received
May 05 07:42:51 MAGNETO corosync[4133]: [TOTEM ] A new membership (192.168.80.104:6564) was formed. Members
May 05 07:42:51 MAGNETO corosync[4133]: notice [QUORUM] Members[1]: 1
May 05 07:42:51 MAGNETO corosync[4133]: notice [MAIN ] Completed service synchronization, ready to provide service.
May 05 07:42:51 MAGNETO corosync[4133]: [CPG ] downlist left_list: 0 received
May 05 07:42:51 MAGNETO corosync[4133]: [QUORUM] Members[1]: 1
May 05 07:42:51 MAGNETO corosync[4133]: [MAIN ] Completed service synchronization, ready to provide service.
May 05 07:42:53 MAGNETO corosync[4133]: notice [TOTEM ] A new membership (192.168.80.104:6568) was formed. Members
May 05 07:42:53 MAGNETO corosync[4133]: warning [CPG ] downlist left_list: 0 received
May 05 07:42:53 MAGNETO corosync[4133]: [TOTEM ] A new membership (192.168.80.104:6568) was formed. Members
May 05 07:42:53 MAGNETO corosync[4133]: notice [QUORUM] Members[1]: 1
May 05 07:42:53 MAGNETO corosync[4133]: notice [MAIN ] Completed service synchronization, ready to provide service.
May 05 07:42:53 MAGNETO corosync[4133]: [CPG ] downlist left_list: 0 received
May 05 07:42:53 MAGNETO corosync[4133]: [QUORUM] Members[1]: 1
May 05 07:42:53 MAGNETO corosync[4133]: [MAIN ] Completed service synchronization, ready to provide service.
May 05 07:42:54 MAGNETO corosync[4133]: notice [TOTEM ] A new membership (192.168.80.104:6572) was formed. Members
May 05 07:42:54 MAGNETO corosync[4133]: warning [CPG ] downlist left_list: 0 received
May 05 07:42:54 MAGNETO corosync[4133]: notice [QUORUM] Members[1]: 1
May 05 07:42:54 MAGNETO corosync[4133]: [TOTEM ] A new membership (192.168.80.104:6572) was formed. Members
May 05 07:42:54 MAGNETO corosync[4133]: notice [MAIN ] Completed service synchronization, ready to provide service.
May 05 07:42:54 MAGNETO corosync[4133]: [CPG ] downlist left_list: 0 received
May 05 07:42:54 MAGNETO corosync[4133]: [QUORUM] Members[1]: 1
May 05 07:42:54 MAGNETO corosync[4133]: [MAIN ] Completed service synchronization, ready to provide service.
May 05 07:42:55 MAGNETO corosync[4133]: notice [TOTEM ] A new membership (192.168.80.100:6576) was formed. Members joined: 2
May 05 07:42:55 MAGNETO corosync[4133]: [TOTEM ] A new membership (192.168.80.100:6576) was formed. Members joined: 2
May 05 07:42:55 MAGNETO corosync[4133]: warning [CPG ] downlist left_list: 0 received
May 05 07:42:55 MAGNETO corosync[4133]: [CPG ] downlist left_list: 0 received
May 05 07:42:55 MAGNETO corosync[4133]: [CPG ] downlist left_list: 0 received
May 05 07:42:55 MAGNETO corosync[4133]: warning [CPG ] downlist left_list: 0 received
May 05 07:42:55 MAGNETO pmxcfs[4114]: [dcdb] notice: members: 1/4114, 2/2393
May 05 07:42:55 MAGNETO corosync[4133]: notice [QUORUM] This node is within the primary component and will provide service.
May 05 07:42:55 MAGNETO corosync[4133]: notice [QUORUM] Members[2]: 2 1
May 05 07:42:55 MAGNETO corosync[4133]: notice [MAIN ] Completed service synchronization, ready to provide service.
May 05 07:42:55 MAGNETO pmxcfs[4114]: [dcdb] notice: starting data syncronisation
May 05 07:42:55 MAGNETO corosync[4133]: [QUORUM] This node is within the primary component and will provide service.
May 05 07:42:55 MAGNETO corosync[4133]: [QUORUM] Members[2]: 2 1
May 05 07:42:55 MAGNETO corosync[4133]: [MAIN ] Completed service synchronization, ready to provide service.
May 05 07:42:56 MAGNETO pmxcfs[4114]: [dcdb] notice: cpg_send_message retried 1 times
May 05 07:42:56 MAGNETO pmxcfs[4114]: [status] notice: node has quorum
May 05 07:42:56 MAGNETO pmxcfs[4114]: [status] notice: members: 1/4114, 2/2393
May 05 07:42:56 MAGNETO pmxcfs[4114]: [status] notice: starting data syncronisation
May 05 07:42:56 MAGNETO pmxcfs[4114]: [dcdb] notice: received sync request (epoch 1/4114/00000072)
May 05 07:42:56 MAGNETO pmxcfs[4114]: [status] notice: received sync request (epoch 1/4114/0000005C)
May 05 07:42:56 MAGNETO pmxcfs[4114]: [dcdb] notice: received all states
May 05 07:42:56 MAGNETO pmxcfs[4114]: [dcdb] notice: leader is 2/2393
May 05 07:42:56 MAGNETO pmxcfs[4114]: [dcdb] notice: synced members: 2/2393
May 05 07:42:56 MAGNETO pmxcfs[4114]: [dcdb] notice: waiting for updates from leader
May 05 07:42:56 MAGNETO pmxcfs[4114]: [status] notice: received all states
May 05 07:42:56 MAGNETO pmxcfs[4114]: [status] notice: all data is up to date
May 05 07:42:56 MAGNETO pmxcfs[4114]: [dcdb] notice: update complete - trying to commit (got 2 inode updates)
May 05 07:42:56 MAGNETO pmxcfs[4114]: [dcdb] notice: all data is up to date
May 05 07:43:00 MAGNETO systemd[1]: Starting Proxmox VE replication runner...
May 05 07:43:00 MAGNETO systemd[1]: Started Proxmox VE replication runner.
May 05 07:44:00 MAGNETO systemd[1]: Starting Proxmox VE replication runner...
May 05 07:44:00 MAGNETO systemd[1]: Started Proxmox VE replication runner.
May 05 07:44:03 MAGNETO pmxcfs[4114]: [dcdb] notice: data verification successful
May 05 07:45:00 MAGNETO systemd[1]: Starting Proxmox VE replication runner...
May 05 07:45:00 MAGNETO systemd[1]: Started Proxmox VE replication runner.
May 05 07:46:00 MAGNETO systemd[1]: Starting Proxmox VE replication runner...
May 05 07:46:00 MAGNETO systemd[1]: Started Proxmox VE replication runner.
May 05 07:47:00 MAGNETO systemd[1]: Starting Proxmox VE replication runner...
May 05 07:47:00 MAGNETO systemd[1]: Started Proxmox VE replication runner.

And check out the attached screenshots. Definitely offline.

I'm going to try disabling one of the interfaces on the bond0 (vmbr0). Or turn it into active / passive. Right now it is running balance-rr, which is how I setup PVE3, but maybe PVE5 is more intricate?

BTW, the interface on 5.1-52 is a great improvement!
 

Attachments

  • Screen Shot 2018-05-05 at 7.42.13 AM.png
    Screen Shot 2018-05-05 at 7.42.13 AM.png
    618.5 KB · Views: 6
Then the second it drops I get a lot of this in the logs:

May 05 08:06:34 SUNFIRE corosync[1989]: notice [TOTEM ] Retransmit List: 7bf
May 05 08:06:34 SUNFIRE corosync[1989]: [TOTEM ] Retransmit List: 7bf
May 05 08:06:34 SUNFIRE corosync[1989]: notice [TOTEM ] Retransmit List: 7bf
May 05 08:06:34 SUNFIRE corosync[1989]: [TOTEM ] Retransmit List: 7bf
May 05 08:06:34 SUNFIRE corosync[1989]: notice [TOTEM ] Retransmit List: 7bf
May 05 08:06:34 SUNFIRE corosync[1989]: [TOTEM ] Retransmit List: 7bf
May 05 08:06:34 SUNFIRE corosync[1989]: notice [TOTEM ] Retransmit List: 7bf
May 05 08:06:34 SUNFIRE corosync[1989]: [TOTEM ] Retransmit List: 7bf
May 05 08:06:34 SUNFIRE corosync[1989]: notice [TOTEM ] Retransmit List: 7bf
May 05 08:06:34 SUNFIRE corosync[1989]: [TOTEM ] Retransmit List: 7bf
May 05 08:06:34 SUNFIRE corosync[1989]: notice [TOTEM ] Retransmit List: 7bf
May 05 08:06:34 SUNFIRE corosync[1989]: [TOTEM ] Retransmit List: 7bf
May 05 08:06:34 SUNFIRE corosync[1989]: notice [TOTEM ] Retransmit List: 7bf
May 05 08:06:34 SUNFIRE corosync[1989]: [TOTEM ] Retransmit List: 7bf
May 05 08:06:34 SUNFIRE corosync[1989]: notice [TOTEM ] Retransmit List: 7bf
May 05 08:06:34 SUNFIRE corosync[1989]: [TOTEM ] Retransmit List: 7bf
May 05 08:06:34 SUNFIRE corosync[1989]: notice [TOTEM ] Retransmit List: 7bf
May 05 08:06:34 SUNFIRE corosync[1989]: [TOTEM ] Retransmit List: 7bf
May 05 08:06:34 SUNFIRE corosync[1989]: notice [TOTEM ] Retransmit List: 7bf
May 05 08:06:34 SUNFIRE corosync[1989]: [TOTEM ] Retransmit List: 7bf
May 05 08:06:34 SUNFIRE corosync[1989]: notice [TOTEM ] Retransmit List: 7bf
May 05 08:06:34 SUNFIRE corosync[1989]: [TOTEM ] Retransmit List: 7bf
May 05 08:06:34 SUNFIRE corosync[1989]: notice [TOTEM ] Retransmit List: 7bf
May 05 08:06:34 SUNFIRE corosync[1989]: [TOTEM ] Retransmit List: 7bf

and then ...

May 05 08:06:36 SUNFIRE pve-ha-lrm[2130]: unable to write lrm status file - closing file '/etc/pve/nodes/SUNFIRE/lrm_status.tmp.2130' failed - Permission denied
May 05 08:07:00 SUNFIRE systemd[1]: Starting Proxmox VE replication runner...
May 05 08:07:00 SUNFIRE systemd[1]: Started Proxmox VE replication runner.
May 05 08:07:53 SUNFIRE systemd-journald[472]: Suppressed 1958 messages from /system.slice/corosync.service
May 05 08:07:53 SUNFIRE corosync[1989]: notice [TOTEM ] A new membership (192.168.80.100:7000) was formed. Members joined: 1
May 05 08:07:53 SUNFIRE corosync[1989]: [TOTEM ] A new membership (192.168.80.100:7000) was formed. Members joined: 1
May 05 08:07:53 SUNFIRE corosync[1989]: warning [CPG ] downlist left_list: 0 received
May 05 08:07:53 SUNFIRE corosync[1989]: [CPG ] downlist left_list: 0 received
May 05 08:07:53 SUNFIRE corosync[1989]: warning [CPG ] downlist left_list: 0 received
May 05 08:07:53 SUNFIRE corosync[1989]: [CPG ] downlist left_list: 0 received
May 05 08:07:53 SUNFIRE pmxcfs[1909]: [dcdb] notice: members: 1/4114, 2/1909
May 05 08:07:53 SUNFIRE corosync[1989]: notice [QUORUM] This node is within the primary component and will provide service.
May 05 08:07:53 SUNFIRE corosync[1989]: notice [QUORUM] Members[2]: 2 1
May 05 08:07:53 SUNFIRE corosync[1989]: notice [MAIN ] Completed service synchronization, ready to provide service.
May 05 08:07:53 SUNFIRE pmxcfs[1909]: [dcdb] notice: starting data syncronisation
May 05 08:07:53 SUNFIRE pmxcfs[1909]: [status] notice: members: 1/4114, 2/1909
May 05 08:07:53 SUNFIRE pmxcfs[1909]: [status] notice: starting data syncronisation
May 05 08:07:53 SUNFIRE corosync[1989]: [QUORUM] This node is within the primary component and will provide service.
May 05 08:07:53 SUNFIRE corosync[1989]: [QUORUM] Members[2]: 2 1
May 05 08:07:53 SUNFIRE corosync[1989]: [MAIN ] Completed service synchronization, ready to provide service.
May 05 08:07:53 SUNFIRE pmxcfs[1909]: [status] notice: node has quorum
May 05 08:07:53 SUNFIRE pmxcfs[1909]: [dcdb] notice: received sync request (epoch 1/4114/00000078)
May 05 08:07:53 SUNFIRE pmxcfs[1909]: [status] notice: received sync request (epoch 1/4114/00000062)
May 05 08:07:53 SUNFIRE pmxcfs[1909]: [dcdb] notice: received all states
May 05 08:07:53 SUNFIRE pmxcfs[1909]: [dcdb] notice: leader is 1/4114
May 05 08:07:53 SUNFIRE pmxcfs[1909]: [dcdb] notice: synced members: 1/4114
May 05 08:07:53 SUNFIRE pmxcfs[1909]: [dcdb] notice: waiting for updates from leader
May 05 08:07:53 SUNFIRE pmxcfs[1909]: [status] notice: received all states
May 05 08:07:53 SUNFIRE pmxcfs[1909]: [status] notice: all data is up to date
May 05 08:07:53 SUNFIRE pmxcfs[1909]: [dcdb] notice: update complete - trying to commit (got 2 inode updates)
May 05 08:07:53 SUNFIRE pmxcfs[1909]: [dcdb] notice: all data is up to date
May 05 08:08:00 SUNFIRE systemd[1]: Starting Proxmox VE replication runner...
May 05 08:08:00 SUNFIRE systemd[1]: Started Proxmox VE replication runner.
May 05 08:08:59 SUNFIRE systemd[1]: Starting Cleanup of Temporary Directories...
May 05 08:08:59 SUNFIRE systemd[1]: Started Cleanup of Temporary Directories.
May 05 08:09:00 SUNFIRE systemd[1]: Starting Proxmox VE replication runner...
May 05 08:09:00 SUNFIRE systemd[1]: Started Proxmox VE replication runner.

And it comes back up?
 
Ok hold on ... I just read this in the manual ...

"If you intend to run your cluster network on the bonding interfaces, then you have to use active-passive mode on the bonding interfaces, other modes are unsupported."

I am using the LAN network for the cluster as well, something Ive done since PVE3, however PVE5 is more sensitive? If so, then that's fine, I may redo my network config or just leave it active-backup. But, can you confirm just so I know why its dropping like this?

Thank you!
 
So using active backup is still dropping. I’m going to rebuild today with a dedicated cluster network.

Just so I’m not crazy, I can have a two node cluster right? Even though my ultimate goal is 6, I need to start rebuilding with pve 5. But two node shouldn’t drop consistently?
 
Thanks for clarifying! That’s what I was afraid of though.

It must be my switching gear. But something else must have changed in the corosync package. Is it more chatty? Pve3 never a problem.

However I think a dedicated cluster network is probably the most reliable anyways.
 
Update. I rebuilt using a dedicated network / for the cluster and it works! So maybe the new version is more chatty. It just wouldn’t work over the regular lan as it used to. Too much cross talk maybe.

So anyways, that’s good! Now question, can I bond two NICs to be used as the cluster network?
 
We also ran fine with the corosync network on the same network as the Proxmox for a while. Then, as load moderately increased, we started getting totem re-transmit errors.

Once we separated corosync network we haven't had any issues whatsoever.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!