Proxmox cluster broke at upgrade

May 10, 2019
4
0
1
Hi,

We have a cluster of 6 Proxmox nodes running various versions of Proxmox 5.2 and 5.3.

After upgrading one of the nodes to 5.4 we have multiple problems in the entire cluster:

- vms don't start correctly:
Code:
May 10 15:50:01 rwb070 pvedaemon[3101]: <root@pam> starting task UPID:rwb070:00002BD2:00076D70:5CD58189:qmstart:10030:root@pam:
May 10 15:50:01 rwb070 pvedaemon[11218]: start VM 10030: UPID:rwb070:00002BD2:00076D70:5CD58189:qmstart:10030:root@pam:
May 10 15:50:01 rwb070 pvedaemon[11218]: start failed: org.freedesktop.systemd1.UnitExists: Unit 10030.scope already exists.
May 10 15:50:01 rwb070 pvedaemon[3101]: <root@pam> end task UPID:rwb070:00002BD2:00076D70:5CD58189:qmstart:10030:root@pam: start failed: org.freedesktop.systemd1.UnitExists: Unit 10030.scope already exists.

- other commands fail as well, like qm list (hangs)

- pve cluster fails:

Pve versions:

Upgraded Proxmox host:
Code:
[16:25:43][root@rwb069(4)]:~
(0)#: pveversion -v
proxmox-ve: 5.4-1 (running kernel: 4.15.18-13-pve)
pve-manager: not correctly installed (running version: 5.4-5/c6fdb264)
pve-kernel-4.15: 5.4-1
pve-kernel-4.13: 5.2-2
pve-kernel-4.15.18-13-pve: 4.15.18-37
pve-kernel-4.15.18-4-pve: 4.15.18-23
pve-kernel-4.13.16-4-pve: 4.13.16-51
pve-kernel-4.13.16-1-pve: 4.13.16-46
pve-kernel-4.10.17-3-pve: 4.10.17-23
pve-kernel-4.10.17-2-pve: 4.10.17-20
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-8
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-51
libpve-guest-common-perl: 2.0-20
libpve-http-server-perl: 2.0-13
libpve-storage-perl: 5.0-41
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.1.0-3
lxcfs: 3.0.3-pve1
novnc-pve: 1.0.0-3
openvswitch-switch: 2.7.0-3
proxmox-widget-toolkit: 1.0-26
pve-cluster: 5.0-36
pve-container: 2.0-37
pve-docs: 5.4-2
pve-edk2-firmware: 1.20190312-1
pve-firewall: 3.0-20
pve-firmware: 2.0-6
pve-ha-manager: not correctly installed
pve-i18n: 1.1-4
pve-libspice-server1: 0.14.1-2
pve-qemu-kvm: 2.12.1-3
pve-xtermjs: 3.12.0-1
qemu-server: not correctly installed
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.13-pve1~bpo2

Not yet upgrade Proxmox host:
Code:
[14:42:49][root@rwb071(4)]:/var/log/lxc
(1)#: pveversion -v
proxmox-ve: 5.2-2 (running kernel: 4.15.18-4-pve)
pve-manager: 5.2-8 (running version: 5.2-8/fdf39912)
pve-kernel-4.15: 5.2-7
pve-kernel-4.15.18-4-pve: 4.15.18-23
pve-kernel-4.13.13-2-pve: 4.13.13-33
corosync: 2.4.2-pve5
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-38
libpve-guest-common-perl: 2.0-17
libpve-http-server-perl: 2.0-10
libpve-storage-perl: 5.0-27
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 3.0.2+pve1-2
lxcfs: 3.0.0-1
novnc-pve: 1.0.0-2
openvswitch-switch: 2.7.0-3
proxmox-widget-toolkit: 1.0-19
pve-cluster: 5.0-30
pve-container: 2.0-26
pve-docs: 5.2-8
pve-firewall: 3.0-14
pve-firmware: 2.0-5
pve-ha-manager: 2.0-5
pve-i18n: 1.0-6
pve-libspice-server1: 0.12.8-3
pve-qemu-kvm: 2.11.2-1
pve-xtermjs: 1.0-5
qemu-server: 5.0-33
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.9-pve1~bpo9

Note that on the upgraded Proxmox host we have this:
Code:
pve-manager: not correctly installed (running version: 5.4-5/c6fdb264)

So we do
Code:
[16:25:48][root@rwb069(4)]:~
(0)#: apt install pve-manager
E: dpkg was interrupted, you must manually run 'dpkg --configure -a' to correct the problem. 
[16:28:08][root@rwb069(4)]:~
(100)#: dpkg --configure -a
Setting up pve-ha-manager (2.0-9) ...


Which hangs at:
Code:
root     10762  0.0  0.0  95220  6840 ?        Ss   16:25   0:00  \_ sshd: root@pts/4
root     10788  0.0  0.0  21576  5504 pts/4    Ss   16:25   0:00  |   \_ -bash
root     11002  0.0  0.0  19744  5332 pts/4    S+   16:28   0:00  |       \_ dpkg --configure -a
root     11003  0.0  0.0   4280  1268 pts/4    S+   16:28   0:00  |           \_ /bin/sh /var/lib/dpkg/info/pve-ha-manager.postinst configur
root     11029  0.0  0.0  39596  4824 pts/4    S+   16:28   0:00  |               \_ /bin/systemctl try-restart pve-ha-lrm.service
root     11030  0.0  0.0  37808  2148 pts/4    S+   16:28   0:00  |                   \_ /bin/systemd-tty-ask-password-agent --watch


Code:
[16:33:49][root@rwb069(5)]:~
(0)#: systemctl list-jobs
JOB UNIT               TYPE    STATE  
337 pvesr.service      start   running
550 pve-ha-lrm.service restart running


2 jobs listed.

Code:
[16:37:22][root@rwb069(5)]:~
(0)#: journalctl -u pve-ha-lrm.service
-- Logs begin at Fri 2019-05-10 14:54:11 CEST, end at Fri 2019-05-10 16:37:41 CEST. --
May 10 14:54:23 rwb069 systemd[1]: Starting PVE Local HA Ressource Manager Daemon...
May 10 14:54:24 rwb069 pve-ha-lrm[3305]: starting server
May 10 14:54:24 rwb069 pve-ha-lrm[3305]: status change startup => wait_for_agent_lock
May 10 14:54:24 rwb069 systemd[1]: Started PVE Local HA Ressource Manager Daemon.
May 10 16:28:14 rwb069 systemd[1]: Stopping PVE Local HA Ressource Manager Daemon...
[16:37:45][root@rwb069(5)]:~
(0)#: journalctl -u pvesr.service
-- Logs begin at Fri 2019-05-10 14:54:11 CEST, end at Fri 2019-05-10 16:37:57 CEST. --
May 10 14:55:00 rwb069 systemd[1]: Starting Proxmox VE replication runner...
May 10 14:55:00 rwb069 pvesr[3398]: trying to acquire cfs lock 'file-replication_cfg' ...
May 10 14:55:01 rwb069 pvesr[3398]: trying to acquire cfs lock 'file-replication_cfg' ...
May 10 14:55:02 rwb069 pvesr[3398]: trying to acquire cfs lock 'file-replication_cfg' ...
May 10 14:55:03 rwb069 pvesr[3398]: trying to acquire cfs lock 'file-replication_cfg' ...
May 10 14:55:04 rwb069 pvesr[3398]: trying to acquire cfs lock 'file-replication_cfg' ...
May 10 14:55:05 rwb069 pvesr[3398]: trying to acquire cfs lock 'file-replication_cfg' ...
May 10 14:55:06 rwb069 pvesr[3398]: trying to acquire cfs lock 'file-replication_cfg' ...
May 10 14:55:07 rwb069 pvesr[3398]: trying to acquire cfs lock 'file-replication_cfg' ...
May 10 14:55:08 rwb069 pvesr[3398]: trying to acquire cfs lock 'file-replication_cfg' ...
May 10 14:55:09 rwb069 pvesr[3398]: error with cfs lock 'file-replication_cfg': no quorum!
May 10 14:55:09 rwb069 systemd[1]: pvesr.service: Main process exited, code=exited, status=13/n/a
May 10 14:55:09 rwb069 systemd[1]: Failed to start Proxmox VE replication runner.
May 10 14:55:09 rwb069 systemd[1]: pvesr.service: Unit entered failed state.
May 10 14:55:09 rwb069 systemd[1]: pvesr.service: Failed with result 'exit-code'.
May 10 14:56:00 rwb069 systemd[1]: Starting Proxmox VE replication runner...
May 10 14:56:00 rwb069 pvesr[3516]: trying to acquire cfs lock 'file-replication_cfg' ...
May 10 14:56:01 rwb069 pvesr[3516]: trying to acquire cfs lock 'file-replication_cfg' ...
May 10 14:56:02 rwb069 pvesr[3516]: trying to acquire cfs lock 'file-replication_cfg' ...
May 10 14:56:03 rwb069 pvesr[3516]: trying to acquire cfs lock 'file-replication_cfg' ...
May 10 14:56:04 rwb069 pvesr[3516]: trying to acquire cfs lock 'file-replication_cfg' ...
May 10 14:56:05 rwb069 pvesr[3516]: trying to acquire cfs lock 'file-replication_cfg' ...
May 10 14:56:06 rwb069 pvesr[3516]: trying to acquire cfs lock 'file-replication_cfg' ...
May 10 14:56:07 rwb069 pvesr[3516]: trying to acquire cfs lock 'file-replication_cfg' ...
May 10 14:56:08 rwb069 pvesr[3516]: trying to acquire cfs lock 'file-replication_cfg' ...
May 10 14:56:09 rwb069 pvesr[3516]: error with cfs lock 'file-replication_cfg': no quorum!
May 10 14:56:09 rwb069 systemd[1]: pvesr.service: Main process exited, code=exited, status=13/n/a
May 10 14:56:09 rwb069 systemd[1]: Failed to start Proxmox VE replication runner.
May 10 14:56:09 rwb069 systemd[1]: pvesr.service: Unit entered failed state.
May 10 14:56:09 rwb069 systemd[1]: pvesr.service: Failed with result 'exit-code'.
May 10 14:57:00 rwb069 systemd[1]: Starting Proxmox VE replication runner...


Please advise.
 
Hi,
seems that you have lost quorum. What's the output of `pvecm status` and `systemctl status corosync.service`?
 
Code:
(0)#: pvecm status
Quorum information
------------------
Date:             Sat May 11 11:41:01 2019
Quorum provider:  corosync_votequorum
Nodes:            6
Node ID:          0x00000001
Ring ID:          1/78560
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   6
Highest expected: 6
Total votes:      6
Quorum:           4
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.2.10.69 (local)
0x00000002          1 10.2.10.70
0x00000003          1 10.2.10.71
0x00000006          1 10.2.10.230
0x00000005          1 10.2.10.231
0x00000004          1 10.2.10.237

Code:
● corosync.service - Corosync Cluster Engine
   Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
   Active: active (running) since Fri 2019-05-10 17:22:32 CEST; 18h ago
     Docs: man:corosync
           man:corosync.conf
           man:corosync_overview
 Main PID: 3052 (corosync)
    Tasks: 2 (limit: 8601)
   Memory: 728.8M
      CPU: 32min 9.408s
   CGroup: /system.slice/corosync.service
           └─3052 /usr/sbin/corosync -f

May 11 11:22:45 rwb069 corosync[3052]:  [CPG   ] downlist left_list: 0 received
May 11 11:22:45 rwb069 corosync[3052]: warning [CPG   ] downlist left_list: 0 received
May 11 11:22:45 rwb069 corosync[3052]:  [CPG   ] downlist left_list: 0 received
May 11 11:22:45 rwb069 corosync[3052]:  [CPG   ] downlist left_list: 0 received
May 11 11:22:45 rwb069 corosync[3052]:  [CPG   ] downlist left_list: 0 received
May 11 11:22:45 rwb069 corosync[3052]:  [CPG   ] downlist left_list: 0 received
May 11 11:22:45 rwb069 corosync[3052]: notice  [QUORUM] Members[6]: 1 2 3 6 5 4
May 11 11:22:45 rwb069 corosync[3052]: notice  [MAIN  ] Completed service synchronization, ready to provide service.
May 11 11:22:45 rwb069 corosync[3052]:  [QUORUM] Members[6]: 1 2 3 6 5 4
May 11 11:22:45 rwb069 corosync[3052]:  [MAIN  ] Completed service synchronization, ready to provide service.
 
Ok, seems you have quorum again. Do you still encounter the same issues when running `dpkg --configure -a`?
 
Update: Cluster of 6 nodes, /etc/pve is not accessible on some nodes (3) (ls /etc/pve just hangs) so that why lots of commands just hang.

node1 /etc/pve hang.
node2 /etc/pve works
node3 /etc/pve hang
node4 /etc/pve hang
node5 /etc/pve works
node6 /etc/pve works


All nodes say they have quorum.

i've fixed the pkg error, needed to startup in localmode and continue the update.

Code:
systemctl stop pve-cluster
systemctl stop corosync
pmxcfs -l

In local mode, /etc/pve is accessible and i'm able to use backup vm commands.
 
Please check the file permissions of /var/lib/pve-cluster/config.db (should be 0600) and try to restart the pve-cluster service on the nodes where /etc/pve hangs. What's the status of `systemctl status pve-cluster`? Any errors in `journalctl -u pve-cluster`?
 
i've restarted (stop/start) pve-cluster and corosync on the whole cluster.

permissions on /var/lib/pve-cluster/config.db are 0600.
/etc/pve is accessible again and cluster operational.

Since we're operational now.
For consistency we've decided todo a fresh install of the entire cluster (1by1)

Solved.

Thx.
 
Happy to hear that!
 
On the node that hangs I see
Code:
root@pve2:~# systemctl status pve-cluster
● pve-cluster.service - The Proxmox VE cluster filesystem
   Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; vendor preset: enabled)
   Active: active (running) since Mon 2019-05-13 21:49:07 MSK; 16min ago
  Process: 6765 ExecStartPost=/usr/bin/pvecm updatecerts --silent (code=exited, status=0/SUCCESS)
  Process: 6737 ExecStart=/usr/bin/pmxcfs (code=exited, status=0/SUCCESS)
 Main PID: 6748 (pmxcfs)
    Tasks: 7 (limit: 9830)
   Memory: 49.7M
      CPU: 1.152s
   CGroup: /system.slice/pve-cluster.service
           └─6748 /usr/bin/pmxcfs

May 13 21:59:42 pve2 pmxcfs[6748]: [status] notice: received log
May 13 21:59:48 pve2 pmxcfs[6748]: [status] notice: received log
May 13 22:02:51 pve2 pmxcfs[6748]: [status] notice: node lost quorum
May 13 22:02:51 pve2 pmxcfs[6748]: [dcdb] notice: members: 2/6748
May 13 22:02:51 pve2 pmxcfs[6748]: [status] notice: members: 2/6748
May 13 22:02:51 pve2 pmxcfs[6748]: [dcdb] crit: received write while not quorate - trigger resync
May 13 22:02:51 pve2 pmxcfs[6748]: [dcdb] crit: leaving CPG group
May 13 22:02:52 pve2 pmxcfs[6748]: [dcdb] notice: start cluster connection
May 13 22:02:52 pve2 pmxcfs[6748]: [dcdb] notice: members: 2/6748
May 13 22:02:52 pve2 pmxcfs[6748]: [dcdb] notice: all data is up to date

Code:
root@pve2:~# pveversion -v
proxmox-ve: 5.4-1 (running kernel: 4.15.18-14-pve)
pve-manager: 5.4-5 (running version: 5.4-5/c6fdb264)
pve-kernel-4.15: 5.4-2
pve-kernel-4.13: 5.2-2
pve-kernel-4.15.18-14-pve: 4.15.18-38
pve-kernel-4.15.18-11-pve: 4.15.18-34
pve-kernel-4.15.18-8-pve: 4.15.18-28
pve-kernel-4.15.18-7-pve: 4.15.18-27
pve-kernel-4.15.18-4-pve: 4.15.18-23
pve-kernel-4.15.17-2-pve: 4.15.17-10
pve-kernel-4.15.17-1-pve: 4.15.17-9
pve-kernel-4.13.16-4-pve: 4.13.16-51
pve-kernel-4.13.16-3-pve: 4.13.16-50
pve-kernel-4.13.16-2-pve: 4.13.16-48
pve-kernel-4.13.16-1-pve: 4.13.16-46
pve-kernel-4.13.13-5-pve: 4.13.13-38
pve-kernel-4.13.13-2-pve: 4.13.13-33
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-9
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-51
libpve-guest-common-perl: 2.0-20
libpve-http-server-perl: 2.0-13
libpve-storage-perl: 5.0-42
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.1.0-3
lxcfs: 3.0.3-pve1
novnc-pve: 1.0.0-3
proxmox-widget-toolkit: 1.0-26
pve-cluster: 5.0-37
pve-container: 2.0-37
pve-docs: 5.4-2
pve-edk2-firmware: 1.20190312-1
pve-firewall: 3.0-20
pve-firmware: 2.0-6
pve-ha-manager: 2.0-9
pve-i18n: 1.1-4
pve-libspice-server1: 0.14.1-2
pve-qemu-kvm: 3.0.1-2
pve-xtermjs: 3.12.0-1
qemu-server: 5.0-51
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.13-pve1~bpo2

Code:
root@pve2:~# cat /etc/pve/corosync.conf 
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: pve1
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.64.200.200
  }
  node {
    name: pve2
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.64.200.202
  }
  node {
    name: pve3
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 10.64.200.203
  }
  node {
    name: pve4A
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 10.64.200.204
  }
  node {
    name: pve4B
    nodeid: 5
    quorum_votes: 1
    ring0_addr: 10.64.200.205
  }
  node {
    name: pve4C
    nodeid: 6
    quorum_votes: 1
    ring0_addr: 10.64.200.206
  }
  node {
    name: pve4D
    nodeid: 7
    quorum_votes: 1
    ring0_addr: 10.64.200.207
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: pve-cluster
  config_version: 9
  interface {
    bindnetaddr: 10.64.200.200
    ringnumber: 0
  }
  ip_version: ipv4
  secauth: on
  version: 2
}

Before an upgrade to 5.4 (on 5.3) everything worked fine for a while
 
Here is a log from the node that became a part of cluster and than left

Code:
May 14 10:26:08 pve4A corosync[20900]: notice  [MAIN  ] Completed service synchronization, ready to provide service.
May 14 10:26:08 pve4A corosync[20900]:  [CPG   ] downlist left_list: 0 received
May 14 10:26:08 pve4A corosync[20900]:  [QUORUM] Members[7]: 1 2 4 3 5 6 7
May 14 10:26:08 pve4A pvesr[23365]: trying to acquire cfs lock 'file-replication_cfg' ...
May 14 10:26:08 pve4A corosync[20900]:  [MAIN  ] Completed service synchronization, ready to provide service.
May 14 10:26:09 pve4A pmxcfs[21107]: [dcdb] notice: members: 1/1297, 2/3495, 3/21107, 4/11069, 5/27190, 6/12205, 7/21753
May 14 10:26:09 pve4A pmxcfs[21107]: [dcdb] notice: starting data syncronisation
May 14 10:26:09 pve4A pmxcfs[21107]: [dcdb] notice: received sync request (epoch 1/1297/0000000B)
May 14 10:26:09 pve4A pmxcfs[21107]: [status] notice: members: 1/1297, 2/3495, 3/21107, 4/11069, 5/27190, 6/12205, 7/21753
May 14 10:26:09 pve4A pmxcfs[21107]: [status] notice: starting data syncronisation
May 14 10:26:09 pve4A pmxcfs[21107]: [status] notice: received sync request (epoch 1/1297/00000009)
May 14 10:26:09 pve4A pmxcfs[21107]: [dcdb] notice: received all states
May 14 10:26:09 pve4A pmxcfs[21107]: [dcdb] notice: leader is 2/3495
May 14 10:26:09 pve4A pmxcfs[21107]: [dcdb] notice: synced members: 2/3495, 3/21107, 4/11069, 5/27190, 6/12205, 7/21753
May 14 10:26:09 pve4A pmxcfs[21107]: [dcdb] notice: all data is up to date
May 14 10:26:09 pve4A pmxcfs[21107]: [status] notice: received all states
May 14 10:26:09 pve4A pmxcfs[21107]: [status] notice: all data is up to date
May 14 10:26:09 pve4A pvesr[23365]: trying to acquire cfs lock 'file-replication_cfg' ...
May 14 10:26:10 pve4A systemd[1]: Started Proxmox VE replication runner.
May 14 10:26:43 pve4A pvedaemon[2253]: <root@pam> starting task UPID:pve4A:00005D0E:00446EA5:5CDA6DB3:vncshell::root@pam:
May 14 10:26:43 pve4A pvedaemon[23822]: starting termproxy UPID:pve4A:00005D0E:00446EA5:5CDA6DB3:vncshell::root@pam:
May 14 10:26:44 pve4A pvedaemon[2253]: <root@pam> successful auth for user 'root@pam'
May 14 10:26:48 pve4A pvedaemon[2253]: <root@pam> end task UPID:pve4A:00005D0E:00446EA5:5CDA6DB3:vncshell::root@pam: OK
May 14 10:26:48 pve4A pvedaemon[2253]: <root@pam> starting task UPID:pve4A:00005D44:004470CA:5CDA6DB8:vncshell::root@pam:
May 14 10:26:48 pve4A pvedaemon[23876]: starting termproxy UPID:pve4A:00005D44:004470CA:5CDA6DB8:vncshell::root@pam:
May 14 10:26:49 pve4A pvedaemon[2255]: <root@pam> successful auth for user 'root@pam'
May 14 10:26:54 pve4A pvedaemon[2253]: <root@pam> end task UPID:pve4A:00005D44:004470CA:5CDA6DB8:vncshell::root@pam: OK
May 14 10:26:54 pve4A pvedaemon[23924]: starting termproxy UPID:pve4A:00005D74:00447331:5CDA6DBE:vncshell::root@pam:
May 14 10:26:54 pve4A pvedaemon[2253]: <root@pam> starting task UPID:pve4A:00005D74:00447331:5CDA6DBE:vncshell::root@pam:
May 14 10:26:55 pve4A pvedaemon[2255]: <root@pam> successful auth for user 'root@pam'
May 14 10:26:59 pve4A pvedaemon[2253]: <root@pam> end task UPID:pve4A:00005D74:00447331:5CDA6DBE:vncshell::root@pam: OK
May 14 10:26:59 pve4A pvedaemon[2253]: <root@pam> starting task UPID:pve4A:00005DB9:004474E5:5CDA6DC3:vncshell::root@pam:
May 14 10:26:59 pve4A pvedaemon[23993]: starting termproxy UPID:pve4A:00005DB9:004474E5:5CDA6DC3:vncshell::root@pam:
May 14 10:26:59 pve4A pvedaemon[2255]: <root@pam> successful auth for user 'root@pam'
May 14 10:26:59 pve4A systemd[1]: Created slice User Slice of root.
May 14 10:26:59 pve4A systemd[1]: Starting User Manager for UID 0...
May 14 10:26:59 pve4A systemd[1]: Started Session 17 of user root.
May 14 10:26:59 pve4A systemd[24003]: Listening on GnuPG cryptographic agent (access for web browsers).
May 14 10:26:59 pve4A systemd[24003]: Reached target Paths.
May 14 10:26:59 pve4A systemd[24003]: Listening on GnuPG cryptographic agent (ssh-agent emulation).
May 14 10:26:59 pve4A systemd[24003]: Listening on GnuPG cryptographic agent and passphrase cache (restricted).
May 14 10:26:59 pve4A systemd[24003]: Reached target Timers.
May 14 10:26:59 pve4A systemd[24003]: Listening on GnuPG cryptographic agent and passphrase cache.
May 14 10:26:59 pve4A systemd[24003]: Reached target Sockets.
May 14 10:26:59 pve4A systemd[24003]: Reached target Basic System.
May 14 10:26:59 pve4A systemd[24003]: Reached target Default.
May 14 10:26:59 pve4A systemd[24003]: Startup finished in 29ms.
May 14 10:26:59 pve4A systemd[1]: Started User Manager for UID 0.
May 14 10:27:00 pve4A systemd[1]: Starting Proxmox VE replication runner...
May 14 10:27:05 pve4A systemd[1]: Stopping User Manager for UID 0...
May 14 10:27:05 pve4A systemd[24003]: Stopped target Default.
May 14 10:27:05 pve4A systemd[24003]: Stopped target Basic System.
May 14 10:27:05 pve4A systemd[24003]: Stopped target Timers.
May 14 10:27:05 pve4A systemd[24003]: Stopped target Sockets.
May 14 10:27:05 pve4A systemd[24003]: Closed GnuPG cryptographic agent and passphrase cache (restricted).
May 14 10:27:05 pve4A systemd[24003]: Closed GnuPG cryptographic agent (ssh-agent emulation).
May 14 10:27:05 pve4A systemd[24003]: Stopped target Paths.
May 14 10:27:05 pve4A systemd[24003]: Closed GnuPG cryptographic agent (access for web browsers).
May 14 10:27:05 pve4A systemd[24003]: Closed GnuPG cryptographic agent and passphrase cache.
May 14 10:27:05 pve4A systemd[24003]: Reached target Shutdown.
May 14 10:27:05 pve4A systemd[24003]: Starting Exit the Session...
May 14 10:27:05 pve4A systemd[24003]: Received SIGRTMIN+24 from PID 24093 (kill).
May 14 10:27:05 pve4A pvedaemon[2253]: <root@pam> end task UPID:pve4A:00005DB9:004474E5:5CDA6DC3:vncshell::root@pam: OK
May 14 10:27:05 pve4A systemd[1]: Stopped User Manager for UID 0.
May 14 10:27:05 pve4A systemd[1]: Removed slice User Slice of root.
May 14 10:27:05 pve4A pvedaemon[2255]: <root@pam> starting task UPID:pve4A:00005E2F:0044777D:5CDA6DC9:vncshell::root@pam:
May 14 10:27:05 pve4A pvedaemon[24111]: starting termproxy UPID:pve4A:00005E2F:0044777D:5CDA6DC9:vncshell::root@pam:
May 14 10:27:06 pve4A pvedaemon[2253]: <root@pam> successful auth for user 'root@pam'
May 14 10:27:10 pve4A corosync[20900]: error   [TOTEM ] FAILED TO RECEIVE
May 14 10:27:10 pve4A corosync[20900]:  [TOTEM ] FAILED TO RECEIVE
May 14 10:27:11 pve4A pvedaemon[2255]: <root@pam> end task UPID:pve4A:00005E2F:0044777D:5CDA6DC9:vncshell::root@pam: OK
May 14 10:27:11 pve4A pvedaemon[24175]: starting termproxy UPID:pve4A:00005E6F:00447990:5CDA6DCF:vncshell::root@pam:
May 14 10:27:11 pve4A pvedaemon[2255]: <root@pam> starting task UPID:pve4A:00005E6F:00447990:5CDA6DCF:vncshell::root@pam:
May 14 10:27:11 pve4A pvedaemon[2254]: <root@pam> successful auth for user 'root@pam'
May 14 10:27:14 pve4A pvedaemon[2255]: <root@pam> end task UPID:pve4A:00005E6F:00447990:5CDA6DCF:vncshell::root@pam: OK
May 14 10:27:14 pve4A pvedaemon[2255]: <root@pam> starting task UPID:pve4A:00005E94:00447B03:5CDA6DD2:vncshell::root@pam:
May 14 10:27:14 pve4A pvedaemon[24212]: starting termproxy UPID:pve4A:00005E94:00447B03:5CDA6DD2:vncshell::root@pam:
May 14 10:27:15 pve4A corosync[20900]: notice  [TOTEM ] A new membership (10.64.200.204:2856) was formed. Members left: 1 2 4 5 6 7
May 14 10:27:15 pve4A corosync[20900]: notice  [TOTEM ] Failed to receive the leave message. failed: 1 2 4 5 6 7
May 14 10:27:15 pve4A corosync[20900]:  [TOTEM ] A new membership (10.64.200.204:2856) was formed. Members left: 1 2 4 5 6 7
May 14 10:27:15 pve4A corosync[20900]: warning [CPG   ] downlist left_list: 6 received
May 14 10:27:15 pve4A corosync[20900]: notice  [QUORUM] This node is within the non-primary component and will NOT provide any services.
May 14 10:27:15 pve4A corosync[20900]: notice  [QUORUM] Members[1]: 3
May 14 10:27:15 pve4A corosync[20900]: notice  [MAIN  ] Completed service synchronization, ready to provide service.
May 14 10:27:15 pve4A corosync[20900]:  [TOTEM ] Failed to receive the leave message. failed: 1 2 4 5 6 7
May 14 10:27:15 pve4A corosync[20900]:  [CPG   ] downlist left_list: 6 received
May 14 10:27:15 pve4A pmxcfs[21107]: [dcdb] notice: members: 3/21107
May 14 10:27:15 pve4A pmxcfs[21107]: [status] notice: members: 3/21107
May 14 10:27:15 pve4A corosync[20900]:  [QUORUM] This node is within the non-primary component and will NOT provide any services.
May 14 10:27:15 pve4A corosync[20900]:  [QUORUM] Members[1]: 3
May 14 10:27:15 pve4A corosync[20900]:  [MAIN  ] Completed service synchronization, ready to provide service.
May 14 10:27:15 pve4A pmxcfs[21107]: [status] notice: node lost quorum
May 14 10:27:15 pve4A pvesr[24036]: trying to acquire cfs lock 'file-replication_cfg' ...
May 14 10:27:15 pve4A pmxcfs[21107]: [dcdb] notice: cpg_send_message retried 1 times
May 14 10:27:15 pve4A pmxcfs[21107]: [dcdb] crit: received write while not quorate - trigger resync
May 14 10:27:15 pve4A pmxcfs[21107]: [dcdb] crit: leaving CPG group
May 14 10:27:15 pve4A pve-ha-lrm[2318]: unable to write lrm status file - closing file '/etc/pve/nodes/pve4A/lrm_status.tmp.2318' failed - Operation not permitted
May 14 10:27:15 pve4A pvedaemon[2254]: <root@pam> successful auth for user 'root@pam'
May 14 10:27:15 pve4A pmxcfs[21107]: [dcdb] notice: start cluster connection
May 14 10:27:15 pve4A pmxcfs[21107]: [dcdb] notice: members: 3/21107
May 14 10:27:15 pve4A pmxcfs[21107]: [dcdb] notice: all data is up to date
May 14 10:27:16 pve4A pvesr[24036]: trying to acquire cfs lock 'file-replication_cfg' ...
May 14 10:27:17 pve4A pvesr[24036]: trying to acquire cfs lock 'file-replication_cfg' ...
May 14 10:27:18 pve4A pvesr[24036]: trying to acquire cfs lock 'file-replication_cfg' ...
May 14 10:27:19 pve4A pvesr[24036]: trying to acquire cfs lock 'file-replication_cfg' ...
May 14 10:27:20 pve4A pvesr[24036]: trying to acquire cfs lock 'file-replication_cfg' ...
May 14 10:27:21 pve4A pvesr[24036]: trying to acquire cfs lock 'file-replication_cfg' ...
May 14 10:27:22 pve4A pvesr[24036]: trying to acquire cfs lock 'file-replication_cfg' ...
May 14 10:27:23 pve4A pvesr[24036]: trying to acquire cfs lock 'file-replication_cfg' ...
May 14 10:27:24 pve4A pvesr[24036]: error with cfs lock 'file-replication_cfg': no quorum!
May 14 10:27:24 pve4A systemd[1]: pvesr.service: Main process exited, code=exited, status=13/n/a
May 14 10:27:24 pve4A systemd[1]: Failed to start Proxmox VE replication runner.
May 14 10:27:24 pve4A systemd[1]: pvesr.service: Unit entered failed state.
May 14 10:27:24 pve4A systemd[1]: pvesr.service: Failed with result 'exit-code'.
May 14 10:27:29 pve4A pveproxy[2292]: worker 30623 finished
May 14 10:27:29 pve4A pveproxy[2292]: starting 1 worker(s)
May 14 10:27:29 pve4A pveproxy[2292]: worker 24371 started
May 14 10:27:31 pve4A pveproxy[24370]: got inotify poll request in wrong process - disabling inotify
May 14 10:27:33 pve4A pvedaemon[2255]: <root@pam> end task UPID:pve4A:00005E94:00447B03:5CDA6DD2:vncshell::root@pam: OK
May 14 10:27:33 pve4A pvedaemon[2255]: <root@pam> starting task UPID:pve4A:00005F67:00448273:5CDA6DE5:vncshell::root@pam:
May 14 10:27:33 pve4A pvedaemon[24423]: starting termproxy UPID:pve4A:00005F67:00448273:5CDA6DE5:vncshell::root@pam:
May 14 10:27:34 pve4A pveproxy[24370]: worker exit
May 14 10:27:34 pve4A pvedaemon[2254]: <root@pam> successful auth for user 'root@pam'
May 14 10:27:34 pve4A systemd[1]: Created slice User Slice of root.
May 14 10:27:34 pve4A systemd[1]: Starting User Manager for UID 0...
May 14 10:27:34 pve4A systemd[1]: Started Session 19 of user root.
May 14 10:27:34 pve4A systemd[24433]: Listening on GnuPG cryptographic agent (access for web browsers).
May 14 10:27:34 pve4A systemd[24433]: Reached target Timers.
May 14 10:27:34 pve4A systemd[24433]: Listening on GnuPG cryptographic agent and passphrase cache (restricted).
May 14 10:27:34 pve4A systemd[24433]: Listening on GnuPG cryptographic agent (ssh-agent emulation).
May 14 10:27:34 pve4A systemd[24433]: Listening on GnuPG cryptographic agent and passphrase cache.
May 14 10:27:34 pve4A systemd[24433]: Reached target Sockets.
May 14 10:27:34 pve4A systemd[24433]: Reached target Paths.
May 14 10:27:34 pve4A systemd[24433]: Reached target Basic System.
May 14 10:27:34 pve4A systemd[24433]: Reached target Default.
May 14 10:27:34 pve4A systemd[24433]: Startup finished in 21ms.
May 14 10:27:34 pve4A systemd[1]: Started User Manager for UID 0.
May 14 10:27:35 pve4A systemd[1]: Stopping User Manager for UID 0...
May 14 10:27:35 pve4A systemd[24433]: Stopped target Default.
May 14 10:27:35 pve4A systemd[24433]: Stopped target Basic System.
May 14 10:27:35 pve4A systemd[24433]: Stopped target Timers.
May 14 10:27:35 pve4A systemd[24433]: Stopped target Sockets.
May 14 10:27:35 pve4A systemd[24433]: Closed GnuPG cryptographic agent and passphrase cache (restricted).
May 14 10:27:35 pve4A systemd[24433]: Closed GnuPG cryptographic agent (access for web browsers).
May 14 10:27:35 pve4A systemd[24433]: Stopped target Paths.
May 14 10:27:35 pve4A systemd[24433]: Closed GnuPG cryptographic agent (ssh-agent emulation).
May 14 10:27:35 pve4A systemd[24433]: Closed GnuPG cryptographic agent and passphrase cache.
May 14 10:27:35 pve4A systemd[24433]: Reached target Shutdown.
May 14 10:27:35 pve4A systemd[24433]: Starting Exit the Session...
May 14 10:27:35 pve4A systemd[24433]: Received SIGRTMIN+24 from PID 24460 (kill).
May 14 10:27:35 pve4A systemd[1]: Stopped User Manager for UID 0.
May 14 10:27:35 pve4A systemd[1]: Removed slice User Slice of root.
May 14 10:27:35 pve4A pvedaemon[2255]: <root@pam> end task UPID:pve4A:00005F67:00448273:5CDA6DE5:vncshell::root@pam: OK
May 14 10:27:35 pve4A pveproxy[1656]: worker exit
May 14 10:27:35 pve4A pveproxy[2292]: worker 1656 finished
May 14 10:27:35 pve4A pveproxy[2292]: starting 1 worker(s)
May 14 10:27:35 pve4A pveproxy[2292]: worker 24482 started
May 14 10:27:39 pve4A pveproxy[24371]: Clearing outdated entries from certificate cache
May 14 10:27:43 pve4A pvedaemon[24588]: starting termproxy UPID:pve4A:0000600C:0044864C:5CDA6DEF:vncshell::root@pam:
May 14 10:27:43 pve4A pvedaemon[2253]: <root@pam> starting task UPID:pve4A:0000600C:0044864C:5CDA6DEF:vncshell::root@pam:
May 14 10:27:44 pve4A pvedaemon[2255]: <root@pam> successful auth for user 'root@pam'
May 14 10:27:44 pve4A systemd[1]: Created slice User Slice of root.
May 14 10:27:44 pve4A systemd[1]: Starting User Manager for UID 0...
May 14 10:27:44 pve4A systemd[1]: Started Session 21 of user root.
May 14 10:27:44 pve4A systemd[24598]: Listening on GnuPG cryptographic agent and passphrase cache (restricted).
May 14 10:27:44 pve4A systemd[24598]: Reached target Timers.
May 14 10:27:44 pve4A systemd[24598]: Listening on GnuPG cryptographic agent and passphrase cache.
May 14 10:27:44 pve4A systemd[24598]: Listening on GnuPG cryptographic agent (access for web browsers).
May 14 10:27:44 pve4A systemd[24598]: Listening on GnuPG cryptographic agent (ssh-agent emulation).
May 14 10:27:44 pve4A systemd[24598]: Reached target Sockets.
May 14 10:27:44 pve4A systemd[24598]: Reached target Paths.
May 14 10:27:44 pve4A systemd[24598]: Reached target Basic System.
May 14 10:27:44 pve4A systemd[24598]: Reached target Default.
May 14 10:27:44 pve4A systemd[24598]: Startup finished in 21ms.
May 14 10:27:44 pve4A systemd[1]: Started User Manager for UID 0.
May 14 10:27:48 pve4A pveproxy[24482]: Clearing outdated entries from certificate cache
 
This morning I restarted corosync on all the nodes again. Cluster was forking for couple of minutes and than hanged

Code:
May 15 09:40:10 pve1 systemd[1]: Starting Corosync Cluster Engine...
May 15 09:40:10 pve1 corosync[24728]:  [MAIN  ] Corosync Cluster Engine ('2.4.4-dirty'): started and ready to provide service.
May 15 09:40:10 pve1 corosync[24728]:  [MAIN  ] Corosync built-in features: dbus rdma monitoring watchdog systemd xmlconf qdevices qnetd snmp pie relro bindnow
May 15 09:40:10 pve1 corosync[24728]: notice  [MAIN  ] Corosync Cluster Engine ('2.4.4-dirty'): started and ready to provide service.
May 15 09:40:10 pve1 corosync[24728]: info    [MAIN  ] Corosync built-in features: dbus rdma monitoring watchdog systemd xmlconf qdevices qnetd snmp pie relro bindnow
May 15 09:40:10 pve1 corosync[24728]:  [MAIN  ] interface section bindnetaddr is used together with nodelist. Nodelist one is going to be used.
May 15 09:40:10 pve1 corosync[24728]: warning [MAIN  ] interface section bindnetaddr is used together with nodelist. Nodelist one is going to be used.
May 15 09:40:10 pve1 corosync[24728]: warning [MAIN  ] Please migrate config file to nodelist.
May 15 09:40:10 pve1 corosync[24728]:  [MAIN  ] Please migrate config file to nodelist.
May 15 09:40:10 pve1 corosync[24728]: notice  [TOTEM ] Initializing transport (UDP/IP Multicast).
May 15 09:40:10 pve1 corosync[24728]: notice  [TOTEM ] Initializing transmit/receive security (NSS) crypto: aes256 hash: sha1
May 15 09:40:10 pve1 corosync[24728]:  [TOTEM ] Initializing transport (UDP/IP Multicast).
May 15 09:40:10 pve1 corosync[24728]:  [TOTEM ] Initializing transmit/receive security (NSS) crypto: aes256 hash: sha1
May 15 09:40:10 pve1 pmxcfs[27969]: [quorum] crit: quorum_initialize failed: 2
May 15 09:40:10 pve1 pmxcfs[27969]: [quorum] crit: can't initialize service
May 15 09:40:10 pve1 pmxcfs[27969]: [confdb] crit: cmap_initialize failed: 2
May 15 09:40:10 pve1 pmxcfs[27969]: [confdb] crit: can't initialize service
May 15 09:40:10 pve1 pmxcfs[27969]: [dcdb] notice: start cluster connection
May 15 09:40:10 pve1 pmxcfs[27969]: [dcdb] crit: cpg_initialize failed: 2
May 15 09:40:10 pve1 pmxcfs[27969]: [dcdb] crit: can't initialize service
May 15 09:40:10 pve1 pmxcfs[27969]: [status] notice: start cluster connection
May 15 09:40:10 pve1 pmxcfs[27969]: [status] crit: cpg_initialize failed: 2
May 15 09:40:10 pve1 pmxcfs[27969]: [status] crit: can't initialize service
May 15 09:40:10 pve1 corosync[24728]: notice  [TOTEM ] The network interface [10.64.200.200] is now up.
May 15 09:40:10 pve1 corosync[24728]: notice  [SERV  ] Service engine loaded: corosync configuration map access [0]
May 15 09:40:10 pve1 corosync[24728]: info    [QB    ] server name: cmap
May 15 09:40:10 pve1 corosync[24728]: notice  [SERV  ] Service engine loaded: corosync configuration service [1]
May 15 09:40:10 pve1 corosync[24728]: info    [QB    ] server name: cfg
May 15 09:40:10 pve1 corosync[24728]: notice  [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
May 15 09:40:10 pve1 corosync[24728]: info    [QB    ] server name: cpg
May 15 09:40:10 pve1 corosync[24728]: notice  [SERV  ] Service engine loaded: corosync profile loading service [4]
May 15 09:40:10 pve1 corosync[24728]:  [TOTEM ] The network interface [10.64.200.200] is now up.
May 15 09:40:10 pve1 corosync[24728]: notice  [SERV  ] Service engine loaded: corosync resource monitoring service [6]
May 15 09:40:10 pve1 corosync[24728]: warning [WD    ] Watchdog not enabled by configuration
May 15 09:40:10 pve1 corosync[24728]: warning [WD    ] resource load_15min missing a recovery key.
May 15 09:40:10 pve1 corosync[24728]: warning [WD    ] resource memory_used missing a recovery key.
May 15 09:40:10 pve1 corosync[24728]: info    [WD    ] no resources configured.
May 15 09:40:10 pve1 corosync[24728]: notice  [SERV  ] Service engine loaded: corosync watchdog service [7]
May 15 09:40:10 pve1 corosync[24728]: notice  [QUORUM] Using quorum provider corosync_votequorum
May 15 09:40:10 pve1 corosync[24728]: notice  [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
May 15 09:40:10 pve1 corosync[24728]: info    [QB    ] server name: votequorum
May 15 09:40:10 pve1 corosync[24728]: notice  [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
May 15 09:40:10 pve1 corosync[24728]: info    [QB    ] server name: quorum
May 15 09:40:10 pve1 corosync[24728]: notice  [TOTEM ] A new membership (10.64.200.200:4624) was formed. Members joined: 1
May 15 09:40:10 pve1 corosync[24728]:  [SERV  ] Service engine loaded: corosync configuration map access [0]
May 15 09:40:10 pve1 systemd[1]: Started Corosync Cluster Engine.
May 15 09:40:10 pve1 corosync[24728]: warning [CPG   ] downlist left_list: 0 received
May 15 09:40:10 pve1 corosync[24728]: notice  [QUORUM] Members[1]: 1
May 15 09:40:10 pve1 corosync[24728]: notice  [MAIN  ] Completed service synchronization, ready to provide service.
May 15 09:40:10 pve1 corosync[24728]:  [QB    ] server name: cmap
May 15 09:40:10 pve1 corosync[24728]:  [SERV  ] Service engine loaded: corosync configuration service [1]
May 15 09:40:10 pve1 corosync[24728]:  [QB    ] server name: cfg
May 15 09:40:10 pve1 corosync[24728]:  [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
May 15 09:40:10 pve1 corosync[24728]:  [QB    ] server name: cpg
May 15 09:40:10 pve1 corosync[24728]:  [SERV  ] Service engine loaded: corosync profile loading service [4]
May 15 09:40:10 pve1 corosync[24728]:  [SERV  ] Service engine loaded: corosync resource monitoring service [6]
May 15 09:40:10 pve1 corosync[24728]:  [WD    ] Watchdog not enabled by configuration
May 15 09:40:10 pve1 corosync[24728]:  [WD    ] resource load_15min missing a recovery key.
May 15 09:40:10 pve1 corosync[24728]:  [WD    ] resource memory_used missing a recovery key.
May 15 09:40:10 pve1 corosync[24728]:  [WD    ] no resources configured.
May 15 09:40:10 pve1 corosync[24728]:  [SERV  ] Service engine loaded: corosync watchdog service [7]
May 15 09:40:10 pve1 corosync[24728]:  [QUORUM] Using quorum provider corosync_votequorum
May 15 09:40:10 pve1 corosync[24728]:  [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
May 15 09:40:10 pve1 corosync[24728]:  [QB    ] server name: votequorum
May 15 09:40:10 pve1 corosync[24728]:  [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
May 15 09:40:10 pve1 corosync[24728]:  [QB    ] server name: quorum
May 15 09:40:10 pve1 corosync[24728]:  [TOTEM ] A new membership (10.64.200.200:4624) was formed. Members joined: 1
May 15 09:40:10 pve1 corosync[24728]:  [CPG   ] downlist left_list: 0 received
May 15 09:40:10 pve1 corosync[24728]:  [QUORUM] Members[1]: 1
May 15 09:40:10 pve1 corosync[24728]:  [MAIN  ] Completed service synchronization, ready to provide service.
May 15 09:40:10 pve1 pvedaemon[7140]: <root@pam> end task UPID:pve1:00006088:00C8414B:5CDBB44A:srvrestart:corosync:root@pam: OK
May 15 09:40:10 pve1 corosync[24728]: notice  [TOTEM ] A new membership (10.64.200.200:4628) was formed. Members joined: 2 4 3 5 6 7
May 15 09:40:10 pve1 corosync[24728]:  [TOTEM ] A new membership (10.64.200.200:4628) was formed. Members joined: 2 4 3 5 6 7
May 15 09:40:10 pve1 corosync[24728]: warning [CPG   ] downlist left_list: 0 received
May 15 09:40:10 pve1 corosync[24728]: warning [CPG   ] downlist left_list: 0 received
May 15 09:40:10 pve1 corosync[24728]:  [CPG   ] downlist left_list: 0 received
May 15 09:40:10 pve1 corosync[24728]: warning [CPG   ] downlist left_list: 0 received
May 15 09:40:10 pve1 corosync[24728]: warning [CPG   ] downlist left_list: 0 received
May 15 09:40:10 pve1 corosync[24728]:  [CPG   ] downlist left_list: 0 received
May 15 09:40:10 pve1 corosync[24728]: warning [CPG   ] downlist left_list: 0 received
May 15 09:40:10 pve1 corosync[24728]:  [CPG   ] downlist left_list: 0 received
May 15 09:40:10 pve1 corosync[24728]:  [CPG   ] downlist left_list: 0 received
May 15 09:40:10 pve1 corosync[24728]: warning [CPG   ] downlist left_list: 0 received
May 15 09:40:10 pve1 corosync[24728]:  [CPG   ] downlist left_list: 0 received
May 15 09:40:10 pve1 corosync[24728]:  [CPG   ] downlist left_list: 0 received
May 15 09:40:10 pve1 corosync[24728]: warning [CPG   ] downlist left_list: 0 received
May 15 09:40:10 pve1 corosync[24728]:  [CPG   ] downlist left_list: 0 received
May 15 09:40:10 pve1 corosync[24728]: notice  [QUORUM] This node is within the primary component and will provide service.
May 15 09:40:10 pve1 corosync[24728]: notice  [QUORUM] Members[7]: 1 2 4 3 5 6 7
May 15 09:40:10 pve1 corosync[24728]: notice  [MAIN  ] Completed service synchronization, ready to provide service.
May 15 09:40:10 pve1 corosync[24728]:  [QUORUM] This node is within the primary component and will provide service.
May 15 09:40:10 pve1 corosync[24728]:  [QUORUM] Members[7]: 1 2 4 3 5 6 7
May 15 09:40:10 pve1 corosync[24728]:  [MAIN  ] Completed service synchronization, ready to provide service.
May 15 09:40:14 pve1 pvedaemon[25070]: re-starting service postfix: UPID:pve1:000061EE:00C842DD:5CDBB44E:srvrestart:postfix:root@pam:
May 15 09:40:14 pve1 pvedaemon[5678]: <root@pam> starting task UPID:pve1:000061EE:00C842DD:5CDBB44E:srvrestart:postfix:root@pam:
May 15 09:40:14 pve1 systemd[1]: Stopped Postfix Mail Transport Agent.
May 15 09:40:14 pve1 systemd[1]: Stopping Postfix Mail Transport Agent...
May 15 09:40:14 pve1 systemd[1]: Stopping Postfix Mail Transport Agent (instance -)...
May 15 09:40:14 pve1 postfix[25073]: Postfix is running with backwards-compatible default settings
May 15 09:40:14 pve1 postfix[25073]: See http://www.postfix.org/COMPATIBILITY_README.html for details
May 15 09:40:14 pve1 postfix[25073]: To disable backwards compatibility use "postconf compatibility_level=2" and "postfix reload"
May 15 09:40:14 pve1 postfix/postfix-script[25079]: stopping the Postfix mail system
May 15 09:40:14 pve1 postfix/master[3506]: terminating on signal 15
May 15 09:40:14 pve1 systemd[1]: Stopped Postfix Mail Transport Agent (instance -).
May 15 09:40:14 pve1 systemd[1]: Starting Postfix Mail Transport Agent (instance -)...
May 15 09:40:14 pve1 postfix[25148]: Postfix is running with backwards-compatible default settings
May 15 09:40:14 pve1 postfix[25148]: See http://www.postfix.org/COMPATIBILITY_README.html for details
May 15 09:40:14 pve1 postfix[25148]: To disable backwards compatibility use "postconf compatibility_level=2" and "postfix reload"
May 15 09:40:15 pve1 postfix/postfix-script[25261]: starting the Postfix mail system
May 15 09:40:15 pve1 postfix/master[25263]: daemon started -- version 3.1.12, configuration /etc/postfix
May 15 09:40:15 pve1 systemd[1]: Started Postfix Mail Transport Agent (instance -).
May 15 09:40:15 pve1 systemd[1]: Starting Postfix Mail Transport Agent...
May 15 09:40:15 pve1 systemd[1]: Started Postfix Mail Transport Agent.
May 15 09:40:15 pve1 pvedaemon[5678]: <root@pam> end task UPID:pve1:000061EE:00C842DD:5CDBB44E:srvrestart:postfix:root@pam: OK
May 15 09:40:16 pve1 pmxcfs[27969]: [status] notice: update cluster info (cluster name  pve-cluster, version = 9)
May 15 09:40:16 pve1 pmxcfs[27969]: [status] notice: node has quorum
May 15 09:40:16 pve1 pmxcfs[27969]: [dcdb] notice: members: 1/27969, 2/17635, 3/25391, 4/11069, 5/27190, 6/12205, 7/21753
May 15 09:40:16 pve1 pmxcfs[27969]: [dcdb] notice: starting data syncronisation
May 15 09:40:16 pve1 pmxcfs[27969]: [dcdb] notice: received sync request (epoch 1/27969/0000000F)
May 15 09:40:16 pve1 pmxcfs[27969]: [status] notice: members: 1/27969, 2/17635, 3/25391, 4/11069, 5/27190, 6/12205, 7/21753
May 15 09:40:16 pve1 pmxcfs[27969]: [status] notice: starting data syncronisation
May 15 09:40:16 pve1 pmxcfs[27969]: [status] notice: received sync request (epoch 1/27969/0000000D)
May 15 09:40:16 pve1 pmxcfs[27969]: [dcdb] notice: received all states
May 15 09:40:16 pve1 pmxcfs[27969]: [dcdb] notice: leader is 2/17635
May 15 09:40:16 pve1 pmxcfs[27969]: [dcdb] notice: synced members: 2/17635, 3/25391, 4/11069, 5/27190, 6/12205, 7/21753
May 15 09:40:16 pve1 pmxcfs[27969]: [dcdb] notice: waiting for updates from leader
May 15 09:40:16 pve1 pmxcfs[27969]: [dcdb] notice: update complete - trying to commit (got 10 inode updates)
May 15 09:40:16 pve1 pmxcfs[27969]: [dcdb] notice: all data is up to date
May 15 09:40:16 pve1 pmxcfs[27969]: [status] notice: received all states
May 15 09:40:16 pve1 pmxcfs[27969]: [status] notice: all data is up to date
May 15 09:40:19 pve1 pvedaemon[22829]: <root@pam> starting task UPID:pve1:00006414:00C844AA:5CDBB452:srvrestart:pve-cluster:root@pam:
May 15 09:40:19 pve1 pvedaemon[25620]: re-starting service pve-cluster: UPID:pve1:00006414:00C844AA:5CDBB452:srvrestart:pve-cluster:root@pam:
May 15 09:40:19 pve1 systemd[1]: Stopping The Proxmox VE cluster filesystem...
May 15 09:40:19 pve1 pmxcfs[27969]: [main] notice: teardown filesystem
May 15 09:40:19 pve1 pmxcfs[27969]: [main] notice: exit proxmox configuration filesystem (0)
May 15 09:40:19 pve1 systemd[1]: Stopped The Proxmox VE cluster filesystem.
May 15 09:40:19 pve1 systemd[1]: Starting The Proxmox VE cluster filesystem...
May 15 09:40:19 pve1 pmxcfs[25670]: [status] notice: update cluster info (cluster name  pve-cluster, version = 9)
May 15 09:40:19 pve1 pmxcfs[25670]: [status] notice: node has quorum
May 15 09:40:19 pve1 pmxcfs[25670]: [dcdb] notice: members: 1/25670, 2/17635, 3/25391, 4/11069, 5/27190, 6/12205, 7/21753
May 15 09:40:19 pve1 pmxcfs[25670]: [dcdb] notice: starting data syncronisation
May 15 09:40:19 pve1 pmxcfs[25670]: [status] notice: members: 1/25670, 2/17635, 3/25391, 4/11069, 5/27190, 6/12205, 7/21753
May 15 09:40:19 pve1 pmxcfs[25670]: [status] notice: starting data syncronisation
May 15 09:40:19 pve1 pmxcfs[25670]: [dcdb] notice: received sync request (epoch 1/25670/00000001)
May 15 09:40:19 pve1 pmxcfs[25670]: [status] notice: received sync request (epoch 1/25670/00000001)
May 15 09:40:19 pve1 pmxcfs[25670]: [dcdb] notice: received all states
May 15 09:40:19 pve1 pmxcfs[25670]: [dcdb] notice: leader is 1/25670
May 15 09:40:19 pve1 pmxcfs[25670]: [dcdb] notice: synced members: 1/25670, 2/17635, 3/25391, 4/11069, 5/27190, 6/12205, 7/21753
May 15 09:40:19 pve1 pmxcfs[25670]: [dcdb] notice: start sending inode updates
May 15 09:40:19 pve1 pmxcfs[25670]: [dcdb] notice: sent all (0) updates
May 15 09:40:19 pve1 pmxcfs[25670]: [dcdb] notice: all data is up to date
May 15 09:40:19 pve1 pmxcfs[25670]: [status] notice: received all states
May 15 09:40:19 pve1 pmxcfs[25670]: [status] notice: all data is up to date
May 15 09:40:19 pve1 pmxcfs[25670]: [status] notice: dfsm_deliver_queue: queue length 12
May 15 09:40:21 pve1 systemd[1]: Started The Proxmox VE cluster filesystem.
May 15 09:40:21 pve1 pvedaemon[22829]: <root@pam> end task UPID:pve1:00006414:00C844AA:5CDBB452:srvrestart:pve-cluster:root@pam: OK
May 15 09:40:31 pve1 pmxcfs[25670]: [status] notice: received log
May 15 09:41:00 pve1 systemd[1]: Starting Proxmox VE replication runner...
May 15 09:41:06 pve1 pvedaemon[5678]: <root@pam> successful auth for user 'root@pam'
May 15 09:42:25 pve1 pveproxy[20332]: worker exit
May 15 09:42:25 pve1 pveproxy[3772]: worker 20332 finished
May 15 09:42:25 pve1 pveproxy[3772]: starting 1 worker(s)
May 15 09:42:25 pve1 pveproxy[3772]: worker 4842 started
May 15 09:42:50 pve1 corosync[24728]: notice  [TOTEM ] Retransmit List: ff1 ff2
May 15 09:42:50 pve1 corosync[24728]:  [TOTEM ] Retransmit List: ff1 ff2
May 15 09:42:50 pve1 corosync[24728]: notice  [TOTEM ] Retransmit List: fe8 ff0 ff1 ff2 ff3 ff4
May 15 09:42:50 pve1 corosync[24728]:  [TOTEM ] Retransmit List: fe8 ff0 ff1 ff2 ff3 ff4
May 15 09:42:50 pve1 corosync[24728]: notice  [TOTEM ] Retransmit List: fe7 fe8 fe9 ff0 ff1 ff2 ff3 ff4
May 15 09:42:50 pve1 corosync[24728]:  [TOTEM ] Retransmit List: fe7 fe8 fe9 ff0 ff1 ff2 ff3 ff4
May 15 09:42:50 pve1 corosync[24728]: notice  [TOTEM ] Retransmit List: ff0 ff2 ff3 ff4 ff5
May 15 09:42:50 pve1 corosync[24728]:  [TOTEM ] Retransmit List: ff0 ff2 ff3 ff4 ff5
May 15 09:42:50 pve1 corosync[24728]: notice  [TOTEM ] Retransmit List: fee fef fd3 ff2 ff0 ff1 ff3 ff4 ff5 ff6 ff7
May 15 09:42:50 pve1 corosync[24728]:  [TOTEM ] Retransmit List: fee fef fd3 ff2 ff0 ff1 ff3 ff4 ff5 ff6 ff7
May 15 09:42:50 pve1 corosync[24728]: notice  [TOTEM ] Retransmit List: fee fef ff0 ff1 ff2 ff3 ff4 ff5 ff6 ff7
May 15 09:42:50 pve1 corosync[24728]:  [TOTEM ] Retransmit List: fee fef ff0 ff1 ff2 ff3 ff4 ff5 ff6 ff7
May 15 09:42:50 pve1 corosync[24728]: notice  [TOTEM ] Retransmit List: fee fd3 fef ff0 fe7 ff1 ff2 ff3 ff4 ff5 ff6 ff7
May 15 09:42:50 pve1 corosync[24728]:  [TOTEM ] Retransmit List: fee fd3 fef ff0 fe7 ff1 ff2 ff3 ff4 ff5 ff6 ff7
May 15 09:42:50 pve1 corosync[24728]: notice  [TOTEM ] Retransmit List: fed fee fef ff0 ff1 ff2 ff3 ff4 ff5 ff6 ff7
May 15 09:42:50 pve1 corosync[24728]:  [TOTEM ] Retransmit List: fed fee fef ff0 ff1 ff2 ff3 ff4 ff5 ff6 ff7
May 15 09:42:50 pve1 corosync[24728]: notice  [TOTEM ] Retransmit List: fee fef ff0 ff1 ff2 ff3 ff4 ff5 ff6 ff7
May 15 09:42:50 pve1 corosync[24728]:  [TOTEM ] Retransmit List: fee fef ff0 ff1 ff2 ff3 ff4 ff5 ff6 ff7
May 15 09:42:50 pve1 corosync[24728]: notice  [TOTEM ] Retransmit List: fed fee fef ff0 ff1 ff2 ff3 ff4 ff5 ff6 ff7
May 15 09:42:50 pve1 corosync[24728]:  [TOTEM ] Retransmit List: fed fee fef ff0 ff1 ff2 ff3 ff4 ff5 ff6 ff7
May 15 09:42:50 pve1 corosync[24728]: notice  [TOTEM ] Retransmit List: fd1 ff0 ff1 ff2 ff3 ff4 ff5 ff6 ff7
May 15 09:42:50 pve1 corosync[24728]:  [TOTEM ] Retransmit List: fd1 ff0 ff1 ff2 ff3 ff4 ff5 ff6 ff7
May 15 09:42:50 pve1 corosync[24728]: notice  [TOTEM ] Retransmit List: fee fd2 fd3 fef ff0 ff1 ff2 ff3 ff4 ff5 ff6 ff7

...

May 15 09:42:52 pve1 corosync[24728]: notice  [TOTEM ] Retransmit List: fe0 fe1 fe2 fe3 fe4 fe5 fe6 fe7 fe8 fec fcc fcd fce fed fee ff8 fe9 fea fef ff0 ff1 ff2 ff9 ffa ffb ffc ffd ffe fff 1000
May 15 09:42:52 pve1 corosync[24728]:  [TOTEM ] Retransmit List: fe0 fe1 fe2 fe3 fe4 fe5 fe6 fe7 fe8 fec fcc fcd fce fed fee ff8 fe9 fea fef ff0 ff1 ff2 ff9 ffa ffb ffc ffd ffe fff 1000
May 15 09:42:52 pve1 corosync[24728]: notice  [TOTEM ] Retransmit List: fe0 fe1 fe2 fcc fcd fce feb ff8 ff0 ff1 ff2 ff3 ff4 ff5 ff6 ff9 ffa ffb ffc ffd ffe fff 1000
May 15 09:42:52 pve1 corosync[24728]:  [TOTEM ] Retransmit List: fe0 fe1 fe2 fcc fcd fce feb ff8 ff0 ff1 ff2 ff3 ff4 ff5 ff6 ff9 ffa ffb ffc ffd ffe fff 1000
May 15 09:42:52 pve1 corosync[24728]: notice  [TOTEM ] Retransmit List: ffd ffe fff 1000 fe4 fe5 fcc fcd fce fe6 fe7 fe8 fe9 fea feb fec fed fee ff8 ff9 ffa ffb ffc
May 15 09:42:52 pve1 corosync[24728]:  [TOTEM ] Retransmit List: ffd ffe fff 1000 fe4 fe5 fcc fcd fce fe6 fe7 fe8 fe9 fea feb fec fed fee ff8 ff9 ffa ffb ffc
May 15 09:42:52 pve1 corosync[24728]: notice  [TOTEM ] Retransmit List: fe3 fef ff0 ff1 fcc fcd fce ff8 ff2 ff3 ff4 ff5 ff6 ff7 ff9 ffa ffb ffc ffd ffe fff 1000
May 15 09:42:52 pve1 corosync[24728]:  [TOTEM ] Retransmit List: fe3 fef ff0 ff1 fcc fcd fce ff8 ff2 ff3 ff4 ff5 ff6 ff7 ff9 ffa ffb ffc ffd ffe fff 1000
May 15 09:42:52 pve1 corosync[24728]: notice  [TOTEM ] Retransmit List: fe2 fe3 fe4 fe5 fe6 fe7 fe8 fe9 fea fcc fcd fce feb fec fed fee fef ff0 ff1 ff2 ff3 ff8 ff9 ffa ffb ffc ffd ffe fff 1000
May 15 09:42:52 pve1 corosync[24728]:  [TOTEM ] Retransmit List: fe2 fe3 fe4 fe5 fe6 fe7 fe8 fe9 fea fcc fcd fce feb fec fed fee fef ff0 ff1 ff2 ff3 ff8 ff9 ffa ffb ffc ffd ffe fff 1000
May 15 09:42:52 pve1 corosync[24728]: notice  [TOTEM ] Retransmit List: fdf fe0 fe1 fe2 fcc fcd fce fe3 fe4 ff8 fe7 ff0 ff1 ff2 ff4 ff5 ff9 ffa ffb ffc ffd ffe fff 1000
May 15 09:42:52 pve1 corosync[24728]:  [TOTEM ] Retransmit List: fdf fe0 fe1 fe2 fcc fcd fce fe3 fe4 ff8 fe7 ff0 ff1 ff2 ff4 ff5 ff9 ffa ffb ffc ffd ffe fff 1000
May 15 09:42:52 pve1 corosync[24728]: notice  [TOTEM ] Retransmit List: ffe fff 1000 fe6 fe8 fe9 fcc fcd fce fea feb fec fed fee fef ff0 ff1 ff2 ff8 ff9 ffa ffb ffc ffd
May 15 09:42:52 pve1 corosync[24728]:  [TOTEM ] Retransmit List: ffe fff 1000 fe6 fe8 fe9 fcc fcd fce fea feb fec fed fee fef ff0 ff1 ff2 ff8 ff9 ffa ffb ffc ffd
May 15 09:42:52 pve1 corosync[24728]: notice  [TOTEM ] Retransmit List: fe2 fe3 fe4 fe5 fcc fcd fce fe7 ff8 ff9 fe8 fe9 ff0 ff1 ff2 ff3 ffa ffb ffc ffd ffe fff 1000
May 15 09:42:52 pve1 corosync[24728]:  [TOTEM ] Retransmit List: fe2 fe3 fe4 fe5 fcc fcd fce fe7 ff8 ff9 fe8 fe9 ff0 ff1 ff2 ff3 ffa ffb ffc ffd ffe fff 1000
May 15 09:42:52 pve1 corosync[24728]: notice  [TOTEM ] Retransmit List: fe1 fe2 fe3 fe4 fe6 fea feb fec fed fcc fcd fce fee fef ff8 fe7 fe8 ff0 ff1 ff2 ff4 ff9 ffa ffb ffc ffd ffe fff 1000
May 15 09:42:52 pve1 corosync[24728]:  [TOTEM ] Retransmit List: fe1 fe2 fe3 fe4 fe6 fea feb fec fed fcc fcd fce fee fef ff8 fe7 fe8 ff0 ff1 ff2 ff4 ff9 ffa ffb ffc ffd ffe fff 1000
May 15 09:42:52 pve1 corosync[24728]: notice  [TOTEM ] Retransmit List: fe0 fe1 fe2 fe3 fcc fcd fce fe5 fe9 ff8 fea ff0 ff1 ff2 ff3 ff5 ff9 ffa ffb ffc ffd ffe fff 1000
May 15 09:42:52 pve1 corosync[24728]:  [TOTEM ] Retransmit List: fe0 fe1 fe2 fe3 fcc fcd fce fe5 fe9 ff8 fea ff0 ff1 ff2 ff3 ff5 ff9 ffa ffb ffc ffd ffe fff 1000
May 15 09:42:52 pve1 corosync[24728]: notice  [TOTEM ] Retransmit List: ffe fff 1000 fe4 fe6 fe7 fcc fcd fce fe8 feb fec fe9 fed fee fef ff0 ff1 ff8 ff9 ffa ffb ffc ffd
May 15 09:42:52 pve1 corosync[24728]:  [TOTEM ] Retransmit List: ffe fff 1000 fe4 fe6 fe7 fcc fcd fce fe8 feb fec fe9 fed fee fef ff0 ff1 ff8 ff9 ffa ffb ffc ffd
May 15 09:42:52 pve1 corosync[24728]: notice  [TOTEM ] Retransmit List: fe2 fe3 fe5 fea fcc fcd fce ff8 ff9 ffa fe7 ff0 ff1 ff2 ff3 ff4 ffb ffc ffd ffe fff 1000
May 15 09:42:52 pve1 corosync[24728]:  [TOTEM ] Retransmit List: fe2 fe3 fe5 fea fcc fcd fce ff8 ff9 ffa fe7 ff0 ff1 ff2 ff3 ff4 ffb ffc ffd ffe fff 1000
May 15 09:42:52 pve1 corosync[24728]: notice  [TOTEM ] Retransmit List: fe2 fe3 fe4 fe5 fe6 fe8 fe9 feb fec fcc fcd fce fed fee fef fe7 fea ff0 ff1 ff2 ff3 ff8 ff9 ffa ffb ffc ffd ffe fff 1000
May 15 09:42:52 pve1 corosync[24728]:  [TOTEM ] Retransmit List: fe2 fe3 fe4 fe5 fe6 fe8 fe9 feb fec fcc fcd fce fed fee fef fe7 fea ff0 ff1 ff2 ff3 ff8 ff9 ffa ffb ffc ffd ffe fff 1000
May 15 09:42:52 pve1 corosync[24728]: notice  [TOTEM ] Retransmit List: fdf fe0 fe1 fe2 fcc fcd fce fe3 fe4 ff8 fe8 ff0 ff1 ff2 ff4 ff5 ff9 ffa ffb ffc ffd ffe fff 1000
May 15 09:42:52 pve1 corosync[24728]:  [TOTEM ] Retransmit List: fdf fe0 fe1 fe2 fcc fcd fce fe3 fe4 ff8 fe8 ff0 ff1 ff2 ff4 ff5 ff9 ffa ffb ffc ffd ffe fff 1000
May 15 09:42:52 pve1 corosync[24728]: notice  [TOTEM ] Retransmit List: ffe fff 1000 fe6 fe7 fe9 fcc fcd fce fea feb fec fed fee fef ff0 ff1 ff2 ff8 ff9 ffa ffb ffc ffd
May 15 09:42:52 pve1 corosync[24728]:  [TOTEM ] Retransmit List: ffe fff 1000 fe6 fe7 fe9 fcc fcd fce fea feb fec fed fee fef ff0 ff1 ff2 ff8 ff9 ffa ffb ffc ffd
May 15 09:42:52 pve1 corosync[24728]: notice  [TOTEM ] Retransmit List: fe2 fe3 fe4 fe5 fcc fcd fce fe8 ff8 ff9 fe7 fe9 ff0 ff1 ff2 ff3 ffa ffb ffc ffd ffe fff 1000
May 15 09:42:52 pve1 corosync[24728]:  [TOTEM ] Retransmit List: fe2 fe3 fe4 fe5 fcc fcd fce fe8 ff8 ff9 fe7 fe9 ff0 ff1 ff2 ff3 ffa ffb ffc ffd ffe fff 1000
May 15 09:42:52 pve1 corosync[24728]: notice  [TOTEM ] Retransmit List: fe1 fe2 fe3 fe4 fe6 fea feb fec fed fcc fcd fce fee fef ff8 fe7 fe8 ff0 ff1 ff2 ff4 ff9 ffa ffb ffc ffd ffe fff 1000
May 15 09:42:52 pve1 corosync[24728]:  [TOTEM ] Retransmit List: fe1 fe2 fe3 fe4 fe6 fea feb fec fed fcc fcd fce fee fef ff8 fe7 fe8 ff0 ff1 ff2 ff4 ff9 ffa ffb ffc ffd ffe fff 1000
May 15 09:42:52 pve1 corosync[24728]: notice  [TOTEM ] Retransmit List: fe0 fe1 fe2 fe3 fcc fcd fce fe5 fe9 ff8 fea ff0 ff1 ff2 ff3 ff5 ff9 ffa ffb ffc ffd ffe fff 1000
May 15 09:42:52 pve1 corosync[24728]:  [TOTEM ] Retransmit List: fe0 fe1 fe2 fe3 fcc fcd fce fe5 fe9 ff8 fea ff0 ff1 ff2 ff3 ff5 ff9 ffa ffb ffc ffd ffe fff 1000
May 15 09:42:52 pve1 corosync[24728]: notice  [TOTEM ] Retransmit List: ffe fff 1000 fe4 fe6 fe7 fcc fcd fce fe8 feb fec fe9 fed fee fef ff0 ff1 ff8 ff9 ffa ffb ffc ffd
May 15 09:42:52 pve1 corosync[24728]:  [TOTEM ] Retransmit List: ffe fff 1000 fe4 fe6 fe7 fcc fcd fce fe8 feb fec fe9 fed fee fef ff0 ff1 ff8 ff9 ffa ffb ffc ffd
May 15 09:42:52 pve1 corosync[24728]: notice  [TOTEM ] Retransmit List: fe2 fe3 fe5 fea fcc fcd fce ff8 ff9 ffa fe7 ff0 ff1 ff2 ff3 ff4 ffb ffc ffd ffe fff 1000
May 15 09:42:52 pve1 corosync[24728]:  [TOTEM ] Retransmit List: fe2 fe3 fe5 fea fcc fcd fce ff8 ff9 ffa fe7 ff0 ff1 ff2 ff3 ff4 ffb ffc ffd ffe fff 1000
May 15 09:42:52 pve1 corosync[24728]: notice  [TOTEM ] Retransmit List: fe2 fe3 fe4 fe5 fe6 fe8 fe9 feb fec fcc fcd fce fed fee fef fe7 fea ff0 ff1 ff2 ff3 ff8 ff9 ffa ffb ffc ffd ffe fff 1000

...

May 15 09:43:03 pve1 corosync[24728]: notice  [TOTEM ] A new membership (10.64.200.200:4632) was formed. Members left: 4
May 15 09:43:03 pve1 corosync[24728]: notice  [TOTEM ] Failed to receive the leave message. failed: 4
May 15 09:43:03 pve1 corosync[24728]:  [TOTEM ] A new membership (10.64.200.200:4632) was formed. Members left: 4
May 15 09:43:03 pve1 corosync[24728]:  [TOTEM ] Failed to receive the leave message. failed: 4
May 15 09:43:03 pve1 corosync[24728]: warning [CPG   ] downlist left_list: 1 received
May 15 09:43:03 pve1 corosync[24728]:  [CPG   ] downlist left_list: 1 received
May 15 09:43:03 pve1 corosync[24728]:  [CPG   ] downlist left_list: 1 received
May 15 09:43:03 pve1 corosync[24728]: warning [CPG   ] downlist left_list: 1 received
May 15 09:43:03 pve1 corosync[24728]: warning [CPG   ] downlist left_list: 1 received
May 15 09:43:03 pve1 corosync[24728]:  [CPG   ] downlist left_list: 1 received
May 15 09:43:03 pve1 corosync[24728]: warning [CPG   ] downlist left_list: 1 received
May 15 09:43:03 pve1 corosync[24728]:  [CPG   ] downlist left_list: 1 received
May 15 09:43:03 pve1 corosync[24728]: warning [CPG   ] downlist left_list: 1 received
May 15 09:43:03 pve1 corosync[24728]:  [CPG   ] downlist left_list: 1 received
May 15 09:43:03 pve1 corosync[24728]: warning [CPG   ] downlist left_list: 1 received
May 15 09:43:03 pve1 corosync[24728]:  [CPG   ] downlist left_list: 1 received
May 15 09:43:03 pve1 pmxcfs[25670]: [dcdb] notice: members: 1/25670, 2/17635, 3/25391, 5/27190, 6/12205, 7/21753
May 15 09:43:03 pve1 pmxcfs[25670]: [dcdb] notice: starting data syncronisation
May 15 09:43:03 pve1 corosync[24728]: notice  [QUORUM] Members[6]: 1 2 3 5 6 7
May 15 09:43:03 pve1 corosync[24728]: notice  [MAIN  ] Completed service synchronization, ready to provide service.
May 15 09:43:03 pve1 corosync[24728]:  [QUORUM] Members[6]: 1 2 3 5 6 7
May 15 09:43:03 pve1 corosync[24728]:  [MAIN  ] Completed service synchronization, ready to provide service.
May 15 09:43:03 pve1 pmxcfs[25670]: [dcdb] notice: cpg_send_message retried 1 times
May 15 09:43:03 pve1 pmxcfs[25670]: [status] notice: members: 1/25670, 2/17635, 3/25391, 5/27190, 6/12205, 7/21753
May 15 09:43:03 pve1 pmxcfs[25670]: [status] notice: starting data syncronisation
May 15 09:43:04 pve1 corosync[24728]: notice  [TOTEM ] A new membership (10.64.200.200:4636) was formed. Members
May 15 09:43:04 pve1 corosync[24728]:  [TOTEM ] A new membership (10.64.200.200:4636) was formed. Members
May 15 09:43:04 pve1 corosync[24728]: warning [CPG   ] downlist left_list: 0 received
May 15 09:43:04 pve1 corosync[24728]:  [CPG   ] downlist left_list: 0 received
May 15 09:43:04 pve1 corosync[24728]:  [CPG   ] downlist left_list: 0 received
May 15 09:43:04 pve1 corosync[24728]: warning [CPG   ] downlist left_list: 0 received
May 15 09:43:04 pve1 corosync[24728]: warning [CPG   ] downlist left_list: 0 received
May 15 09:43:04 pve1 corosync[24728]:  [CPG   ] downlist left_list: 0 received
May 15 09:43:04 pve1 corosync[24728]: warning [CPG   ] downlist left_list: 0 received
May 15 09:43:04 pve1 corosync[24728]:  [CPG   ] downlist left_list: 0 received
May 15 09:43:04 pve1 corosync[24728]: warning [CPG   ] downlist left_list: 0 received
May 15 09:43:04 pve1 corosync[24728]:  [CPG   ] downlist left_list: 0 received
May 15 09:43:04 pve1 corosync[24728]: warning [CPG   ] downlist left_list: 0 received
May 15 09:43:04 pve1 corosync[24728]:  [CPG   ] downlist left_list: 0 received
May 15 09:43:04 pve1 corosync[24728]: notice  [QUORUM] Members[6]: 1 2 3 5 6 7
May 15 09:43:04 pve1 corosync[24728]: notice  [MAIN  ] Completed service synchronization, ready to provide service.
May 15 09:43:04 pve1 corosync[24728]:  [QUORUM] Members[6]: 1 2 3 5 6 7
May 15 09:43:04 pve1 corosync[24728]:  [MAIN  ] Completed service synchronization, ready to provide service.
May 15 09:43:04 pve1 pmxcfs[25670]: [dcdb] notice: received sync request (epoch 1/25670/00000002)
May 15 09:43:04 pve1 pmxcfs[25670]: [status] notice: received sync request (epoch 1/25670/00000002)
May 15 09:43:04 pve1 pmxcfs[25670]: [dcdb] notice: received all states
May 15 09:43:04 pve1 pmxcfs[25670]: [dcdb] notice: leader is 1/25670
May 15 09:43:04 pve1 pmxcfs[25670]: [dcdb] notice: synced members: 1/25670, 2/17635, 3/25391, 5/27190, 6/12205, 7/21753
May 15 09:43:04 pve1 pmxcfs[25670]: [dcdb] notice: start sending inode updates
May 15 09:43:04 pve1 pmxcfs[25670]: [dcdb] notice: sent all (0) updates
May 15 09:43:04 pve1 pmxcfs[25670]: [dcdb] notice: all data is up to date
May 15 09:43:04 pve1 pmxcfs[25670]: [dcdb] notice: dfsm_deliver_queue: queue length 7
May 15 09:43:04 pve1 pmxcfs[25670]: [status] notice: received all states
May 15 09:43:04 pve1 pmxcfs[25670]: [status] notice: all data is up to date
May 15 09:43:04 pve1 pmxcfs[25670]: [status] notice: dfsm_deliver_queue: queue length 47
May 15 09:43:07 pve1 smartd[2703]: Device: /dev/sdf [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 70 to 71
May 15 09:43:07 pve1 smartd[2703]: Device: /dev/sdf [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 61 to 60
May 15 09:43:07 pve1 smartd[2703]: Device: /dev/sdg [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 63 to 60
May 15 09:43:09 pve1 corosync[24728]: notice  [TOTEM ] A new membership (10.64.200.200:4640) was formed. Members
May 15 09:43:09 pve1 corosync[24728]:  [TOTEM ] A new membership (10.64.200.200:4640) was formed. Members
May 15 09:43:09 pve1 corosync[24728]: warning [CPG   ] downlist left_list: 0 received
May 15 09:43:09 pve1 corosync[24728]: warning [CPG   ] downlist left_list: 0 received

I'm facing cluster stability problems on all 3 installs that I've upgraded from 5.3 to 5.4. With different switches types (vendors) and network topology. I'm fully confident that all the switches are configured properly (IGMP snooping and queerer) and all 3 setups have been working smoothly for a log time starting form PVE 3.x

The only common thing is: bond (LACP) and bridge on top of bond
 
Does corosync run on it's own separate network or do you have other traffic over the same network? The omping results indicate that it is indeed a networking problem.
 
Your trouble seem to start right after a worker was started according to the logs, so probably this utilizes the network and corosync messages are not delivered in time (corosync is very latency sensitive).
May 15 09:42:25 pve1 pveproxy[3772]: worker 4842 started May 15 09:42:50 pve1 corosync[24728]: notice [TOTEM ] Retransmit List: ff1 ff2
 
Below how omping result looks like now:
Code:
root@pve2:~# omping -c 600 -i 1 -q pve2 pve3 pve4A
pve3  : waiting for response msg
pve4A : waiting for response msg
pve4A : joined (S,G) = (*, 232.43.211.234), pinging
pve3  : joined (S,G) = (*, 232.43.211.234), pinging
pve3  : given amount of query messages was sent
pve4A : given amount of query messages was sent
pve3  :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.068/0.154/0.415/0.026
pve3  : multicast, xmt/rcv/%loss = 600/260/56% (seq>=2 56%), min/avg/max/std-dev = 0.127/0.165/0.229/0.020
pve4A :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.118/0.176/0.402/0.027
pve4A : multicast, xmt/rcv/%loss = 600/260/56% (seq>=2 56%), min/avg/max/std-dev = 0.139/0.195/0.409/0.031

There was no drop before an upgrade. Switch config has not been touched for years
 
Does corosync run on it's own separate network or do you have other traffic over the same network? The omping results indicate that it is indeed a networking problem.
There is no dedicated net. But switch is not loaded (according to SNMP stats). And once again: everything was fine before an upgrade
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!