[SOLVED] Problem with corosync, Cluster stuck several minutes

David Calvache Casas · May 7, 2019

I have a 7 node cluster.
Corosync configured with 2 rings on two diferent ethernet interfaces.

All is Running OK

root@ceph6:~# pvecm status
Quorum information
------------------
Date: Tue May 7 13:07:05 2019
Quorum provider: corosync_votequorum
Nodes: 7
Node ID: 0x00000006
Ring ID: 1/2780
Quorate: Yes

Votequorum information
----------------------
Expected votes: 7
Highest expected: 7
Total votes: 7
Quorum: 4
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.9.5.151
0x00000002 1 10.9.5.152
0x00000003 1 10.9.5.153
0x00000004 1 10.9.5.154
0x00000005 1 10.9.5.155
0x00000006 1 10.9.5.156 (local)
0x00000007 1 10.9.5.157

But when i shut down a node, (example node4) the cluster notifies the shutdown and goes well.

May 7 13:09:00 ceph1 systemd[1]: Starting Proxmox VE replication runner...
May 7 13:09:00 ceph1 systemd[1]: Started Proxmox VE replication runner.
May 7 13:09:03 ceph1 pmxcfs[4047]: [dcdb] notice: members: 1/4047, 2/3707, 3/3671, 5/3539, 6/3691, 7/3421
May 7 13:09:03 ceph1 pmxcfs[4047]: [dcdb] notice: starting data syncronisation
May 7 13:09:03 ceph1 pmxcfs[4047]: [status] notice: members: 1/4047, 2/3707, 3/3671, 5/3539, 6/3691, 7/3421
May 7 13:09:03 ceph1 pmxcfs[4047]: [status] notice: starting data syncronisation
May 7 13:09:03 ceph1 corosync[4073]: notice [TOTEM ] A new membership (10.9.5.151:2784) was formed. Members left: 4
May 7 13:09:03 ceph1 corosync[4073]: [TOTEM ] A new membership (10.9.5.151:2784) was formed. Members left: 4
May 7 13:09:03 ceph1 corosync[4073]: warning [CPG ] downlist left_list: 1 received
May 7 13:09:03 ceph1 corosync[4073]: warning [CPG ] downlist left_list: 1 received
May 7 13:09:03 ceph1 corosync[4073]: [CPG ] downlist left_list: 1 received
May 7 13:09:03 ceph1 corosync[4073]: warning [CPG ] downlist left_list: 1 received
May 7 13:09:03 ceph1 corosync[4073]: warning [CPG ] downlist left_list: 1 received
May 7 13:09:03 ceph1 corosync[4073]: [CPG ] downlist left_list: 1 received
May 7 13:09:03 ceph1 corosync[4073]: warning [CPG ] downlist left_list: 1 received
May 7 13:09:03 ceph1 corosync[4073]: warning [CPG ] downlist left_list: 1 received
May 7 13:09:03 ceph1 corosync[4073]: [CPG ] downlist left_list: 1 received
May 7 13:09:03 ceph1 corosync[4073]: [CPG ] downlist left_list: 1 received
May 7 13:09:03 ceph1 corosync[4073]: [CPG ] downlist left_list: 1 received
May 7 13:09:03 ceph1 corosync[4073]: [CPG ] downlist left_list: 1 received
May 7 13:09:03 ceph1 corosync[4073]: [QUORUM] Members[6]: 1 2 3 5 6 7
May 7 13:09:03 ceph1 corosync[4073]: notice [QUORUM] Members[6]: 1 2 3 5 6 7
May 7 13:09:03 ceph1 corosync[4073]: notice [MAIN ] Completed service synchronization, ready to provide service.
May 7 13:09:03 ceph1 corosync[4073]: [MAIN ] Completed service synchronization, ready to provide service.
May 7 13:09:03 ceph1 pmxcfs[4047]: [dcdb] notice: received sync request (epoch 1/4047/00000008)
May 7 13:09:03 ceph1 pmxcfs[4047]: [status] notice: received sync request (epoch 1/4047/00000008)
May 7 13:09:03 ceph1 pmxcfs[4047]: [dcdb] notice: received all states
May 7 13:09:03 ceph1 pmxcfs[4047]: [dcdb] notice: leader is 1/4047
May 7 13:09:03 ceph1 pmxcfs[4047]: [dcdb] notice: synced members: 1/4047, 2/3707, 3/3671, 5/3539, 6/3691, 7/3421
May 7 13:09:03 ceph1 pmxcfs[4047]: [dcdb] notice: start sending inode updates
May 7 13:09:03 ceph1 pmxcfs[4047]: [dcdb] notice: sent all (0) updates
May 7 13:09:03 ceph1 pmxcfs[4047]: [dcdb] notice: all data is up to date
May 7 13:09:03 ceph1 pmxcfs[4047]: [dcdb] notice: dfsm_deliver_queue: queue length 4
May 7 13:09:03 ceph1 pmxcfs[4047]: [status] notice: received all states
May 7 13:09:03 ceph1 pmxcfs[4047]: [status] notice: all data is up to date
May 7 13:10:00 ceph1 systemd[1]: Starting Proxmox VE replication runner...
.
.
.
May 7 13:14:00 ceph1 systemd[1]: Starting Proxmox VE replication runner...
May 7 13:14:00 ceph1 systemd[1]: Started Proxmox VE replication runner.

root@ceph6:~# pvecm status
Quorum information
------------------
Date: Tue May 7 13:15:50 2019
Quorum provider: corosync_votequorum
Nodes: 6
Node ID: 0x00000006
Ring ID: 1/2784
Quorate: Yes

Votequorum information
----------------------
Expected votes: 7
Highest expected: 7
Total votes: 6
Quorum: 4
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.9.5.151
0x00000002 1 10.9.5.152
0x00000003 1 10.9.5.153
0x00000005 1 10.9.5.155
0x00000006 1 10.9.5.156 (local)
0x00000007 1 10.9.5.157

But, when i power up the node4 again, then all the cluster is down, Quorate lost.

Code:

root@ceph6:~# pvecm status
Quorum information
------------------
Date:             Tue May  7 13:29:45 2019
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000006
Ring ID:          6/2796
Quorate:          No

Votequorum information
----------------------
Expected votes:   7
Highest expected: 7
Total votes:      1
Quorum:           4 Activity blocked
Flags:

Membership information
----------------------
    Nodeid      Votes Name
0x00000006          1 10.9.5.156 (local)

May 7 13:24:00 ceph1 systemd[1]: Started Proxmox VE replication runner.
May 7 13:25:00 ceph1 systemd[1]: Starting Proxmox VE replication runner...
May 7 13:25:00 ceph1 systemd[1]: Started Proxmox VE replication runner.
May 7 13:25:43 ceph1 corosync[4073]: notice [TOTEM ] A new membership (10.9.5.151:2792) was formed. Members joined: 4
May 7 13:25:43 ceph1 corosync[4073]: [TOTEM ] A new membership (10.9.5.151:2792) was formed. Members joined: 4
May 7 13:25:43 ceph1 corosync[4073]: notice [TOTEM ] Retransmit List: 2 4
May 7 13:25:43 ceph1 corosync[4073]: [TOTEM ] Retransmit List: 2 4
May 7 13:25:43 ceph1 corosync[4073]: notice [TOTEM ] Retransmit List: 7 4
May 7 13:25:43 ceph1 corosync[4073]: [TOTEM ] Retransmit List: 7 4
May 7 13:25:43 ceph1 corosync[4073]: notice [TOTEM ] Retransmit List: 8
May 7 13:25:43 ceph1 corosync[4073]: [TOTEM ] Retransmit List: 8
May 7 13:25:43 ceph1 corosync[4073]: notice [TOTEM ] Retransmit List: 8
May 7 13:25:43 ceph1 corosync[4073]: [TOTEM ] Retransmit List: 8
May 7 13:25:43 ceph1 corosync[4073]: notice [TOTEM ] Retransmit List: a
May 7 13:25:43 ceph1 corosync[4073]: [TOTEM ] Retransmit List: a
May 7 13:25:43 ceph1 corosync[4073]: notice [TOTEM ] Retransmit List: c e
May 7 13:25:43 ceph1 corosync[4073]: [TOTEM ] Retransmit List: c e
May 7 13:25:43 ceph1 corosync[4073]: notice [TOTEM ] Retransmit List: f 11
May 7 13:25:43 ceph1 corosync[4073]: [TOTEM ] Retransmit List: f 11
May 7 13:25:43 ceph1 corosync[4073]: warning [CPG ] downlist left_list: 0 received
May 7 13:25:43 ceph1 corosync[4073]: [CPG ] downlist left_list: 0 received
May 7 13:25:43 ceph1 corosync[4073]: notice [TOTEM ] Retransmit List: 11
May 7 13:25:43 ceph1 corosync[4073]: [TOTEM ] Retransmit List: 11
May 7 13:25:43 ceph1 corosync[4073]: notice [TOTEM ] Retransmit List: 12 14 15 17 19
May 7 13:25:43 ceph1 corosync[4073]: [TOTEM ] Retransmit List: 12 14 15 17 19
May 7 13:25:43 ceph1 corosync[4073]: [CPG ] downlist left_list: 0 received
May 7 13:25:43 ceph1 corosync[4073]: warning [CPG ] downlist left_list: 0 received
May 7 13:25:43 ceph1 corosync[4073]: notice [TOTEM ] Retransmit List: 14 1e 21 17
May 7 13:25:43 ceph1 corosync[4073]: [TOTEM ] Retransmit List: 14 1e 21 17
May 7 13:25:43 ceph1 corosync[4073]: warning [CPG ] downlist left_list: 0 received
May 7 13:25:43 ceph1 corosync[4073]: [CPG ] downlist left_list: 0 received
May 7 13:25:43 ceph1 corosync[4073]: [TOTEM ] Retransmit List: 24 17
May 7 13:25:43 ceph1 corosync[4073]: notice [TOTEM ] Retransmit List: 24 17
May 7 13:25:43 ceph1 corosync[4073]: notice [TOTEM ] Retransmit List: 17
May 7 13:25:43 ceph1 corosync[4073]: [TOTEM ] Retransmit List: 17
May 7 13:25:43 ceph1 corosync[4073]: warning [CPG ] downlist left_list: 0 received
May 7 13:25:43 ceph1 corosync[4073]: warning [CPG ] downlist left_list: 0 received
May 7 13:25:43 ceph1 corosync[4073]: warning [CPG ] downlist left_list: 0 received
May 7 13:25:43 ceph1 corosync[4073]: warning [CPG ] downlist left_list: 0 received
May 7 13:25:43 ceph1 corosync[4073]: [CPG ] downlist left_list: 0 received
May 7 13:25:43 ceph1 corosync[4073]: [CPG ] downlist left_list: 0 received
May 7 13:25:43 ceph1 corosync[4073]: [CPG ] downlist left_list: 0 received
May 7 13:25:43 ceph1 corosync[4073]: [CPG ] downlist left_list: 0 received
May 7 13:25:43 ceph1 pmxcfs[4047]: [dcdb] notice: members: 1/4047, 2/3707, 3/3671, 4/3421, 5/3539, 6/3691, 7/3421
May 7 13:25:43 ceph1 pmxcfs[4047]: [dcdb] notice: starting data syncronisation
May 7 13:25:43 ceph1 corosync[4073]: notice [TOTEM ] Retransmit List: 26
May 7 13:25:43 ceph1 corosync[4073]: [TOTEM ] Retransmit List: 26
May 7 13:25:43 ceph1 corosync[4073]: notice [TOTEM ] Retransmit List: 28 2a 2b 2d 2f 31
May 7 13:25:43 ceph1 corosync[4073]: [TOTEM ] Retransmit List: 28 2a 2b 2d 2f 31
May 7 13:25:43 ceph1 corosync[4073]: [QUORUM] Members[7]: 1 2 3 4 5 6 7
May 7 13:25:43 ceph1 corosync[4073]: notice [QUORUM] Members[7]: 1 2 3 4 5 6 7
May 7 13:25:43 ceph1 corosync[4073]: notice [MAIN ] Completed service synchronization, ready to provide service.
May 7 13:25:43 ceph1 corosync[4073]: [MAIN ] Completed service synchronization, ready to provide service.
May 7 13:25:43 ceph1 corosync[4073]: [TOTEM ] Retransmit List: 2b 2f
May 7 13:25:43 ceph1 corosync[4073]: notice [TOTEM ] Retransmit List: 2b 2f
May 7 13:25:43 ceph1 corosync[4073]: notice [TOTEM ] Retransmit List: 2b
May 7 13:25:43 ceph1 corosync[4073]: [TOTEM ] Retransmit List: 2b
May 7 13:25:43 ceph1 corosync[4073]: notice [TOTEM ] Retransmit List: 2b
May 7 13:25:43 ceph1 corosync[4073]: [TOTEM ] Retransmit List: 2b
May 7 13:25:43 ceph1 corosync[4073]: notice [TOTEM ] Retransmit List: 33 35 37 39 3b
May 7 13:25:43 ceph1 corosync[4073]: [TOTEM ] Retransmit List: 33 35 37 39 3b
May 7 13:25:43 ceph1 corosync[4073]: notice [TOTEM ] Retransmit List: 33 37 40 44 48 4c 3b
May 7 13:25:43 ceph1 corosync[4073]: [TOTEM ] Retransmit List: 33 37 40 44 48 4c 3b
May 7 13:25:43 ceph1 corosync[4073]: notice [TOTEM ] Retransmit List: 33 4c 3b
May 7 13:25:43 ceph1 corosync[4073]: [TOTEM ] Retransmit List: 33 4c 3b
May 7 13:25:43 ceph1 corosync[4073]: notice [TOTEM ] Retransmit List: 33
May 7 13:25:43 ceph1 corosync[4073]: [TOTEM ] Retransmit List: 33
May 7 13:25:43 ceph1 corosync[4073]: error [TOTEM ] Marking ringid 0 interface 10.9.5.151 FAULTY
May 7 13:25:43 ceph1 corosync[4073]: [TOTEM ] Marking ringid 0 interface 10.9.5.151 FAULTY
May 7 13:25:43 ceph1 pmxcfs[4047]: [dcdb] notice: cpg_send_message retried 1 times
May 7 13:25:43 ceph1 pmxcfs[4047]: [status] notice: members: 1/4047, 2/3707, 3/3671, 4/3421, 5/3539, 6/3691, 7/3421
May 7 13:25:43 ceph1 pmxcfs[4047]: [status] notice: starting data syncronisation
May 7 13:25:43 ceph1 pmxcfs[4047]: [dcdb] notice: received sync request (epoch 1/4047/00000009)
May 7 13:25:43 ceph1 pmxcfs[4047]: [status] notice: received sync request (epoch 1/4047/00000009)
May 7 13:25:43 ceph1 corosync[4073]: notice [TOTEM ] Retransmit List: 59 5b 5d 5f 61 63 65 67 69 6a 6c 6e 70 72 74 76 78 7a
May 7 13:25:43 ceph1 corosync[4073]: [TOTEM ] Retransmit List: 59 5b 5d 5f 61 63 65 67 69 6a 6c 6e 70 72 74 76 78 7a
May 7 13:25:43 ceph1 corosync[4073]: notice [TOTEM ] Retransmit List: 70 72 74 76 78 7a 8c 8e 90
May 7 13:25:43 ceph1 corosync[4073]: [TOTEM ] Retransmit List: 70 72 74 76 78 7a 8c 8e 90
May 7 13:25:43 ceph1 corosync[4073]: notice [TOTEM ] Retransmit List: 8c 8e 90 ba bc
May 7 13:25:43 ceph1 corosync[4073]: [TOTEM ] Retransmit List: 8c 8e 90 ba bc
May 7 13:25:43 ceph1 pmxcfs[4047]: [dcdb] notice: received all states
May 7 13:25:43 ceph1 pmxcfs[4047]: [dcdb] notice: leader is 1/4047
May 7 13:25:43 ceph1 pmxcfs[4047]: [dcdb] notice: synced members: 1/4047, 2/3707, 3/3671, 5/3539, 6/3691, 7/3421
May 7 13:25:43 ceph1 pmxcfs[4047]: [dcdb] notice: start sending inode updates
May 7 13:25:43 ceph1 pmxcfs[4047]: [dcdb] notice: sent all (9) updates
May 7 13:25:43 ceph1 pmxcfs[4047]: [dcdb] notice: all data is up to date
May 7 13:25:43 ceph1 pmxcfs[4047]: [dcdb] notice: dfsm_deliver_queue: queue length 6
May 7 13:25:43 ceph1 pmxcfs[4047]: [status] notice: received all states
May 7 13:25:43 ceph1 pmxcfs[4047]: [status] notice: all data is up to date
May 7 13:25:43 ceph1 pmxcfs[4047]: [status] notice: dfsm_deliver_queue: queue length 29
May 7 13:25:44 ceph1 corosync[4073]: notice [TOTEM ] Automatically recovered ring 0
May 7 13:25:44 ceph1 corosync[4073]: [TOTEM ] Automatically recovered ring 0
May 7 13:25:45 ceph1 corosync[4073]: notice [TOTEM ] Retransmit List: 3a4 3a5 3a6 3a7 3a8 3a9
May 7 13:25:45 ceph1 corosync[4073]: [TOTEM ] Retransmit List: 3a4 3a5 3a6 3a7 3a8 3a9
.
.
.
May 7 13:25:47 ceph1 corosync[4073]: notice [TOTEM ] Retransmit List: 3a4 3a5 3a6 3a7 3a8 3a9 3aa 3ab 3ac 3ad 3ae 3af 3b0 3b1 3b2 3b3 3b4 3b5 3b6 3b7 3b8 3c2 3c3 3c4 3c5 3c6 3c7 3c8 3c9 3ca
May 7 13:25:47 ceph1 corosync[4073]: [TOTEM ] Retransmit List: 3a4 3a5 3a6 3a7 3a8 3a9 3aa 3ab 3ac 3ad 3ae 3af 3b0 3b1 3b2 3b3 3b4 3b5 3b6 3b7 3b8 3c2 3c3 3c4 3c5 3c6 3c7 3c8 3c9 3ca
May 7 13:25:52 ceph1 corosync[4073]: notice [TOTEM ] A processor failed, forming new configuration.
May 7 13:25:52 ceph1 corosync[4073]: [TOTEM ] A processor failed, forming new configuration.
May 7 13:25:57 ceph1 corosync[4073]: notice [TOTEM ] A new membership (10.9.5.151:2796) was formed. Members left: 2 3 4 5 6 7
May 7 13:25:57 ceph1 corosync[4073]: notice [TOTEM ] Failed to receive the leave message. failed: 2 3 4 5 6 7
May 7 13:25:57 ceph1 corosync[4073]: [TOTEM ] A new membership (10.9.5.151:2796) was formed. Members left: 2 3 4 5 6 7
May 7 13:25:57 ceph1 corosync[4073]: warning [CPG ] downlist left_list: 6 received
May 7 13:25:57 ceph1 corosync[4073]: [TOTEM ] Failed to receive the leave message. failed: 2 3 4 5 6 7
May 7 13:25:57 ceph1 corosync[4073]: notice [QUORUM] This node is within the non-primary component and will NOT provide any services.
May 7 13:25:57 ceph1 corosync[4073]: notice [QUORUM] Members[1]: 1
May 7 13:25:57 ceph1 corosync[4073]: notice [MAIN ] Completed service synchronization, ready to provide service.
May 7 13:25:57 ceph1 corosync[4073]: [CPG ] downlist left_list: 6 received
May 7 13:25:57 ceph1 pmxcfs[4047]: [dcdb] notice: members: 1/4047
May 7 13:25:57 ceph1 pmxcfs[4047]: [status] notice: members: 1/4047
May 7 13:25:57 ceph1 corosync[4073]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
May 7 13:25:57 ceph1 corosync[4073]: [QUORUM] Members[1]: 1
May 7 13:25:57 ceph1 corosync[4073]: [MAIN ] Completed service synchronization, ready to provide service.
May 7 13:25:57 ceph1 pmxcfs[4047]: [status] notice: node lost quorum
May 7 13:25:57 ceph1 pmxcfs[4047]: [dcdb] crit: received write while not quorate - trigger resync
May 7 13:25:57 ceph1 pmxcfs[4047]: [dcdb] crit: leaving CPG group
May 7 13:25:57 ceph1 pve-ha-lrm[5619]: unable to write lrm status file - unable to open file '/etc/pve/nodes/ceph1/lrm_status.tmp.5619' - Permission denied
May 7 13:25:57 ceph1 pmxcfs[4047]: [dcdb] notice: start cluster connection
May 7 13:25:57 ceph1 pmxcfs[4047]: [dcdb] notice: members: 1/4047
May 7 13:25:57 ceph1 pmxcfs[4047]: [dcdb] notice: all data is up to date
May 7 13:26:00 ceph1 systemd[1]: Starting Proxmox VE replication runner...
May 7 13:26:00 ceph1 pvesr[35726]: trying to acquire cfs lock 'file-replication_cfg' ...
May 7 13:26:08 ceph1 pvesr[35726]: trying to acquire cfs lock 'file-replication_cfg' ...
May 7 13:26:09 ceph1 pvesr[35726]: error with cfs lock 'file-replication_cfg': no quorum!
May 7 13:26:09 ceph1 systemd[1]: pvesr.service: Main process exited, code=exited, status=13/n/a
May 7 13:26:09 ceph1 systemd[1]: Failed to start Proxmox VE replication runner.
May 7 13:26:09 ceph1 systemd[1]: pvesr.service: Unit entered failed state.
May 7 13:26:09 ceph1 systemd[1]: pvesr.service: Failed with result 'exit-code'.
May 7 13:26:18 ceph1 pvedaemon[4741]: <root@pam> successful auth for user 'root@pam'
May 7 13:26:28 ceph1 smartd[2644]: Device: /dev/sdc [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 60 to 59
May 7 13:26:28 ceph1 smartd[2644]: Device: /dev/sdc [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 60 to 59
May 7 13:27:00 ceph1 systemd[1]: Starting Proxmox VE replication runner...
May 7 13:27:00 ceph1 pvesr[36655]: trying to acquire cfs lock 'file-replication_cfg' ...
May 7 13:27:01 ceph1 pvesr[36655]: trying to acquire cfs lock 'file-replication_cfg' ...
May 7 13:27:08 ceph1 pvesr[36655]: trying to acquire cfs lock 'file-replication_cfg' ...
May 7 13:27:09 ceph1 pvesr[36655]: error with cfs lock 'file-replication_cfg': no quorum!
May 7 13:27:09 ceph1 systemd[1]: pvesr.service: Main process exited, code=exited, status=13/n/a
May 7 13:27:09 ceph1 systemd[1]: Failed to start Proxmox VE replication runner.
May 7 13:27:09 ceph1 systemd[1]: pvesr.service: Unit entered failed state.
May 7 13:27:09 ceph1 systemd[1]: pvesr.service: Failed with result 'exit-code'.
May 7 13:28:00 ceph1 systemd[1]: Starting Proxmox VE replication runner...
May 7 13:28:00 ceph1 pvesr[37606]: trying to acquire cfs lock 'file-replication_cfg' ...
May 7 13:28:08 ceph1 pvesr[37606]: trying to acquire cfs lock 'file-replication_cfg' ...
May 7 13:28:09 ceph1 pvesr[37606]: error with cfs lock 'file-replication_cfg': no quorum!
May 7 13:28:09 ceph1 systemd[1]: pvesr.service: Main process exited, code=exited, status=13/n/a
May 7 13:28:09 ceph1 systemd[1]: Failed to start Proxmox VE replication runner.
May 7 13:28:09 ceph1 systemd[1]: pvesr.service: Unit entered failed state.
May 7 13:28:09 ceph1 systemd[1]: pvesr.service: Failed with result 'exit-code'.
May 7 13:29:00 ceph1 systemd[1]: Starting Proxmox VE replication runner...
May 7 13:29:00 ceph1 pvesr[38462]: trying to acquire cfs lock 'file-replication_cfg' ...
May 7 13:29:08 ceph1 pvesr[38462]: trying to acquire cfs lock 'file-replication_cfg' ...
May 7 13:29:09 ceph1 pvesr[38462]: error with cfs lock 'file-replication_cfg': no quorum!
May 7 13:29:09 ceph1 systemd[1]: pvesr.service: Main process exited, code=exited, status=13/n/a
May 7 13:29:09 ceph1 systemd[1]: Failed to start Proxmox VE replication runner.
May 7 13:29:09 ceph1 systemd[1]: pvesr.service: Unit entered failed state.
May 7 13:29:09 ceph1 systemd[1]: pvesr.service: Failed with result 'exit-code'.
May 7 13:30:00 ceph1 systemd[1]: Starting Proxmox VE replication runner...
May 7 13:30:00 ceph1 pvesr[39341]: trying to acquire cfs lock 'file-replication_cfg' ...
May 7 13:30:01 ceph1 pvesr[39341]: trying to acquire cfs lock 'file-replication_cfg' ...
May 7 13:30:02 ceph1 corosync[4073]: notice [TOTEM ] A new membership (10.9.5.151:2968) was formed. Members joined: 2 3 4 5 6 7
May 7 13:30:02 ceph1 corosync[4073]: [TOTEM ] A new membership (10.9.5.151:2968) was formed. Members joined: 2 3 4 5 6 7
May 7 13:30:02 ceph1 corosync[4073]: notice [TOTEM ] Retransmit List: 1 2 3
May 7 13:30:02 ceph1 corosync[4073]: [TOTEM ] Retransmit List: 1 2 3
May 7 13:30:02 ceph1 corosync[4073]: notice [TOTEM ] Retransmit List: 7
May 7 13:30:02 ceph1 corosync[4073]: [TOTEM ] Retransmit List: 7
May 7 13:30:02 ceph1 corosync[4073]: notice [TOTEM ] Retransmit List: 9 b c d
May 7 13:30:02 ceph1 corosync[4073]: [TOTEM ] Retransmit List: 9 b c d
May 7 13:30:02 ceph1 corosync[4073]: notice [TOTEM ] Retransmit List: c
May 7 13:30:02 ceph1 corosync[4073]: [TOTEM ] Retransmit List: c
May 7 13:30:02 ceph1 corosync[4073]: notice [TOTEM ] Retransmit List: 10
May 7 13:30:02 ceph1 corosync[4073]: [TOTEM ] Retransmit List: 10
May 7 13:30:02 ceph1 corosync[4073]: notice [TOTEM ] Retransmit List: 10
May 7 13:30:02 ceph1 corosync[4073]: [TOTEM ] Retransmit List: 10
May 7 13:30:02 ceph1 corosync[4073]: warning [CPG ] downlist left_list: 0 received
May 7 13:30:02 ceph1 corosync[4073]: [CPG ] downlist left_list: 0 received
May 7 13:30:02 ceph1 corosync[4073]: notice [TOTEM ] Retransmit List: 13 16
May 7 13:30:02 ceph1 corosync[4073]: warning [CPG ] downlist left_list: 0 received
May 7 13:30:02 ceph1 corosync[4073]: [TOTEM ] Retransmit List: 13 16
May 7 13:30:02 ceph1 corosync[4073]: [CPG ] downlist left_list: 0 received
May 7 13:30:02 ceph1 corosync[4073]: [TOTEM ] Retransmit List: 1d 21 23 16
May 7 13:30:02 ceph1 corosync[4073]: notice [TOTEM ] Retransmit List: 1d 21 23 16
May 7 13:30:02 ceph1 corosync[4073]: warning [CPG ] downlist left_list: 0 received
May 7 13:30:02 ceph1 corosync[4073]: warning [CPG ] downlist left_list: 0 received
May 7 13:30:02 ceph1 corosync[4073]: warning [CPG ] downlist left_list: 0 received
May 7 13:30:02 ceph1 corosync[4073]: warning [CPG ] downlist left_list: 0 received
May 7 13:30:02 ceph1 corosync[4073]: warning [CPG ] downlist left_list: 0 received
May 7 13:30:02 ceph1 corosync[4073]: notice [TOTEM ] Retransmit List: 23 24
May 7 13:30:02 ceph1 corosync[4073]: [CPG ] downlist left_list: 0 received
May 7 13:30:02 ceph1 corosync[4073]: [CPG ] downlist left_list: 0 received
May 7 13:30:02 ceph1 corosync[4073]: [CPG ] downlist left_list: 0 received
May 7 13:30:02 ceph1 corosync[4073]: [CPG ] downlist left_list: 0 received
May 7 13:30:02 ceph1 corosync[4073]: [CPG ] downlist left_list: 0 received
May 7 13:30:02 ceph1 corosync[4073]: [TOTEM ] Retransmit List: 23 24
May 7 13:30:02 ceph1 pmxcfs[4047]: [dcdb] notice: members: 1/4047, 3/3671
May 7 13:30:02 ceph1 pmxcfs[4047]: [dcdb] notice: starting data syncronisation
May 7 13:30:02 ceph1 corosync[4073]: notice [TOTEM ] Retransmit List: 25 26 27
May 7 13:30:02 ceph1 corosync[4073]: [TOTEM ] Retransmit List: 25 26 27
May 7 13:30:02 ceph1 corosync[4073]: notice [TOTEM ] Retransmit List: 2a 26
May 7 13:30:02 ceph1 corosync[4073]: [TOTEM ] Retransmit List: 2a 26
May 7 13:30:02 ceph1 corosync[4073]: notice [QUORUM] This node is within the primary component and will provide service.
May 7 13:30:02 ceph1 corosync[4073]: notice [QUORUM] Members[7]: 1 2 3 4 5 6 7
May 7 13:30:02 ceph1 corosync[4073]: notice [MAIN ] Completed service synchronization, ready to provide service.
May 7 13:30:02 ceph1 corosync[4073]: [QUORUM] This node is within the primary component and will provide service.
May 7 13:30:02 ceph1 corosync[4073]: [QUORUM] Members[7]: 1 2 3 4 5 6 7
May 7 13:30:02 ceph1 corosync[4073]: [MAIN ] Completed service synchronization, ready to provide service.
May 7 13:30:02 ceph1 corosync[4073]: notice [TOTEM ] Retransmit List: 2b 2d 2f
May 7 13:30:02 ceph1 corosync[4073]: [TOTEM ] Retransmit List: 2b 2d 2f
May 7 13:30:02 ceph1 corosync[4073]: notice [TOTEM ] Retransmit List: 2b 2f
May 7 13:30:02 ceph1 corosync[4073]: [TOTEM ] Retransmit List: 2b 2f
May 7 13:30:02 ceph1 corosync[4073]: notice [TOTEM ] Retransmit List: 2f
May 7 13:30:02 ceph1 corosync[4073]: [TOTEM ] Retransmit List: 2f
May 7 13:30:02 ceph1 pmxcfs[4047]: [dcdb] notice: cpg_send_message retried 1 times
May 7 13:30:02 ceph1 pmxcfs[4047]: [status] notice: node has quorum
May 7 13:30:02 ceph1 pmxcfs[4047]: [dcdb] notice: members: 1/4047, 2/3707, 3/3671
May 7 13:30:02 ceph1 pmxcfs[4047]: [status] notice: members: 1/4047, 3/3671
May 7 13:30:02 ceph1 pmxcfs[4047]: [status] notice: starting data syncronisation
May 7 13:30:02 ceph1 pmxcfs[4047]: [dcdb] notice: members: 1/4047, 2/3707, 3/3671, 7/3421
May 7 13:30:02 ceph1 pmxcfs[4047]: [dcdb] notice: members: 1/4047, 2/3707, 3/3671, 6/3691, 7/3421
May 7 13:30:02 ceph1 pmxcfs[4047]: [status] notice: members: 1/4047, 2/3707, 3/3671
May 7 13:30:02 ceph1 pmxcfs[4047]: [dcdb] notice: members: 1/4047, 2/3707, 3/3671, 5/3539, 6/3691, 7/3421
May 7 13:30:02 ceph1 pmxcfs[4047]: [dcdb] notice: members: 1/4047, 2/3707, 3/3671, 4/3421, 5/3539, 6/3691, 7/3421
May 7 13:30:02 ceph1 pmxcfs[4047]: [status] notice: members: 1/4047, 2/3707, 3/3671, 7/3421
May 7 13:30:02 ceph1 pmxcfs[4047]: [status] notice: members: 1/4047, 2/3707, 3/3671, 6/3691, 7/3421
May 7 13:30:02 ceph1 pmxcfs[4047]: [status] notice: members: 1/4047, 2/3707, 3/3671, 5/3539, 6/3691, 7/3421
May 7 13:30:02 ceph1 pmxcfs[4047]: [status] notice: members: 1/4047, 2/3707, 3/3671, 4/3421, 5/3539, 6/3691, 7/3421
May 7 13:30:02 ceph1 pmxcfs[4047]: [dcdb] crit: ignore sync request from wrong member 2/3707
May 7 13:30:02 ceph1 pmxcfs[4047]: [dcdb] notice: received sync request (epoch 2/3707/00000021)
May 7 13:30:02 ceph1 pmxcfs[4047]: [status] crit: ignore sync request from wrong member 2/3707
May 7 13:30:02 ceph1 pmxcfs[4047]: [status] notice: received sync request (epoch 2/3707/0000006F)
May 7 13:30:02 ceph1 corosync[4073]: error [TOTEM ] Marking ringid 1 interface 10.9.6.151 FAULTY
May 7 13:30:02 ceph1 corosync[4073]: [TOTEM ] Marking ringid 1 interface 10.9.6.151 FAULTY
May 7 13:30:02 ceph1 corosync[4073]: notice [TOTEM ] Retransmit List: 33
May 7 13:30:02 ceph1 corosync[4073]: [TOTEM ] Retransmit List: 33
May 7 13:30:02 ceph1 pmxcfs[4047]: [dcdb] notice: received sync request (epoch 1/4047/0000000D)
May 7 13:30:02 ceph1 pmxcfs[4047]: [status] notice: received sync request (epoch 1/4047/0000000B)
May 7 13:30:02 ceph1 pmxcfs[4047]: [dcdb] notice: received sync request (epoch 1/4047/0000000E)
May 7 13:30:02 ceph1 pmxcfs[4047]: [dcdb] notice: received sync request (epoch 1/4047/0000000F)
May 7 13:30:02 ceph1 corosync[4073]: notice [TOTEM ] Retransmit List: 38 3c 40 44 48 4c 4f 51 53 55 57 59 5b 5d 5f 60 62 64 66
May 7 13:30:02 ceph1 corosync[4073]: [TOTEM ] Retransmit List: 38 3c 40 44 48 4c 4f 51 53 55 57 59 5b 5d 5f 60 62 64 66
May 7 13:30:02 ceph1 pmxcfs[4047]: [status] notice: received sync request (epoch 1/4047/0000000C)
May 7 13:30:02 ceph1 pmxcfs[4047]: [dcdb] notice: received sync request (epoch 1/4047/00000010)
May 7 13:30:02 ceph1 pmxcfs[4047]: [dcdb] notice: received sync request (epoch 1/4047/00000011)
May 7 13:30:02 ceph1 pmxcfs[4047]: [status] notice: received sync request (epoch 1/4047/0000000D)
May 7 13:30:02 ceph1 pmxcfs[4047]: [dcdb] notice: received sync request (epoch 1/4047/00000012)
May 7 13:30:02 ceph1 pmxcfs[4047]: [status] notice: received sync request (epoch 1/4047/0000000E)
May 7 13:30:02 ceph1 pmxcfs[4047]: [status] notice: received sync request (epoch 1/4047/0000000F)
May 7 13:30:02 ceph1 corosync[4073]: [TOTEM ] Retransmit List: 6f 48 53 57 5b 5f 62 66 69 6d 71 73 75 78 7a 7c 7f 81 83 86 88 8a
May 7 13:30:02 ceph1 corosync[4073]: notice [TOTEM ] Retransmit List: 6f 48 53 57 5b 5f 62 66 69 6d 71 73 75 78 7a 7c 7f 81 83 86 88 8a
May 7 13:30:02 ceph1 pmxcfs[4047]: [status] notice: received sync request (epoch 1/4047/00000010)
May 7 13:30:02 ceph1 corosync[4073]: notice [TOTEM ] Retransmit List: 8a 9a 5b 62 6d 73 78 7c 86 9c 9e a0 a2 a4 81
May 7 13:30:02 ceph1 corosync[4073]: [TOTEM ] Retransmit List: 8a 9a 5b 62 6d 73 78 7c 86 9c 9e a0 a2 a4 81
May 7 13:30:02 ceph1 corosync[4073]: notice [TOTEM ] Retransmit List: a0 a2 a4 ba bc be c0 c2 c4 c6 c8 ca
May 7 13:30:02 ceph1 corosync[4073]: [TOTEM ] Retransmit List: a0 a2 a4 ba bc be c0 c2 c4 c6 c8 ca
May 7 13:30:02 ceph1 corosync[4073]: notice [TOTEM ] Retransmit List: c2 c4 c6 c8 ca dc
May 7 13:30:02 ceph1 corosync[4073]: [TOTEM ] Retransmit List: c2 c4 c6 c8 ca dc
May 7 13:30:02 ceph1 pmxcfs[4047]: [dcdb] notice: received all states
May 7 13:30:02 ceph1 pmxcfs[4047]: [dcdb] notice: leader is 4/3421
May 7 13:30:02 ceph1 pmxcfs[4047]: [dcdb] notice: synced members: 4/3421
May 7 13:30:02 ceph1 pmxcfs[4047]: [dcdb] notice: waiting for updates from leader
May 7 13:30:02 ceph1 pmxcfs[4047]: [dcdb] notice: update complete - trying to commit (got 2 inode updates)
May 7 13:30:02 ceph1 pmxcfs[4047]: [dcdb] notice: all data is up to date
May 7 13:30:02 ceph1 pmxcfs[4047]: [dcdb] notice: dfsm_deliver_queue: queue length 1
May 7 13:30:02 ceph1 pvesr[39341]: trying to acquire cfs lock 'file-replication_cfg' ...
May 7 13:30:02 ceph1 pmxcfs[4047]: [status] notice: received all states
May 7 13:30:02 ceph1 pmxcfs[4047]: [status] notice: all data is up to date
May 7 13:30:02 ceph1 pmxcfs[4047]: [status] notice: dfsm_deliver_queue: queue length 7
May 7 13:30:03 ceph1 corosync[4073]: notice [TOTEM ] Automatically recovered ring 1
May 7 13:30:03 ceph1 corosync[4073]: [TOTEM ] Automatically recovered ring 1
May 7 13:30:03 ceph1 systemd[1]: Started Proxmox VE replication runner.
May 7 13:30:58 ceph1 pmxcfs[4047]: [status] notice: received log
May 7 13:31:00 ceph1 systemd[1]: Starting Proxmox VE replication runner...
May 7 13:31:00 ceph1 systemd[1]: Started Proxmox VE replication runner.

The test with OmPing is ok,

Corosync.conf:

root@ceph6:~# more /etc/pve/corosync.conf
logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: ceph1
nodeid: 1
quorum_votes: 1
ring0_addr: 10.9.5.151
ring1_addr: 10.9.6.151
}
node {
name: ceph2
nodeid: 2
quorum_votes: 1
ring0_addr: 10.9.5.152
ring1_addr: 10.9.6.152
}
node {
name: ceph3
nodeid: 3
quorum_votes: 1
ring0_addr: 10.9.5.153
ring1_addr: 10.9.6.153
}
node {
name: ceph4
nodeid: 4
quorum_votes: 1
ring0_addr: 10.9.5.154
ring1_addr: 10.9.6.154
}
node {
name: ceph5
nodeid: 5
quorum_votes: 1
ring0_addr: 10.9.5.155
ring1_addr: 10.9.6.155
}
node {
name: ceph6
nodeid: 6
quorum_votes: 1
ring0_addr: 10.9.5.156
ring1_addr: 10.9.6.156
}
node {
name: ceph7
nodeid: 7
quorum_votes: 1
ring0_addr: 10.9.5.157
ring1_addr: 10.9.6.157
}
}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: clusterceph1
config_version: 7
interface {
bindnetaddr: 10.9.5.151
ringnumber: 0
}
interface {
bindnetaddr: 10.9.6.151
ringnumber: 1
}
ip_version: ipv4
rrp_mode: passive
secauth: on
version: 2
}

mira · May 7, 2019

Please post the 'omping' commands you ran and the output. Please also provide the output of 'pveversion -v'. Which hardware do you use? Which NICs? Switches?

David Calvache Casas · May 7, 2019

root@ceph6:~# pveversion -v
proxmox-ve: 5.4-1 (running kernel: 4.15.18-14-pve)
pve-manager: 5.4-5 (running version: 5.4-5/c6fdb264)
pve-kernel-4.15: 5.4-2
pve-kernel-4.15.18-14-pve: 4.15.18-38
pve-kernel-4.15.18-13-pve: 4.15.18-37
pve-kernel-4.15.18-12-pve: 4.15.18-36
ceph: 12.2.12-pve1
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-9
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-51
libpve-guest-common-perl: 2.0-20
libpve-http-server-perl: 2.0-13
libpve-storage-perl: 5.0-42
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.1.0-3
lxcfs: 3.0.3-pve1
novnc-pve: 1.0.0-3
openvswitch-switch: 2.7.0-3
proxmox-widget-toolkit: 1.0-26
pve-cluster: 5.0-37
pve-container: 2.0-37
pve-docs: 5.4-2
pve-edk2-firmware: 1.20190312-1
pve-firewall: 3.0-20
pve-firmware: 2.0-6
pve-ha-manager: 2.0-9
pve-i18n: 1.1-4
pve-libspice-server1: 0.14.1-2
pve-qemu-kvm: 3.0.1-2
pve-xtermjs: 3.12.0-1
qemu-server: 5.0-51
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.13-pve1~bpo2

omping -c 10000 -i 0.001 -F -q 10.9.5.151 10.9.5.152 10.9.5.153 10.9.5.154 10.9.5.155 10.9.5.155 10.9.5.156 10.9.5.157

10.9.5.151 : unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.029/0.082/0.237/0.023
10.9.5.151 : multicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.030/0.084/0.238/0.024
10.9.5.152 : unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.027/0.084/0.230/0.022
10.9.5.152 : multicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.028/0.086/0.232/0.022
10.9.5.153 : unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.027/0.087/0.247/0.025
10.9.5.153 : multicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.029/0.089/0.254/0.025
10.9.5.154 : unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.026/0.102/0.336/0.039
10.9.5.154 : multicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.027/0.105/0.342/0.040
10.9.5.155 : unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.024/0.098/0.246/0.039
10.9.5.155 : multicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.026/0.103/0.275/0.041
10.9.5.157 : unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.025/0.113/0.346/0.048
10.9.5.157 : multicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.026/0.116/0.348/0.049

NICS: Intel Gigabit , Driver igb

Switches Huawei S5700, with igmp forwarding/querier enable.

harvie · May 7, 2019

Leave omping running for 10 minutes... If it will stop printing "multicast" lines, there's some problem.
Also try disabling multicast snooping on your bridges (and switches?):

echo 0 > /sys/devices/virtual/net/vmbr0/bridge/multicast_snooping

i was blaming the switch all the time. turned out problem was in the bridge on proxmox server.
don't know how to make this persistent... maybe put it in /etc/network/interfaces

David Calvache Casas · May 7, 2019

harvie said:
Leave omping running for 10 minutes... If it will stop printing "multicast" lines, there's some problem.
Also try disabling multicast snooping on your bridges (and switches?):

echo 0 > /sys/devices/virtual/net/vmbr0/bridge/multicast_snooping

i was blaming the switch all the time. turned out problem was in the bridge on proxmox server.
don't know how to make this persistent... maybe put it in /etc/network/interfaces

I will test the omping for 10 minutes...

there is no bridge in corosync interfaces....

more /etc/network/interfaces
auto lo
iface lo inet loopback

auto enp175s0f0
iface enp175s0f0 inet static
address 10.9.5.156
netmask 255.255.255.0
#Corosync RING0

auto enp175s0f1
iface enp175s0f1 inet static
address 10.9.6.156
netmask 255.255.255.0
#Corosync RING1

They are 2 interfaces only for that purpose, the rest of the diferentes workloads travel in other interfaces...

David Calvache Casas · May 7, 2019

The 10 minute test is Ok too.

10.9.5.151 : unicast, xmt/rcv/%loss = 600000/600000/0%, min/avg/max/std-dev = 0.023/0.090/0.918/0.032
10.9.5.151 : multicast, xmt/rcv/%loss = 600000/600000/0%, min/avg/max/std-dev = 0.023/0.093/0.920/0.032
10.9.5.152 : unicast, xmt/rcv/%loss = 600000/600000/0%, min/avg/max/std-dev = 0.023/0.089/0.515/0.028
10.9.5.152 : multicast, xmt/rcv/%loss = 600000/600000/0%, min/avg/max/std-dev = 0.025/0.092/0.516/0.028
10.9.5.153 : unicast, xmt/rcv/%loss = 600000/600000/0%, min/avg/max/std-dev = 0.023/0.090/2.861/0.030
10.9.5.153 : multicast, xmt/rcv/%loss = 600000/600000/0%, min/avg/max/std-dev = 0.025/0.093/2.910/0.030
10.9.5.155 : unicast, xmt/rcv/%loss = 600000/600000/0%, min/avg/max/std-dev = 0.023/0.109/0.376/0.052
10.9.5.155 : multicast, xmt/rcv/%loss = 600000/600000/0%, min/avg/max/std-dev = 0.025/0.113/0.379/0.053
10.9.5.156 : unicast, xmt/rcv/%loss = 600000/600000/0%, min/avg/max/std-dev = 0.023/0.092/0.416/0.034
10.9.5.156 : multicast, xmt/rcv/%loss = 600000/600000/0%, min/avg/max/std-dev = 0.025/0.094/0.420/0.035

mira · May 8, 2019

mira said:
Please post the 'omping' commands you ran and the output. Please also provide the output of 'pveversion -v'. Which hardware do you use? Which NICs? Switches?

Did you update anything before this started to happen? (PVE, BIOS, Firmware of anything)
It's strange that it takes 5 minutes to regain quorum. Could you try the omping command during this time?

David Calvache Casas · May 8, 2019

OK, Issue resolved.
Was my fault
Igmp querier not enabled in vlan at the switch. I changed the vlans and forget to enable it.

I apologize for the inconvenients caused.

Thank you very much for the waste of time and brain.

mira · May 8, 2019

Glad you solved it! Please mark the thread as solved.

Search

Search

[SOLVED] Problem with corosync, Cluster stuck several minutes

David Calvache Casas

Well-Known Member

mira

Proxmox Staff Member

David Calvache Casas

Well-Known Member

harvie

Well-Known Member

David Calvache Casas

Well-Known Member

David Calvache Casas

Well-Known Member

mira

Proxmox Staff Member

David Calvache Casas

Well-Known Member

mira

Proxmox Staff Member