No Quorum

Jim Surles · May 15, 2020

Hello,

I thought I had posted this already, but I can't seem to locate the post. Forgive me if I'm mistaken.

I have a 4 node cluster running:

proxmox-ve: 5.4-2 (running kernel: 4.15.18-24-pve)
pve-manager: 5.4-13 (running version: 5.4-13/aee6f0ec)
corosync: 2.4.4-pve1

The do not share any disks, and are only clustered for the convenience of management.

Last night, I received an email because the backups (vzdump) failed on the nodes. Checking into it I see that they all failed with the same message:

INFO: starting new backup job: vzdump --mode snapshot --mailto redacted@redacted.com --compress lzo --mailnotification failure --storage vrtx-backup --quiet 1 --all 1
INFO: Starting Backup of VM 100 (qemu)
INFO: Backup started at 2020-05-14 00:00:02
INFO: status = stopped
INFO: unable to open file '/etc/pve/nodes/slot-1/qemu-server/100.conf.tmp.30553' - Permission denied
INFO: update VM 100: -lock backup
ERROR: Backup of VM 100 failed - command 'qm set 100 --lock backup' failed: exit code 2
INFO: Failed at 2020-05-14 00:00:03

I logged into the server to ensure it wasn't a disk space issue, and it was not.

I then noticed in the web manager (https://ip.address:8006) that it wasn't showing all the nodes online. Each node that I logged into, it only showed that node, with the rest offline. So I went back to the command line and checked out pvecm status whic showed the same on all hosts, except it only showed itself online.

Quorum information
------------------
Date: Thu May 14 17:46:34 2020
Quorum provider: corosync_votequorum
Nodes: 1
Node ID: 0x00000001
Ring ID: 1/110056
Quorate: No

Votequorum information
----------------------
Expected votes: 4
Highest expected: 4
Total votes: 1
Quorum: 3 Activity blocked
Flags:

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.0.100.71 (local)

I restarted corosync and didn't resolve anything. I also restarted pve-cluster, also with no results.

I did some searching online and found a suggesting to move the corosync to using unicast instead of multicast by addeding "transport: udpu" to /etc/pve/corosync.conf. I did so, but it would not let me save the file. So I thought perhaps the filesystem was in read only mode for whatever reason. At that point I performed a clean restart of node-3 and node-4 as they are for redundancy, and once they came up, nothing changed. All 4 nodes only see themselves.

Any advice or suggestions would be appreciated.

t.lamprecht · May 15, 2020

Hi,

Jim Surles said:
I thought I had posted this already, but I can't seem to locate the post. Forgive me if I'm mistaken.

FYI: it was caught in the forum spam filter waiting for approval, I dropped it now to avoid a duplicate post.

Jim Surles said:
I did some searching online and found a suggesting to move the corosync to using unicast instead of multicast by addeding "transport: udpu" to /etc/pve/corosync.conf. I did so, but it would not let me save the file. So I thought perhaps the filesystem was in read only mode for whatever reason. At that point I performed a clean restart of node-3 and node-4 as they are for redundancy, and once they came up, nothing changed. All 4 nodes only see themselves.

Any advice or suggestions would be appreciated.

Please search (and share) some logs, either over the webinterface nodes syslog panel or using SSH and journalctl -b -u corosync -u pve-cluster we need to get the "real" error messages.

Also, it seems that this happened "suddenly", was anything changed in the network shortly before that?

Jim Surles · May 15, 2020

t.lamprecht said:
Hi,

FYI: it was caught in the forum spam filter waiting for approval, I dropped it now to avoid a duplicate post.

Please search (and share) some logs, either over the webinterface nodes syslog panel or using SSH and journalctl -b -u corosync -u pve-cluster we need to get the "real" error messages.

Also, it seems that this happened "suddenly", was anything changed in the network shortly before that?

This does seem quite sudden. To my knowledge there were no changes. This is in a remote datacenter, and no records of anyone logging in to any network gear, nor the proxmox hypervisors.

This cluster does run in a VRTX chassis.. and I am hesitant to restart the whole thing. In looking at the logs you suggested I see things like:

I see alot of this over and over..

Code:

May 13 10:35:35 slot-1 corosync[3489]: notice  [TOTEM ] Retransmit List: 1 3 4 5
May 13 10:35:35 slot-1 corosync[3489]:  [TOTEM ] Retransmit List: 1 3 4 5
May 13 10:35:35 slot-1 corosync[3489]: notice  [TOTEM ] Retransmit List: 1 3 4 5
May 13 10:35:35 slot-1 corosync[3489]:  [TOTEM ] Retransmit List: 1 3 4 5

This seems to be around when the logs start deviating from the norm.

Code:

May 13 07:53:29 slot-1 pmxcfs[2238]: [dcdb] notice: data verification successful
May 13 08:53:29 slot-1 pmxcfs[2238]: [dcdb] notice: data verification successful
May 13 09:53:29 slot-1 pmxcfs[2238]: [dcdb] notice: data verification successful
May 13 10:35:30 slot-1 corosync[3489]: notice  [TOTEM ] Retransmit List: aeaccf
May 13 10:35:30 slot-1 corosync[3489]:  [TOTEM ] Retransmit List: aeaccf
May 13 10:35:30 slot-1 corosync[3489]: notice  [TOTEM ] Retransmit List: aeacb8 aeacb9 aeaccf
May 13 10:35:30 slot-1 corosync[3489]:  [TOTEM ] Retransmit List: aeacb8 aeacb9 aeaccf
May 13 10:35:30 slot-1 corosync[3489]: notice  [TOTEM ] Retransmit List: aeacb5 aeacb7 aeacb8 aeacb9 aeaccf
May 13 10:35:30 slot-1 corosync[3489]:  [TOTEM ] Retransmit List: aeacb5 aeacb7 aeacb8 aeacb9 aeaccf
May 13 10:35:30 slot-1 corosync[3489]: notice  [TOTEM ] Retransmit List: aeacb4 aeacb5 aeacb7 aeacb8 aeacb9 aeaccf
May 13 10:35:30 slot-1 corosync[3489]:  [TOTEM ] Retransmit List: aeacb4 aeacb5 aeacb7 aeacb8 aeacb9 aeaccf
May 13 10:35:30 slot-1 corosync[3489]: notice  [TOTEM ] Retransmit List: aeacb3 aeacb4 aeacb5 aeacb7 aeacb8 aeacb9 aeaccf
May 13 10:35:30 slot-1 corosync[3489]:  [TOTEM ] Retransmit List: aeacb3 aeacb4 aeacb5 aeacb7 aeacb8 aeacb9 aeaccf
May 13 10:35:30 slot-1 corosync[3489]: notice  [TOTEM ] Retransmit List: aeacb2 aeacb3 aeacb4 aeacb5 aeacb7 aeacb8 aeacb9 aeaccf
May 13 10:35:30 slot-1 corosync[3489]:  [TOTEM ] Retransmit List: aeacb2 aeacb3 aeacb4 aeacb5 aeacb7 aeacb8 aeacb9 aeaccf
May 13 10:35:30 slot-1 corosync[3489]: notice  [TOTEM ] Retransmit List: aeacb1 aeacb2 aeacb3 aeacb4 aeacb5 aeacb7 aeacb8 aeacb9 aeaccf
May 13 10:35:30 slot-1 corosync[3489]:  [TOTEM ] Retransmit List: aeacb1 aeacb2 aeacb3 aeacb4 aeacb5 aeacb7 aeacb8 aeacb9 aeaccf
May 13 10:35:30 slot-1 corosync[3489]: notice  [TOTEM ] Retransmit List: aeacb0 aeacb1 aeacb2 aeacb3 aeacb4 aeacb5 aeacb7 aeacb8 aeacb9 aeaccf
May 13 10:35:30 slot-1 corosync[3489]:  [TOTEM ] Retransmit List: aeacb0 aeacb1 aeacb2 aeacb3 aeacb4 aeacb5 aeacb7 aeacb8 aeacb9 aeaccf
May 13 10:35:30 slot-1 corosync[3489]: notice  [TOTEM ] Retransmit List: aeacb0 aeacb1 aeacb2 aeacb3 aeacb4 aeacb5 aeacb7 aeacb8 aeacb9 aeaccf
May 13 10:35:30 slot-1 corosync[3489]:  [TOTEM ] Retransmit List: aeacb0 aeacb1 aeacb2 aeacb3 aeacb4 aeacb5 aeacb7 aeacb8 aeacb9 aeaccf
May 13 10:35:30 slot-1 corosync[3489]: notice  [TOTEM ] Retransmit List: aeacb0 aeacb1 aeacb2 aeacb3 aeacb4 aeacb5 aeacb7 aeacb8 aeacb9 aeaccf
May 13 10:35:30 slot-1 corosync[3489]:  [TOTEM ] Retransmit List: aeacb0 aeacb1 aeacb2 aeacb3 aeacb4 aeacb5 aeacb7 aeacb8 aeacb9 aeaccf
May 13 10:35:30 slot-1 corosync[3489]: notice  [TOTEM ] Retransmit List: aeacb0 aeacb1 aeacb2 aeacb3 aeacb4 aeacb5 aeacb7 aeacb8 aeacb9 aeaccf
May 13 10:35:30 slot-1 corosync[3489]:  [TOTEM ] Retransmit List: aeacb0 aeacb1 aeacb2 aeacb3 aeacb4 aeacb5 aeacb7 aeacb8 aeacb9 aeaccf
May 13 10:35:30 slot-1 corosync[3489]: notice  [TOTEM ] Retransmit List: aeacb0 aeacb1 aeacb2 aeacb3 aeacb4 aeacb5 aeacb7 aeacb8 aeacb9 aeaccf
May 13 10:35:30 slot-1 corosync[3489]:  [TOTEM ] Retransmit List: aeacb0 aeacb1 aeacb2 aeacb3 aeacb4 aeacb5 aeacb7 aeacb8 aeacb9 aeaccf
May 13 10:35:30 slot-1 corosync[3489]: notice  [TOTEM ] Retransmit List: aeacb0 aeacb1 aeacb2 aeacb3 aeacb4 aeacb5 aeacb7 aeacb8 aeacb9 aeaccf
May 13 10:35:30 slot-1 corosync[3489]:  [TOTEM ] Retransmit List: aeacb0 aeacb1 aeacb2 aeacb3 aeacb4 aeacb5 aeacb7 aeacb8 aeacb9 aeaccf
May 13 10:35:30 slot-1 corosync[3489]: notice  [TOTEM ] Retransmit List: aeacb0 aeacb1 aeacb2 aeacb3 aeacb4 aeacb5 aeacb7 aeacb8 aeacb9 aeaccf
May 13 10:35:30 slot-1 corosync[3489]:  [TOTEM ] Retransmit List: aeacb0 aeacb1 aeacb2 aeacb3 aeacb4 aeacb5 aeacb7 aeacb8 aeacb9 aeaccf
May 13 10:35:30 slot-1 corosync[3489]: notice  [TOTEM ] Retransmit List: aeacb0 aeacb1 aeacb2 aeacb3 aeacb4 aeacb5 aeacb7 aeacb8 aeacb9 aeaccf
May 13 10:35:30 slot-1 corosync[3489]:  [TOTEM ] Retransmit List: aeacb0 aeacb1 aeacb2 aeacb3 aeacb4 aeacb5 aeacb7 aeacb8 aeacb9 aeaccf
May 13 10:35:30 slot-1 corosync[3489]: notice  [TOTEM ] Retransmit List: aeacb0 aeacb1 aeacb2 aeacb3 aeacb4 aeacb5 aeacb7 aeacb8 aeacb9 aeaccf
May 13 10:35:30 slot-1 corosync[3489]:  [TOTEM ] Retransmit List: aeacb0 aeacb1 aeacb2 aeacb3 aeacb4 aeacb5 aeacb7 aeacb8 aeacb9 aeaccf
May 13 10:35:30 slot-1 corosync[3489]: notice  [TOTEM ] Retransmit List: aeacb0 aeacb1 aeacb2 aeacb3 aeacb4 aeacb5 aeacb7 aeacb8 aeacb9 aeaccf
May 13 10:35:30 slot-1 corosync[3489]:  [TOTEM ] Retransmit List: aeacb0 aeacb1 aeacb2 aeacb3 aeacb4 aeacb5 aeacb7 aeacb8 aeacb9 aeaccf
May 13 10:35:30 slot-1 corosync[3489]: notice  [TOTEM ] Retransmit List: aeacb0 aeacb1 aeacb2 aeacb3 aeacb4 aeacb5 aeacb7 aeacb8 aeacb9 aeaccf
May 13 10:35:30 slot-1 corosync[3489]:  [TOTEM ] Retransmit List: aeacb0 aeacb1 aeacb2 aeacb3 aeacb4 aeacb5 aeacb7 aeacb8 aeacb9 aeaccf
May 13 10:35:30 slot-1 corosync[3489]: notice  [TOTEM ] Retransmit List: aeacb0 aeacb1 aeacb2 aeacb3 aeacb4 aeacb5 aeacb7 aeacb8 aeacb9 aeaccf
May 13 10:35:30 slot-1 corosync[3489]:  [TOTEM ] Retransmit List: aeacb0 aeacb1 aeacb2 aeacb3 aeacb4 aeacb5 aeacb7 aeacb8 aeacb9 aeaccf
May 13 10:35:30 slot-1 corosync[3489]: notice  [TOTEM ] Retransmit List: aeacb0 aeacb1 aeacb2 aeacb3 aeacb4 aeacb5 aeacb7 aeacb8 aeacb9 aeaccf
May 13 10:35:30 slot-1 corosync[3489]:  [TOTEM ] Retransmit List: aeacb0 aeacb1 aeacb2 aeacb3 aeacb4 aeacb5 aeacb7 aeacb8 aeacb9 aeaccf
May 13 10:35:32 slot-1 corosync[3489]: notice  [TOTEM ] A processor failed, forming new configuration.
May 13 10:35:32 slot-1 corosync[3489]:  [TOTEM ] A processor failed, forming new configuration.
May 13 10:35:32 slot-1 corosync[3489]: notice  [TOTEM ] A new membership (10.0.100.71:109992) was formed. Members left: 2
May 13 10:35:32 slot-1 corosync[3489]: notice  [TOTEM ] Failed to receive the leave message. failed: 2
May 13 10:35:32 slot-1 corosync[3489]:  [TOTEM ] A new membership (10.0.100.71:109992) was formed. Members left: 2
May 13 10:35:32 slot-1 corosync[3489]:  [TOTEM ] Failed to receive the leave message. failed: 2
May 13 10:35:32 slot-1 corosync[3489]: warning [CPG   ] downlist left_list: 1 received
May 13 10:35:32 slot-1 corosync[3489]:  [CPG   ] downlist left_list: 1 received
May 13 10:35:32 slot-1 corosync[3489]: warning [CPG   ] downlist left_list: 1 received
May 13 10:35:32 slot-1 corosync[3489]:  [CPG   ] downlist left_list: 1 received
May 13 10:35:32 slot-1 corosync[3489]:  [CPG   ] downlist left_list: 1 received
May 13 10:35:32 slot-1 corosync[3489]: notice  [QUORUM] Members[3]: 1 3 4

Search

Search

No Quorum

Jim Surles

Member

t.lamprecht

Proxmox Staff Member

Jim Surles

Member

We value your privacy