Hello,
I thought I had posted this already, but I can't seem to locate the post. Forgive me if I'm mistaken.
I have a 4 node cluster running:
The do not share any disks, and are only clustered for the convenience of management.
Last night, I received an email because the backups (vzdump) failed on the nodes. Checking into it I see that they all failed with the same message:
I logged into the server to ensure it wasn't a disk space issue, and it was not.
I then noticed in the web manager (https://ip.address:8006) that it wasn't showing all the nodes online. Each node that I logged into, it only showed that node, with the rest offline. So I went back to the command line and checked out pvecm status whic showed the same on all hosts, except it only showed itself online.
I restarted corosync and didn't resolve anything. I also restarted pve-cluster, also with no results.
I did some searching online and found a suggesting to move the corosync to using unicast instead of multicast by addeding "transport: udpu" to /etc/pve/corosync.conf. I did so, but it would not let me save the file. So I thought perhaps the filesystem was in read only mode for whatever reason. At that point I performed a clean restart of node-3 and node-4 as they are for redundancy, and once they came up, nothing changed. All 4 nodes only see themselves.
Any advice or suggestions would be appreciated.
I thought I had posted this already, but I can't seem to locate the post. Forgive me if I'm mistaken.
I have a 4 node cluster running:
proxmox-ve: 5.4-2 (running kernel: 4.15.18-24-pve)
pve-manager: 5.4-13 (running version: 5.4-13/aee6f0ec)
corosync: 2.4.4-pve1
The do not share any disks, and are only clustered for the convenience of management.
Last night, I received an email because the backups (vzdump) failed on the nodes. Checking into it I see that they all failed with the same message:
INFO: starting new backup job: vzdump --mode snapshot --mailto redacted@redacted.com --compress lzo --mailnotification failure --storage vrtx-backup --quiet 1 --all 1
INFO: Starting Backup of VM 100 (qemu)
INFO: Backup started at 2020-05-14 00:00:02
INFO: status = stopped
INFO: unable to open file '/etc/pve/nodes/slot-1/qemu-server/100.conf.tmp.30553' - Permission denied
INFO: update VM 100: -lock backup
ERROR: Backup of VM 100 failed - command 'qm set 100 --lock backup' failed: exit code 2
INFO: Failed at 2020-05-14 00:00:03
I logged into the server to ensure it wasn't a disk space issue, and it was not.
I then noticed in the web manager (https://ip.address:8006) that it wasn't showing all the nodes online. Each node that I logged into, it only showed that node, with the rest offline. So I went back to the command line and checked out pvecm status whic showed the same on all hosts, except it only showed itself online.
Quorum information
------------------
Date: Thu May 14 17:46:34 2020
Quorum provider: corosync_votequorum
Nodes: 1
Node ID: 0x00000001
Ring ID: 1/110056
Quorate: No
Votequorum information
----------------------
Expected votes: 4
Highest expected: 4
Total votes: 1
Quorum: 3 Activity blocked
Flags:
Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.0.100.71 (local)
I restarted corosync and didn't resolve anything. I also restarted pve-cluster, also with no results.
I did some searching online and found a suggesting to move the corosync to using unicast instead of multicast by addeding "transport: udpu" to /etc/pve/corosync.conf. I did so, but it would not let me save the file. So I thought perhaps the filesystem was in read only mode for whatever reason. At that point I performed a clean restart of node-3 and node-4 as they are for redundancy, and once they came up, nothing changed. All 4 nodes only see themselves.
Any advice or suggestions would be appreciated.