Hi Folks,
Long time Proxmox user and supporter (I help on IRC where I can).
One of the clusters I work with, has been 2 nodes for years now. Recently 2 more nodes were added to the cluster, but with the intent of being temporary "lab" space. As such, these nodes were powered down out of hours to try and save power (as they're old hardware, not horribly efficient).
Now, the problem I have is with backups. For those who don't already see the issue, when we have 4 nodes in the cluster, and 2 turn off, the quorum goes into an "emergency" state of sorts. This is due to the threshold of 50% of the nodes being offline being met. The default configuration of quorums in proxmox is that at this point consistency cannot reasonably be met in the cluster (seems like a reasonable default IMO).
So what actual issue does this cause me? It breaks my nightly backups of the VMs. This is using the built-in backup mechanism, to an NFS share, nothing like VEEAM or whatever. Backups have been veerrrry reliable up until I found this hiccup.
The logs show that certain file locks cannot be attained when trying to backup (on ALL the VMs listed for backup):
"
INFO: Starting Backup of VM 212 (qemu)
INFO: status = running
INFO: unable to open file '/etc/pve/nodes/REDACTED/qemu-server/212.conf.tmp.10682' - Permission denied
INFO: update VM 212: -lock backup
ERROR: Backup of VM 212 failed - command 'qm set 212 --lock backup' failed: exit code 2
INFO: Starting Backup of VM 400 (qemu)
INFO: status = running
INFO: unable to open file '/etc/pve/nodes/REDACTED/qemu-server/400.conf.tmp.10692' - Permission denied
INFO: update VM 400: -lock backup
ERROR: Backup of VM 400 failed - command 'qm set 400 --lock backup' failed: exit code 2
"
I'm quite sure this is because the clustered filesystem is in a RO state due to earlier mentioned 50% thresholds being met.
Anyways. I see two "solutions", but I was hoping that a third would be found as I like neither "solution":
I have tried, with no success, telling pvecm to expect 2 votes, however that value doesn't seem to _change_ OR _stick_. Furthermore, the only docs I can find about this are for proxmox cluster v2, can't even find anything for v3 or v4.
Also, it's worth nothing that the nodes are not in an HA cluster, just a regular one. I don't need HA clustering, and the first 2 nodes aren't able to be fenced at this time anyways.
Well, I'd love to hear your thoughts! So, please tell me them. Thanks peeps.
Long time Proxmox user and supporter (I help on IRC where I can).
One of the clusters I work with, has been 2 nodes for years now. Recently 2 more nodes were added to the cluster, but with the intent of being temporary "lab" space. As such, these nodes were powered down out of hours to try and save power (as they're old hardware, not horribly efficient).
Now, the problem I have is with backups. For those who don't already see the issue, when we have 4 nodes in the cluster, and 2 turn off, the quorum goes into an "emergency" state of sorts. This is due to the threshold of 50% of the nodes being offline being met. The default configuration of quorums in proxmox is that at this point consistency cannot reasonably be met in the cluster (seems like a reasonable default IMO).
So what actual issue does this cause me? It breaks my nightly backups of the VMs. This is using the built-in backup mechanism, to an NFS share, nothing like VEEAM or whatever. Backups have been veerrrry reliable up until I found this hiccup.
The logs show that certain file locks cannot be attained when trying to backup (on ALL the VMs listed for backup):
"
INFO: Starting Backup of VM 212 (qemu)
INFO: status = running
INFO: unable to open file '/etc/pve/nodes/REDACTED/qemu-server/212.conf.tmp.10682' - Permission denied
INFO: update VM 212: -lock backup
ERROR: Backup of VM 212 failed - command 'qm set 212 --lock backup' failed: exit code 2
INFO: Starting Backup of VM 400 (qemu)
INFO: status = running
INFO: unable to open file '/etc/pve/nodes/REDACTED/qemu-server/400.conf.tmp.10692' - Permission denied
INFO: update VM 400: -lock backup
ERROR: Backup of VM 400 failed - command 'qm set 400 --lock backup' failed: exit code 2
"
I'm quite sure this is because the clustered filesystem is in a RO state due to earlier mentioned 50% thresholds being met.
Anyways. I see two "solutions", but I was hoping that a third would be found as I like neither "solution":
- Leave the "lab" nodes on all the time. The backups start at 1am and go for 4hrs-ish. This is known to work as it happened successfully last night.
- Remove those two nodes from this cluster, and do something else with them.
I have tried, with no success, telling pvecm to expect 2 votes, however that value doesn't seem to _change_ OR _stick_. Furthermore, the only docs I can find about this are for proxmox cluster v2, can't even find anything for v3 or v4.
Also, it's worth nothing that the nodes are not in an HA cluster, just a regular one. I don't need HA clustering, and the first 2 nodes aren't able to be fenced at this time anyways.
Well, I'd love to hear your thoughts! So, please tell me them. Thanks peeps.
Last edited: