HA and backup via NFS

LnxBil · Nov 3, 2016

I'd like to know how you're doing backups in a HA environment. We have a Proxmox VE Cluster running on a SAN with all components at least twice, so anything can fail. At least in theory. Our weak spot is our backup system. Currently we're backing up via NFS to a server (server grade hardware), but it crashed recently and stall NFS handles are still a pain in the ... so, the Proxmox VE GUI hangs (first graphs due to problems with data acquisition, then all nodes including VMs seems offline). Machines are still working, so the functionality is not affected, but you cannot control any machine via GUI anymore. At least until the NFS gets up again and all connections are reestablished.

Our backup server has local disks with ZFS, so a clustered NFS solution is not easy applicable.

Best,
LnxBil

robhost · Nov 3, 2016

Hi,

we had the same issue several times ago. In our case it helped to stop pvestatd, umount the volume and then start pvestatd again after NFS is recovered.

LnxBil · Nov 3, 2016

Hi robhost,

Yes, simple waiting does also work (only takes longer), but manual resolving and waiting is not a good HA solution. Pushing the whole infrastructure to the limit with respect to high availability and then crippling everything with NFS feels just wrong.

I'm facing 'stalled NFS handles' for almost two decades now. NFS is bad, but still better than anything else I tried over the years. Samba works of course, but has other drawbacks. Maybe I'll give it another try but I'm really looking forward to other solutions or ideas.

robhost · Nov 4, 2016

Hi,

yes, this is really uncool ;-) But also in our enviroment we still have not found a better working alternative for NFS as backup ressource with PVE.

I'm wondering why PVE still uses NFS v3 by default, not v4. You can change this in /etc/pve/storage.cfg afaik in "options".

hansm · Nov 4, 2016

We're facing the same problem. But I already did some tests to get rid of the NFS share and use vzdump to stdout over SSH from the backup server. It works perfectly but I need to script the whole procedure to login to every PVE node and run vzdump for a VMID on the right node. Some raw steps so you can try what I describe.
1. Create SSH key for root on your backup server (ssh-keygen)
2. Copy your public key (id_rsa.pub) to /etc/pve/priv/authorized_keys (you can add it manually or use ssh-copy-id)
3. Login from your backup server to all your nodes to confirm the authenticity of the host you login to and to test you have a working connection: ssh root@<IP.of.your.PVEnode>
4. Make sure that the Local storage (defined in PVE at Datacenter->Storage) accept VZDump backup file as content. IMO it's a bug in vzdump, if you use vzdump with stdout and doesn't specify a destination storage, vzdump assumes you use Local as storage destination, however the backup will not stored at Local storage, it's just a check in vzdump.
5. Try to backup a VM. Login as root on your backup server and run following command (adjust accordingly):
ssh root@<IP.of.your.PVEnode> 'vzdump <VMID> -compress lzo -mode snapshot -stdout' > /path/to/backupstorage/vzdump-qemu-<VMID>-2016_11_04-12_40_00.vma.lzo 2> /path/to/backupstorage/vzdump-qemu-<VMID>-2016_11_04-12_40_00.log
The backup will be compressed on your PVE node and the compressed data will be streamed to the specified file on your backup server and the command output will be logged, just like Proxmox normally does. You can add options to vzdump to warn you by email on failure, like this:
-mailnotification failure -mailto your@emailaddress.tld
See https://pve.proxmox.com/wiki/VZDump

There is no way to list your backups now in the PVE GUI, or restore them directly. I'll experiment with a read-only NFS share defined in /etc/fstab on PVE nodes and add that mount point as Local storage dir in PVE. It's similar to the Samba share procedure (https://pve.proxmox.com/wiki/Storag...MBa.29_share_on_Proxmox_VE_via_.2Fetc.2Ffstab). But I'm not sure if the GUI fails when the NFS server is unreachable, if so, this won't help and we can try Samba or SSHFS.

When I have more time I'm going to script it. A simple check for which VMID's are on a node: ssh root@<IP.of.your.PVEnode> 'ls -1 /etc/pve/qemu-server | cut -d "." -f 1'

My script is going to roughly look like:
1. Define your nodes
2. Loop through the nodes and list the VMID's in a nested loop
3. For every VMID, check if a new backup need to be created
4. Goto next VMID if backup file is recent
5. If backup is too old, add command to remove oldest backup and create new backup to a list of jobs
6. If a backup job is added to the list of commands skip the nested loop and get back in the loop of nodes. This will prevent multiple backup jobs running in parallel on the same node.
7. When the loop of all nodes is finished we have a list of backup jobs - maximum 1 per node.
8. Run the command list through GNU Parallel (https://www.gnu.org/software/parallel/) to execute all commands simultaneously on all nodes
9. When it's finished re-run the procedure from 2-8 until no backup jobs need to be created.

I think it needs some more intelligence on steps 6-9. It's better to add a queue of all backup jobs first and then start a procedure to run max 1 job per node and when a job on a node finish, directly check if there is another job for that host. But I'm not sure yet how to accomplish this.

robhost · Nov 4, 2016

Crazy solution ;-)
You could use sshfs and mount a SSH endpoint, but maybe this is slow due to the usage of fuse.

LnxBil · Nov 6, 2016

Thank you @hansm. We also thought about such a setup, but the "cannot be restored easily" part was crucial. We also thought about "on demand" NFS mounting using automount, but the statd "needs" the storage almost every minute, so this was also not good.

Search

Search

HA and backup via NFS

LnxBil

Distinguished Member

robhost

Active Member

LnxBil

Distinguished Member

robhost

Active Member

hansm

Well-Known Member

robhost

Active Member

LnxBil

Distinguished Member

We value your privacy