Backups failing, Proxmox configuration problem or NFS issue?

jerel

New Member
Jun 13, 2012
8
0
1
Hello all, I have a problem with our cluster that I've been working with for quite some time now and I haven't been able to work out what the issue is. We have a two node cluster that is running VMs on local storage and snapshots all VMs nightly and stores the lzo's on an NFS server. The problem is that the snapshots hang and never finish. When they hang it takes down the web interface and occasionally the VMs have became unavailable on the network (exciting). By looking at the logs it seems that the snapshot works fine but the transfer is what does not finish.

Here's the items modified in /etc/vzdump.conf:
Code:
bwlimit: 50000
size: 15000

And my modified /etc/pve/storage.cfg:
Code:
nfs: file-server        path /mnt/pve/file-server
        server 192.168.100.10
        export /home/user/storage/cluster
        options vers=3,soft,rsize=32768,wsize=32768
        content images,iso,vztmpl,rootdir,backup
        maxfiles 5
The NFS is running Ubuntu 12.04 connected with one gigabit switch and the exports look like this:
Code:
/home/user/storage/cluster 192.168.100.1(rw,sync,no_subtree_check) 192.168.100.2(rw,sync,no_subtree_check)
When a backup hangs the vzdump log looks like this:
Code:
Sep 24 21:31:56 INFO: Starting Backup of VM 101 (qemu)Sep 24 21:31:56 INFO: status = running
Sep 24 21:31:57 INFO: backup mode: snapshot
Sep 24 21:31:57 INFO: bandwidth limit: 50000 KB/s
Sep 24 21:31:57 INFO: ionice priority: 7
Sep 24 21:31:57 INFO:   Logical volume "vzsnap-intel-0" created
Sep 24 21:31:57 INFO: creating archive '/mnt/pve/file-server/dump/vzdump-qemu-101-2012_09_24-21_31_56.tar.lzo'
Sep 24 21:31:57 INFO: adding '/mnt/pve/file-server/dump/vzdump-qemu-101-2012_09_24-21_31_56.tmp/qemu-server.conf' to archive ('qemu-server.c$
Sep 24 21:31:57 INFO: adding '/mnt/vzsnap0/images/101/vm-101-disk-1.vmdk' to archive ('vm-disk-ide0.vmdk')
The modification date of the .dat file always shows the current time but the size either doesn't change or only changes by so few bits that it isn't noticeable. It appears that it would stay like this forever or would possibly complete in a few days if I let it continue.

The VM that it usually failed/fails on (the first large machine in line) generates a ~75Gb archive when the backup succeeds so it is a decently large disk. It does seem like anything over 10Gb is game for trouble tho. When it hangs I have to reboot the NFS to recover and then restart services on Proxmox to get the web interface back. It does seem that rebooting the actual Proxmox node will at least let the next nightly backup session complete without issue. With all the default configs it made it 4 nights without hanging and with the modified configs it made it one night. I started the backup of some large machines manually with the old configs and they never completed tho so I don't think the config tweaks are helping or hurting.

Running top on the NFS during a large backup shows something like this:
Code:
Cpu(s):  0.1%us,  0.8%sy,  0.0%ni, 72.6%id, 25.5%wa,  0.0%hi,  0.9%si,  0.0%st

And nfsstat on the Proxmox node shows this:
Code:
Client rpc stats:calls      retrans    authrefrsh
9533134    16         8452156 


Client nfs v3:
null         getattr      setattr      lookup       access       readlink     
0         0% 52731     0% 9         0% 71        0% 9464      0% 0         0% 
read         write        create       mkdir        symlink      mknod        
14        0% 9437898  99% 29        0% 10        0% 0         0% 0         0% 
remove       rmdir        rename       link         readdir      readdirplus  
12        0% 8         0% 8         0% 0         0% 0         0% 44        0% 
fsstat       fsinfo       pathconf     commit       
26798     0% 2         0% 1         0% 6035      0%

Does anybody have a suggestion on what to try next?
 
Storage: the NFS has 2TB of free space.

Lvm: on an old Proxmox node it was running with the default "size" in vzdump.conf and I believe it threw an error that pointed to that (I can't remember what the error was tho). So when I built this new node I assigned 15GB and I've never gotten any errors from vzdump. Would there be a way I could check the usage during a backup?
 
I've left the backup from last night run all day today and it appears that it keeps making headway but only at a crawl. For a 23GB machine it's taking about 6 hours instead of a few minutes.

Here's what lvs shows now while it's running:
Code:
  LV             VG   Attr     LSize  Pool Origin Data%  Move Log Copy%  Convert  data           pve  owi-aos-  1.75t                                           
  root           pve  -wi-ao-- 96.00g                                           
  swap           pve  -wi-ao-- 47.00g                                           
  vzsnap-intel-0 pve  swi-aos- 14.65g      data    13.23

Here's the relevant output of lvdisplay too if that's helpful:
Code:
  --- Logical volume ---  LV Path                /dev/pve/vzsnap-intel-0
  LV Name                vzsnap-intel-0
  VG Name                pve
  LV UUID                ognFDc-wvn0-dQjo-9980-T7RZ-qtpH-SpSAlv
  LV Write Access        read/write
  LV Creation host, time intel, 2012-09-25 10:21:37 -0500
  LV snapshot status     active destination for data
  LV Status              available
  # open                 1
  LV Size                1.75 TiB
  Current LE             458177
  COW-table size         14.65 GiB
  COW-table LE           3750
  Allocated to snapshot  14.36%
  Snapshot chunk size    4.00 KiB
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:3
If we are running out of space I suppose it's possible to give it more room?

[Edit] I should mention that these servers change very little at night as far as data goes. There might be a few KB worth of files changed during a backup but that should be it.
 
Last edited:
Just an update. I've been monitoring the backups and what I've found is that the NFS transfer slows and then fails shortly after 20Gb. At that time the snapshot is only using 5% of the allocated space. Of course the snapshot keeps growing as it sits and waits on NFS and I saw it as high as 11% before I aborted it.

So I don't know what else it would be but a bug in the NFS. Any other ideas from you guys? Build another backup server with a different distro?
 
I often experimented that issue while backuping on NFS, mainly on internet servers, and mostly when multiple hosts are backuping at the same moment on one NFS server.

Right now, I've created in each host a specific LVM volume called backupLocal and each host backup on that volume of his own. It's much faster and constant. I know that the backup duration will be almost the same, day after day.

Later in the night, the backup server opens an rsync connection with each server and download the backups.

Any probleme since that process is up !

Michel
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!