Hello all, I have a problem with our cluster that I've been working with for quite some time now and I haven't been able to work out what the issue is. We have a two node cluster that is running VMs on local storage and snapshots all VMs nightly and stores the lzo's on an NFS server. The problem is that the snapshots hang and never finish. When they hang it takes down the web interface and occasionally the VMs have became unavailable on the network (exciting). By looking at the logs it seems that the snapshot works fine but the transfer is what does not finish.
Here's the items modified in /etc/vzdump.conf:
And my modified /etc/pve/storage.cfg:
The NFS is running Ubuntu 12.04 connected with one gigabit switch and the exports look like this:
When a backup hangs the vzdump log looks like this:
The modification date of the .dat file always shows the current time but the size either doesn't change or only changes by so few bits that it isn't noticeable. It appears that it would stay like this forever or would possibly complete in a few days if I let it continue.
The VM that it usually failed/fails on (the first large machine in line) generates a ~75Gb archive when the backup succeeds so it is a decently large disk. It does seem like anything over 10Gb is game for trouble tho. When it hangs I have to reboot the NFS to recover and then restart services on Proxmox to get the web interface back. It does seem that rebooting the actual Proxmox node will at least let the next nightly backup session complete without issue. With all the default configs it made it 4 nights without hanging and with the modified configs it made it one night. I started the backup of some large machines manually with the old configs and they never completed tho so I don't think the config tweaks are helping or hurting.
Running top on the NFS during a large backup shows something like this:
And nfsstat on the Proxmox node shows this:
Does anybody have a suggestion on what to try next?
Here's the items modified in /etc/vzdump.conf:
Code:
bwlimit: 50000
size: 15000
And my modified /etc/pve/storage.cfg:
Code:
nfs: file-server path /mnt/pve/file-server
server 192.168.100.10
export /home/user/storage/cluster
options vers=3,soft,rsize=32768,wsize=32768
content images,iso,vztmpl,rootdir,backup
maxfiles 5
Code:
/home/user/storage/cluster 192.168.100.1(rw,sync,no_subtree_check) 192.168.100.2(rw,sync,no_subtree_check)
Code:
Sep 24 21:31:56 INFO: Starting Backup of VM 101 (qemu)Sep 24 21:31:56 INFO: status = running
Sep 24 21:31:57 INFO: backup mode: snapshot
Sep 24 21:31:57 INFO: bandwidth limit: 50000 KB/s
Sep 24 21:31:57 INFO: ionice priority: 7
Sep 24 21:31:57 INFO: Logical volume "vzsnap-intel-0" created
Sep 24 21:31:57 INFO: creating archive '/mnt/pve/file-server/dump/vzdump-qemu-101-2012_09_24-21_31_56.tar.lzo'
Sep 24 21:31:57 INFO: adding '/mnt/pve/file-server/dump/vzdump-qemu-101-2012_09_24-21_31_56.tmp/qemu-server.conf' to archive ('qemu-server.c$
Sep 24 21:31:57 INFO: adding '/mnt/vzsnap0/images/101/vm-101-disk-1.vmdk' to archive ('vm-disk-ide0.vmdk')
The VM that it usually failed/fails on (the first large machine in line) generates a ~75Gb archive when the backup succeeds so it is a decently large disk. It does seem like anything over 10Gb is game for trouble tho. When it hangs I have to reboot the NFS to recover and then restart services on Proxmox to get the web interface back. It does seem that rebooting the actual Proxmox node will at least let the next nightly backup session complete without issue. With all the default configs it made it 4 nights without hanging and with the modified configs it made it one night. I started the backup of some large machines manually with the old configs and they never completed tho so I don't think the config tweaks are helping or hurting.
Running top on the NFS during a large backup shows something like this:
Code:
Cpu(s): 0.1%us, 0.8%sy, 0.0%ni, 72.6%id, 25.5%wa, 0.0%hi, 0.9%si, 0.0%st
And nfsstat on the Proxmox node shows this:
Code:
Client rpc stats:calls retrans authrefrsh
9533134 16 8452156
Client nfs v3:
null getattr setattr lookup access readlink
0 0% 52731 0% 9 0% 71 0% 9464 0% 0 0%
read write create mkdir symlink mknod
14 0% 9437898 99% 29 0% 10 0% 0 0% 0 0%
remove rmdir rename link readdir readdirplus
12 0% 8 0% 8 0% 0 0% 0 0% 44 0%
fsstat fsinfo pathconf commit
26798 0% 2 0% 1 0% 6035 0%
Does anybody have a suggestion on what to try next?