snapshot backup to NFS crashes cluster

ewuewu

Renowned Member
Sep 14, 2010
58
0
71
Hamburg
I have setup a two cluster using proxmox 2.1 with drbd. This cluster works fine. I configured the cluster in that way that it starts snapshot backups to our company nfs servers.
Sometimes it works perfectly but sometimes one of the clusternodes stops working correctly. The node loses the cluster connection. It seems to me, that this behavior occurs, when the nfs server is under heavy load. The backup lock display the following messages:

Code:
INFO: starting new backup job: vzdump 107 --remove 0 --mode snapshot --compress
lzo --storage ubuntu-nfs-1 --node proxmox1
INFO: Starting Backup of VM 107
(qemu)
INFO: status = running
INFO: backup mode: snapshot
INFO: ionice
priority: 7
INFO: suspend vm to make snapshot
INFO: Logical volume
"vzsnap-proxmox1-0" created
INFO: Logical volume "vzsnap-proxmox1-1"
created
INFO: resume vm
INFO: vm is online again after 1 seconds
INFO:
creating archive
'/mnt/pve/ubuntu-nfs-1/dump/vzdump-qemu-107-2012_06_07-16_44_20.tar.lzo'
INFO:
adding
'/mnt/pve/ubuntu-nfs-1/dump/vzdump-qemu-107-2012_06_07-16_44_20.tmp/qemu-server.conf'
to archive ('qemu-server.conf')
INFO: adding '/dev/drbdvg/vzsnap-proxmox1-0'
to archive ('vm-disk-ide0.raw')
INFO: adding '/dev/drbdvg/vzsnap-proxmox1-1'
to archive ('vm-disk-virtio0.raw')
INFO: Total bytes written: 214756756480
(64.00 MiB/s)
INFO: archive file size: 69.01GB
INFO: unable to open file
'/etc/pve/nodes/proxmox1/qemu-server/107.conf.tmp.314129' - Device or resource
busy
trying to aquire cfs lock 'storage-drbdvg' ...trying to aquire cfs lock
'storage-drbdvg' ...trying to aquire cfs lock 'storage-drbdvg' ...trying to
aquire cfs lock 'storage-drbdvg' ...INFO: lvremove failed - trying again in 8
seconds
trying to aquire cfs lock 'storage-drbdvg' ...INFO: lvremove failed -
trying again in 16 seconds
trying to aquire cfs lock 'storage-drbdvg'
...INFO: lvremove failed - trying again in 32 seconds
trying to aquire cfs
lock 'storage-drbdvg' ...ERROR: got lock request timeout
trying to aquire cfs
lock 'storage-drbdvg' ...trying to aquire cfs lock 'storage-drbdvg' ...trying to
aquire cfs lock 'storage-drbdvg' ...trying to aquire cfs lock 'storage-drbdvg'
...INFO: lvremove failed - trying again in 8 seconds
trying to aquire cfs
lock 'storage-drbdvg' ...INFO: lvremove failed - trying again in 16
seconds
trying to aquire cfs lock 'storage-drbdvg' ...INFO: lvremove failed -
trying again in 32 seconds
trying to aquire cfs lock 'storage-drbdvg'
...ERROR: got lock request timeout
INFO: Finished Backup of VM 107
(00:58:13)
INFO: Backup job finished successfully
TASK OK

After this the node turned to red in the web interface and wasn't accessible any more from the remaining node.

The syslog of the crashed node was flooded with:

Code:
...
Jun 9 06:47:35 proxmox1 pmxcfs[586854]: [status] crit: cpg_send_message failed: 9
Jun 9 06:47:35 proxmox1 pmxcfs[586854]: [status] crit: cpg_send_message failed: 9
Jun 9 06:47:35 proxmox1 pmxcfs[586854]: [status] crit: cpg_send_message failed: 9
Jun 9 06:47:35 proxmox1 pmxcfs[586854]: [status] crit: cpg_send_message failed: 9
...


I could remove the snapshots with lvremove but I wasn't able to bring the node back to normal operation. Only a reboot of this node has brought it back to operation.

The nfs-storage is connected with these options in storage.cfg:

Code:
nfs: ubuntu-nfs-1
path /mnt/pve/ubuntu-nfs-1
server 195.50.143.109
export /backup-ext3-sdc/proxmox
options vers=3,rw,rsize=32768,wsize=32768
content images,iso,vztmpl,rootdir,backup
maxfiles 4

The output of pveversion ist:

Code:
#pveversion --verbose
pve-manager: 2.1-1 (pve-manager/2.1/f9b0f63a)
running kernel: 2.6.32-12-pve
proxmox-ve-2.6.32: 2.1-68
pve-kernel-2.6.32-12-pve: 2.6.32-68
pve-kernel-2.6.32-7-pve: 2.6.32-60
lvm2: 2.02.95-1pve2
clvm: 2.02.95-1pve2
corosync-pve: 1.4.3-1
openais-pve: 1.1.4-2
libqb: 0.10.1-2
redhat-cluster-pve: 3.1.8-3
resource-agents-pve: 3.9.2-3
fence-agents-pve: 3.1.7-2
pve-cluster: 1.0-26
qemu-server: 2.0-39
pve-firmware: 1.0-16
libpve-common-perl: 1.0-27
libpve-access-control: 1.0-21
libpve-storage-perl: 2.0-18
vncterm: 1.0-2
vzctl: 3.0.30-2pve5
vzprocps: 2.0.11-2
vzquota: 3.0.12-3
pve-qemu-kvm: 1.0-9
ksm-control-daemon: 1.1-1

My questions are: Is there a way to reintegrate the failed node into the cluster without rebooting after such a crash occurs and much more important: how can I avoid such a crash?

Currently snapshot backup is not useable for us.

Any help is appreciated
 
My questions are: Is there a way to reintegrate the failed node into the cluster without rebooting after such a crash occurs

# /etc/init.d/cman start
# /etc/init.d/pve-cluster start

and much more important: how can I avoid such a crash?

Currently snapshot backup is not useable for us.

Any help is appreciated

Use an extra physical network for cluster communication, or limit the backup bandwidth (/etc/vzdump.conf, bwlimit).
 
First of all thank you for your quick reply guys and sorry for my late response but I've been out of office for a couple of days.

Your hint about the network separation did the trick. But unfortunately I've a new issue with the online backup.

I have one vm with two relative big disks (one ide disk with 100GB and one virtio disk with 100GB).
Whenever I try to backup this vm the process starts quickly but after a while I get some error messages. These messages are repeated several times.

Code:
Jun 14 19:14:49 proxmox2 rrdcached[1810]: queue_thread_main: rrd_update_r (/var/lib/rrdcached/db/pve2-storage/proxmox2/drbdvg) failed with status -1. (/var/lib/rrdcached/db/pve2-storage/proxmox2/drbdvg: illegal attempt to update using time 1339693798 when last update time is 1339693842 (minimum one second step))
Jun 14 19:14:49 proxmox2 rrdcached[1810]: queue_thread_main: rrd_update_r (/var/lib/rrdcached/db/pve2-storage/proxmox2/local) failed with status -1. (/var/lib/rrdcached/db/pve2-storage/proxmox2/local: illegal attempt to update using time 1339693798 when last update time is 1339693842 (minimum one second step))
Jun 14 19:14:59 proxmox2 pvestatd[3200]: WARNING: command 'df -P -B 1 /mnt/pve/ubuntu-nfs1' failed: got timeout
Jun 14 19:15:29 proxmox2 pvestatd[3200]: WARNING: command 'df -P -B 1 /mnt/pve/ubuntu-nfs1' failed: got timeout
Jun 14 19:15:49 proxmox2 pvestatd[3200]: WARNING: command 'df -P -B 1 /mnt/pve/ubuntu-nfs1' failed: got timeout
Jun 14 19:15:49 proxmox2 pvestatd[3200]: status update time (12.237 seconds)
Jun 14 19:15:51 proxmox2 pvestatd[3200]: WARNING: command 'df -P -B 1 /mnt/pve/ubuntu-nfs1' failed: got timeout
Jun 14 19:15:53 proxmox2 kernel: INFO: task lzop:726799 blocked for more than 120 seconds.
Jun 14 19:15:53 proxmox2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 14 19:15:53 proxmox2 kernel: lzop D ffff88026c426a00 0 726799 726797 0 0x00000000
Jun 14 19:15:53 proxmox2 kernel: ffff88026d94d1e8 0000000000000086 ffff88026d94d188 ffff88026fc73e00
Jun 14 19:15:53 proxmox2 kernel: 000000000000000c ffff88026d94d220 0000000000000000 ffffea00012156e8
Jun 14 19:15:53 proxmox2 kernel: ffff8802972a4280 ffff88026c426fa0 ffff88026d94dfd8 ffff88026d94dfd8
Jun 14 19:15:53 proxmox2 kernel: Call Trace:
Jun 14 19:15:53 proxmox2 kernel: [<ffffffff81120c60>] ? sync_page+0x0/0x50
Jun 14 19:15:53 proxmox2 kernel: [<ffffffff81513e33>] io_schedule+0x73/0xc0
Jun 14 19:15:53 proxmox2 kernel: [<ffffffff81120c9d>] sync_page+0x3d/0x50
Jun 14 19:15:53 proxmox2 kernel: [<ffffffff815147ff>] __wait_on_bit+0x5f/0x90
Jun 14 19:15:53 proxmox2 kernel: [<ffffffff81120e53>] wait_on_page_bit+0x73/0x80
Jun 14 19:15:53 proxmox2 kernel: [<ffffffff81094650>] ? wake_bit_function+0x0/0x40
Jun 14 19:15:53 proxmox2 kernel: [<ffffffff8113de0a>] shrink_page_list+0x15a/0x9e0
Jun 14 19:15:53 proxmox2 kernel: [<ffffffff81136f97>] ? get_dirty_limits+0x27/0x2f0
Jun 14 19:15:53 proxmox2 kernel: [<ffffffff81136eea>] ? determine_dirtyable_memory+0x1a/0x30
Jun 14 19:15:53 proxmox2 kernel: [<ffffffff81136f97>] ? get_dirty_limits+0x27/0x2f0
Jun 14 19:15:53 proxmox2 kernel: [<ffffffff8113fc38>] shrink_zone+0x5d8/0x9d0
Jun 14 19:15:53 proxmox2 kernel: [<ffffffff811499a9>] ? zone_statistics+0x99/0xc0
Jun 14 19:15:53 proxmox2 kernel: [<ffffffff81140d3d>] __zone_reclaim+0x22d/0x2f0
Jun 14 19:15:53 proxmox2 kernel: [<ffffffff8113d380>] ? isolate_pages_global+0x0/0x520
Jun 14 19:15:53 proxmox2 kernel: [<ffffffff81140f17>] zone_reclaim+0x117/0x150
Jun 14 19:15:53 proxmox2 kernel: [<ffffffff81130eec>] get_page_from_freelist+0x6ac/0x840
Jun 14 19:15:53 proxmox2 kernel: [<ffffffff81132b26>] __alloc_pages_nodemask+0x116/0xb50
Jun 14 19:15:53 proxmox2 kernel: [<ffffffff8126c52e>] ? radix_tree_tag_clear+0x1e/0x200
Jun 14 19:15:53 proxmox2 kernel: [<ffffffff81173a5a>] alloc_pages_current+0xaa/0x120
Jun 14 19:15:53 proxmox2 kernel: [<ffffffff81121f4b>] __page_cache_alloc+0x4b/0xf0
Jun 14 19:15:53 proxmox2 kernel: [<ffffffff8112316e>] grab_cache_page_write_begin+0xde/0x130
Jun 14 19:15:53 proxmox2 kernel: [<ffffffffa05bc537>] nfs_write_begin+0x77/0x230 [nfs]
Jun 14 19:15:53 proxmox2 kernel: [<ffffffffa05bd78d>] ? nfs_write_end+0x6d/0x290 [nfs]
Jun 14 19:15:53 proxmox2 kernel: [<ffffffff8112163c>] generic_file_buffered_write_iter+0xfc/0x270
Jun 14 19:15:53 proxmox2 kernel: [<ffffffff810630ab>] ? dequeue_task_fair+0x12b/0x130
Jun 14 19:15:53 proxmox2 kernel: [<ffffffff811233bd>] __generic_file_write_iter+0x1fd/0x400
Jun 14 19:15:53 proxmox2 kernel: [<ffffffff81071c97>] ? current_fs_time+0x27/0x30
Jun 14 19:15:53 proxmox2 kernel: [<ffffffff81123645>] __generic_file_aio_write+0x85/0xa0
Jun 14 19:15:53 proxmox2 kernel: [<ffffffff811236cf>] generic_file_aio_write+0x6f/0xe0
Jun 14 19:15:53 proxmox2 kernel: [<ffffffffa05bd2bc>] nfs_file_write+0x10c/0x210 [nfs]
Jun 14 19:15:53 proxmox2 kernel: [<ffffffff8118f4ea>] do_sync_write+0xfa/0x140
Jun 14 19:15:53 proxmox2 kernel: [<ffffffff81094610>] ? autoremove_wake_function+0x0/0x40
Jun 14 19:15:53 proxmox2 kernel: [<ffffffff8100984c>] ? __switch_to+0x1ac/0x320
Jun 14 19:15:53 proxmox2 kernel: [<ffffffff8118f7c8>] vfs_write+0xb8/0x1a0
Jun 14 19:15:53 proxmox2 kernel: [<ffffffff811901d1>] sys_write+0x51/0x90
Jun 14 19:15:53 proxmox2 kernel: [<ffffffff8100b182>] system_call_fastpath+0x16/0x1b
Jun 14 19:15:54 proxmox2 pvedaemon[2696]: worker 740014 finished
Jun 14 19:15:54 proxmox2 pvedaemon[2696]: starting 1 worker(s)
Jun 14 19:15:54 proxmox2 pvedaemon[2696]: worker 742614 started


There is a another issue. It seems to me that the host disables one disk of the vm for a short while. This occurs only during backup. The vm is a debian 6.0 OS and is running a content management system called alfresco. In this vm I have a moderate mysql load. But after the short disk stop, mysql crashes, what is not really surprising me. The log messages from the vm are:

Code:
Jun 14 18:35:58 docs3 kernel: [448131.816078] ata1: lost interrupt (Status 0x50)
Jun 14 18:35:58 docs3 kernel: [448131.817872] sd 0:0:0:0: [sda] Unhandled error code
Jun 14 18:35:58 docs3 kernel: [448131.817880] sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
Jun 14 18:35:58 docs3 kernel: [448131.817884] sd 0:0:0:0: [sda] CDB: Write(10): 2a 00 01 25 73 68 00 00 08 00
Jun 14 18:35:58 docs3 kernel: [448131.819103] lost page write due to I/O error on dm-3
Jun 14 18:35:58 docs3 kernel: [448131.821042] JBD: Detected IO errors while flushing file data on dm-3

Has anyone of you an idea what’s going wrong here?
 
I've been hit by the same issue, during an NFS backup, the client becomes irresponsible and several process hang. It eventually timeout and got fenced by another node (forcing HA to move it on another node, although refusing to start it since it's locked for backup.)

Looking around, I've read several issue with NFS hanging when writing large files and I was able to reproduce the issue while creating a file dozen of Gb large with "dd if=/dev/zero ..." on a NFS filesystem. Few options alter the behaviour. Using "async" write on server helped (it hung later) and NFS over UDP looked like better than TCP on this matter. Possibly limiting the bandwidth of vzbackup is a workaround (I've not tried it) but that'll mean this already large backup would last even longer.

The issue would be fixed on kernel 2.6.33 according this bug id https://bugzilla.kernel.org/show_bug.cgi?id=15552 while the kernel I used was 2.6.32-13-pve. It's a one line patch, maybe it can be backported.
 
I made a test with 2.6.32-14-pve, it behaved better. The test were made on a slightly different node (a bit weaker). Creating a 100Gb copy of /dev/zero on NFS over TCP through Gb ethernet, the platform was slowed, the web interface was unresponsive and not usable, but it did not log any specific slowness and was not fenced like before. With NFS over UDP, the system responded better, the web interface was slowed but usable with patience.

NFS over TCP had a better throughput (100MB/s) while it ended at 60MB/s with 2.6.32-13-pve. On the other hand NFS over UDP was slower 80MB/s instead of 100MB/s with the pve stable kernel.

I wouldn't still trust NFS too much for backing up large disks yet. An alternative for a transparent backup system might be to use a CIFS/samba share, pretending it's a local drive, in the hope they have not such an issue. Currently, I'm using a rsync server to recieve the backups without any trouble (although the rate is only 50MB/s)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!