Hi, I am having problem with snaphot backups. The backup process stalls just before finising. It is reproducible every time with large VMs. I started snapshot back of 200 GB single disk VM at 18:25:
in a few 10 minutes, as you see, it was almost finished. On the NFS server there were vzdump-qemu-121-2012_05_10-18_25_36.tar.dat, vzdump-qemu-121-2012_05_10-18_25_36.tmp files. Even the proxmox UI reorted INFO: Total bytes written: 215822109184 (126.66 MiB/s) line, a few "ls -al" command on the NFS server showed that proxmox still writing very slowly. The tar.dat file was still groving but with very slow speed ( a few KB/s). It eventualy stoped growing in a few hours then nothing happens. VZdump process seems stalled and never complete. Even a few days later. And "Stop" button in the GUI doesn't cancel the backup.
There are plenty of lzo errors in syslog. They start about 20 min after backup started. It seems the cause for very slow writing speed just before it stalls:
The only way to cancel bakup is to reboot host
01:00.0 RAID bus controller [0104]: LSI Logic / Symbios Logic MegaRAID SAS 2108 [Liberator] [1000:0079] (rev 04)
I also tried Backup in stop mode, the same happens. From the old archives of this VM, I see that a successful lzo backup is about 18 GB. I watch transfer rate on NFS server with "ls -al" and see that the tar.dat file grows about ~100-120 MB/s but when it reach 17 GB it slows down to a few KB/s and some hours later it stops to grow at ~18 GB.
VM's with small HDDs work without a problem with both backup mode btw. Oh, there is also no free space problem as the VG has over 2 TB free space.
Code:
[COLOR=#000000][FONT=tahoma]INFO: starting new backup job: vzdump 121 --remove 0 --mode snapshot --compress lzo --storage NFS_NexentaStor --node kvm44[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: Starting Backup of VM 121 (qemu)[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: status = running[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: backup mode: snapshot[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: ionice priority: 7[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: trying to remove stale snapshot '/dev/ST4x2000/vzsnap-kvm44-0'[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: Logical volume "vzsnap-kvm44-0" created[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: creating archive '/mnt/pve/NFS_NexentaStor/dump/vzdump-qemu-121-2012_05_10-18_25_36.tar.lzo'[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: adding '/mnt/pve/NFS_NexentaStor/dump/vzdump-qemu-121-2012_05_10-18_25_36.tmp/qemu-server.conf' to archive ('qemu-server.conf')[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: adding '/dev/ST4x2000/vzsnap-kvm44-0' to archive ('vm-disk-ide0.raw')[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: Total bytes written: 215822109184 (126.66 MiB/s)[/FONT][/COLOR]
in a few 10 minutes, as you see, it was almost finished. On the NFS server there were vzdump-qemu-121-2012_05_10-18_25_36.tar.dat, vzdump-qemu-121-2012_05_10-18_25_36.tmp files. Even the proxmox UI reorted INFO: Total bytes written: 215822109184 (126.66 MiB/s) line, a few "ls -al" command on the NFS server showed that proxmox still writing very slowly. The tar.dat file was still groving but with very slow speed ( a few KB/s). It eventualy stoped growing in a few hours then nothing happens. VZdump process seems stalled and never complete. Even a few days later. And "Stop" button in the GUI doesn't cancel the backup.
There are plenty of lzo errors in syslog. They start about 20 min after backup started. It seems the cause for very slow writing speed just before it stalls:
Code:
May 10 18:54:58 kvm44 kernel: INFO: task lzop:11558 blocked for more than 120 seconds.May 10 18:54:58 kvm44 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 10 18:54:58 kvm44 kernel: lzop D ffff88063bab0480 0 11558 11556 0 0x00000000
May 10 18:54:58 kvm44 kernel: ffff88044fd23b68 0000000000000046 ffff88044fd23ae8 ffffffffa03e6ec0
May 10 18:54:58 kvm44 kernel: ffff8807932d99c0 ffff880634f9c9c0 ffff88044f000001 00000000000007a8
May 10 18:54:58 kvm44 kernel: ffffffffa03f51a8 ffff88063bab0a20 ffff88044fd23fd8 ffff88044fd23fd8
May 10 18:54:58 kvm44 kernel: Call Trace:
May 10 18:54:58 kvm44 kernel: [<ffffffff81120bf0>] ? sync_page+0x0/0x50
May 10 18:54:58 kvm44 kernel: [<ffffffff81512663>] io_schedule+0x73/0xc0
May 10 18:54:58 kvm44 kernel: [<ffffffff81120c2d>] sync_page+0x3d/0x50
May 10 18:54:58 kvm44 kernel: [<ffffffff8151302f>] __wait_on_bit+0x5f/0x90
May 10 18:54:58 kvm44 kernel: [<ffffffff81120de3>] wait_on_page_bit+0x73/0x80
May 10 18:54:58 kvm44 kernel: [<ffffffff810944f0>] ? wake_bit_function+0x0/0x40
May 10 18:54:58 kvm44 kernel: [<ffffffff811392f5>] ? pagevec_lookup_tag+0x25/0x40
May 10 18:54:58 kvm44 kernel: [<ffffffff811212eb>] wait_on_page_writeback_range+0xfb/0x190
May 10 18:54:58 kvm44 kernel: [<ffffffff811214b8>] filemap_write_and_wait_range+0x78/0x90
May 10 18:54:58 kvm44 kernel: [<ffffffff811bf75a>] vfs_fsync_range+0xba/0x190
May 10 18:54:58 kvm44 kernel: [<ffffffff811bf89d>] vfs_fsync+0x1d/0x20
May 10 18:54:58 kvm44 kernel: [<ffffffffa03aa920>] nfs_file_flush+0x70/0xa0 [nfs]
May 10 18:54:58 kvm44 kernel: [<ffffffff8118bd5c>] filp_close+0x3c/0x90
May 10 18:54:58 kvm44 kernel: [<ffffffff8106dddf>] put_files_struct+0x7f/0xf0
May 10 18:54:58 kvm44 kernel: [<ffffffff8106dea3>] exit_files+0x53/0x70
May 10 18:54:58 kvm44 kernel: [<ffffffff8106fa8d>] do_exit+0x1ad/0x920
May 10 18:54:58 kvm44 kernel: [<ffffffff81070258>] do_group_exit+0x58/0xd0
May 10 18:54:58 kvm44 kernel: [<ffffffff810702e7>] sys_exit_group+0x17/0x20
May 10 18:54:58 kvm44 kernel: [<ffffffff8100b182>] system_call_fastpath+0x16/0x1b
The only way to cancel bakup is to reboot host

Code:
pve-manager: 2.1-1 (pve-manager/2.1/f9b0f63a)
running kernel: 2.6.32-11-pve
proxmox-ve-2.6.32: 2.0-66
pve-kernel-2.6.32-11-pve: 2.6.32-66
lvm2: 2.02.95-1pve2
clvm: 2.02.95-1pve2
corosync-pve: 1.4.3-1
openais-pve: 1.1.4-2
libqb: 0.10.1-2
redhat-cluster-pve: 3.1.8-3
resource-agents-pve: 3.9.2-3
fence-agents-pve: 3.1.7-2
pve-cluster: 1.0-26
qemu-server: 2.0-39
pve-firmware: 1.0-15
libpve-common-perl: 1.0-27
libpve-access-control: 1.0-21
libpve-storage-perl: 2.0-18
vncterm: 1.0-2
vzctl: 3.0.30-2pve5
vzprocps: 2.0.11-2
vzquota: 3.0.12-3
pve-qemu-kvm: 1.0-9
ksm-control-daemon: 1.1-1
01:00.0 RAID bus controller [0104]: LSI Logic / Symbios Logic MegaRAID SAS 2108 [Liberator] [1000:0079] (rev 04)
I also tried Backup in stop mode, the same happens. From the old archives of this VM, I see that a successful lzo backup is about 18 GB. I watch transfer rate on NFS server with "ls -al" and see that the tar.dat file grows about ~100-120 MB/s but when it reach 17 GB it slows down to a few KB/s and some hours later it stops to grow at ~18 GB.
VM's with small HDDs work without a problem with both backup mode btw. Oh, there is also no free space problem as the VG has over 2 TB free space.