vzdump snapshot backup causes vm hang

jcpham

New Member
Apr 12, 2009
19
0
1
Hi all. Long thread, but hopefully lots of info on a problem than has puzzled me for some time. I've reported the problem previously and it went unsolved, but figured I'd try again.

Essentially, I have 2 KVM windows 2008 vm's on one PVE host. This box has been in production since PVE .9 beta. One of them is SBS2008 w a 300GB qcow drive, the other is a regular 2008 server with a 80GB qcow drive. PVE host has 8GB physical RAM with 6GB assigned to the vms:

Code:
[root@volt:/etc/qemu-server]$ ls *
101.conf  102.conf

[root@volt:/etc/qemu-server]$ cat 101.conf
name: SBS
sockets: 1
bootdisk: ide0
ide0: vm-101-disk.qcow2
ostype: w2k8
memory: 4096
onboot: 1
vlan0: e1000=26:0A:5B:7F:00:F6
description: <edit>
hostusb: 067b:2303
cores: 2
boot: c
freeze: 0
cpuunits: 1000
acpi: 1
kvm: 1

[root@volt:/etc/qemu-server]$ cat 102.conf
name: SQL
ide2: none,media=cdrom
sockets: 1
bootdisk: ide0
ostype: w2k8
memory: 2048
onboot: 1
boot: dc
freeze: 0
cpuunits: 1000
acpi: 1
kvm: 1
ide0: vm-102-disk.qcow2
vlan0: e1000=82:A0:A5:67:0F:2F
description: <edit>
cores: 2
VZdump snapshots of the SBS frequently lock up vm 101 itself, and nothing else. Sometimes for 8-12 hours if left alone on a weekend. Sometimes the web interface will not stop the vm, and I have to use qm to stop it. vmtar sometimes will just keep writing and fill up all space on my backup device if left unchecked. To clarify, I believe vmtar (or some combination of hardware and software settings) is the root problem, not the vzdump perl script. Normally when this this happens, vmtar is pegged at 100% cpu usage.

I've posted these things previously on the forum. logfiles of vzdump failing because it ran out of disk space on a 1tb disk during the backup of a <300gb vm. Initially we were doing this to external USB. Maybe about a year ago we swapped to an internal disk on the same controller and the problem is consistent across both.

Yes I know I need to reboot, I updated this morning.

All the good stuff:
Code:
[root@volt:~]$ pveversion -v
pve-manager: 1.7-10 (pve-manager/1.7/5323)
running kernel: 2.6.24-9-pve
pve-kernel-2.6.24-7-pve: 2.6.24-11
pve-kernel-2.6.24-1-pve: 2.6.24-4
pve-kernel-2.6.24-9-pve: 2.6.24-18
pve-kernel-2.6.24-5-pve: 2.6.24-6
pve-kernel-2.6.24-2-pve: 2.6.24-5
qemu-server: 1.1-28
pve-firmware: 1.0-10
libpve-storage-perl: 1.0-16
vncterm: 0.9-2
vzctl: 3.0.24-1pve4
vzdump: 1.2-10
vzprocps: 2.0.11-1dso2
vzquota: 3.0.11-1

[root@volt:~]$ pveperf
CPU BOGOMIPS:      37240.26
REGEX/SECOND:      627087
HD SIZE:           94.49 GB (/dev/pve/root)
BUFFERED READS:    203.67 MB/sec
AVERAGE SEEK TIME: 9.58 ms
FSYNCS/SECOND:     1345.03
DNS EXT:           156.51 ms
DNS INT:           92.39 ms

[root@volt:~]$ pvs
  PV         VG   Fmt  Attr PSize   PFree
  /dev/sda2  pve  lvm2 a-   930.00G 4.00G
[root@volt:~]$ lvs
  LV   VG   Attr   LSize   Origin Snap%  Move Log Copy%  Convert
  data pve  -wi-ao 823.00G
  root pve  -wi-ao  96.00G
  swap pve  -wi-ao   7.00G
[root@volt:~]$ vgs
  VG   #PV #LV #SN Attr   VSize   VFree
  pve    1   3   0 wz--n- 930.00G 4.00G

[root@volt:~]$ lspci | grep RAID
01:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 1078 (rev 04)

[root@volt:~]$ mount
/dev/pve/root on / type ext3 (rw,errors=remount-ro)
tmpfs on /lib/init/rw type tmpfs (rw,nosuid,mode=0755)
proc on /proc type proc (rw,noexec,nosuid,nodev)
sysfs on /sys type sysfs (rw,noexec,nosuid,nodev)
procbususb on /proc/bus/usb type usbfs (rw)
udev on /dev type tmpfs (rw,mode=0755)
tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)
devpts on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=620)
/dev/mapper/pve-data on /var/lib/vz type ext3 (rw)
/dev/sda1 on /boot type ext3 (rw)
/dev/sdb5 on /backup type ext3 (rw,errors=remount-ro)

[root@volt:~]$ ls /var/lib/vz/images/*
/var/lib/vz/images/101:
vm-101-disk.qcow2

/var/lib/vz/images/102:
vm-102-disk.qcow2

[root@volt:~]$ qm list
      VMID NAME                 STATUS     MEM(MB)    BOOTDISK(GB) PID
       101 SBS                  running    4096             300.00 24284
       102 SQL                  running    2048              80.00 8663

[root@volt:~]$ df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/pve/root          95G  2.0G   88G   3% /
tmpfs                 3.9G     0  3.9G   0% /lib/init/rw
udev                   10M  2.7M  7.4M  27% /dev
tmpfs                 3.9G     0  3.9G   0% /dev/shm
/dev/mapper/pve-data  811G  500G  311G  62% /var/lib/vz
/dev/sda1             496M  100M  371M  22% /boot
/dev/sdb5             917G  499G  372G  58% /backup
/etc/vzdump.conf:
changing vzdump options has never really seemed to help consistently.
-size 2048 suggested by Dietmar. Tom suggested higher, been using 4096 since, same results. higher?
-bwlimit changing this doesn't seem to matter

Code:
[root@volt:~]$ cat /etc/vzdump.conf
############################################
# vzdump static options configuration file #
# ALL settings commented out purposely     #
# July 1st 2010, new internal raid0 1TB    #
# drive installed. VZdump testing          #
############################################

# script: /root/script/usb-rm.pl
size: 4096
dumpdir: /backup
# bwlimit: 10000
maxfiles: 2
here's where I kill it at 8:30 because it has been running for 10+ hours. This particular backup takes between 4 and 6 hours, when it is successful. Lowering bwlimit doesn't seem to effect the locking up, just how long the backup takes to complete.

Code:
[root@volt:~]$ cat vzdump-qemu-101-2011_01_20-20_00_02.log
Jan 20 20:00:02 INFO: Starting Backup of VM 101 (qemu)
Jan 20 20:00:03 INFO: running
Jan 20 20:00:03 INFO: status = running
Jan 20 20:00:03 INFO: backup mode: snapshot
Jan 20 20:00:03 INFO: ionice priority: 7
Jan 20 20:00:04 INFO:   Logical volume "vzsnap-volt-0" created
Jan 20 20:00:04 INFO: creating archive '/backup/vzdump-qemu-101-2011_01_20-20_00_02.tar'
Jan 20 20:00:04 INFO: adding '/backup/vzdump-qemu-101-2011_01_20-20_00_02.tmp/qemu-server.conf' to archive ('qemu-server.conf')
Jan 20 20:00:04 INFO: adding '/mnt/vzsnap0/images/101/vm-101-disk.qcow2' to archive ('vm-disk-ide0.qcow2')
Jan 21 08:30:12 INFO: received signal - terminate process
Jan 21 08:30:13 INFO:   Logical volume "vzsnap-volt-0" successfully removed
Jan 21 08:32:32 ERROR: Backup of VM 101 failed - command '/usr/lib/qemu-server/vmtar '/backup/vzdump-qemu-101-2011_01_20-20_00_02.tmp/qemu-server.conf' 'qemu-server.conf' '/mnt/vzsnap0/images/101/vm-101-disk.qcow2' 'vm-disk-ide0.qcow2' >/backup/vzdump-qemu-101-2011_01_20-20_00_02.dat' failed with exit code 255
We have plans to double the physical RAM in the PVE host for good measure. I do not believe this is a PVE-version specific problem, because it has always happened across each version of PVE we install. It only happens to a single VM, not both - so the size of the container seems to be an issue somehow. From the perspective of vzdump and vmtar, that is.

Any help or suggestions would be appreciated. Previous thread(s):
http://forum.proxmox.com/threads/2990-Another-vzdump-problem.-VERY-STRANGE!
 
Last edited:
@jcpham,

Hello,

we are having the same or similar problems.
Did you resolve the problem?

I'm thinking, that snapshot LV is too small.
In vzdump.conf I've put the following:
size: 3900
But it seems it doesn't help.
Maybe resizing the pve-data LV and so we can increase
the size of Logical volume vzsnap-px-0?

Regards
 
Hi experiencing the same symptoms here. vmtar hangs at 100%, I'll try to see if the LVM volume is running of space.
 
More information there is over 200Gb free in the snapshot volume. Running strace on the vmtar process gives me output like this:

read(3, 0xfffff74b5b8d3960, 9571968363188) = -1 EFAULT (Bad address)
read(3, 0xfffff74b5b8d395f, 9571968363189) = -1 EFAULT (Bad address)
read(3, 0xfffff74b5b8d395e, 9571968363190) = -1 EFAULT (Bad address)
read(3, 0xfffff74b5b8d395d, 9571968363191) = -1 EFAULT (Bad address)
read(3, 0xfffff74b5b8d395c, 9571968363192) = -1 EFAULT (Bad address)
read(3, 0xfffff74b5b8d395b, 9571968363193) = -1 EFAULT (Bad address)
read(3, 0xfffff74b5b8d395a, 9571968363194) = -1 EFAULT (Bad address)
read(3, 0xfffff74b5b8d3959, 9571968363195) = -1 EFAULT (Bad address)
read(3, 0xfffff74b5b8d3958, 9571968363196) = -1 EFAULT (Bad address)
read(3, 0xfffff74b5b8d3957, 9571968363197) = -1 EFAULT (Bad address)
read(3, 0xfffff74b5b8d3956, 9571968363198) = -1 EFAULT (Bad address)
read(3, 0xfffff74b5b8d3955, 9571968363199) = -1 EFAULT (Bad address)
read(3, 0xfffff74b5b8d3954, 9571968363200) = -1 EFAULT (Bad address)
 
I just realise I'm out of date and there is a newer version of vzdump in the repository. I'll see if that makes the problem go away.
 
I just realise I'm out of date and there is a newer version of vzdump in the repository. I'll see if that makes the problem go away.

Well, I just carefully reviewed the current code, and I am 100% sure that it cannot produce above call trace!
 
I just reran the same backup job with a updated vmdump and it was successful. Sorry for the confusion, I should have checked I was up to date first!