KVM machines very slow /unreachable during vmtar backup

I assume your snapshot is full. check with 'lvdisplay'
 
see this example, showing that ht snapshot is used by 0.02 % (Allocated to snapshot 0.02%)

Code:
lvdisplay /dev/pve/vzsnap*
  
--- Logical volume ---
  LV Name                /dev/pve/vzsnap-ns227086-0
  VG Name                pve
  LV UUID                aDHUMn-mkWJ-P0ct-a2es-OxUF-tH9D-za708x
  LV Write Access        read/write
  LV snapshot status     active destination for /dev/pve/data
  LV Status              available
  # open                 1
  LV Size                500.00 GiB
  Current LE             128000
  COW-table size         1.00 GiB
  COW-table LE           256
  Allocated to snapshot  [COLOR=#ff0000]0.02%[/COLOR]
  Snapshot chunk size    4.00 KiB
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:2
 
yes, the snapshot will be removed after the backup process.
 
Changed to noop and deadline (tried both), same exact issue. Performance at all VMs is at a near-standstill right now, with many soft-locking processes. Yikes.

Virtual server has been unreachable via ssh for the last few minutes, though pingable.

Interestingly, this does not happen when we gzip backups (because, I assume, it bottlenecks at CPU before hitting I/O hard).

Any ideas?

Hardware:
i7-870, 16GB DDR3
Areca 1280 Raid w/ 8x 1TB SAS (raw lvm volumes for vms)
Backup to 3TB SATA drive, approx 90MB/s write speed
 
More info - The issue started around midnight (00:00) tonight, right in the middle of the backup for VM 107. Here are the logs for backups of VM 107 and 108:

vmserver:/var/log/vzdump# cat qemu-107.log
Nov 29 23:50:20 INFO: Starting Backup of VM 107 (qemu)
Nov 29 23:50:20 INFO: running
Nov 29 23:50:20 INFO: status = running
Nov 29 23:50:21 INFO: backup mode: snapshot
Nov 29 23:50:21 INFO: bandwidth limit: 65536 KB/s
Nov 29 23:50:21 INFO: ionice priority: 7
Nov 29 23:50:21 INFO: Logical volume "vzsnap-vmserver-0" created
Nov 29 23:50:21 INFO: creating archive '/offsite/vzdump-qemu-107-2011_11_29-23_50_20.tar'
Nov 29 23:50:21 INFO: adding '/offsite/vzdumptmp147138/qemu-server.conf' to archive ('qemu-server.conf')
Nov 29 23:50:21 INFO: adding '/dev/array/vzsnap-vmserver-0' to archive ('vm-disk-virtio0.raw')
Nov 30 00:20:26 INFO: Total bytes written: 86851864576 (45.89 MiB/s)
Nov 30 00:20:26 INFO: archive file size: 80.89GB
Nov 30 00:20:26 INFO: delete old backup '/offsite/vzdump-qemu-107-2011_11_28-23_46_34.tar'
Nov 30 00:20:45 INFO: Logical volume "vzsnap-vmserver-0" successfully removed
Nov 30 00:20:45 INFO: Finished Backup of VM 107 (00:30:25)
vmserver:/var/log/vzdump# cat qemu-108.log
Nov 30 00:20:45 INFO: Starting Backup of VM 108 (qemu)
Nov 30 00:20:45 INFO: running
Nov 30 00:20:45 INFO: status = running
Nov 30 00:20:46 INFO: backup mode: snapshot
Nov 30 00:20:46 INFO: bandwidth limit: 65536 KB/s
Nov 30 00:20:46 INFO: ionice priority: 7
Nov 30 00:20:46 INFO: Logical volume "vzsnap-vmserver-0" created
Nov 30 00:20:46 INFO: creating archive '/offsite/vzdump-qemu-108-2011_11_30-00_20_45.tar'
 
These are all kvm vms - You can see that I have the 'bandwidth limit' in vzdump at 65536 KB/s in vzdump.conf, but I'm not sure what that actually does, as i/o rates on the source and backup drives go _far_ above that during backup.
 
your limit seems quite high (to high), try with 25000.
 
pve-manager: 1.9-26 (pve-manager/1.9/6567)
running kernel: 2.6.32-6-pve
proxmox-ve-2.6.32: 1.9-53
pve-kernel-2.6.24-9-pve: 2.6.24-18
pve-kernel-2.6.24-8-pve: 2.6.24-16
pve-kernel-2.6.32-6-pve: 2.6.32-53
qemu-server: 1.1-32
pve-firmware: 1.0-14
libpve-storage-perl: 1.0-19
vncterm: 0.9-2
vzctl: 3.0.29-3pve1
vzdump: 1.2-16
vzprocps: 2.0.11-2
vzquota: 3.0.11-1
pve-qemu-kvm: 0.15.0-2
ksm-control-daemon: 1.0-6
 
What does the 'limit' represent? I assumed it was KB/s. I have set it to 25000 and will report back after tonight's backup attempt.
 
yes KB/s, for details see 'man vzdump'.
 
I see that. However, I'm not sure it's actually affecting anything for these kvm/lvm snapshot images, I have it set to 65536 and it didn't seem to limit i/o on any device... Regardless, I have made the change to 25000 and will report back after tonight's backup.
 
I see that. However, I'm not sure it's actually affecting anything for these kvm/lvm snapshot images, I have it set to 65536 and it didn't seem to limit i/o on any device... Regardless, I have made the change to 25000 and will report back after tonight's backup.

Your backup speed is 45.89 MiB/s (from the backup logs)
You limit is 65MB/s (65536KB) - so why should that trigger?
 
Interesting. I see i/o on devices much higher than that throughout the process using iostat -xk 1... At any rate, I'll give the bwlimit of 25000 a shot and post back here.
 
OK, last night (w/ 25000 bwlimit) was much better - machines were generally reachable during entire backup. However, system load was still extremely high at points during backup (with corresponding high load on guest VMs - backup starts at 23:00):

load.PNG

VZdump logs: https://gist.github.com/1417896

*EDIT* After looking at logs and the load graph above, it seems that the start of each
Code:
Adding '/dev/array/vzsnap-vmserver-0' to archive ('vm-disk-virtio0.raw')
is what causes the high load.

I'm wondering why the backup affects system performance so much? Is lvm working hard to maintain the snapshot? There's not much i/o going on on the guests during backup... Would a faster backup target drive improve things?
 
Last edited:
I'm wondering why the backup affects system performance so much? Is lvm working hard to maintain the snapshot? There's not much i/o going on on the guests during backup... Would a faster backup target drive improve things?

How fast is your disk subsystem? Please run:

# pveperf

(when there is no load on the host)
 
pveperf on PVE root (on SATA SSD):
Code:
CPU BOGOMIPS:      44756.82
REGEX/SECOND:      772976
HD SIZE:           7.14 GB (/dev/mapper/pve-root)
BUFFERED READS:    141.48 MB/sec
AVERAGE SEEK TIME: 0.25 ms
FSYNCS/SECOND:     149.38
DNS EXT:           53.63 ms
DNS INT:           42.97 ms (praece.com)

pveperf of backup storage location (on SATA HDD):
Code:
CPU BOGOMIPS:      44756.82
REGEX/SECOND:      788643
HD SIZE:           2750.67 GB (/dev/sdd1)
BUFFERED READS:    87.43 MB/sec
AVERAGE SEEK TIME: 14.88 ms
FSYNCS/SECOND:     21.73
DNS EXT:           51.28 ms
DNS INT:           52.18 ms (praece.com)