Can not Backup while Disk Throttle is set

nanonettr

Member
Jul 25, 2015
17
0
21
Hello,

I am using pve-manager/3.4-9/4b51d87a (running kernel: 3.10.0-11-pve) and testing Disk Throttle.
It runs pretty good. But have an issue with backups.

If virtual hard disk has limits; backup command starts then waits for at 0, never make a progress.
Code:
[COLOR=#000000][FONT=tahoma]INFO: starting new backup job: vzdump 99999 --remove 0 --mode stop --storage QNAPNFS --node node02[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: Starting Backup of VM 99999 (qemu)[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: status = stopped[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: update VM 99999: -lock backup[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: backup mode: stop[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: bandwidth limit: 61500 KB/s[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: ionice priority: 7[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: creating archive '/mnt/pve/QNAPNFS/dump/vzdump-qemu-99999-2015_09_16-23_28_27.vma'[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: starting kvm to execute backup task[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: started backup task 'c9a6a287-37f7-417b-99c0-6cfb0ff56eba'[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]INFO: status: 0% (2162688/146028888064), sparse 0% (12288), duration 3, 0/0 MB/s[/FONT][/COLOR]


is this a known bug ? any workarounds ?

 
Hi

What is the output of qm config < your VM_ID > ?
 
Hi

What is the output of qm config < your VM_ID > ?


Code:
root@node02:~# qm config 99999
acpi: 1
boot: c
bootdisk: virtio0
cores: 4
cpuunits: 1000
description: Old Windows 2003
freeze: 0
hotplug: 0
ide2: none,media=cdrom
kvm: 1
memory: 6144
name: OldWindows2k3
net0: virtio=6A:6D:B1:4F:3E:7C,link_down=1
onboot: 0
ostype: wxp
sockets: 1
tablet: 0
virtio0: SANSSD:99999/vm-99999-disk-1.qcow2,format=qcow2,discard=on,iops_rd=70,iops_wr=70,iops_rd_max=350,iops_wr_max=350,mbps_rd=20,mbps_wr=20,mbps_rd_max=100,mbps_wr_max=100,size=34G
virtio1: SANSSD:99999/vm-99999-disk-2.qcow2,format=qcow2,discard=on,iops_rd=70,iops_wr=70,iops_rd_max=350,iops_wr_max=350,mbps_rd=20,mbps_wr=20,mbps_rd_max=100,mbps_wr_max=100,size=102G
 
I tried to find the problem and strace'd the proxmox backup task's pid. (it was 26074). it was looping like this;
Code:
stat("/etc/pve/nodes/node02/qemu-server/99999.conf", {st_dev=makedev(0, 22), st_ino=1101, st_mode=S_IFREG|0640, st_nlink=1, st_uid=0, st_gid=33, st_blksize=4096, st_blocks=1, st_size=697, st_atime=2015/09/17-18:08:54, st_mtime=2015/09/17-18:08:54, st_ctime=2015/09/17-18:08:54}) = 0open("/var/run/qemu-server/99999.pid", O_RDONLY) = 18
ioctl(18, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7ffeb415f940) = -1 ENOTTY (Inappropriate ioctl for device)
lseek(18, 0, SEEK_CUR)                  = 0
fstat(18, {st_dev=makedev(0, 16), st_ino=89630958, st_mode=S_IFREG|0600, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=8, st_size=6, st_atime=2015/09/17-18:08:54, st_mtime=2015/09/17-18:08:54, st_ctime=2015/09/17-18:08:54}) = 0
fcntl(18, F_SETFD, FD_CLOEXEC)          = 0
fstat(18, {st_dev=makedev(0, 16), st_ino=89630958, st_mode=S_IFREG|0600, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=8, st_size=6, st_atime=2015/09/17-18:08:54, st_mtime=2015/09/17-18:08:54, st_ctime=2015/09/17-18:08:54}) = 0
read(18, "26117\n", 8192)               = 6
close(18)                               = 0
open("/proc/26117/cmdline", O_RDONLY)   = 18
ioctl(18, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7ffeb415f940) = -1 ENOTTY (Inappropriate ioctl for device)
lseek(18, 0, SEEK_CUR)                  = 0
fstat(18, {st_dev=makedev(0, 3), st_ino=89636995, st_mode=S_IFREG|0444, st_nlink=1, st_uid=0, st_gid=0, st_blksize=1024, st_blocks=0, st_size=0, st_atime=2015/09/17-18:08:54, st_mtime=2015/09/17-18:08:54, st_ctime=2015/09/17-18:08:54}) = 0
fcntl(18, F_SETFD, FD_CLOEXEC)          = 0
read(18, "/usr/bin/kvm\0-id\00099999\0-chardev\0"..., 8192) = 1502
read(18, "", 8192)                      = 0
close(18)                               = 0
open("/proc/26117/stat", O_RDONLY)      = 18
ioctl(18, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7ffeb415f940) = -1 ENOTTY (Inappropriate ioctl for device)
lseek(18, 0, SEEK_CUR)                  = 0
fstat(18, {st_dev=makedev(0, 3), st_ino=89636996, st_mode=S_IFREG|0444, st_nlink=1, st_uid=0, st_gid=0, st_blksize=1024, st_blocks=0, st_size=0, st_atime=2015/09/17-18:08:54, st_mtime=2015/09/17-18:08:54, st_ctime=2015/09/17-18:08:54}) = 0
fcntl(18, F_SETFD, FD_CLOEXEC)          = 0
read(18, "26117 (kvm) S 1 26116 26116 0 -1"..., 8192) = 381
close(18)                               = 0
stat("/var/run/qemu-server/99999.qmp", {st_dev=makedev(0, 16), st_ino=89630957, st_mode=S_IFSOCK|0750, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=0, st_size=0, st_atime=2015/09/17-18:08:54, st_mtime=2015/09/17-18:08:54, st_ctime=2015/09/17-18:08:54}) = 0
socket(PF_FILE, SOCK_STREAM, 0)         = 18
ioctl(18, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7ffeb415fb20) = -1 ENOTTY (Inappropriate ioctl for device)
lseek(18, 0, SEEK_CUR)                  = -1 ESPIPE (Illegal seek)
ioctl(18, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7ffeb415fb20) = -1 ENOTTY (Inappropriate ioctl for device)
lseek(18, 0, SEEK_CUR)                  = -1 ESPIPE (Illegal seek)
fcntl(18, F_SETFD, FD_CLOEXEC)          = 0
fcntl(18, F_GETFL)                      = 0x2 (flags O_RDWR)
fcntl(18, F_SETFL, O_RDWR|O_NONBLOCK)   = 0
connect(18, {sa_family=AF_FILE, path="/var/run/qemu-server/99999.qmp"}, 110) = 0
fcntl(18, F_GETFL)                      = 0x802 (flags O_RDWR|O_NONBLOCK)
fcntl(18, F_SETFL, O_RDWR)              = 0
fcntl(18, F_GETFL)                      = 0x2 (flags O_RDWR)
fcntl(18, F_SETFL, O_RDWR|O_NONBLOCK)   = 0
getsockopt(18, SOL_SOCKET, SO_TYPE, [1], [4]) = 0
select(24, [18], [18], NULL, {599, 999867}) = 2 (in [18], out [18], left {599, 999865})
read(18, "{\"QMP\": {\"version\": {\"qemu\": {\"m"..., 8192) = 105
write(18, "{\"execute\":\"qmp_capabilities\",\"i"..., 63) = 63
select(24, [18], [], NULL, {599, 999555}) = 1 (in [18], left {599, 999060})
read(18, "{\"return\": {}, \"id\": \"26074:2866"..., 8192) = 36
select(24, [18], [18], NULL, {599, 998853}) = 1 (out [18], left {599, 998850})
write(18, "{\"execute\":\"query-backup\",\"id\":\""..., 59) = 59
select(24, [18], [], NULL, {599, 998716}) = 1 (in [18], left {599, 998257})
read(18, "{\"return\": {\"total\": 14602888806"..., 8192) = 233
close(18)                               = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], SA_RESTORER, 0x7f200fed60a0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
nanosleep({1, 0}, 0x7ffeb415fd80)       = 0

Checked for open files;
Code:
root@node02:/mnt/pve/SANSSD/images/99999# ls -lash
 14G -rw-r--r-- 1 root root  35G Sep 16 21:52 vm-99999-disk-1.qcow2
 68G -rw-r--r-- 1 root root 103G Sep 16 21:52 vm-99999-disk-2.qcow2

root@node02:/mnt/pve/SANSSD/images/99999# lsof vm-99999-disk-1.qcow2 
COMMAND   PID USER   FD   TYPE DEVICE    SIZE/OFF     NODE NAME
kvm     26117 root   14u   REG   0,30 36513054720 19529754 vm-99999-disk-1.qcow2

root@node02:/mnt/pve/SANSSD/images/99999# lsof vm-99999-disk-2.qcow2 
COMMAND   PID USER   FD   TYPE DEVICE     SIZE/OFF     NODE NAME
kvm     26117 root   11u   REG   0,30 109538705408 19529746 vm-99999-disk-2.qcow2

root@node02:/mnt/pve/QNAPNFS/dump# ls -lash
4.0K -rw-r--r-- 1 nobody nogroup   38 Sep 17 18:08 vzdump-qemu-99999-2015_09_17-18_08_53.vma.dat

root@node02:/mnt/pve/QNAPNFS/dump# lsof  vzdump-qemu-99999-2015_09_17-18_08_53.vma.dat
COMMAND   PID USER   FD   TYPE DEVICE SIZE/OFF     NODE NAME
task    26074 root   15w   REG   0,24       38 39518215 vzdump-qemu-99999-2015_09_17-18_08_53.vma.dat
lzop    26125 root    1w   REG   0,24       38 39518215 vzdump-qemu-99999-2015_09_17-18_08_53.vma.dat

root@node02:/mnt/pve/QNAPNFS/dump# lsof /var/run/qemu-server/99999.pid
COMMAND   PID USER   FD   TYPE DEVICE SIZE/OFF     NODE NAME
kvm     26117 root    8uW  REG   0,16        6 89630958 /run/qemu-server/99999.pid

root@node02:/mnt/pve/QNAPNFS/dump# lsof /var/run/qemu-server/99999.qmp
COMMAND   PID USER   FD   TYPE             DEVICE SIZE/OFF     NODE NAME
kvm     26117 root    7u  unix 0xffff881be49bc740      0t0 89630956 /var/run/qemu-server/99999.qmp


so i checked these pids (26117 and 26125). LZOP (26125) was just waiting for data at "read(0, "

But the KVM process loops continuously.
Gets "Resource temporarily unavailable" error then waits, reads then errors again...
Code:
root@node02:/mnt/pve/QNAPNFS/dump# time strace -c -qvvfp 26117 
^C% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 70.56    0.023371           5      5113           ppoll
 14.59    0.004832           1      5796           write
  7.15    0.002367           1      2821           recvmsg
  7.09    0.002347           1      2882        67 read
  0.16    0.000053           3        16           close
  0.15    0.000049           1        48           fcntl
  0.14    0.000045           3        16           accept4
  0.05    0.000018           1        16           fstat
  0.05    0.000017           1        16           getsockname
  0.04    0.000014           1        16           getpeername
  0.02    0.000007           1         7           ioctl
------ ----------- ----------- --------- --------- ----------------
100.00    0.033120                 16747        67 total

real    0m9.159s
user    0m0.041s
sys     0m0.226s

root@node02:/mnt/pve/QNAPNFS/dump# strace -e read,ppoll,write -qvvfp 26117
[pid 26117] ppoll([{fd=21, events=POLLIN|POLLERR|POLLHUP}, {fd=3, events=POLLIN|POLLERR|POLLHUP}, {fd=6, events=POLLIN}, {fd=15, events=POLLIN}, {fd=12, events=POLLIN}, {fd=5, events=POLLIN}, {fd=22, events=POLLIN}], 7, {0, 4000000}, NULL, 8) = 2 ([{fd=6, revents=POLLIN}, {fd=22, revents=POLLIN}], left {0, 3998677})
[pid 26117] read(6, "\2\0\0\0\0\0\0\0", 16) = 8
[pid 26117] write(6, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 26117] write(6, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 26117] ppoll([{fd=21, events=POLLIN|POLLERR|POLLHUP}, {fd=3, events=POLLIN|POLLERR|POLLHUP}, {fd=6, events=POLLIN}, {fd=15, events=POLLIN}, {fd=12, events=POLLIN}, {fd=5, events=POLLIN}, {fd=22, events=POLLIN}], 7, {0, 4000000}, NULL, 8) = 2 ([{fd=6, revents=POLLIN}, {fd=22, revents=POLLIN}], left {0, 3998648})
[pid 26117] read(6, "\2\0\0\0\0\0\0\0", 16) = 8
[pid 26117] write(6, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 26117] write(6, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 26117] read(6, 0x7ffc04ea4780, 16) = -1 EAGAIN (Resource temporarily unavailable)
[pid 26117] ppoll([{fd=21, events=POLLIN|POLLERR|POLLHUP}, {fd=3, events=POLLIN|POLLERR|POLLHUP}, {fd=6, events=POLLIN}, {fd=15, events=POLLIN}, {fd=12, events=POLLIN}, {fd=5, events=POLLIN}, {fd=7, events=POLLIN}], 7, {0, 4000000}, NULL, 8) = 0 (Timeout)
[pid 26117] ppoll([{fd=21, events=POLLIN|POLLERR|POLLHUP}, {fd=3, events=POLLIN|POLLERR|POLLHUP}, {fd=6, events=POLLIN}, {fd=15, events=POLLIN}, {fd=12, events=POLLIN}, {fd=5, events=POLLIN}, {fd=7, events=POLLIN}], 7, {0, 4000000}, NULL, 8) = 0 (Timeout)
[pid 26117] ppoll([{fd=21, events=POLLIN|POLLERR|POLLHUP}, {fd=3, events=POLLIN|POLLERR|POLLHUP}, {fd=6, events=POLLIN}, {fd=15, events=POLLIN}, {fd=12, events=POLLIN}, {fd=5, events=POLLIN}, {fd=7, events=POLLIN}], 7, {0, 4000000}, NULL, 8) = 0 (Timeout)
[pid 26117] ppoll([{fd=21, events=POLLIN|POLLERR|POLLHUP}, {fd=3, events=POLLIN|POLLERR|POLLHUP}, {fd=6, events=POLLIN}, {fd=15, events=POLLIN}, {fd=12, events=POLLIN}, {fd=5, events=POLLIN}, {fd=7, events=POLLIN}], 7, {0, 4000000}, NULL, 8) = 0 (Timeout)

The limits have effects on system calls for backup process as far as I understood.

For comparison; KVM prosesses trace while backup (without any disk throttle);
Code:
root@node02:/mnt/pve/QNAPNFS/dump# time strace -c -qvvfp 15174^C% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 59.85    0.096854           1     77134           write
 13.14    0.021265           2     11608           ppoll
  9.76    0.015795           1     16944       862 read
  9.65    0.015622           4      3597           io_submit
  5.00    0.008095           2      3597           io_getevents
  2.44    0.003948           1      3801           recvmsg
  0.06    0.000091           5        20           close
  0.03    0.000051           1        60           fcntl
  0.02    0.000037           2        20           accept4
  0.01    0.000020           1        20           getsockname
  0.01    0.000019           2        11           ioctl
  0.01    0.000017           1        20           fstat
  0.01    0.000014           1        20           getpeername
------ ----------- ----------- --------- --------- ----------------
100.00    0.161828                116852       862 total


real    0m9.945s
user    0m0.367s
sys     0m1.491s
 
Hi
Thanks for the time you took on this issue. I can reproduce with Proxmox VE 3.4, but everything is fine with 4.0Beta2 (with or without VirtIO devices)
Can you maybe upgrade your host ?
 
Hi
Thanks for the time you took on this issue. I can reproduce with Proxmox VE 3.4, but everything is fine with 4.0Beta2 (with or without VirtIO devices)
Can you maybe upgrade your host ?

Hello, thanks for your reply.

It's a 3 nodes cluster. There is not anything too important on it but i dont want to risk it.
I think its better to wait for release, thanks again.