ERROR: Backup of VM <>failed - Node 'drive-virtio0' is busy: block device is in use by block job:

MartinZie

New Member
Jun 26, 2018
3
0
1
43
Hi,

I need some help- we are not abled to backup our VMs and Containers. For some reason backup process stops at some point and does not continue.

That's the Log:
-Backup stopped at 23%- I stopped it manually after several hours...

Task viewer: Backup

OutputStatus

Stop
INFO: starting new backup job: vzdump 107 --compress lzo --storage local --mode snapshot --node ariel3
INFO: Starting Backup of VM 107 (qemu)
INFO: status = running
INFO: update VM 107: -lock backup
INFO: VM Name: titan
INFO: include disk 'ide0' 'keeprbd:vm-107-disk-1' 50G
INFO: include disk 'virtio0' 'keeprbd:vm-107-disk-2' 400G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating archive '/var/lib/vz/dump/vzdump-qemu-107-2018_07_22-23_13_23.vma.lzo'
INFO: started backup task '67a28ad7-863a-45a2-97a0-2a34786bebfe'
INFO: status: 0% (778633216/483183820800), sparse 0% (476749824), duration 3, read/write 259/100 MB/s
INFO: status: 1% (5500239872/483183820800), sparse 0% (3599241216), duration 19, read/write 295/99 MB/s
INFO: status: 2% (9803268096/483183820800), sparse 1% (7525838848), duration 28, read/write 478/41 MB/s
INFO: status: 3% (14871953408/483183820800), sparse 2% (12471676928), duration 36, read/write 633/15 MB/s
INFO: status: 4% (19894697984/483183820800), sparse 3% (17262485504), duration 44, read/write 627/28 MB/s
INFO: status: 5% (24825954304/483183820800), sparse 4% (22168535040), duration 51, read/write 704/3 MB/s
INFO: status: 6% (29135536128/483183820800), sparse 4% (22929035264), duration 72, read/write 205/169 MB/s
INFO: status: 7% (33990377472/483183820800), sparse 5% (25684680704), duration 91, read/write 255/110 MB/s
INFO: status: 8% (38850920448/483183820800), sparse 5% (27206504448), duration 113, read/write 220/151 MB/s
INFO: status: 9% (43556339712/483183820800), sparse 6% (29629091840), duration 130, read/write 276/134 MB/s
INFO: status: 10% (48502276096/483183820800), sparse 6% (31348297728), duration 159, read/write 170/111 MB/s
INFO: status: 11% (54051471360/483183820800), sparse 7% (36365295616), duration 170, read/write 504/48 MB/s
INFO: status: 12% (58047332352/483183820800), sparse 7% (38584745984), duration 188, read/write 221/98 MB/s
INFO: status: 13% (62860492800/483183820800), sparse 8% (41522401280), duration 206, read/write 267/104 MB/s
INFO: status: 14% (67654975488/483183820800), sparse 8% (43381915648), duration 230, read/write 199/122 MB/s
INFO: status: 15% (72597176320/483183820800), sparse 9% (44818755584), duration 256, read/write 190/134 MB/s
INFO: status: 16% (77534593024/483183820800), sparse 9% (46654148608), duration 279, read/write 214/134 MB/s
INFO: status: 17% (82156716032/483183820800), sparse 9% (47831597056), duration 308, read/write 159/118 MB/s
INFO: status: 18% (87025909760/483183820800), sparse 10% (49060966400), duration 335, read/write 180/134 MB/s
INFO: status: 19% (91971584000/483183820800), sparse 10% (49986768896), duration 367, read/write 154/125 MB/s
INFO: status: 20% (96727400448/483183820800), sparse 10% (50781589504), duration 397, read/write 158/132 MB/s
INFO: status: 21% (101582766080/483183820800), sparse 11% (53600915456), duration 414, read/write 285/119 MB/s
INFO: status: 22% (106304962560/483183820800), sparse 11% (54811701248), duration 435, read/write 224/167 MB/s
INFO: status: 23% (111283732480/483183820800), sparse 11% (56046559232), duration 460, read/write 199/149 MB/s
ERROR: interrupted by signal
INFO: aborting backup job
ERROR: Backup of VM 107 failed - interrupted by signal
ERROR: Backup job failed - interrupted by signal

TASK ERROR: interrupted by signal
####
Trying to start backup manually I got this message:

NFO: starting new backup job: vzdump 107 --remove 0 --mode snapshot --compress lzo --storage keepbackup --node ariel3
INFO: Starting Backup of VM 107 (qemu)
INFO: status = running
INFO: update VM 107: -lock backup
INFO: VM Name: titan
INFO: include disk 'ide0' 'keeprbd:vm-107-disk-1' 50G
INFO: include disk 'virtio0' 'keeprbd:vm-107-disk-2' 400G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: pending configuration changes found (not included into backup)
INFO: creating archive '/srv/backup/dump/vzdump-qemu-107-2018_07_23-08_10_11.vma.lzo'
ERROR: Node 'drive-virtio0' is busy: block device is in use by block job: backup
INFO: aborting backup job
ERROR: Backup of VM 107 failed - Node 'drive-virtio0' is busy: block device is in use by block job: backup
INFO: Backup job finished with errors
TASK ERROR: job errors

-> Virtual Environment 5.2-2

Can anybody help? Backup location is local or storage..
Thanks,
Martin
 
Got same problem: backup stuck, I've cancel it and now when I've trying backup VM again I've got error:

Code:
ERROR: Node 'drive-virtio0' is busy: block device is in use by block job: backup
INFO: aborting backup job
ERROR: Backup of VM 699 failed - Node 'drive-virtio0' is busy: block device is in use by block job: backup
INFO: Backup job finished with errors
TASK ERROR: job errors
-------------------------------
qm unlock 699 doesn't help.

pve-manager/5.2-6/bcd5f008 (running kernel: 4.15.18-1-pve)


UPD:
I've tried live migration to another node and it stuck at the end... Aboring again:
Code:
2018-08-04 17:59:32 starting migration of VM 699 to node 'node-3' (x.x.x.x)
2018-08-04 17:59:32 copying disk images
2018-08-04 17:59:32 starting VM 699 on remote node 'node-3'
2018-08-04 17:59:35 start remote tunnel
2018-08-04 17:59:35 ssh tunnel ver 1
2018-08-04 17:59:35 starting online/live migration on unix:/run/qemu-server/699.migrate
2018-08-04 17:59:35 migrate_set_speed: 8589934592
2018-08-04 17:59:35 migrate_set_downtime: 0.1
2018-08-04 17:59:35 set migration_caps
2018-08-04 17:59:35 set cachesize: 1073741824
2018-08-04 17:59:35 start migrate command to unix:/run/qemu-server/699.migrate
2018-08-04 17:59:36 migration status: active (transferred 98670789, remaining 10594033664), total 10755055616)
2018-08-04 17:59:36 migration xbzrle cachesize: 1073741824 transferred 0 pages 0 cachemiss 0 overflow 0
......
2018-08-04 18:01:02 migration xbzrle cachesize: 1073741824 transferred 0 pages 0 cachemiss 154814 overflow 0
2018-08-04 18:01:02 migration status: active (transferred 10146249824, remaining 59158528), total 10755055616)
2018-08-04 18:01:02 migration xbzrle cachesize: 1073741824 transferred 580518 pages 3201 cachemiss 157556 overflow 24
2018-08-04 18:01:02 migration status: active (transferred 10158147745, remaining 26271744), total 10755055616)
2018-08-04 18:01:02 migration xbzrle cachesize: 1073741824 transferred 801508 pages 4695 cachemiss 160399 overflow 26
query migrate failed: interrupted by signal

2018-08-04 18:02:59 query migrate failed: interrupted by signal


Now KVM stuck - I cann't connect to OS.

I've trying to shutdown:
Code:
TASK ERROR: VM quit/powerdown failed

Then I've trying to stop:
Code:
VM quit/powerdown failed - terminating now with SIGTERM
VM still running - terminating now with SIGKILL
rbd: sysfs write failed
can't unmap rbd volume vm-699-disk-1: rbd: sysfs write failed
volume deactivation failed: ssd-ceph:vm-699-disk-1 at /usr/share/perl5/PVE/Storage.pm line 1003.
TASK OK

Now if I've trying to start:
Code:
rbd: sysfs write failed
can't unmap rbd volume vm-699-disk-1: rbd: sysfs write failed
TASK ERROR: start failed: org.freedesktop.systemd1.UnitExists: Unit 699.scope already exists.

-------------------
Code:
root@node-4:~# find / -name 699.scope
/var/lib/lxcfs/cgroup/pids/qemu.slice/699.scope
/var/lib/lxcfs/cgroup/cpu,cpuacct/qemu.slice/699.scope
/var/lib/lxcfs/cgroup/name=systemd/qemu.slice/699.scope
find: ‘/proc/127393/task/127393/net’: Invalid argument
find: ‘/proc/127393/task/127450/net’: Invalid argument
find: ‘/proc/127393/net’: Invalid argument
/run/systemd/transient/699.scope
/sys/fs/cgroup/pids/qemu.slice/699.scope
/sys/fs/cgroup/cpu,cpuacct/qemu.slice/699.scope
/sys/fs/cgroup/systemd/qemu.slice/699.scope

Image had no lock.
Code:
rbd -p ssd-pool lock ls --image vm-699-disk-1
show nothing

I've moved image to another name, and edit qemu config file
Code:
root@node-4:~# rbd -p ssd-pool mv vm-699-disk-1 vm-24699-disk-1

Now it start and I can backup it.
I've got this problem on few VM on one node. How I can "unlock" VM without stop/moving disk?
 
Last edited:
There are two ways to unlock VM without moving disk:
1. qm unlock >vmid>
2. Remove manually file from /var/lock/qemu-server/lock-<vmid>.conf
 
Removing /var/lock/qemu-server/lock-<vmid>.conf doesn't help.
Some VM's still can't finish backup and got same error when trying stop/start VM.
Also I've tried migrate disk (both: live and offline) to another ceph pool and it even doesn't start.
 
After qm unlock <vmid>, the /var/lock/qemu-server/lock-<vmid>.conf file is created, even after removing it I have no control of backups of this machine which backup was stalled and I manually interrupted it:

NFO: starting new backup job:
INFO: Starting Backup of VM <vmid> (qemu)
INFO: status = running
INFO: update VM <vmid>: -lock backup
INFO: VM Name: MyServer
INFO: include disk 'virtio0' 'poolname:vm-<vmid>-disk-0' 300G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating archive 'file.vma.gz'
ERROR: Node 'drive-virtio0' is busy: block device is in use by block job: backup
INFO: aborting backup job
ERROR: Backup of VM <vmid> failed - Node 'drive-virtio0' is busy: block device is in use by block job: backup
INFO: Backup job finished with errors
TASK ERROR: job errors

I suppose it can happen if vm during backup is under heavy maintenance, but how to exit this backup finally without rebooting the proxmox server?

I have found the similar problem with containers, but I have only vms:
https://bugzilla.proxmox.com/show_bug.cgi?id=1911

After vm is stopped, I have another error:
rbd: sysfs write failed
can't unmap rbd device /dev/rbd/<poolname>/vm-<vmid>-disk-1: rbd: sysfs write failed
ERROR: Backup of VM <vmid> failed - start failed: org.freedesktop.systemd1.UnitExists: Unit <vmid>.scope already exists.

Of course, reboot of the proxmox server helps, but this is not a solution.


 
Last edited:
Maybe it stopped, it should be tested for some time, after it I will reply.
After some backups, but only with a few clients (I just was cautious after many failures), I have no single error.
 
Last edited: