Scheduled VM backup fails randomly, even with patch (post 5.2 version)

Thomas P. · Sep 6, 2018

Hello,

I would like to report there is always the problem of random failed scheduled backup, in "stop" mode, on VMs on Proxmox 5.2-5.

I read on other thread there was a problem and said it has been fixed since (please see: https://forum.proxmox.com/threads/problems-with-backing-up-and-boot-again.35962/#post-210853).
Now the VM reboot always ok with this patch, this is great, but it seems there is always the problem of failed backup.
That is why i open this issue separatly, to be specific on the problem of the fail of the backup.

Error message is each time like this:

INFO: Starting Backup of VM 100 (qemu)
INFO: status = running
INFO: update VM 100: -lock backup
INFO: backup mode: stop
INFO: ionice priority: 7
INFO: VM Name: webserver-int
INFO: include disk 'scsi0' 'zfs:vm-100-disk-1' 200G
INFO: stopping vm
INFO: creating archive '/mnt/pve/backup-daily/dump/vzdump-qemu-100-2018_09_06-04_37_01.vma.lzo'
INFO: starting kvm to execute backup task
INFO: restarting vm
INFO: vm is online again after 22 seconds
ERROR: Backup of VM 100 failed - start failed: org.freedesktop.systemd1.UnitExists: Unit 100.scope already exists.

This happens randomly. We backup the same 6 VMs each days (some of them pretty big), and around 2 times per weeks we have this error, making the backup for the VM fail. Yesterday we get a fail on 2 VM on the 6.
This can happen on any VM, those with Debian(+ pve guest agent) and those without guest agent (like a pfSense instance).

All VM restart ok (i think this is what the patch has corrected), but the backup itself is not done.

Our promox is fairly new (fresh install a few month ago as Proxmox 5-1), we upgraded recently to 5-2.5. No problem on our proxmox instance, all work greats except this backup problem.
We use only enterprise channel to update (using subscription) to be sure to keep maximum stability.

We prefer using "stop" mode to garantee datas integrity during backup (important datas, with live production database running, etc.).

Please feel free to ask any information which can help you to resolve this problem.
I hope it could be fixed, as it is pretty annoying not to have a robust backup system in which my company can trust.

Thanks.

tom · Sep 6, 2018

pls add your:

> pveversion -v

Thomas P. · Sep 6, 2018

Here is the result of "pveversion -v" on the proxmox host:

proxmox-ve: 5.2-2 (running kernel: 4.15.18-1-pve)
pve-manager: 5.2-5 (running version: 5.2-5/eb24855a)
pve-kernel-4.15: 5.2-4
pve-kernel-4.15.18-1-pve: 4.15.18-15
pve-kernel-4.13.13-5-pve: 4.13.13-38
corosync: 2.4.2-pve5
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-35
libpve-guest-common-perl: 2.0-17
libpve-http-server-perl: 2.0-9
libpve-storage-perl: 5.0-24
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 3.0.0-3
lxcfs: 3.0.0-1
novnc-pve: 1.0.0-1
proxmox-widget-toolkit: 1.0-19
pve-cluster: 5.0-28
pve-container: 2.0-24
pve-docs: 5.2-4
pve-firewall: 3.0-13
pve-firmware: 2.0-5
pve-ha-manager: 2.0-5
pve-i18n: 1.0-6
pve-libspice-server1: 0.12.8-3
pve-qemu-kvm: 2.11.2-1
pve-xtermjs: 1.0-5
qemu-server: 5.0-29
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.9-pve1~bpo9

Thomas P. · Sep 14, 2018

Sorry to making a up, but is there any news on it ? Do you need some more informations ?
Thanks, best regards.

augusto baldi · Sep 21, 2018

Goodmorning everyone,
I have the same problem too.
unfortunately, if the backup tool is not solved (unreliable) it affects the whole infrastructure of my client. No solution? below the backup logs.
Thank you all.

INFO: starting new backup job: vzdump 107 --storage Disk1 --quiet 1 --mailto xxx@yyyy.it --compress lzo --mode stop --mailnotification failure
INFO: Starting Backup of VM 107 (qemu)
INFO: status = running
INFO: update VM 107: -lock backup
INFO: backup mode: stop
INFO: ionice priority: 7
INFO: VM Name: APPL-win2008-FC
INFO: include disk 'scsi0' 'FC-storage:107/vm-107-disk-1.qcow2' 163G
INFO: include disk 'scsi1' 'FC-storage:107/vm-107-disk-2.qcow2' 300G
INFO: stopping vm
INFO: creating archive '/Disk1/dump/vzdump-qemu-107-2018_09_20-22_00_02.vma.lzo'
INFO: starting kvm to execute backup task
INFO: restarting vm
INFO: vm is online again after 26 seconds
ERROR: Backup of VM 107 failed - start failed: org.freedesktop.systemd1.UnitExists: Unit 107.scope already exists.
INFO: Backup job finished with errors

TASK ERROR: job errors

tomc · Dec 28, 2018

Hi
I have just installed latest release of Proxmox on a Dell R710, with 48 GB ram, 6 x 4TB Seagate HDD under ZFS raid 2.
I get failures on both Manual and scheduled backup when using Stop.
Backups work when using Snapshot.
Below are further details.
========================================================================================
root@pve01:~# pveversion -v
proxmox-ve: 5.3-1 (running kernel: 4.15.18-9-pve)
pve-manager: 5.3-6 (running version: 5.3-6/37b3c8df)
pve-kernel-4.15: 5.2-12
pve-kernel-4.15.18-9-pve: 4.15.18-30
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-3
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-43
libpve-guest-common-perl: 2.0-18
libpve-http-server-perl: 2.0-11
libpve-storage-perl: 5.0-34
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.0.2+pve1-5
lxcfs: 3.0.2-2
novnc-pve: 1.0.0-2
proxmox-widget-toolkit: 1.0-22
pve-cluster: 5.0-31
pve-container: 2.0-31
pve-docs: 5.3-1
pve-edk2-firmware: 1.20181023-1
pve-firewall: 3.0-16
pve-firmware: 2.0-6
pve-ha-manager: 2.0-5
pve-i18n: 1.0-9
pve-libspice-server1: 0.14.1-1
pve-qemu-kvm: 2.12.1-1
pve-xtermjs: 1.0-5
qemu-server: 5.0-43
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.12-pve1~bpo1
root@pve01:~#
=======================================================================================
Backup Log
Virtual Environment 5.3-6
Node 'pve01'
CPU usage
0.67% of 16 CPU(s)

IO delay
0.00%
Load average
0.21,0.28,0.27
RAM usage
72.36% (34.12 GiB of 47.15 GiB)

KSM sharing
0 B
HD space(root)
6.92% (924.14 GiB of 13.03 TiB)

SWAP usage
N/A
CPU(s)
16 x Intel(R) Xeon(R) CPU X5550 @ 2.67GHz (2 Sockets)
Kernel Version
Linux 4.15.18-9-pve #1 SMP PVE 4.15.18-30 (Thu, 15 Nov 2018 13:32:46 +0100)
PVE Manager Version
pve-manager/5.3-6/37b3c8df
Logs
()
INFO: starting new backup job: vzdump 100 --storage local --mode stop --mailto tomc@html.com.au --remove 0 --node pve01 --compress lzo
INFO: Starting Backup of VM 100 (qemu)
INFO: status = running
INFO: update VM 100: -lock backup
INFO: backup mode: stop
INFO: ionice priority: 7
INFO: VM Name: CCD-Data
INFO: include disk 'virtio0' 'local-zfs:vm-100-disk-0' 900G
INFO: stopping vm
INFO: creating archive '/var/lib/vz/dump/vzdump-qemu-100-2018_12_28-09_31_08.vma.lzo'
INFO: starting kvm to execute backup task
INFO: restarting vm
INFO: start failed: org.freedesktop.systemd1.UnitExists: Unit 100.scope already exists.
command 'qm start 100 --skiplock' failed: exit code 255
ERROR: Backup of VM 100 failed - start failed: org.freedesktop.systemd1.UnitExists: Unit 100.scope already exists.
INFO: Backup job finished with errors
TASK ERROR: job errors
=================================================================================================================

Thomas P. · Dec 28, 2018

I confirm that using snapshot, it works ok too. But using stop mode i get random fail.

Stoiko Ivanov · Dec 28, 2018

Hm - tried to reproduce the issue, but couldn't.
Does this happen everytime ?
Do you have the qemu-guest-agent installed and activated in the config?

If you can reliably reproduce it, what's the output of `systemctl status -l 100.scope`?

Thomas P. · Dec 28, 2018

Hello Stoiko,

This happens randomly, depending the VM.
For instance our VM100 is a big one of 40GiB memory, 16 CPU, 200Go of hard disk (this VM contains a docker host and its docker containers). This VM has guest agent intalled (and i can see the IP on the Summary tab of the VM in the Proxmox GUI) and can fail randomly around 30% of time i think.
We got another VM like this one (big one with docker, guest agent installed) and it can fails the same around 30%. So it is not a specific problem to one VM.

If we take another, our VM102, it is more small one (2GiB memory, 2 CPU, 4Go of hard disk), and the fail is more rare (around 5% i would say). This VM has no guest agent (i can see the "No Guest Agent configured" message on the Summary tab of the VM in the Proxmox GUI). On this VM is installed pFsense (it is our virtual router/firewall).

So ti seems with or without guest agent, the backup in stop mode can fail. It fails more if the VM is big.

Actually snapshot mode is configured, so i cannot put "systemctl status" status just after a fail, but below you can find the return of this command i just runned:

---

VM100 screenshots:

Here is a copy of "systemctl status -l 100.scope" (our big VM with docker inside):

● 100.scope
Loaded: loaded (/run/systemd/transient/100.scope; transient; vendor preset: enabled)
Transient: yes
Active: active (running) since Tue 2018-11-13 10:52:32 CET; 1 months 14 days ago
Tasks: 24 (limit: 4915)
CPU: 2w 3d 15h 3min 41.878s
CGroup: /qemu.slice/100.scope
└─30627 /usr/bin/kvm -id 100 -name webserver-int -chardev socket,id=qmp,path=/var/run/qemu-server/100.qmp,server,nowait -mon chardev=qmp,mode=control -chardev socket,id=qmp-event,path=/var/run/qemu-server/100-event.qmp,server,nowait -mon chardev=qmp-event,mode=control -pidfile /var/run/qemu-server/100.pid -daemonize -smbios type=1,uuid=26f69ebd-ba2d-48bf-ab46-e0157646b0da -smp 16,sockets=2,cores=8,maxcpus=16 -nodefaults -boot menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg -vga qxl -vnc unix:/var/run/qemu-server/100.vnc,x509,password -cpu kvm64,+lahf_lm,+sep,+kvm_pv_unhalt,+kvm_pv_eoi,enforce -m 40960 -object memory-backend-ram,id=ram-node0,size=20480M -numa node,nodeid=0,cpus=0-7,memdev=ram-node0 -object memory-backend-ram,id=ram-node1,size=20480M -numa node,nodeid=1,cpus=8-15,memdev=ram-node1 -device pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f -device pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e -device piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2 -chardev socket,id=serial0,path=/var/run/qemu-server/100.serial0,server,nowait -device isa-serial,chardev=serial0 -chardev socket,path=/var/run/qemu-server/100.qga,server,nowait,id=qga0 -device virtio-serial,id=qga0,bus=pci.0,addr=0x8 -device virtserialport,chardev=qga0,name=org.qemu.guest_agent.0 -spice tls-port=61001,addr=127.0.0.1,tls-ciphers=HIGH,seamless-migration=on -device virtio-serial,id=spice,bus=pci.0,addr=0x9 -chardev spicevmc,id=vdagent,name=vdagent -device virtserialport,chardev=vdagent,name=com.redhat.spice.0 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3 -drive if=none,id=drive-ide2,media=cdrom,aio=threads -device ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=200 -device virtio-scsi-pci,id=scsihw0,bus=pci.0,addr=0x5 -drive file=/dev/zvol/rpool/vm-100-disk-1,if=none,id=drive-scsi0,discard=on,format=raw,cache=none,aio=native,detect-zeroes=unmap -device scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,bootindex=100 -netdev type=tap,id=net0,ifname=tap100i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on -device virtio-net-pci,mac=F6:62:80:37:ED:2D,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300

---

VM102 screenshots:

Here is a copy of "systemctl status -l 102.scope" (our small VM with pfsense inside):

● 102.scope
Loaded: loaded (/run/systemd/transient/102.scope; transient; vendor preset: enabled)
Transient: yes
Active: active (running) since Tue 2018-11-13 04:37:47 CET; 1 months 14 days ago
Tasks: 8 (limit: 4915)
CPU: 2d 14h 15min 8.210s
CGroup: /qemu.slice/102.scope
└─31555 /usr/bin/kvm -id 102 -name pfsense -chardev socket,id=qmp,path=/var/run/qemu-server/102.qmp,server,nowait -mon chardev=qmp,mode=control -chardev socket,id=qmp-event,path=/var/run/qemu-server/102-event.qmp,server,nowait -mon chardev=qmp-event,mode=control -pidfile /var/run/qemu-server/102.pid -daemonize -smbios type=1,uuid=cadf84af-7ec7-4d6a-a0b6-fac36e5a2a00 -smp 2,sockets=1,cores=2,maxcpus=2 -nodefaults -boot menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg -vga std -vnc unix:/var/run/qemu-server/102.vnc,x509,password -cpu kvm64,+lahf_lm,+sep,+kvm_pv_unhalt,+kvm_pv_eoi,enforce -m 2048 -device pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e -device pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f -device piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2 -device usb-tablet,id=tablet,bus=uhci.0,port=1 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3 -drive if=none,id=drive-ide2,media=cdrom,aio=threads -device ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=200 -device virtio-scsi-pci,id=scsihw0,bus=pci.0,addr=0x5 -drive file=/dev/zvol/rpool/vm-102-disk-1,if=none,id=drive-scsi0,format=raw,cache=none,aio=native,detect-zeroes=on -device scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,bootindex=100 -netdev type=tap,id=net0,ifname=tap102i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on -device virtio-net-pci,mac=02:00:00:56:9e:ac,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300 -netdev type=tap,id=net1,ifname=tap102i1,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on -device virtio-net-pci,mac=06:CC:3B:43:E8:3F,netdev=net1,bus=pci.0,addr=0x13,id=net1,bootindex=301 -S

---

I hope it helps, feel free to ask any more questions.

Thanks.

Stoiko Ivanov · Dec 28, 2018

Thanks for the update and the information!

I had a hunch, that it might be caused by the VM not being able to shutdown in a timely fashion, and tried reproducing that - but this yields a different error (`VM quit/powerdown failed - got timeout`) .
My next guess would be that it's related to the VM-config (vzdump in stop mode shuts down the guest, starts a kvm-process which does the blockjobs (for the backup), and starts the VM in parallel) - maybe the start for the backup takes too long...

Please open a bug in https://bugzilla.proxmox.com (and refer to this thread)!
could you please post the config-file of the 2 VMs? (anonymized if needed)

Thomas P. · Dec 28, 2018

Stoiko,

Yes the stopping/starting delay could be the cause, especially for our big VM with many docker container. Maybe the system need to wait all containers are started... (there are many, around 50-75 containers).

Ok, I will open a bug in Bugzilla, could you just indicate me how to get the config-file please ?

Thanks.

Stoiko Ivanov · Dec 28, 2018

Thanks!
`qm config <VMID>` on the commandline (the actual file with snapshots is in `/etc/pve/qemu-server/<VMID>.conf`

Thomas P. · Dec 28, 2018

A bug report has been filled here:
https://bugzilla.proxmox.com/show_bug.cgi?id=2043

Feel free to add any details to help to resolve this one,

Thanks.

upnort · Jan 2, 2019

I posted a similar stop job problem. I found relief by not using lzo compression (last post in the thread).

Search

Search

Scheduled VM backup fails randomly, even with patch (post 5.2 version)

Thomas P.

Member

tom

Proxmox Staff Member

Thomas P.

Member

Thomas P.

Member

augusto baldi

Member

tomc

Active Member

Thomas P.

Member

Stoiko Ivanov

Proxmox Staff Member

Thomas P.

Member

Stoiko Ivanov

Proxmox Staff Member

Thomas P.

Member

Stoiko Ivanov

Proxmox Staff Member

Thomas P.

Member

upnort

Member