ERROR: Backup failed - start failed Unit already exists

cefek

Member
Mar 18, 2018
11
0
21
43
Hello,

I have been trying to find a solution to this annoying problem for over six months. Have updated both servers (one is a community edition subscription and one is without subscription). This happens with both of them.

Whenever I try to have an automated or manual backup of my VM (I don't use containers) with latest Proxmox (and Proxmoxes before the latest) I have a 50/50 chance of having this error:

ERROR: Backup of VM 100 failed - start failed: org.freedesktop.systemd1.UnitExists: Unit 100.scope already exists.

and no backup is made.

Of course the VM number and scope number changes.

Sometimes it is also impossible to start the vm, either with the Web interface or Command line.

I have read all the forum threads on that issue and seems to me the only "fix" is to upgrade Proxmox - which I have done countless times. Still this error persists.

I am doing 'stop' mode backups because I care about data integrity.

Is there ANY solution? I am really tearing my hair out of my head...
 
please post your:

> pveversion -v

and

> qm config VMID
 
Here it goes.

Code:
# pveversion -v

proxmox-ve: 5.1-42 (running kernel: 4.13.13-6-pve)

pve-manager: 5.1-46 (running version: 5.1-46/ae8241d4)

pve-kernel-4.13: 5.1-42

pve-kernel-4.13.13-6-pve: 4.13.13-42

pve-kernel-4.13.13-5-pve: 4.13.13-38

pve-kernel-4.13.13-2-pve: 4.13.13-33

pve-kernel-4.13.13-1-pve: 4.13.13-31

pve-kernel-4.4.95-1-pve: 4.4.95-99

pve-kernel-4.4.35-1-pve: 4.4.35-77

corosync: 2.4.2-pve3

criu: 2.11.1-1~bpo90

glusterfs-client: 3.8.8-1

ksm-control-daemon: 1.2-2

libjs-extjs: 6.0.1-2

libpve-access-control: 5.0-8

libpve-common-perl: 5.0-28

libpve-guest-common-perl: 2.0-14

libpve-http-server-perl: 2.0-8

libpve-storage-perl: 5.0-17

libqb0: 1.0.1-1

lvm2: 2.02.168-pve6

lxc-pve: 2.1.1-3

lxcfs: 2.0.8-2

novnc-pve: 0.6-4

proxmox-widget-toolkit: 1.0-11

pve-cluster: 5.0-20

pve-container: 2.0-19

pve-docs: 5.1-16

pve-firewall: 3.0-5

pve-firmware: 2.0-4

pve-ha-manager: 2.0-5

pve-i18n: 1.0-4

pve-libspice-server1: 0.12.8-3

pve-qemu-kvm: 2.9.1-9

pve-xtermjs: 1.0-2

qemu-server: 5.0-22

smartmontools: 6.5+svn4324-1

spiceterm: 3.0-5

vncterm: 1.5-3

zfsutils-linux: 0.7.6-pve1~bpo9

Code:
# qm config 100
balloon: 0
boot: cdn
bootdisk: scsi0
cores: 2
cpu: host
description: Poczta ponad wszystko
ide2: none,media=cdrom
memory: 16385
name: pompon
net0: virtio=C2:94:25:F2:CE:FE,bridge=vmbr0
numa: 0
onboot: 1
ostype: l26
protection: 1
scsi0: local-zfs:vm-100-disk-1,discard=on,size=520G
scsihw: virtio-scsi-pci
smbios1: uuid=f5a069dc-d6a0-4530-bde3-f8ea4299d779
sockets: 2

As far as I'm reading, qemu-server 5.0-24 should fix that but it's currently unavailable for my proxmox servers yet.
 
So? Nothing new here? No new packages, nothing changed, still having unreliable backup. Care to advise on next steps to solve the issue?
 
Any update please? Already two weeks gone since last post, but issue still persists.
 
Any update please? Already two weeks gone since last post, but issue still persists.

which qemu-server package do you run now? latest is qemu-server (5.0-25)

(check your version with: "pveversion -v")
 
which qemu-server package do you run now? latest is qemu-server (5.0-25)

(check your version with: "pveversion -v")

I am an update junkie, always on the edge.

proxmox-ve: 5.1-42 (running kernel: 4.13.16-2-pve)
pve-manager: 5.1-51 (running version: 5.1-51/96be5354)
qemu-server: 5.0-25

Just rebooted, will see what is going to happen on next programmed backup. Manual was never an issue.
 
Yes, can confirm, qemu-server version 5.0-25 did not change this, still updates can exit with error and the machine stays stopped (one needs to manually "systemctl kill $vmid.scope" and then "qm start $vmid.scope", this really sucks.
 
Yes, can confirm, qemu-server version 5.0-25 did not change this, still updates can exit with error and the machine stays stopped (one needs to manually "systemctl kill $vmid.scope" and then "qm start $vmid.scope", this really sucks.
Unfortunately it did not help in my case. I had to restart the whole hypervisor.
Quite annoying in a production environment with a couple of VMs running.
 
Yes, can confirm, qemu-server version 5.0-25 did not change this, still updates can exit with error and the machine stays stopped (one needs to manually "systemctl kill $vmid.scope" and then "qm start $vmid.scope", this really sucks.
Next time this happens, can you please post the output of `systemctl status $vmid.scope` before doing a systemctl kill to see which processes are running in the scope.
 
OK - I have to disable my temporary "fix" or rather, a workaround that at least made the machine to start correctly regardless of the above error (this error still cancels backup), which was to change QemuServer.pm line 4822:

Code:
run_command(['/bin/systemctl', 'stop', "$vmid.scope"],

to

Code:
run_command(['/bin/systemctl', 'kill', "$vmid.scope"],

The above change still did not produce backup, made the same error message but at least the machine was restarted so services were online.

I have changed my change ("kill") to distribution ("stop") and now I need to wait couple of days for the error to happen again.

Will post all the data then.
 
Oh and BTW I think this error that's not allowing backup to run must have something to do with starting machine's filesystem as a container so it can be backed up, and might have something to do with LXC - I don't know how Proxmox's STOP backup works for KVM, but I have a strong feeling it's got to do with the way it's mounted for backup purpose (given that the machine is made online after initial shutdown and is online even when backup is commencing). This is a pretty way to do it and I appreciate, there must just be some kind of race condition.
 
Ok, so I think it happened again:

This is the status of a 'backup' task that is scheduled for weekly run:

Code:
INFO: starting new backup job: vzdump 200 --storage local --compress gzip --quiet 1 --mode stop --mailnotification always
INFO: Starting Backup of VM 200 (qemu)
INFO: status = running
INFO: update VM 200: -lock backup
INFO: backup mode: stop
INFO: ionice priority: 7
INFO: VM Name: brlnt
INFO: include disk 'scsi0' 'local-zfs:vm-200-disk-1' 640G
INFO: stopping vm
INFO: creating archive '/var/lib/vz/dump/vzdump-qemu-200-2018_05_07-02_20_02.vma.gz'
INFO: starting kvm to execute backup task
INFO: restarting vm
INFO: start failed: org.freedesktop.systemd1.UnitExists: Unit 200.scope already exists.
command 'qm start 200 --skiplock' failed: exit code 255
ERROR: Backup of VM 200 failed - start failed: org.freedesktop.systemd1.UnitExists: Unit 200.scope already exists.
INFO: Backup job finished with errors

TASK ERROR: job errors

The machine is stopped now. This is the 'systemctl status 200.scope':

Code:
root@machine:~# systemctl status 200.scope
● 200.scope
   Loaded: loaded (/run/systemd/transient/200.scope; transient; vendor preset: enabled)
Transient: yes
   Active: inactive (dead) since Mon 2018-05-07 02:20:07 CEST; 6h ago
      CPU: 5h 37min 41.299s
   CGroup: /qemu.slice/200.scope
          └─1167 gpg-agent --homedir /root/.gnupg --use-standard-socket --daemon
Apr 30 02:20:08 machine systemd[1]: Started 200.scope.
Apr 30 03:13:36 machine vzdump[1953]: INFO: Finished Backup of VM 200 (00:53:34)
Apr 30 03:13:36 machine vzdump[1953]: INFO: Backup job finished successfully
May 07 02:20:07 machine systemd[1]: Stopped 200.scope.

The 'Apr 30' is the last successful backup of this VM, not the current one obciously, the current one is from May 07 morning (around 2-3 AM)

This is what happens when I try to start the machine:

Code:
root@machine:~# qm start 200
start failed: org.freedesktop.systemd1.UnitExists: Unit 200.scope already exists.

and this is a mandatory version report:

Code:
root@machine:~# pveversion -v

proxmox-ve: 5.1-43 (running kernel: 4.13.13-6-pve)
pve-manager: 5.1-52 (running version: 5.1-52/ba597a64)
pve-kernel-4.13: 5.1-44
pve-kernel-4.15: 5.1-3
pve-kernel-4.15.15-1-pve: 4.15.15-6
pve-kernel-4.13.16-2-pve: 4.13.16-47
pve-kernel-4.13.16-1-pve: 4.13.16-46
pve-kernel-4.13.13-6-pve: 4.13.13-42
pve-kernel-4.13.13-5-pve: 4.13.13-38
pve-kernel-4.13.13-2-pve: 4.13.13-33
pve-kernel-4.13.13-1-pve: 4.13.13-31
pve-kernel-4.4.95-1-pve: 4.4.95-99
pve-kernel-4.4.35-1-pve: 4.4.35-77
corosync: 2.4.2-pve5
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-apiclient-perl: 2.0-4
libpve-common-perl: 5.0-30
libpve-guest-common-perl: 2.0-15
libpve-http-server-perl: 2.0-8
libpve-storage-perl: 5.0-19
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 3.0.0-2
lxcfs: 3.0.0-1
novnc-pve: 0.6-4
proxmox-widget-toolkit: 1.0-15
pve-cluster: 5.0-26
pve-container: 2.0-22
pve-docs: 5.1-17
pve-firewall: 3.0-8
pve-firmware: 2.0-4
pve-ha-manager: 2.0-5
pve-i18n: 1.0-4
pve-libspice-server1: 0.12.8-3
pve-qemu-kvm: 2.11.1-5
pve-xtermjs: 1.0-3
qemu-server: 5.0-25
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.7-pve1~bpo9

Please try to fix this, as it makes the automated backups unreliable (and that's the opposite of what a backup should be) and also makes the machine stay offline until 'systemctl kill 200.scope' command is issued.

This is a very persistent bug and current fixes are not working at all; please read my previous post for hints what could be causing it.
 
I don't know how Proxmox's STOP backup works for KVM
The machine is stopped to put it in a consistent state (OS shutdown, disks cleanly unmounted etc.), then started again, and the qemu process itself performs the backup while letting the guest run, backup up blocks the guest wants to write to early to not stall it for too long.

Code:
(...)
CGroup: /qemu.slice/200.scope
└─1167 gpg-agent --homedir /root/.gnupg --use-standard-socket --daemon
Well that's not supposed to be in there... gonna have to check where that comes from.
 
Oh and BTW I think this error that's not allowing backup to run must have something to do with starting machine's filesystem as a container so it can be backed up, and might have something to do with LXC - I don't know how Proxmox's STOP backup works for KVM, but I have a strong feeling it's got to do with the way it's mounted for backup purpose (given that the machine is made online after initial shutdown and is online even when backup is commencing). This is a pretty way to do it and I appreciate, there must just be some kind of race condition.

I can assure you I run NO LXC at all, only VMs and was getting same error. Lately I have made latest update and restarted the host. First backup went ok, so waiting next one today to ensure the issue is gone.
 
I have installed gpg couple of months ago to encrypt backup files that are later being sent to external storage via vzdump script. While I guess it is possible gpg-agent is launching whenever backup process spawns, I'd like to know how to disable it; it has got some files in /usr/lib/systemd/user that supposedly launch gpg-agent.

Moreiq, do you also use GPG on the host machine?

The above files (gpg-agent.service and gpg-agent*.socket) from the directory I have mentioned above are also being the only files referenced by the directory /usr/lib/systemd/user/socket.target.wants.

Should I just delete them from there? I'm going to do it and then check what happens. This gpg-agent launching is clearly done by userland process so maybe it gets stuck as it daemonizes.
 
I have just deleted the socket.target.wants directory (hope it won't be re-installed with gpg updates) and rebooted the less important node. Will see what happens.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!