TASK ERROR: start failed: org.freedesktop.systemd1.UnitExists: Unit scope already exists.

encore

Member
May 4, 2018
95
0
6
31
Hi,

we have a big problem with our Proxmox cluster. The cluster consists of 25 nodes with 10-50 servers each (LXC & KVM).
It happens 20-30 times a day that KVM servers freeze. The console is then not reachable and the server itself is also not.
When we stop the server, Proxmox displays "stopped".
If we restart it, we get the error message:
Total translation table size: 0
Total rockridge attributes bytes: 417
Total directory bytes: 0
Path table size(bytes): 10
Max brk space used 0
178 extents written (0 MB)
TASK ERROR: start failed: org.freedesktop.systemd1.UnitExists: Unit 1098559.scope already exists.
ps -ef | grep VMID
still shows the VM process, which cannot be stopped with kill -9.

Nodes are up2date with
deb http://download.proxmox.com/debian/pve stretch pve-no-subscription
repo.
 

dietmar

Proxmox Staff Member
Staff member
Apr 28, 2005
16,503
320
83
Austria
www.proxmox.com
Maybe a problem with your storage? What kind of storage do you use? In case of failure, please check the storage status with

# pvesm status
 

encore

Member
May 4, 2018
95
0
6
31
we use a different storage (just a SSD with ext4 as directory storage) on each node. Storages looking fine. Issue is happening on ALL nodes by chance. It also happens on nodes with 5 VMs and on nodes with 30 VMs. So it is no overprovisioning.
VMs are stored on VirtIO SCSI / qcow2.

Any idea on how to debug it further? Any log files what might give more details about the issue?

We are using Cloud-Init so "serial port socket" is enabled on all VMs. Could that cause the issues?
 

encore

Member
May 4, 2018
95
0
6
31
these "stucky vms" dont response to guest tools and do have a memory usage of >90%.
Last night I had running some test servers with absolutly no operations or load. Some of them do have the same issue this morning.

Some of them had windows installed, some linux.
 

mira

Proxmox Staff Member
Staff member
Aug 1, 2018
357
28
28
Please post the output of 'pveversion -v', your storage config (/etc/pve/storage.cfg) and the config of a VM that exhibits this problem ('qm config <VMID>')
 

encore

Member
May 4, 2018
95
0
6
31
global storage .cfg:
dir: local
path /var/lib/vz
content vztmpl,iso
maxfiles 50
shared 0

lvmthin: local-lvm
thinpool data
vgname pve
content none

dir: captive002-lxcstor-01
disable
path /mnt/captive002-lxcstor-01/
content images,vztmpl,rootdir,backup
maxfiles 300
nodes captive002-77015
shared 1

dir: captive003-lxcstor-01
disable
path /mnt/captive003-lxcstor-01
content vztmpl,images,backup,rootdir
maxfiles 300
nodes captive003-77030
shared 1

dir: captive004-lxcstor-01
disable
path /mnt/captive004-lxcstor-01
content backup,rootdir,images,vztmpl
maxfiles 300
nodes captive004-77028
shared 0

dir: captive002-lxcstor-02
disable
path /mnt/captive002-lxcstor-02/
content rootdir,backup,vztmpl,images
maxfiles 300
nodes captive002-77015
shared 0

dir: captive003-lxcstor-02-LOCAL
path /mnt/captive003-lxcstor-02-LOCAL/
content rootdir,backup,vztmpl,images
maxfiles 300
nodes captive003-77030
shared 0

dir: captive004-lxcstor-02-LOCAL
path /mnt/captive004-lxcstor-02-LOCAL
content vztmpl,images,rootdir,backup
maxfiles 300
nodes captive004-77028
shared 0

dir: captive005-lxcstor-01-LOCAL
path /mnt/captive005-lxcstor-01-LOCAL
content backup,rootdir,images,vztmpl
maxfiles 300
nodes captive005-74001
shared 0

dir: captive005-lxcstor-02-LOCAL
path /mnt/captive005-lxcstor-02-LOCAL
content vztmpl,images,backup,rootdir
maxfiles 300
nodes captive005-74001
shared 0

dir: captive006-lxcstor-01-LOCAL
path /mnt/captive006-lxcstor-01-local
content images,vztmpl,backup,rootdir
maxfiles 300
nodes captive006-73029
shared 0

dir: captive007-lxcstor-01-LOCAL
path /mnt/captive007-lxcstor-01-local
content rootdir,backup,images,vztmpl
maxfiles 300
nodes captive003-77030,captive007-73030
shared 0

dir: captive008-lxcstor-01-LOCAL
path /mnt/captive008-lxcstor-01-LOCAL
content images,vztmpl,rootdir,backup
maxfiles 300
nodes captive008-74005
shared 0

dir: captive009-lxcstor-01-LOCAL
path /mnt/captive009-lxcstor-01-LOCAL
content backup,rootdir,vztmpl,images
maxfiles 300
nodes captive009-77014
shared 0

dir: captive009-lxcstor-02-LOCAL
path /mnt/captive009-lxcstor-02-LOCAL
content rootdir,backup,vztmpl,images
maxfiles 300
nodes captive009-77014
shared 0

dir: captive011-lxcstor-01-LOCAL
path /mnt/captive011-lxcstor-01-LOCAL
content backup,rootdir,vztmpl,images
maxfiles 300
nodes captive011-74007
shared 0

dir: captive011-lxcstor-02-LOCAL
path /mnt/captive011-lxcstor-02-LOCA
content images,vztmpl,rootdir,backup
maxfiles 300
nodes captive011-74007
shared 0

dir: captive001-lxcstor-01-localLV
path /mnt/captive001-lxcstor-01-localLV
content images,vztmpl,rootdir,backup
maxfiles 1
nodes captive001-72001-bl12
shared 0

dir: captive006-lxcstor-01-localLV
path /mnt/captive006-lxcstor-01-localLV
content backup,rootdir,images,vztmpl
maxfiles 1
nodes captive006-72011-bl09
shared 0

dir: captive002-lxcstor-01-LOCAL
path /mnt/captive002-lxcstor-01-LOCAL
content vztmpl,images,backup,rootdir
maxfiles 300
nodes captive002-77015
shared 0

dir: captive002-lxcstor-02-LOCAL
path /mnt/captive002-lxcstor-02-LOCAL
content images,vztmpl,rootdir,backup
maxfiles 100
nodes captive002-77015
shared 1

dir: captive004-lxcstor-01-LOCAL
path /mnt/captive004-lxcstor-01-LOCAL
content vztmpl,images,rootdir,backup
maxfiles 300
nodes captive004-77028
shared 0

dir: captive003-lxcstor-01-LOCAL
path /mnt/captive003-lxcstor-01-LOCAL
content backup,rootdir,vztmpl,images
maxfiles 100
nodes captive003-77030
shared 0

nfs: imageserver
export /var/pve
path /mnt/pve/imageserver
server 10.10.10.100
content iso,vztmpl
maxfiles 100
options vers=3

dir: imageserver-clones
disable
path /home/imageserver
content images
shared 1

nfs: solusmigrates
export /var/solus
path /mnt/pve/solusmigrates
server 10.10.10.100
content images
options vers=3

dir: captive007-lxcstor-01-localLV
path /mnt/captive007-lxcstor-01-localLV
content images,vztmpl,rootdir,backup
maxfiles 99
nodes captive007-72001-bl11
shared 0

dir: captive012-lxcstor01-localLV
path /mnt/captive012-lxcstor-01-localLV
content rootdir,backup,images,vztmpl
maxfiles 99
nodes captive012-72011-bl06
shared 0

dir: captive013-lxcstor01-localLV
path /mnt/captive013-lxcstor01-localLV
content backup,rootdir,vztmpl,images
maxfiles 99
nodes captive013-74050-bl08
shared 0

dir: captive014-lxcstor-01-localLV
path /mnt/captive014-lxcstor-01-localLV
content backup,rootdir,vztmpl,images
maxfiles 99
nodes captive014-72001-bl15
shared 0

dir: bondbabe001-lxcstor01-localLV
path /mnt/bondbabe001-lxcstor01-localLV
content rootdir,backup,vztmpl,images
maxfiles 99
nodes bondbabe001-74050-bl06
shared 0

dir: bondsir001-lxcstor01-localLV
path /mnt/bondsir001-lxcstor01-localLV
content rootdir,backup,vztmpl,images
maxfiles 99
nodes bondsir001-72011-bl14
shared 0

dir: captive015-lxcstor01-localLV
path /mnt/captive015-lxcstor01-localLV
content backup,rootdir,vztmpl,images
maxfiles 99
nodes captive015-74050-bl05
shared 0

dir: captive016-lxcstor01-localLV
path /mnt/captive016-lxcstor01-localLV/
content vztmpl,images,rootdir,backup
maxfiles 99
nodes captive016-72001-bl01
shared 0

dir: captive017-lxcstor01-localLV
path /mnt/captive017-lxcstor01-localLV
content backup,rootdir,images,vztmpl
maxfiles 99
nodes captive017-74050-bl09
shared 0

dir: bondsir002-lxcstor01-localLV
path /mnt/bondsir002-lxcstor01-localLV
content images,vztmpl,rootdir,backup
maxfiles 99
nodes bondsir002-72001-bl08
shared 0

dir: captive018-lxcstor01-localLV
path /mnt/captive018-lxcstor01-localLV
content images,vztmpl,rootdir,backup
maxfiles 0
nodes captive018-72001-bl04
shared 0

dir: bondbabe002-lxcstor01-localLV
path /mnt/bondbabe002-lxcstor01-localLV
content images,vztmpl,rootdir,backup
maxfiles 99
nodes bondbabe002-72011-bl12
shared 0

dir: captive019-lxcstor01-localLV
path /mnt/captive019-lxcstor01-localLV
content rootdir,backup,vztmpl,images
maxfiles 99
nodes captive019-74050-bl12
shared 0

dir: bondsir003-lxcstor01-localLV
path /mnt/bondsir003-lxcstor01-localLV
content vztmpl,images,rootdir,backup
maxfiles 99
nodes bondsir003-74050-bl10
shared 0

dir: captive010-lxcstor01-localLV
path /mnt/captive010-lxcstor01-localLV
content backup,rootdir,vztmpl,images
maxfiles 99
nodes captive010-74050-bl14
shared 0

dir: bondsir004-lxcstor01-localLV
path /mnt/bondsir004-lxcstor01-localLV
content backup,rootdir,images,vztmpl
maxfiles 99
nodes bondsir004-74050-bl11
shared 0

dir: captive020-lxcstor01-localLV
path /mnt/captive020-lxcstor01-localLV
content backup,rootdir,vztmpl,images
maxfiles 99
nodes captive020-74050-bl13
shared 0

dir: bondsir005-lxcstor01-localLV
path /mnt/bondsir005-lxcstor01-localLV
content backup,rootdir,images,vztmpl
maxfiles 99
nodes bondsir005-74050-bl16
shared 0

dir: captive021-lxcstor01-localLV
path /mnt/captive021-lxcstor01-localLV
content backup,rootdir,images,vztmpl
maxfiles 99
nodes captive021-74050-bl15-rev2
shared 0

dir: captive022-lxcstor01-localLV
path /mnt/captive022-lxcstor01-localLV
content backup,rootdir,images,vztmpl
maxfiles 99
nodes captive022-79001-bl01
shared 0

Node Bondsir003:
proxmox-ve: 5.4-1 (running kernel: 4.15.18-12-pve)
pve-manager: 5.4-3 (running version: 5.4-3/0a6eaa62)
pve-kernel-4.15: 5.3-3
pve-kernel-4.15.18-12-pve: 4.15.18-35
pve-kernel-4.13.13-2-pve: 4.13.13-33
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-8
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-50
libpve-guest-common-perl: 2.0-20
libpve-http-server-perl: 2.0-13
libpve-storage-perl: 5.0-41
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.1.0-3
lxcfs: 3.0.3-pve1
novnc-pve: 1.0.0-3
proxmox-widget-toolkit: 1.0-25
pve-cluster: 5.0-36
pve-container: 2.0-37
pve-docs: 5.4-2
pve-edk2-firmware: 1.20190312-1
pve-firewall: 3.0-19
pve-firmware: 2.0-6
pve-ha-manager: 2.0-9
pve-i18n: 1.1-4
pve-libspice-server1: 0.14.1-2
pve-qemu-kvm: 2.12.1-3
pve-xtermjs: 3.12.0-1
qemu-server: 5.0-50
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.13-pve1~bpo2

VM config:
root@bondsir003-74050-bl10:~# qm config 1083700
agent: 1
balloon: 0
bootdisk: scsi0
cipassword: **********
ciuser: root
cores: 4
cpu: host
cpulimit: 4
ide2: bondsir003-lxcstor01-localLV:1083700/vm-1083700-cloudinit.qcow2,media=cdrom
ipconfig0: gw=134.255.233.1,ip=134.255.233.248/24
memory: 8192
name: rs-zap411271-1.zap-srv.com
net0: e1000=3A:CD:2F:82:9E:82,bridge=vmbr0,rate=12.5
numa: 1
onboot: 1
ostype: l26
scsi0: bondsir003-lxcstor01-localLV:1083700/vm-1083700-disk-0.qcow2,discard=on,format=qcow2,size=60G,ssd=1
scsihw: virtio-scsi-pci
serial0: socket
smbios1: uuid=f127d443-0be0-447a-ba9e-9e47b117cec4
vmgenid: 6eca759a-b61c-4b8e-993d-8b7ecc6cf5ae
This is only one node and VM.
Currently I can reproduce the issue on 8 different nodes and on 20 VMs total.

LXC containers are not affected btw.
 

mira

Proxmox Staff Member
Staff member
Aug 1, 2018
357
28
28
Could you also post the journal? ('journalctl -b' everything since the last boot) Perhaps it contains some more information.
 

dthompson

Member
Nov 23, 2011
43
0
6
Canada
www.digitaltransitions.ca
I am also getting this exactas same issue a well. This is after I updated my cluster to the latest version of Proxmox (5-4-3) Last night.
This morning I find I have 1 VM so far that I couldn't get started.

I was able to get the VM going again by removing it from the HA and then migrating it to another node in the cluster and starting it. Its now up and running. I am a little nervous though that this might happen on another VM.

I haven't rebooted any of my nodes in the cluster yet as I wasn't prompted to restart the nodes after the latest update, but perhaps that is the issue at play here.

My error is the same as the encore:
TASK ERROR: start failed: org.freedesktop.systemd1.UnitExists: Unit 116.scope already exists.
 

encore

Member
May 4, 2018
95
0
6
31
we are facing that issue for months now. But it became very heavy since our cluster grows (currently we are moving all VMs from SolusVM to proxmox, but stopped that process due to the issues).

I rebooted one Node with only 5 VMs. Disabled ksmtuning before, because of strange ballooning issues before.
Directly after reboot (journalctl -b is now very small, only < 1 MB, will attach it) one VM does not start with scope unit exists. VM id here is 1098575.
It is a different node btw.

Here the journal2.txt with | grep 1098575:
root@bondsir005-74050-bl16:~# cat journal2.txt | grep 1098575
Apr 19 12:54:44 bondsir005-74050-bl16 pvesh[1899]: Starting VM 1098575
Apr 19 12:54:44 bondsir005-74050-bl16 pve-guests[2068]: start VM 1098575: UPID:bondsir005-74050-bl16:00000814:000009CD:5CB9A8F4:qmstart:1098575:root@pam:
Apr 19 12:54:44 bondsir005-74050-bl16 pve-guests[1963]: <root@pam> starting task UPID:bondsir005-74050-bl16:00000814:000009CD:5CB9A8F4:qmstart:1098575:root@pam:
Apr 19 12:54:44 bondsir005-74050-bl16 systemd[1]: Started 1098575.scope.
Apr 19 12:54:44 bondsir005-74050-bl16 systemd-udevd[2086]: Could not generate persistent MAC address for tap1098575i0: No such file or directory
Apr 19 12:54:45 bondsir005-74050-bl16 kernel: device tap1098575i0 entered promiscuous mode
Apr 19 12:54:45 bondsir005-74050-bl16 kernel: vmbr0: port 3(tap1098575i0) entered blocking state
Apr 19 12:54:45 bondsir005-74050-bl16 kernel: vmbr0: port 3(tap1098575i0) entered disabled state
Apr 19 12:54:45 bondsir005-74050-bl16 kernel: vmbr0: port 3(tap1098575i0) entered blocking state
Apr 19 12:54:45 bondsir005-74050-bl16 kernel: vmbr0: port 3(tap1098575i0) entered forwarding state
Apr 19 13:12:33 bondsir005-74050-bl16 qm[8037]: VM 1098575 qmp command failed - VM 1098575 qmp command 'guest-ping' failed - got timeout
Apr 19 13:12:35 bondsir005-74050-bl16 pvedaemon[8066]: stop VM 1098575: UPID:bondsir005-74050-bl16:00001F82:0001AC52:5CB9AD23:qmstop:1098575:zap@pve:
Apr 19 13:12:35 bondsir005-74050-bl16 pvedaemon[1771]: <zap@pve> starting task UPID:bondsir005-74050-bl16:00001F82:0001AC52:5CB9AD23:qmstop:1098575:zap@pve:
Apr 19 13:12:38 bondsir005-74050-bl16 pvedaemon[8066]: VM 1098575 qmp command failed - VM 1098575 qmp command 'quit' failed - unable to connect to VM 1098575 qmp socket - timeout after 31 retries
Apr 19 13:12:49 bondsir005-74050-bl16 pvedaemon[1771]: <zap@pve> end task UPID:bondsir005-74050-bl16:00001F82:0001AC52:5CB9AD23:qmstop:1098575:zap@pve: OK
Apr 19 13:13:08 bondsir005-74050-bl16 pvedaemon[1771]: <zap@pve> update VM 1098575: -balloon 0 -delete shares
Apr 19 13:13:10 bondsir005-74050-bl16 pvedaemon[1771]: <zap@pve> starting task UPID:bondsir005-74050-bl16:00002005:0001BA23:5CB9AD46:qmstart:1098575:zap@pve:
Apr 19 13:13:10 bondsir005-74050-bl16 pvedaemon[8197]: start VM 1098575: UPID:bondsir005-74050-bl16:00002005:0001BA23:5CB9AD46:qmstart:1098575:zap@pve:
Apr 19 13:13:10 bondsir005-74050-bl16 systemd[1]: Stopped 1098575.scope.
Apr 19 13:13:11 bondsir005-74050-bl16 pvedaemon[8197]: start failed: org.freedesktop.systemd1.UnitExists: Unit 1098575.scope already exists.
Apr 19 13:13:11 bondsir005-74050-bl16 pvedaemon[1771]: <zap@pve> end task UPID:bondsir005-74050-bl16:00002005:0001BA23:5CB9AD46:qmstart:1098575:zap@pve: start failed: org.freedesktop.systemd1.UnitExists: Unit 1098575.scope already exists.
Apr 19 13:13:25 bondsir005-74050-bl16 pvedaemon[1772]: <zap@pve> update VM 1098575: -balloon 0 -delete shares
Apr 19 13:13:27 bondsir005-74050-bl16 pvedaemon[8439]: start VM 1098575: UPID:bondsir005-74050-bl16:000020F7:0001C0A5:5CB9AD57:qmstart:1098575:zap@pve:
Apr 19 13:13:27 bondsir005-74050-bl16 pvedaemon[1772]: <zap@pve> starting task UPID:bondsir005-74050-bl16:000020F7:0001C0A5:5CB9AD57:qmstart:1098575:zap@pve:
Apr 19 13:13:27 bondsir005-74050-bl16 pvedaemon[8439]: start failed: org.freedesktop.systemd1.UnitExists: Unit 1098575.scope already exists.
Apr 19 13:13:27 bondsir005-74050-bl16 pvedaemon[1772]: <zap@pve> end task UPID:bondsir005-74050-bl16:000020F7:0001C0A5:5CB9AD57:qmstart:1098575:zap@pve: start failed: org.freedesktop.systemd1.UnitExists: Unit 1098575.scope already exists.
Apr 19 13:15:07 bondsir005-74050-bl16 pvedaemon[1771]: <zap@pve> update VM 1098575: -balloon 0 -delete shares
Apr 19 13:15:09 bondsir005-74050-bl16 pvedaemon[1772]: <zap@pve> starting task UPID:bondsir005-74050-bl16:000023B1:0001E88A:5CB9ADBD:qmstart:1098575:zap@pve:
Apr 19 13:15:09 bondsir005-74050-bl16 pvedaemon[9137]: start VM 1098575: UPID:bondsir005-74050-bl16:000023B1:0001E88A:5CB9ADBD:qmstart:1098575:zap@pve:
Apr 19 13:15:09 bondsir005-74050-bl16 pvedaemon[9137]: start failed: org.freedesktop.systemd1.UnitExists: Unit 1098575.scope already exists.
Apr 19 13:15:09 bondsir005-74050-bl16 pvedaemon[1772]: <zap@pve> end task UPID:bondsir005-74050-bl16:000023B1:0001E88A:5CB9ADBD:qmstart:1098575:zap@pve: start failed: org.freedesktop.systemd1.UnitExists: Unit 1098575.scope already exists.
Apr 19 13:15:20 bondsir005-74050-bl16 pvedaemon[1771]: <zap@pve> update VM 1098575: -balloon 0 -delete shares
Apr 19 13:15:22 bondsir005-74050-bl16 pvedaemon[9230]: start VM 1098575: UPID:bondsir005-74050-bl16:0000240E:0001EDAB:5CB9ADCA:qmstart:1098575:zap@pve:
Apr 19 13:15:22 bondsir005-74050-bl16 pvedaemon[1771]: <zap@pve> starting task UPID:bondsir005-74050-bl16:0000240E:0001EDAB:5CB9ADCA:qmstart:1098575:zap@pve:
Apr 19 13:15:22 bondsir005-74050-bl16 pvedaemon[9230]: start failed: org.freedesktop.systemd1.UnitExists: Unit 1098575.scope already exists.
Apr 19 13:15:22 bondsir005-74050-bl16 pvedaemon[1771]: <zap@pve> end task UPID:bondsir005-74050-bl16:0000240E:0001EDAB:5CB9ADCA:qmstart:1098575:zap@pve: start failed: org.freedesktop.systemd1.UnitExists: Unit 1098575.scope already exists.
Apr 19 13:15:47 bondsir005-74050-bl16 pvedaemon[1771]: <zap@pve> update VM 1098575: -balloon 0 -delete shares
Apr 19 13:15:49 bondsir005-74050-bl16 pvedaemon[1771]: <zap@pve> starting task UPID:bondsir005-74050-bl16:000024BB:0001F82C:5CB9ADE5:qmstart:1098575:zap@pve:
Apr 19 13:15:49 bondsir005-74050-bl16 pvedaemon[9403]: start VM 1098575: UPID:bondsir005-74050-bl16:000024BB:0001F82C:5CB9ADE5:qmstart:1098575:zap@pve:
Apr 19 13:15:49 bondsir005-74050-bl16 pvedaemon[9403]: start failed: org.freedesktop.systemd1.UnitExists: Unit 1098575.scope already exists.
Apr 19 13:15:49 bondsir005-74050-bl16 pvedaemon[1771]: <zap@pve> end task UPID:bondsir005-74050-bl16:000024BB:0001F82C:5CB9ADE5:qmstart:1098575:zap@pve: start failed: org.freedesktop.systemd1.UnitExists: Unit 1098575.scope already exists.
Apr 19 13:16:06 bondsir005-74050-bl16 qm[9562]: VM 1098575 qmp command failed - VM 1098575 not running
 

Attachments

mira

Proxmox Staff Member
Staff member
Aug 1, 2018
357
28
28
3 things that seem strange in the log:
Code:
Apr 19 12:58:11 bondsir005-74050-bl16 corosync[1675]: notice [TOTEM ] Retransmit List: 1f2b 1f34 1f35 1f38 1f44 1f45 1f46 1f47 1f4b
Apr 19 13:12:33 bondsir005-74050-bl16 qm[8037]: VM 1098575 qmp command failed - VM 1098575 qmp command 'guest-ping' failed - got timeout
Apr 19 13:15:16 bondsir005-74050-bl16 pveproxy[1863]: unable to read '/etc/pve/nodes/captive001-72001-bl03/pve-ssl.pem' - No such file or directory
Is the guest-agent installed in the VM? If so, is it running?
Looks like there's also a problem with your corosync network. ('Retransmit List' line in the log)

Also the following messages:

Code:
Apr 19 13:12:38 bondsir005-74050-bl16 pvedaemon[8066]: VM 1098575 qmp command failed - VM 1098575 qmp command 'quit' failed - unable to connect to VM 1098575 qmp socket - timeout after 31 retries
Apr 19 13:12:38 bondsir005-74050-bl16 pvedaemon[8066]: VM quit/powerdown failed - terminating now with SIGTERM
Apr 19 13:12:48 bondsir005-74050-bl16 pvedaemon[8066]: VM still running - terminating now with SIGKILL
Edit: copy-paste error, the first 3 messages should now be the right ones.
 
Last edited:

encore

Member
May 4, 2018
95
0
6
31
Retransmit List caused by three nodes I rebooted with ksmsharing disabled.
Yes, guest-agent is installed. When the VM stucks, it is not running anymore.
Our panel does a qemu agent ping, if it succeeds we trigger a "shutdown", if not, we trigger a "stop" to proxmox API.
 

mira

Proxmox Staff Member
Staff member
Aug 1, 2018
357
28
28
The 'Retransmit List' lines appear later on as well. As I said, this looks like a corosync network problem. (Could be related though unlikely, you should still check your network, retransmits are never a good sign.)
Any idea why the guest agent stops running? Anything in the logs of a 'stuck' VM regarding this?
 

encore

Member
May 4, 2018
95
0
6
31
Any idea why the guest agent stops running? Anything in the logs of a 'stuck' VM regarding this?
This is why I am here ;-) Because the VMs keeps freezing and I have no clue why. If that happen, console (VNC) does not work, guest agent does not work anymore, the VM is not accessable by RDP/SSH. Then I try to STOP and START the server and the scope unit message occurs.

Sorry, copy-paste error, the first 3 messages should be fixed now in
What do you mean with "should be fixed now"?
 

mira

Proxmox Staff Member
Staff member
Aug 1, 2018
357
28
28
Sorry, I pasted the same 3 messages twice, instead of different ones. Now they are the ones i wanted to post originally.
 

encore

Member
May 4, 2018
95
0
6
31
Last edited:

encore

Member
May 4, 2018
95
0
6
31
retransmit issues are gone since we added a seperate corosync ring. Unfortunately we are still having scope unit already exists errors every day:
Total translation table size: 0
Total rockridge attributes bytes: 971
Total directory bytes: 6144
Path table size(bytes): 58
Max brk space used 17000
180 extents written (0 MB)
TASK ERROR: start failed: org.freedesktop.systemd1.UnitExists: Unit 1200186.scope already exists.
 

encore

Member
May 4, 2018
95
0
6
31
we are now using a CEPH Cluster (RBD) with raw VMs. Unfortunately the issue still persists.
Windows VMs keep freezing after a while, CPU usage looks like this:
http://prntscr.com/ny3jkp (this is where they freeze), stopping the VM and starting again leads to:
Total translation table size: 0
Total rockridge attributes bytes: 971
Total directory bytes: 6144
Path table size(bytes): 58
Max brk space used 17000
180 extents written (0 MB)
TASK ERROR: start failed: org.freedesktop.systemd1.UnitExists: Unit 125746.scope already exists.
There is still a qemu process of the vm. Killing it with kill -9 does not help.

Issues persists on Win16,Win19 Datacenter. Tried different driver versions, removed driver, added driver, did many tests but can't figure out what is causing those freezes.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE and Proxmox Mail Gateway. We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!