TASK ERROR: start failed: org.freedesktop.systemd1.UnitExists: Unit scope already exists.

Discussion in 'Proxmox VE: Installation and configuration' started by encore, Apr 18, 2019.

  1. encore

    encore Member

    Joined:
    May 4, 2018
    Messages:
    92
    Likes Received:
    0
    Hi,

    we have a big problem with our Proxmox cluster. The cluster consists of 25 nodes with 10-50 servers each (LXC & KVM).
    It happens 20-30 times a day that KVM servers freeze. The console is then not reachable and the server itself is also not.
    When we stop the server, Proxmox displays "stopped".
    If we restart it, we get the error message:
    ps -ef | grep VMID
    still shows the VM process, which cannot be stopped with kill -9.

    Nodes are up2date with
    deb http://download.proxmox.com/debian/pve stretch pve-no-subscription
    repo.
     
  2. dietmar

    dietmar Proxmox Staff Member
    Staff Member

    Joined:
    Apr 28, 2005
    Messages:
    16,456
    Likes Received:
    310
    Maybe a problem with your storage? What kind of storage do you use? In case of failure, please check the storage status with

    # pvesm status
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  3. encore

    encore Member

    Joined:
    May 4, 2018
    Messages:
    92
    Likes Received:
    0
    we use a different storage (just a SSD with ext4 as directory storage) on each node. Storages looking fine. Issue is happening on ALL nodes by chance. It also happens on nodes with 5 VMs and on nodes with 30 VMs. So it is no overprovisioning.
    VMs are stored on VirtIO SCSI / qcow2.

    Any idea on how to debug it further? Any log files what might give more details about the issue?

    We are using Cloud-Init so "serial port socket" is enabled on all VMs. Could that cause the issues?
     
  4. encore

    encore Member

    Joined:
    May 4, 2018
    Messages:
    92
    Likes Received:
    0
    these "stucky vms" dont response to guest tools and do have a memory usage of >90%.
    Last night I had running some test servers with absolutly no operations or load. Some of them do have the same issue this morning.

    Some of them had windows installed, some linux.
     
  5. mira

    mira Proxmox Staff Member
    Staff Member

    Joined:
    Feb 25, 2019
    Messages:
    160
    Likes Received:
    14
    Please post the output of 'pveversion -v', your storage config (/etc/pve/storage.cfg) and the config of a VM that exhibits this problem ('qm config <VMID>')
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  6. encore

    encore Member

    Joined:
    May 4, 2018
    Messages:
    92
    Likes Received:
    0
    global storage .cfg:
    dir: local
    path /var/lib/vz
    content vztmpl,iso
    maxfiles 50
    shared 0

    lvmthin: local-lvm
    thinpool data
    vgname pve
    content none

    dir: captive002-lxcstor-01
    disable
    path /mnt/captive002-lxcstor-01/
    content images,vztmpl,rootdir,backup
    maxfiles 300
    nodes captive002-77015
    shared 1

    dir: captive003-lxcstor-01
    disable
    path /mnt/captive003-lxcstor-01
    content vztmpl,images,backup,rootdir
    maxfiles 300
    nodes captive003-77030
    shared 1

    dir: captive004-lxcstor-01
    disable
    path /mnt/captive004-lxcstor-01
    content backup,rootdir,images,vztmpl
    maxfiles 300
    nodes captive004-77028
    shared 0

    dir: captive002-lxcstor-02
    disable
    path /mnt/captive002-lxcstor-02/
    content rootdir,backup,vztmpl,images
    maxfiles 300
    nodes captive002-77015
    shared 0

    dir: captive003-lxcstor-02-LOCAL
    path /mnt/captive003-lxcstor-02-LOCAL/
    content rootdir,backup,vztmpl,images
    maxfiles 300
    nodes captive003-77030
    shared 0

    dir: captive004-lxcstor-02-LOCAL
    path /mnt/captive004-lxcstor-02-LOCAL
    content vztmpl,images,rootdir,backup
    maxfiles 300
    nodes captive004-77028
    shared 0

    dir: captive005-lxcstor-01-LOCAL
    path /mnt/captive005-lxcstor-01-LOCAL
    content backup,rootdir,images,vztmpl
    maxfiles 300
    nodes captive005-74001
    shared 0

    dir: captive005-lxcstor-02-LOCAL
    path /mnt/captive005-lxcstor-02-LOCAL
    content vztmpl,images,backup,rootdir
    maxfiles 300
    nodes captive005-74001
    shared 0

    dir: captive006-lxcstor-01-LOCAL
    path /mnt/captive006-lxcstor-01-local
    content images,vztmpl,backup,rootdir
    maxfiles 300
    nodes captive006-73029
    shared 0

    dir: captive007-lxcstor-01-LOCAL
    path /mnt/captive007-lxcstor-01-local
    content rootdir,backup,images,vztmpl
    maxfiles 300
    nodes captive003-77030,captive007-73030
    shared 0

    dir: captive008-lxcstor-01-LOCAL
    path /mnt/captive008-lxcstor-01-LOCAL
    content images,vztmpl,rootdir,backup
    maxfiles 300
    nodes captive008-74005
    shared 0

    dir: captive009-lxcstor-01-LOCAL
    path /mnt/captive009-lxcstor-01-LOCAL
    content backup,rootdir,vztmpl,images
    maxfiles 300
    nodes captive009-77014
    shared 0

    dir: captive009-lxcstor-02-LOCAL
    path /mnt/captive009-lxcstor-02-LOCAL
    content rootdir,backup,vztmpl,images
    maxfiles 300
    nodes captive009-77014
    shared 0

    dir: captive011-lxcstor-01-LOCAL
    path /mnt/captive011-lxcstor-01-LOCAL
    content backup,rootdir,vztmpl,images
    maxfiles 300
    nodes captive011-74007
    shared 0

    dir: captive011-lxcstor-02-LOCAL
    path /mnt/captive011-lxcstor-02-LOCA
    content images,vztmpl,rootdir,backup
    maxfiles 300
    nodes captive011-74007
    shared 0

    dir: captive001-lxcstor-01-localLV
    path /mnt/captive001-lxcstor-01-localLV
    content images,vztmpl,rootdir,backup
    maxfiles 1
    nodes captive001-72001-bl12
    shared 0

    dir: captive006-lxcstor-01-localLV
    path /mnt/captive006-lxcstor-01-localLV
    content backup,rootdir,images,vztmpl
    maxfiles 1
    nodes captive006-72011-bl09
    shared 0

    dir: captive002-lxcstor-01-LOCAL
    path /mnt/captive002-lxcstor-01-LOCAL
    content vztmpl,images,backup,rootdir
    maxfiles 300
    nodes captive002-77015
    shared 0

    dir: captive002-lxcstor-02-LOCAL
    path /mnt/captive002-lxcstor-02-LOCAL
    content images,vztmpl,rootdir,backup
    maxfiles 100
    nodes captive002-77015
    shared 1

    dir: captive004-lxcstor-01-LOCAL
    path /mnt/captive004-lxcstor-01-LOCAL
    content vztmpl,images,rootdir,backup
    maxfiles 300
    nodes captive004-77028
    shared 0

    dir: captive003-lxcstor-01-LOCAL
    path /mnt/captive003-lxcstor-01-LOCAL
    content backup,rootdir,vztmpl,images
    maxfiles 100
    nodes captive003-77030
    shared 0

    nfs: imageserver
    export /var/pve
    path /mnt/pve/imageserver
    server 10.10.10.100
    content iso,vztmpl
    maxfiles 100
    options vers=3

    dir: imageserver-clones
    disable
    path /home/imageserver
    content images
    shared 1

    nfs: solusmigrates
    export /var/solus
    path /mnt/pve/solusmigrates
    server 10.10.10.100
    content images
    options vers=3

    dir: captive007-lxcstor-01-localLV
    path /mnt/captive007-lxcstor-01-localLV
    content images,vztmpl,rootdir,backup
    maxfiles 99
    nodes captive007-72001-bl11
    shared 0

    dir: captive012-lxcstor01-localLV
    path /mnt/captive012-lxcstor-01-localLV
    content rootdir,backup,images,vztmpl
    maxfiles 99
    nodes captive012-72011-bl06
    shared 0

    dir: captive013-lxcstor01-localLV
    path /mnt/captive013-lxcstor01-localLV
    content backup,rootdir,vztmpl,images
    maxfiles 99
    nodes captive013-74050-bl08
    shared 0

    dir: captive014-lxcstor-01-localLV
    path /mnt/captive014-lxcstor-01-localLV
    content backup,rootdir,vztmpl,images
    maxfiles 99
    nodes captive014-72001-bl15
    shared 0

    dir: bondbabe001-lxcstor01-localLV
    path /mnt/bondbabe001-lxcstor01-localLV
    content rootdir,backup,vztmpl,images
    maxfiles 99
    nodes bondbabe001-74050-bl06
    shared 0

    dir: bondsir001-lxcstor01-localLV
    path /mnt/bondsir001-lxcstor01-localLV
    content rootdir,backup,vztmpl,images
    maxfiles 99
    nodes bondsir001-72011-bl14
    shared 0

    dir: captive015-lxcstor01-localLV
    path /mnt/captive015-lxcstor01-localLV
    content backup,rootdir,vztmpl,images
    maxfiles 99
    nodes captive015-74050-bl05
    shared 0

    dir: captive016-lxcstor01-localLV
    path /mnt/captive016-lxcstor01-localLV/
    content vztmpl,images,rootdir,backup
    maxfiles 99
    nodes captive016-72001-bl01
    shared 0

    dir: captive017-lxcstor01-localLV
    path /mnt/captive017-lxcstor01-localLV
    content backup,rootdir,images,vztmpl
    maxfiles 99
    nodes captive017-74050-bl09
    shared 0

    dir: bondsir002-lxcstor01-localLV
    path /mnt/bondsir002-lxcstor01-localLV
    content images,vztmpl,rootdir,backup
    maxfiles 99
    nodes bondsir002-72001-bl08
    shared 0

    dir: captive018-lxcstor01-localLV
    path /mnt/captive018-lxcstor01-localLV
    content images,vztmpl,rootdir,backup
    maxfiles 0
    nodes captive018-72001-bl04
    shared 0

    dir: bondbabe002-lxcstor01-localLV
    path /mnt/bondbabe002-lxcstor01-localLV
    content images,vztmpl,rootdir,backup
    maxfiles 99
    nodes bondbabe002-72011-bl12
    shared 0

    dir: captive019-lxcstor01-localLV
    path /mnt/captive019-lxcstor01-localLV
    content rootdir,backup,vztmpl,images
    maxfiles 99
    nodes captive019-74050-bl12
    shared 0

    dir: bondsir003-lxcstor01-localLV
    path /mnt/bondsir003-lxcstor01-localLV
    content vztmpl,images,rootdir,backup
    maxfiles 99
    nodes bondsir003-74050-bl10
    shared 0

    dir: captive010-lxcstor01-localLV
    path /mnt/captive010-lxcstor01-localLV
    content backup,rootdir,vztmpl,images
    maxfiles 99
    nodes captive010-74050-bl14
    shared 0

    dir: bondsir004-lxcstor01-localLV
    path /mnt/bondsir004-lxcstor01-localLV
    content backup,rootdir,images,vztmpl
    maxfiles 99
    nodes bondsir004-74050-bl11
    shared 0

    dir: captive020-lxcstor01-localLV
    path /mnt/captive020-lxcstor01-localLV
    content backup,rootdir,vztmpl,images
    maxfiles 99
    nodes captive020-74050-bl13
    shared 0

    dir: bondsir005-lxcstor01-localLV
    path /mnt/bondsir005-lxcstor01-localLV
    content backup,rootdir,images,vztmpl
    maxfiles 99
    nodes bondsir005-74050-bl16
    shared 0

    dir: captive021-lxcstor01-localLV
    path /mnt/captive021-lxcstor01-localLV
    content backup,rootdir,images,vztmpl
    maxfiles 99
    nodes captive021-74050-bl15-rev2
    shared 0

    dir: captive022-lxcstor01-localLV
    path /mnt/captive022-lxcstor01-localLV
    content backup,rootdir,images,vztmpl
    maxfiles 99
    nodes captive022-79001-bl01
    shared 0

    Node Bondsir003:
    proxmox-ve: 5.4-1 (running kernel: 4.15.18-12-pve)
    pve-manager: 5.4-3 (running version: 5.4-3/0a6eaa62)
    pve-kernel-4.15: 5.3-3
    pve-kernel-4.15.18-12-pve: 4.15.18-35
    pve-kernel-4.13.13-2-pve: 4.13.13-33
    corosync: 2.4.4-pve1
    criu: 2.11.1-1~bpo90
    glusterfs-client: 3.8.8-1
    ksm-control-daemon: 1.2-2
    libjs-extjs: 6.0.1-2
    libpve-access-control: 5.1-8
    libpve-apiclient-perl: 2.0-5
    libpve-common-perl: 5.0-50
    libpve-guest-common-perl: 2.0-20
    libpve-http-server-perl: 2.0-13
    libpve-storage-perl: 5.0-41
    libqb0: 1.0.3-1~bpo9
    lvm2: 2.02.168-pve6
    lxc-pve: 3.1.0-3
    lxcfs: 3.0.3-pve1
    novnc-pve: 1.0.0-3
    proxmox-widget-toolkit: 1.0-25
    pve-cluster: 5.0-36
    pve-container: 2.0-37
    pve-docs: 5.4-2
    pve-edk2-firmware: 1.20190312-1
    pve-firewall: 3.0-19
    pve-firmware: 2.0-6
    pve-ha-manager: 2.0-9
    pve-i18n: 1.1-4
    pve-libspice-server1: 0.14.1-2
    pve-qemu-kvm: 2.12.1-3
    pve-xtermjs: 3.12.0-1
    qemu-server: 5.0-50
    smartmontools: 6.5+svn4324-1
    spiceterm: 3.0-5
    vncterm: 1.5-3
    zfsutils-linux: 0.7.13-pve1~bpo2

    VM config:
    This is only one node and VM.
    Currently I can reproduce the issue on 8 different nodes and on 20 VMs total.

    LXC containers are not affected btw.
     
  7. mira

    mira Proxmox Staff Member
    Staff Member

    Joined:
    Feb 25, 2019
    Messages:
    160
    Likes Received:
    14
    Could you also post the journal? ('journalctl -b' everything since the last boot) Perhaps it contains some more information.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  8. encore

    encore Member

    Joined:
    May 4, 2018
    Messages:
    92
    Likes Received:
    0
  9. dthompson

    dthompson Member

    Joined:
    Nov 23, 2011
    Messages:
    41
    Likes Received:
    0
    I am also getting this exactas same issue a well. This is after I updated my cluster to the latest version of Proxmox (5-4-3) Last night.
    This morning I find I have 1 VM so far that I couldn't get started.

    I was able to get the VM going again by removing it from the HA and then migrating it to another node in the cluster and starting it. Its now up and running. I am a little nervous though that this might happen on another VM.

    I haven't rebooted any of my nodes in the cluster yet as I wasn't prompted to restart the nodes after the latest update, but perhaps that is the issue at play here.

    My error is the same as the encore:
    TASK ERROR: start failed: org.freedesktop.systemd1.UnitExists: Unit 116.scope already exists.
     
  10. encore

    encore Member

    Joined:
    May 4, 2018
    Messages:
    92
    Likes Received:
    0
    we are facing that issue for months now. But it became very heavy since our cluster grows (currently we are moving all VMs from SolusVM to proxmox, but stopped that process due to the issues).

    I rebooted one Node with only 5 VMs. Disabled ksmtuning before, because of strange ballooning issues before.
    Directly after reboot (journalctl -b is now very small, only < 1 MB, will attach it) one VM does not start with scope unit exists. VM id here is 1098575.
    It is a different node btw.

    Here the journal2.txt with | grep 1098575:
    root@bondsir005-74050-bl16:~# cat journal2.txt | grep 1098575
    Apr 19 12:54:44 bondsir005-74050-bl16 pvesh[1899]: Starting VM 1098575
    Apr 19 12:54:44 bondsir005-74050-bl16 pve-guests[2068]: start VM 1098575: UPID:bondsir005-74050-bl16:00000814:000009CD:5CB9A8F4:qmstart:1098575:root@pam:
    Apr 19 12:54:44 bondsir005-74050-bl16 pve-guests[1963]: <root@pam> starting task UPID:bondsir005-74050-bl16:00000814:000009CD:5CB9A8F4:qmstart:1098575:root@pam:
    Apr 19 12:54:44 bondsir005-74050-bl16 systemd[1]: Started 1098575.scope.
    Apr 19 12:54:44 bondsir005-74050-bl16 systemd-udevd[2086]: Could not generate persistent MAC address for tap1098575i0: No such file or directory
    Apr 19 12:54:45 bondsir005-74050-bl16 kernel: device tap1098575i0 entered promiscuous mode
    Apr 19 12:54:45 bondsir005-74050-bl16 kernel: vmbr0: port 3(tap1098575i0) entered blocking state
    Apr 19 12:54:45 bondsir005-74050-bl16 kernel: vmbr0: port 3(tap1098575i0) entered disabled state
    Apr 19 12:54:45 bondsir005-74050-bl16 kernel: vmbr0: port 3(tap1098575i0) entered blocking state
    Apr 19 12:54:45 bondsir005-74050-bl16 kernel: vmbr0: port 3(tap1098575i0) entered forwarding state
    Apr 19 13:12:33 bondsir005-74050-bl16 qm[8037]: VM 1098575 qmp command failed - VM 1098575 qmp command 'guest-ping' failed - got timeout
    Apr 19 13:12:35 bondsir005-74050-bl16 pvedaemon[8066]: stop VM 1098575: UPID:bondsir005-74050-bl16:00001F82:0001AC52:5CB9AD23:qmstop:1098575:zap@pve:
    Apr 19 13:12:35 bondsir005-74050-bl16 pvedaemon[1771]: <zap@pve> starting task UPID:bondsir005-74050-bl16:00001F82:0001AC52:5CB9AD23:qmstop:1098575:zap@pve:
    Apr 19 13:12:38 bondsir005-74050-bl16 pvedaemon[8066]: VM 1098575 qmp command failed - VM 1098575 qmp command 'quit' failed - unable to connect to VM 1098575 qmp socket - timeout after 31 retries
    Apr 19 13:12:49 bondsir005-74050-bl16 pvedaemon[1771]: <zap@pve> end task UPID:bondsir005-74050-bl16:00001F82:0001AC52:5CB9AD23:qmstop:1098575:zap@pve: OK
    Apr 19 13:13:08 bondsir005-74050-bl16 pvedaemon[1771]: <zap@pve> update VM 1098575: -balloon 0 -delete shares
    Apr 19 13:13:10 bondsir005-74050-bl16 pvedaemon[1771]: <zap@pve> starting task UPID:bondsir005-74050-bl16:00002005:0001BA23:5CB9AD46:qmstart:1098575:zap@pve:
    Apr 19 13:13:10 bondsir005-74050-bl16 pvedaemon[8197]: start VM 1098575: UPID:bondsir005-74050-bl16:00002005:0001BA23:5CB9AD46:qmstart:1098575:zap@pve:
    Apr 19 13:13:10 bondsir005-74050-bl16 systemd[1]: Stopped 1098575.scope.
    Apr 19 13:13:11 bondsir005-74050-bl16 pvedaemon[8197]: start failed: org.freedesktop.systemd1.UnitExists: Unit 1098575.scope already exists.
    Apr 19 13:13:11 bondsir005-74050-bl16 pvedaemon[1771]: <zap@pve> end task UPID:bondsir005-74050-bl16:00002005:0001BA23:5CB9AD46:qmstart:1098575:zap@pve: start failed: org.freedesktop.systemd1.UnitExists: Unit 1098575.scope already exists.
    Apr 19 13:13:25 bondsir005-74050-bl16 pvedaemon[1772]: <zap@pve> update VM 1098575: -balloon 0 -delete shares
    Apr 19 13:13:27 bondsir005-74050-bl16 pvedaemon[8439]: start VM 1098575: UPID:bondsir005-74050-bl16:000020F7:0001C0A5:5CB9AD57:qmstart:1098575:zap@pve:
    Apr 19 13:13:27 bondsir005-74050-bl16 pvedaemon[1772]: <zap@pve> starting task UPID:bondsir005-74050-bl16:000020F7:0001C0A5:5CB9AD57:qmstart:1098575:zap@pve:
    Apr 19 13:13:27 bondsir005-74050-bl16 pvedaemon[8439]: start failed: org.freedesktop.systemd1.UnitExists: Unit 1098575.scope already exists.
    Apr 19 13:13:27 bondsir005-74050-bl16 pvedaemon[1772]: <zap@pve> end task UPID:bondsir005-74050-bl16:000020F7:0001C0A5:5CB9AD57:qmstart:1098575:zap@pve: start failed: org.freedesktop.systemd1.UnitExists: Unit 1098575.scope already exists.
    Apr 19 13:15:07 bondsir005-74050-bl16 pvedaemon[1771]: <zap@pve> update VM 1098575: -balloon 0 -delete shares
    Apr 19 13:15:09 bondsir005-74050-bl16 pvedaemon[1772]: <zap@pve> starting task UPID:bondsir005-74050-bl16:000023B1:0001E88A:5CB9ADBD:qmstart:1098575:zap@pve:
    Apr 19 13:15:09 bondsir005-74050-bl16 pvedaemon[9137]: start VM 1098575: UPID:bondsir005-74050-bl16:000023B1:0001E88A:5CB9ADBD:qmstart:1098575:zap@pve:
    Apr 19 13:15:09 bondsir005-74050-bl16 pvedaemon[9137]: start failed: org.freedesktop.systemd1.UnitExists: Unit 1098575.scope already exists.
    Apr 19 13:15:09 bondsir005-74050-bl16 pvedaemon[1772]: <zap@pve> end task UPID:bondsir005-74050-bl16:000023B1:0001E88A:5CB9ADBD:qmstart:1098575:zap@pve: start failed: org.freedesktop.systemd1.UnitExists: Unit 1098575.scope already exists.
    Apr 19 13:15:20 bondsir005-74050-bl16 pvedaemon[1771]: <zap@pve> update VM 1098575: -balloon 0 -delete shares
    Apr 19 13:15:22 bondsir005-74050-bl16 pvedaemon[9230]: start VM 1098575: UPID:bondsir005-74050-bl16:0000240E:0001EDAB:5CB9ADCA:qmstart:1098575:zap@pve:
    Apr 19 13:15:22 bondsir005-74050-bl16 pvedaemon[1771]: <zap@pve> starting task UPID:bondsir005-74050-bl16:0000240E:0001EDAB:5CB9ADCA:qmstart:1098575:zap@pve:
    Apr 19 13:15:22 bondsir005-74050-bl16 pvedaemon[9230]: start failed: org.freedesktop.systemd1.UnitExists: Unit 1098575.scope already exists.
    Apr 19 13:15:22 bondsir005-74050-bl16 pvedaemon[1771]: <zap@pve> end task UPID:bondsir005-74050-bl16:0000240E:0001EDAB:5CB9ADCA:qmstart:1098575:zap@pve: start failed: org.freedesktop.systemd1.UnitExists: Unit 1098575.scope already exists.
    Apr 19 13:15:47 bondsir005-74050-bl16 pvedaemon[1771]: <zap@pve> update VM 1098575: -balloon 0 -delete shares
    Apr 19 13:15:49 bondsir005-74050-bl16 pvedaemon[1771]: <zap@pve> starting task UPID:bondsir005-74050-bl16:000024BB:0001F82C:5CB9ADE5:qmstart:1098575:zap@pve:
    Apr 19 13:15:49 bondsir005-74050-bl16 pvedaemon[9403]: start VM 1098575: UPID:bondsir005-74050-bl16:000024BB:0001F82C:5CB9ADE5:qmstart:1098575:zap@pve:
    Apr 19 13:15:49 bondsir005-74050-bl16 pvedaemon[9403]: start failed: org.freedesktop.systemd1.UnitExists: Unit 1098575.scope already exists.
    Apr 19 13:15:49 bondsir005-74050-bl16 pvedaemon[1771]: <zap@pve> end task UPID:bondsir005-74050-bl16:000024BB:0001F82C:5CB9ADE5:qmstart:1098575:zap@pve: start failed: org.freedesktop.systemd1.UnitExists: Unit 1098575.scope already exists.
    Apr 19 13:16:06 bondsir005-74050-bl16 qm[9562]: VM 1098575 qmp command failed - VM 1098575 not running
     

    Attached Files:

  11. mira

    mira Proxmox Staff Member
    Staff Member

    Joined:
    Feb 25, 2019
    Messages:
    160
    Likes Received:
    14
    3 things that seem strange in the log:
    Code:
    Apr 19 12:58:11 bondsir005-74050-bl16 corosync[1675]: notice [TOTEM ] Retransmit List: 1f2b 1f34 1f35 1f38 1f44 1f45 1f46 1f47 1f4b
    Apr 19 13:12:33 bondsir005-74050-bl16 qm[8037]: VM 1098575 qmp command failed - VM 1098575 qmp command 'guest-ping' failed - got timeout
    Apr 19 13:15:16 bondsir005-74050-bl16 pveproxy[1863]: unable to read '/etc/pve/nodes/captive001-72001-bl03/pve-ssl.pem' - No such file or directory
    
    Is the guest-agent installed in the VM? If so, is it running?
    Looks like there's also a problem with your corosync network. ('Retransmit List' line in the log)

    Also the following messages:

    Code:
    Apr 19 13:12:38 bondsir005-74050-bl16 pvedaemon[8066]: VM 1098575 qmp command failed - VM 1098575 qmp command 'quit' failed - unable to connect to VM 1098575 qmp socket - timeout after 31 retries
    Apr 19 13:12:38 bondsir005-74050-bl16 pvedaemon[8066]: VM quit/powerdown failed - terminating now with SIGTERM
    Apr 19 13:12:48 bondsir005-74050-bl16 pvedaemon[8066]: VM still running - terminating now with SIGKILL
    Edit: copy-paste error, the first 3 messages should now be the right ones.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
    #11 mira, Apr 19, 2019
    Last edited: Apr 19, 2019
  12. encore

    encore Member

    Joined:
    May 4, 2018
    Messages:
    92
    Likes Received:
    0
    Retransmit List caused by three nodes I rebooted with ksmsharing disabled.
    Yes, guest-agent is installed. When the VM stucks, it is not running anymore.
    Our panel does a qemu agent ping, if it succeeds we trigger a "shutdown", if not, we trigger a "stop" to proxmox API.
     
  13. mira

    mira Proxmox Staff Member
    Staff Member

    Joined:
    Feb 25, 2019
    Messages:
    160
    Likes Received:
    14
    The 'Retransmit List' lines appear later on as well. As I said, this looks like a corosync network problem. (Could be related though unlikely, you should still check your network, retransmits are never a good sign.)
    Any idea why the guest agent stops running? Anything in the logs of a 'stuck' VM regarding this?
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  14. mira

    mira Proxmox Staff Member
    Staff Member

    Joined:
    Feb 25, 2019
    Messages:
    160
    Likes Received:
    14
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  15. encore

    encore Member

    Joined:
    May 4, 2018
    Messages:
    92
    Likes Received:
    0
    This is why I am here ;-) Because the VMs keeps freezing and I have no clue why. If that happen, console (VNC) does not work, guest agent does not work anymore, the VM is not accessable by RDP/SSH. Then I try to STOP and START the server and the scope unit message occurs.

    What do you mean with "should be fixed now"?
     
  16. encore

    encore Member

    Joined:
    May 4, 2018
    Messages:
    92
    Likes Received:
    0
    when they freeze, they look like this:
    http://prntscr.com/ne7eu3

    Host looks fine:
    http://prntscr.com/ne7f1l

    When I stop this VM now and start again, I get scope unit error.

    Btw:
    This happens if Guest Agent is enabled in proxmox, but the guest Agent is not running inside the VM (e.g. still booting up or freezes like it does all the time)
     
  17. mira

    mira Proxmox Staff Member
    Staff Member

    Joined:
    Feb 25, 2019
    Messages:
    160
    Likes Received:
    14
    Sorry, I pasted the same 3 messages twice, instead of different ones. Now they are the ones i wanted to post originally.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  18. encore

    encore Member

    Joined:
    May 4, 2018
    Messages:
    92
    Likes Received:
    0
    #18 encore, Apr 20, 2019
    Last edited: Apr 20, 2019
  19. encore

    encore Member

    Joined:
    May 4, 2018
    Messages:
    92
    Likes Received:
    0
    retransmit issues are gone since we added a seperate corosync ring. Unfortunately we are still having scope unit already exists errors every day:
     
  20. encore

    encore Member

    Joined:
    May 4, 2018
    Messages:
    92
    Likes Received:
    0
    we are now using a CEPH Cluster (RBD) with raw VMs. Unfortunately the issue still persists.
    Windows VMs keep freezing after a while, CPU usage looks like this:
    http://prntscr.com/ny3jkp (this is where they freeze), stopping the VM and starting again leads to:
    There is still a qemu process of the vm. Killing it with kill -9 does not help.

    Issues persists on Win16,Win19 Datacenter. Tried different driver versions, removed driver, added driver, did many tests but can't figure out what is causing those freezes.
     
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice