Shutdown/reboot error nodes.

kbelov · Nov 12, 2020

Hi.

I am implementing a Proxmox HA cluster of 3 nodes.

Code:

proxmox-ve: 6.2-2 (running kernel: 5.4.65-1-pve)
pve-manager: 6.2-15 (running version: 6.2-15 / 48bd51b6)
pve-kernel-5.4: 6.2-7
pve-kernel-helper: 6.2-7
pve-kernel-5.4.65-1-pve: 5.4.65-1
pve-kernel-5.4.34-1-pve: 5.4.34-2
ceph-fuse: 12.2.11 + dfsg1-2.1 + b1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: not correctly installed
ifupdown2: 3.0.0-1 + pve3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.5
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.2-4
libpve-guest-common-perl: 3.1-3
libpve-http-server-perl: 3.0-6
libpve-storage-perl: 6.2-10
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4 ~ pve6 + 1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.0.1-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.3-10
pve-cluster: 6.2-1
pve-container: 3.2-2
pve-docs: 6.2-6
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.1.0-6
pve-xtermjs: 4.7.0-2
qemu-server: 6.2-19
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.4-pve2

During testing, I encountered the problem of shutting down or rebooting the nodes, namely the fact that the guest machines do not shut down normally and the server waits for them to terminate, after which they are forcibly terminated by timeout (although the virtual machines themselves shut down correctly)..

Code:

Nov 12 00:41:42 pve3 pve-guests [14365]: <root @ pam> starting task UPID: pve3: 0000381F: 0004F226: 5FAC5A96: stopall :: root @ pam:
Nov 12 00:41:42 pve3 pvesh [14365]: Stopping VM 107 (timeout = 180 seconds)
Nov 12 00:41:42 pve3 pve-guests [14367]: <root @ pam> starting task UPID: pve3: 00003820: 0004F22B: 5FAC5A96: qmshutdown: 107: root @ pam:
Nov 12 00:41:42 pve3 pve-guests [14368]: shutdown VM 107: UPID: pve3: 00003820: 0004F22B: 5FAC5A96: qmshutdown: 107: root @ pam:
Nov 12 00:41:42 pve3 pvesh [14365]: Stopping VM 102 (timeout = 180 seconds)
Nov 12 00:41:42 pve3 pve-guests [14367]: <root @ pam> starting task UPID: pve3: 00003821: 0004F22D: 5FAC5A96: qmshutdown: 102: root @ pam:
Nov 12 00:41:42 pve3 pve-guests [14369]: shutdown VM 102: UPID: pve3: 00003821: 0004F22D: 5FAC5A96: qmshutdown: 102: root @ pam:
Nov 12 00:41:46 pve3 QEMU [36643]: kvm: Unable to connect character device qmp-event: Failed to connect socket /var/run/qmeventd.sock: Connection refused
Nov 12 00:41:46 pve3 QEMU [2865]: kvm: Unable to connect character device qmp-event: Failed to connect socket /var/run/qmeventd.sock: Connection refused
Nov 12 00:44:42 pve3 pve-guests[14368]: VM 107 qmp command failed - VM 107 qmp command 'guest-shutdown' failed - got timeout
Nov 12 00:44:42 pve3 QEMU[36643]: kvm: terminating on signal 15 from pid 14368 (task UPID:pve3:00003820:0004F22B:5FAC5A96:qmshutdown:107:root@pam:)
Nov 12 00:44:42 pve3 pve-guests[14369]: VM 102 qmp command failed - VM 102 qmp command 'guest-shutdown' failed - got timeout
Nov 12 00:44:42 pve3 QEMU[2865]: kvm: terminating on signal 15 from pid 14369 (task UPID:pve3:00003821:0004F22D:5FAC5A96:qmshutdown:102:root@pam:)

The problem turned out to be that the qmeventd service exits before pve-guests and pve-ha-lrm
By adding qmeventd dependencies to the specified services, we managed to get rid of errors leading to problems when rebooting or shutting down nodes.

Code:

--- old/pve-ha-lrm.service    2020-11-12 10:10:57.535673153 +0300
+++ new/pve-ha-lrm.service    2020-11-12 02:23:58.611344867 +0300
@@ -4,6 +4,7 @@
 Wants=pve-cluster.service
 Wants=watchdog-mux.service
 Wants=pvedaemon.service
+Wants=qmeventd.service
 Wants=pve-ha-crm.service
 Wants=lxc.service
 Wants=pve-storage.target
@@ -14,6 +15,7 @@
 After=pve-storage.target
 After=pvedaemon.service
 After=pveproxy.service
+After=qmeventd.service
 After=ssh.service
 After=syslog.service
 After=watchdog-mux.service

Code:

--- old/pve-guests.service    2020-11-12 10:10:44.975752106 +0300
+++ new/pve-guests.service    2020-11-12 01:01:50.887244327 +0300
@@ -3,11 +3,13 @@
 ConditionPathExists=/usr/bin/pvesh
 RefuseManualStart=true
 RefuseManualStop=true
+Wants=qmeventd.service
 Wants=pvestatd.service
 Wants=pveproxy.service
 Wants=spiceproxy.service
 Wants=pve-firewall.service
 Wants=lxc.service
+After=qmeventd.service
 After=pveproxy.service
 After=pvestatd.service
 After=spiceproxy.service

Perhaps my solution is not the most correct one, but it solves the problem and will be useful to someone, and the developers will also fix this error in the future.

Stefan_R · Nov 12, 2020

Thanks for reporting the issue, and even including the fix! I've reproduced the issue and sent a fix similar to yours to our mailing list: https://lists.proxmox.com/pipermail/pve-devel/2020-November/045897.html

Stefan Radman · Nov 15, 2020

In my case the (standalone) host waited for 3 minutes even though all guests had shut down long before.
I actually used a drop-in snippet instead of editing the original service template.

Code:

root@pve:~# systemctl edit pve-guests
...
root@pve:~# cat /etc/systemd/system/pve-guests.service.d/override.conf
# stop qmeventd.service after pve-guests.service
# https://lists.proxmox.com/pipermail/pve-devel/2020-November/045897.html
[Unit]
After=qmeventd.service

Your report and fix solved the issue for me. Now the host shuts down immediately after the last VM. Thank you.

Best regards
Stefan

Search

Search

Shutdown/reboot error nodes.

kbelov

Renowned Member

Stefan_R

Proxmox Retired Staff

Stefan Radman

Active Member

We value your privacy