Long delay from command issue to command execution

alexskysilk

Distinguished Member
Oct 16, 2015
2,453
739
183
Chatsworth, CA
www.skysilk.com
I have a (what appears to be) intermittent problem with container shutdowns taking a LONG time. For example:
upload_2018-8-21_9-16-45.png

As you can see, there is a NEARLY 7 MINUTE delay from the stop request end time to the shutdown command. What is the cause of this delay and how can it be mitigated?
 
There is no indication in the vm logs that there was anything amiss. moreover, this happens on start tasks as well:

upload_2018-8-30_8-34-18.png

The system is not overloaded and is not very busy, and seems to be operating normally. there is no obvious indications in dmesg, and pvestatd and pveproxy dont show any problems.

Code:
# pveversion -v
proxmox-ve: 5.2-2 (running kernel: 4.15.17-2-pve)
pve-manager: 5.2-1 (running version: 5.2-1/0fcd7879)
pve-kernel-4.15: 5.2-2
pve-kernel-4.15.17-2-pve: 4.15.17-10
pve-kernel-4.13.13-6-pve: 4.13.13-42
pve-kernel-4.13.13-2-pve: 4.13.13-33
ceph: 12.2.5-pve1
corosync: 2.4.2-pve5
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-apiclient-perl: 2.0-4
libpve-common-perl: 5.0-31
libpve-guest-common-perl: 2.0-16
libpve-http-server-perl: 2.0-8
libpve-storage-perl: 5.0-23
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 3.0.0-3
lxcfs: 3.0.0-1
novnc-pve: 0.6-4
openvswitch-switch: 2.7.0-2
proxmox-widget-toolkit: 1.0-18
pve-cluster: 5.0-27
pve-container: 2.0-23
pve-docs: 5.2-4
pve-firewall: 3.0-9
pve-firmware: 2.0-4
pve-ha-manager: 2.0-5
pve-i18n: 1.0-5
pve-libspice-server1: 0.12.8-3
pve-qemu-kvm: 2.11.1-5
pve-xtermjs: 1.0-5
qemu-server: 5.0-26
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.9-pve1~bpo9
 
More information:

In the interim period from when the HA command shows as complete (status OK) If I try to run it manually, eg

lxc-start -n 1101832

I get the following response:

No container config specified
I cant do anything to the container in that period (which can stretch into 12-15 minutes) including migrate it off the node. I tried doing that because it looks like it doesnt happen on all nodes at the same time. Can this be lxcfs related?

Code:
# service lxcfs status
● lxcfs.service - FUSE filesystem for LXC
   Loaded: loaded (/lib/systemd/system/lxcfs.service; enabled; vendor preset: enabled)
   Active: active (running) since Thu 2018-06-28 10:23:10 PDT; 2 months 14 days ago
 Main PID: 5431 (lxcfs)
    Tasks: 11 (limit: 4915)
   Memory: 28.9M
      CPU: 11h 44min 22.188s
   CGroup: /system.slice/lxcfs.service
           └─5431 /usr/bin/lxcfs /var/lib/lxcfs/

there doesnt seem to be any indication of any fault there.

Code:
# service pve-cluster status
● pve-cluster.service - The Proxmox VE cluster filesystem
   Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; vendor preset: enabl
   Active: active (running) since Thu 2018-06-28 10:23:12 PDT; 2 months 14 days ago
 Main PID: 8266 (pmxcfs)
    Tasks: 13 (limit: 4915)
   Memory: 106.4M
      CPU: 2d 16h 15min 46.869s
   CGroup: /system.slice/pve-cluster.service
           └─8266 /usr/bin/pmxcfs
Sep 11 13:31:51 sky10 pmxcfs[8266]: [status] notice: received log
Sep 11 13:31:52 sky10 pmxcfs[8266]: [status] notice: received log
Sep 11 13:31:52 sky10 pmxcfs[8266]: [status] notice: received log
Sep 11 13:31:54 sky10 pmxcfs[8266]: [status] notice: received log
Sep 11 13:31:54 sky10 pmxcfs[8266]: [status] notice: received log
Sep 11 13:32:17 sky10 pmxcfs[8266]: [status] notice: received log
Sep 11 13:32:43 sky10 pmxcfs[8266]: [status] notice: received log
Sep 11 13:32:52 sky10 pmxcfs[8266]: [status] notice: received log
Sep 11 13:32:53 sky10 pmxcfs[8266]: [status] notice: received log
Sep 11 13:34:33 sky10 pmxcfs[8266]: [status] notice: received log

Nothing here either. This seems to have started relatively recently although this specific node has an uptime of 75 days, and I see it on other clusters as well (some updated more recently.) Needless to say this is causing me grief. Any help would be appreciated.