I have a container that has failed to start, and hung pveproxy denying any new activity to the node, which is showing with a grey question mark in the gui. Pveproxy itself appears to be running:
however, pct list just hangs.
going through ps i see this:
581286 ? Ds 0:13 [lxc monitor] /var/lib/lxc 20103
along with a ton of matching lxc-info. trying to kill the process with kill -9 does nothing, and strace shows nothing too. restarting pveproxy doesnt change anything. There are no hung mounts, and ceph is showing healthy. the rbd for the container can be mounted using pct mount, but attempting fsck reports that the disk is still in use (which makes sense, the container task is still present.)
I have two questions:
1. What is causing this
2. is there a way to correct that doesnt involve rebooting the node?
diagnostic data:
system info:
Code:
# service pveproxy status
● pveproxy.service - PVE API Proxy Server
Loaded: loaded (/lib/systemd/system/pveproxy.service; enabled; vendor preset: enabled)
Active: active (running) since Tue 2018-01-09 09:18:38 PST; 1 weeks 5 days ago
Process: 3186117 ExecReload=/usr/bin/pveproxy restart (code=exited, status=0/SUCCESS)
Process: 15563 ExecStart=/usr/bin/pveproxy start (code=exited, status=0/SUCCESS)
Main PID: 15628 (pveproxy)
Tasks: 4 (limit: 4915)
Memory: 179.8M
CPU: 6min 36.275s
CGroup: /system.slice/pveproxy.service
├─ 15628 pveproxy
├─569289 pveproxy worker
├─681981 pveproxy worker
└─761737 pveproxy worker
Jan 21 13:40:28 sky11 pveproxy[310902]: proxy detected vanished client connection
Jan 21 13:40:28 sky11 pveproxy[310902]: proxy detected vanished client connection
Jan 21 13:46:47 sky11 pveproxy[3186692]: worker exit
Jan 21 13:46:47 sky11 pveproxy[15628]: worker 3186692 finished
Jan 21 13:46:47 sky11 pveproxy[15628]: starting 1 worker(s)
Jan 21 13:46:47 sky11 pveproxy[15628]: worker 681981 started
Jan 21 14:06:24 sky11 pveproxy[15628]: worker 310902 finished
Jan 21 14:06:24 sky11 pveproxy[15628]: starting 1 worker(s)
Jan 21 14:06:24 sky11 pveproxy[15628]: worker 761737 started
Jan 21 14:06:25 sky11 pveproxy[761736]: worker exit
however, pct list just hangs.
going through ps i see this:
581286 ? Ds 0:13 [lxc monitor] /var/lib/lxc 20103
along with a ton of matching lxc-info. trying to kill the process with kill -9 does nothing, and strace shows nothing too. restarting pveproxy doesnt change anything. There are no hung mounts, and ceph is showing healthy. the rbd for the container can be mounted using pct mount, but attempting fsck reports that the disk is still in use (which makes sense, the container task is still present.)
I have two questions:
1. What is causing this
2. is there a way to correct that doesnt involve rebooting the node?
diagnostic data:
Code:
# cat /proc/581286/wchan
call_rwsem_down_write_failedroot
# cat /proc/581286/stack
[<ffffffffbc3283f7>] call_rwsem_down_write_failed+0x17/0x30
[<ffffffffbbbd1a29>] unregister_shrinker+0x19/0x60
[<ffffffffbbc5641b>] deactivate_locked_super+0x3b/0x70
[<ffffffffbbc5693e>] deactivate_super+0x4e/0x60
[<ffffffffbbc785ff>] cleanup_mnt+0x3f/0x80
[<ffffffffbbc78682>] __cleanup_mnt+0x12/0x20
[<ffffffffbbaa5480>] task_work_run+0x80/0xa0
[<ffffffffbba031c4>] exit_to_usermode_loop+0xc4/0xd0
[<ffffffffbba03a19>] syscall_return_slowpath+0x59/0x60
[<ffffffffbc4000ec>] entry_SYSCALL_64_fastpath+0x7f/0x81
[<ffffffffffffffff>] 0xffffffffffffffff
# cat /proc/581286/status
Name: lxc-start
Umask: 0022
State: D (disk sleep)
Tgid: 581286
Ngid: 0
Pid: 581286
PPid: 1
TracerPid: 0
Uid: 0 0 0 0
Gid: 0 0 0 0
FDSize: 64
Groups:
NStgid: 581286
NSpid: 581286
NSpgid: 581286
NSsid: 581286
VmPeak: 50260 kB
VmSize: 50216 kB
VmLck: 0 kB
VmPin: 0 kB
VmHWM: 3616 kB
VmRSS: 3600 kB
RssAnon: 744 kB
RssFile: 2856 kB
RssShmem: 0 kB
VmData: 496 kB
VmStk: 132 kB
VmExe: 16 kB
VmLib: 6312 kB
VmPTE: 116 kB
VmPMD: 12 kB
VmSwap: 0 kB
HugetlbPages: 0 kB
Threads: 1
SigQ: 9/644304
SigPnd: 0000000000000100
ShdPnd: 0000000000000100
SigBlk: fffffffe77fbfab7
SigIgn: 0000000000001000
SigCgt: 0000000180000000
CapInh: 0000000000000000
CapPrm: 0000003fffffffff
CapEff: 0000003fffffffff
CapBnd: 0000003fffffffff
CapAmb: 0000000000000000
NoNewPrivs: 0
Seccomp: 0
Cpus_allowed: ffffffff
Cpus_allowed_list: 0-31
Mems_allowed: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000003
Mems_allowed_list: 0-1
voluntary_ctxt_switches: 580526
nonvoluntary_ctxt_switches: 393
system info:
Code:
# pveversion -v
proxmox-ve: 5.1-34 (running kernel: 4.13.13-3-pve)
pve-manager: 5.1-41 (running version: 5.1-41/0b958203)
pve-kernel-4.13.4-1-pve: 4.13.4-26
pve-kernel-4.13.13-2-pve: 4.13.13-33
pve-kernel-4.10.17-2-pve: 4.10.17-20
pve-kernel-4.13.8-3-pve: 4.13.8-30
pve-kernel-4.13.13-3-pve: 4.13.13-34
pve-kernel-4.10.17-3-pve: 4.10.17-23
libpve-http-server-perl: 2.0-8
lvm2: 2.02.168-pve6
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-19
qemu-server: 5.0-18
pve-firmware: 2.0-3
libpve-common-perl: 5.0-25
libpve-guest-common-perl: 2.0-14
libpve-access-control: 5.0-7
libpve-storage-perl: 5.0-17
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-3
pve-docs: 5.1-15
pve-qemu-kvm: 2.9.1-5
pve-container: 2.0-18
pve-firewall: 3.0-5
pve-ha-manager: 2.0-4
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.1.1-2
lxcfs: 2.0.8-1
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.7.3-pve1~bpo9
ceph: 12.2.2-pve1