LXC Container stuck on startup, hangs pveproxy

alexskysilk

Distinguished Member
Oct 16, 2015
1,491
257
153
Chatsworth, CA
www.skysilk.com
I have a container that has failed to start, and hung pveproxy denying any new activity to the node, which is showing with a grey question mark in the gui. Pveproxy itself appears to be running:
Code:
# service pveproxy status
● pveproxy.service - PVE API Proxy Server
   Loaded: loaded (/lib/systemd/system/pveproxy.service; enabled; vendor preset: enabled)
   Active: active (running) since Tue 2018-01-09 09:18:38 PST; 1 weeks 5 days ago
  Process: 3186117 ExecReload=/usr/bin/pveproxy restart (code=exited, status=0/SUCCESS)
  Process: 15563 ExecStart=/usr/bin/pveproxy start (code=exited, status=0/SUCCESS)
 Main PID: 15628 (pveproxy)
    Tasks: 4 (limit: 4915)
   Memory: 179.8M
      CPU: 6min 36.275s
   CGroup: /system.slice/pveproxy.service
           ├─ 15628 pveproxy
           ├─569289 pveproxy worker
           ├─681981 pveproxy worker
           └─761737 pveproxy worker

Jan 21 13:40:28 sky11 pveproxy[310902]: proxy detected vanished client connection
Jan 21 13:40:28 sky11 pveproxy[310902]: proxy detected vanished client connection
Jan 21 13:46:47 sky11 pveproxy[3186692]: worker exit
Jan 21 13:46:47 sky11 pveproxy[15628]: worker 3186692 finished
Jan 21 13:46:47 sky11 pveproxy[15628]: starting 1 worker(s)
Jan 21 13:46:47 sky11 pveproxy[15628]: worker 681981 started
Jan 21 14:06:24 sky11 pveproxy[15628]: worker 310902 finished
Jan 21 14:06:24 sky11 pveproxy[15628]: starting 1 worker(s)
Jan 21 14:06:24 sky11 pveproxy[15628]: worker 761737 started
Jan 21 14:06:25 sky11 pveproxy[761736]: worker exit

however, pct list just hangs.

going through ps i see this:
581286 ? Ds 0:13 [lxc monitor] /var/lib/lxc 20103

along with a ton of matching lxc-info. trying to kill the process with kill -9 does nothing, and strace shows nothing too. restarting pveproxy doesnt change anything. There are no hung mounts, and ceph is showing healthy. the rbd for the container can be mounted using pct mount, but attempting fsck reports that the disk is still in use (which makes sense, the container task is still present.)

I have two questions:
1. What is causing this
2. is there a way to correct that doesnt involve rebooting the node?

diagnostic data:
Code:
# cat /proc/581286/wchan
call_rwsem_down_write_failedroot

# cat /proc/581286/stack
[<ffffffffbc3283f7>] call_rwsem_down_write_failed+0x17/0x30
[<ffffffffbbbd1a29>] unregister_shrinker+0x19/0x60
[<ffffffffbbc5641b>] deactivate_locked_super+0x3b/0x70
[<ffffffffbbc5693e>] deactivate_super+0x4e/0x60
[<ffffffffbbc785ff>] cleanup_mnt+0x3f/0x80
[<ffffffffbbc78682>] __cleanup_mnt+0x12/0x20
[<ffffffffbbaa5480>] task_work_run+0x80/0xa0
[<ffffffffbba031c4>] exit_to_usermode_loop+0xc4/0xd0
[<ffffffffbba03a19>] syscall_return_slowpath+0x59/0x60
[<ffffffffbc4000ec>] entry_SYSCALL_64_fastpath+0x7f/0x81
[<ffffffffffffffff>] 0xffffffffffffffff

# cat /proc/581286/status
Name:   lxc-start
Umask:  0022
State:  D (disk sleep)
Tgid:   581286
Ngid:   0
Pid:    581286
PPid:   1
TracerPid:      0
Uid:    0       0       0       0
Gid:    0       0       0       0
FDSize: 64
Groups:
NStgid: 581286
NSpid:  581286
NSpgid: 581286
NSsid:  581286
VmPeak:    50260 kB
VmSize:    50216 kB
VmLck:         0 kB
VmPin:         0 kB
VmHWM:      3616 kB
VmRSS:      3600 kB
RssAnon:             744 kB
RssFile:            2856 kB
RssShmem:              0 kB
VmData:      496 kB
VmStk:       132 kB
VmExe:        16 kB
VmLib:      6312 kB
VmPTE:       116 kB
VmPMD:        12 kB
VmSwap:        0 kB
HugetlbPages:          0 kB
Threads:        1
SigQ:   9/644304
SigPnd: 0000000000000100
ShdPnd: 0000000000000100
SigBlk: fffffffe77fbfab7
SigIgn: 0000000000001000
SigCgt: 0000000180000000
CapInh: 0000000000000000
CapPrm: 0000003fffffffff
CapEff: 0000003fffffffff
CapBnd: 0000003fffffffff
CapAmb: 0000000000000000
NoNewPrivs:     0
Seccomp:        0
Cpus_allowed:   ffffffff
Cpus_allowed_list:      0-31
Mems_allowed:   00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000003
Mems_allowed_list:      0-1
voluntary_ctxt_switches:        580526
nonvoluntary_ctxt_switches:     393

system info:
Code:
# pveversion -v
proxmox-ve: 5.1-34 (running kernel: 4.13.13-3-pve)
pve-manager: 5.1-41 (running version: 5.1-41/0b958203)
pve-kernel-4.13.4-1-pve: 4.13.4-26
pve-kernel-4.13.13-2-pve: 4.13.13-33
pve-kernel-4.10.17-2-pve: 4.10.17-20
pve-kernel-4.13.8-3-pve: 4.13.8-30
pve-kernel-4.13.13-3-pve: 4.13.13-34
pve-kernel-4.10.17-3-pve: 4.10.17-23
libpve-http-server-perl: 2.0-8
lvm2: 2.02.168-pve6
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-19
qemu-server: 5.0-18
pve-firmware: 2.0-3
libpve-common-perl: 5.0-25
libpve-guest-common-perl: 2.0-14
libpve-access-control: 5.0-7
libpve-storage-perl: 5.0-17
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-3
pve-docs: 5.1-15
pve-qemu-kvm: 2.9.1-5
pve-container: 2.0-18
pve-firewall: 3.0-5
pve-ha-manager: 2.0-4
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.1.1-2
lxcfs: 2.0.8-1
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.7.3-pve1~bpo9
ceph: 12.2.2-pve1
 
If you map the rbd image, does the lxc-monitor change its state?
 
The RBD is mapped at the time of the fault (in this case to rbd31.) Do you mean to map it to another device?

Additional information:

it has happened again, this time with a container stuck stopping. the common denominator between these events is that the systems were processing vzdump tasks at the time, which appears to have finished but not clearing, and had to be killed. Even after the vzdump task wask killed the hung container cannot be killed.

# cat /proc/2052791/status
Name: lxc-start
Umask: 0022
State: D (disk sleep)

# cat /proc/2052791/stack
[<ffffffffb69283f7>] call_rwsem_down_write_failed+0x17/0x30
[<ffffffffb61d1a29>] unregister_shrinker+0x19/0x60
[<ffffffffb625641b>] deactivate_locked_super+0x3b/0x70
[<ffffffffb625693e>] deactivate_super+0x4e/0x60
[<ffffffffb62785ff>] cleanup_mnt+0x3f/0x80
[<ffffffffb6278682>] __cleanup_mnt+0x12/0x20
[<ffffffffb60a5480>] task_work_run+0x80/0xa0
[<ffffffffb60031c4>] exit_to_usermode_loop+0xc4/0xd0
[<ffffffffb6003a19>] syscall_return_slowpath+0x59/0x60
[<ffffffffb6a000ec>] entry_SYSCALL_64_fastpath+0x7f/0x81
[<ffffffffffffffff>] 0xffffffffffffffff
 
For what it's worth, I believe I am also having this same issue. Although my storage is local ZFS rather than ceph. '/etc/pve' is empty, corosync is running, 'pve-cluster' won't start, 'pmxcfs' is in D status and can't be killed. At first I tried to restart pveproxy and pvestatd, which didn't help. Trying to restart pve-cluster is when I lost '/etc/pve' filesystem. This all happened after trying to start a LXC container that hung. I'm going to try rebooting tonight after-hours.
 
Couldn't edit my last post, said it was spam-like or something. I'm seeing:

Code:
# ps aux|grep pmxcfs
root      5612  0.2  0.0 812048 17652 ?        Ds   Jan16  19:03 /usr/bin/pmxcfs

# cat /proc/5612/stack 
[<ffffffff9f129497>] call_rwsem_down_write_failed+0x17/0x30
[<ffffffff9e9d22a9>] unregister_shrinker+0x19/0x60
[<ffffffff9ea56c9b>] deactivate_locked_super+0x3b/0x70
[<ffffffff9ea571be>] deactivate_super+0x4e/0x60
[<ffffffff9ea78e7f>] cleanup_mnt+0x3f/0x80
[<ffffffff9ea78f02>] __cleanup_mnt+0x12/0x20
[<ffffffff9e8a5cd0>] task_work_run+0x80/0xa0
[<ffffffff9e8031c4>] exit_to_usermode_loop+0xc4/0xd0
[<ffffffff9e803a19>] syscall_return_slowpath+0x59/0x60
[<ffffffff9f20024d>] entry_SYSCALL_64_fastpath+0xa1/0xa3
[<ffffffffffffffff>] 0xffffffffffffffff

# ls -l /etc/pve
total 0

# pvecm status
pve configuration filesystem not mounted

# pveversion -v
proxmox-ve: 5.1-36 (running kernel: 4.13.13-5-pve)
pve-manager: 5.1-43 (running version: 5.1-43/bdb08029)
pve-kernel-4.13.4-1-pve: 4.13.4-26
pve-kernel-4.13.13-4-pve: 4.13.13-35
pve-kernel-4.13.13-2-pve: 4.13.13-33
pve-kernel-4.13.13-5-pve: 4.13.13-36
pve-kernel-4.13.13-1-pve: 4.13.13-31
libpve-http-server-perl: 2.0-8
lvm2: 2.02.168-pve6
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-19
qemu-server: 5.0-20
pve-firmware: 2.0-3
libpve-common-perl: 5.0-25
libpve-guest-common-perl: 2.0-14
libpve-access-control: 5.0-7
libpve-storage-perl: 5.0-17
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-3
pve-docs: 5.1-16
pve-qemu-kvm: 2.9.1-6
pve-container: 2.0-18
pve-firewall: 3.0-5
pve-ha-manager: 2.0-4
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.1.1-2
lxcfs: 2.0.8-1
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.7.3-pve1~bpo9
 
The RBD is mapped at the time of the fault (in this case to rbd31.) Do you mean to map it to another device?

Additional information:

it has happened again, this time with a container stuck stopping. the common denominator between these events is that the systems were processing vzdump tasks at the time, which appears to have finished but not clearing, and had to be killed. Even after the vzdump task wask killed the hung container cannot be killed.

# cat /proc/2052791/status
Name: lxc-start
Umask: 0022
State: D (disk sleep)

# cat /proc/2052791/stack
[<ffffffffb69283f7>] call_rwsem_down_write_failed+0x17/0x30
[<ffffffffb61d1a29>] unregister_shrinker+0x19/0x60
[<ffffffffb625641b>] deactivate_locked_super+0x3b/0x70
[<ffffffffb625693e>] deactivate_super+0x4e/0x60
[<ffffffffb62785ff>] cleanup_mnt+0x3f/0x80
[<ffffffffb6278682>] __cleanup_mnt+0x12/0x20
[<ffffffffb60a5480>] task_work_run+0x80/0xa0
[<ffffffffb60031c4>] exit_to_usermode_loop+0xc4/0xd0
[<ffffffffb6003a19>] syscall_return_slowpath+0x59/0x60
[<ffffffffb6a000ec>] entry_SYSCALL_64_fastpath+0x7f/0x81
[<ffffffffffffffff>] 0xffffffffffffffff
If the rbd image is still mapped, then try to unmap the image. If it fails, then check if it has any watchers.

Code:
rbd showmapped
shows all images mapped locally.

Code:
rbd status <pool>/<disk>
Tells you if any disk has watchers (open connections).
 
Today I had another occurrence. I force unmapped the rbd as suggested, but the container remains in D state. Once again, it happened during vzdump backups although it doesnt appear the affected container was involved.

There is a vzdump task still holding the queue even though its log file is showing the task was finished:

659637 ? Ss 0:00 task UPID:sky11:000A10B5:02B72A45:5A6C0D56:vzdump::skysilkapi2@pve:
Code:
find /var/log/pve/tasks/ -name *5A6C0D56:vzdump* | xargs -I {} cat {}
INFO: starting new backup job: vzdump 15663 --mode snapshot --node sky11 --storage adhoc --remove 0 --compress lzo
INFO: Starting Backup of VM 15663 (lxc)
INFO: status = running
INFO: CT Name: wordpressweb.site
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: create storage snapshot 'vzdump'
/dev/rbd44
INFO: creating archive '/mnt/pve/adhoc/dump/vzdump-lxc-15663-2018_01_26-21_25_42.tar.lzo'
INFO: Total bytes written: 27697930240 (26GiB, 62MiB/s)
INFO: archive file size: 10.82GB

Once killed, the task update to show
command 'umount -l -d /mnt/vzsnap0/' failed: interrupted by signal

The plot thickens.

# ps ax | grep vzsnap
737063 ? D 0:00 umount -l -d /mnt/vzsnap0/
1179554 ? D 0:00 mount -o ro,noload /dev/rbd68 /mnt/vzsnap0//

lookie whats happening. looks like we have contention for /mnt/vzsnap; and the mountpoint is getting clobbered. I will need to reboot the node to clear this up, but there is obviously a race condition with backups. devs, this is probably a bug.
 
How is (are) your backup plan(s) configured? And how does your storage configuration look like?
 
How is (are) your backup plan(s) configured?
vzdump --mailto $recipient --mailnotification failure --mode snapshot --storage altbackup --compress lzo --quiet 1 --all 1 --exclude ${exclude[*]}

And how does your storage configuration look like?
From:
rbd: vdisk-3pg
content rootdir,images
krbd 1
monhost 10.18.21.21;10.18.21.23;10.18.21.25;10.18.21.26;10.18.21.27
pool rbd
username admin​

To:
nfs: altbackup
export /nfsroot/backup61
path /mnt/pve/altbackup
server 10.14.0.61
content backup
maxfiles 3
options vers=3,noatime,tcp,nolock​
 
# ps ax | grep vzsnap
737063 ? D 0:00 umount -l -d /mnt/vzsnap0/
1179554 ? D 0:00 mount -o ro,noload /dev/rbd68 /mnt/vzsnap0//
The ceph storage is working properly when this happens? As the mount already waits in the 'D' state, is it possible to create a snapshot, map and mount it by hand? Is something in the ceph logs of the host?

monhost 10.18.21.21;10.18.21.23;10.18.21.25;10.18.21.26;10.18.21.27
On the side, if you are not having a cluster with 1000's of ceph host/clients, you don't need 5 monitors. The extra two bring a little performance overhead.
 
The ceph storage is working properly when this happens? As the mount already waits in the 'D' state, is it possible to create a snapshot, map and mount it by hand? Is something in the ceph logs of the host?

The filesystem is healthy during the event (ceph health shows HEALTH_OK). As for the mount, if you're referring the the underlying rbd- yes, its working normally, is mounted, can be unmounted and unmapped. the snapshot isnt. ceph logs dont show any unusual behavior. I can produce a complete data set for regression (during the weekend ;)) just let me know what you'll be looking for.

On the side, if you are not having a cluster with 1000's of ceph host/clients, you don't need 5 monitors. The extra two bring a little performance overhead.

Thanks :) I'll kill the extras.
 
I am still looking for the bottleneck that hangs the container and hence the backup. Could you run the the container in foreground and hang a strace on it, to see where it might hang?
Code:
lxc-start -n ID -F -l DEBUG -o /tmp/lxc-ID.log

Maybe you also need to increase the logging for RBD to get something int logs.
http://docs.ceph.com/docs/master/rados/troubleshooting/log-and-debug/

Please also update to the newest kernel, there have been some follow-up fixes on the Meltdown/Spectre bug.
 
I'm seeing something similar to this where all of my LXCs and the node itself shows a "?"

SGocsRS.png


But containers are still running mostly. This is after trying ot restart one LXC.

root@pve:~# ps aux| grep pmxcfs
root 1955 0.1 0.1 710848 43664 ? Ssl Jan29 7:15 /usr/bin/pmxcfs
root 8292 0.0 0.0 12788 976 pts/12 S+ 13:37 0:00 grep pmxcfs

root@pve:~# cat /proc/1955/stack
[<ffffffffb47116ea>] futex_wait_queue_me+0xca/0x130
[<ffffffffb4712019>] futex_wait+0x119/0x260
[<ffffffffb4714477>] do_futex+0x2d7/0xaa0
[<ffffffffb4714cc3>] SyS_futex+0x83/0x180
[<ffffffffb500008b>] entry_SYSCALL_64_fastpath+0x1e/0x81
[<ffffffffffffffff>] 0xffffffffffffffff

root@pve:~# ls -l /etc/pve
total 4
-rw-r----- 1 root www-data 451 Dec 23 2015 authkey.pub
-rw-r----- 1 root www-data 441 Jul 27 2017 corosync.conf
-rw-r----- 1 root www-data 16 Dec 23 2015 datacenter.cfg
drwxr-xr-x 2 root www-data 0 Jul 27 2017 ha
lrwxr-xr-x 1 root www-data 0 Dec 31 1969 local -> nodes/pve
lrwxr-xr-x 1 root www-data 0 Dec 31 1969 lxc -> nodes/pve/lxc
drwxr-xr-x 2 root www-data 0 Dec 23 2015 nodes
lrwxr-xr-x 1 root www-data 0 Dec 31 1969 openvz -> nodes/pve/openvz
drwx------ 2 root www-data 0 Dec 23 2015 priv
-rw-r----- 1 root www-data 1350 Dec 23 2015 pve-root-ca.pem
-rw-r----- 1 root www-data 1679 Dec 23 2015 pve-www.key
lrwxr-xr-x 1 root www-data 0 Dec 31 1969 qemu-server -> nodes/pve/qemu-server
-rw-r----- 1 root www-data 782 Jan 2 16:47 storage.cfg
-rw-r----- 1 root www-data 48 May 12 2016 user.cfg
-rw-r----- 1 root www-data 618 Jan 3 17:19 vzdump.cron
 
Could you run the the container in foreground and hang a strace on it, to see where it might hang?

So its saturday, and I have my now-predictable hung container (only one this time.) I cant run it since its already running and in D state, so I cant kill it to restart. ceph is healthy, and there is nothing overt in the log to suggest a ceph/rbd fault. new tasks operate normally but the node is greyed out (as in @lasrb0isct's illustration above) and pct commands work for specific containers, not for node wide operations (eg pct list.)

It is becoming painfully clear that I should not use vzdump in production; regardless of the source of the fault, the potential for blocking cluster management is a major design flaw.
 
The node is grey, as the pvestatd daemon tries to establish the state of the container and hangs. Is it possible to create a snapshot by hand, with ceph tools (rbd snap) and then map it?
 
No, I meant in general. If so, then it would mean that the there is a problem with mapping a snapshot and we could start to investigate deeper there.
 
Not really. there is a substantial difference in the fault; namely that the hung process in @denos situation is recoverable. In my scenario the lxc process is a zombie and cannot be killed. while both setups utilize kernel modules to access the storage, I dont know if this particular issue is specific to ceph (me), zfs (denos) or kernel specific...
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!