LXC Container stuck on startup, hangs pveproxy

alexskysilk · Jan 21, 2018

I have a container that has failed to start, and hung pveproxy denying any new activity to the node, which is showing with a grey question mark in the gui. Pveproxy itself appears to be running:

Code:

# service pveproxy status
● pveproxy.service - PVE API Proxy Server
   Loaded: loaded (/lib/systemd/system/pveproxy.service; enabled; vendor preset: enabled)
   Active: active (running) since Tue 2018-01-09 09:18:38 PST; 1 weeks 5 days ago
  Process: 3186117 ExecReload=/usr/bin/pveproxy restart (code=exited, status=0/SUCCESS)
  Process: 15563 ExecStart=/usr/bin/pveproxy start (code=exited, status=0/SUCCESS)
 Main PID: 15628 (pveproxy)
    Tasks: 4 (limit: 4915)
   Memory: 179.8M
      CPU: 6min 36.275s
   CGroup: /system.slice/pveproxy.service
           ├─ 15628 pveproxy
           ├─569289 pveproxy worker
           ├─681981 pveproxy worker
           └─761737 pveproxy worker

Jan 21 13:40:28 sky11 pveproxy[310902]: proxy detected vanished client connection
Jan 21 13:40:28 sky11 pveproxy[310902]: proxy detected vanished client connection
Jan 21 13:46:47 sky11 pveproxy[3186692]: worker exit
Jan 21 13:46:47 sky11 pveproxy[15628]: worker 3186692 finished
Jan 21 13:46:47 sky11 pveproxy[15628]: starting 1 worker(s)
Jan 21 13:46:47 sky11 pveproxy[15628]: worker 681981 started
Jan 21 14:06:24 sky11 pveproxy[15628]: worker 310902 finished
Jan 21 14:06:24 sky11 pveproxy[15628]: starting 1 worker(s)
Jan 21 14:06:24 sky11 pveproxy[15628]: worker 761737 started
Jan 21 14:06:25 sky11 pveproxy[761736]: worker exit

however, pct list just hangs.

going through ps i see this:
581286 ? Ds 0:13 [lxc monitor] /var/lib/lxc 20103

along with a ton of matching lxc-info. trying to kill the process with kill -9 does nothing, and strace shows nothing too. restarting pveproxy doesnt change anything. There are no hung mounts, and ceph is showing healthy. the rbd for the container can be mounted using pct mount, but attempting fsck reports that the disk is still in use (which makes sense, the container task is still present.)

I have two questions:
1. What is causing this
2. is there a way to correct that doesnt involve rebooting the node?

diagnostic data:

Code:

# cat /proc/581286/wchan
call_rwsem_down_write_failedroot

# cat /proc/581286/stack
[<ffffffffbc3283f7>] call_rwsem_down_write_failed+0x17/0x30
[<ffffffffbbbd1a29>] unregister_shrinker+0x19/0x60
[<ffffffffbbc5641b>] deactivate_locked_super+0x3b/0x70
[<ffffffffbbc5693e>] deactivate_super+0x4e/0x60
[<ffffffffbbc785ff>] cleanup_mnt+0x3f/0x80
[<ffffffffbbc78682>] __cleanup_mnt+0x12/0x20
[<ffffffffbbaa5480>] task_work_run+0x80/0xa0
[<ffffffffbba031c4>] exit_to_usermode_loop+0xc4/0xd0
[<ffffffffbba03a19>] syscall_return_slowpath+0x59/0x60
[<ffffffffbc4000ec>] entry_SYSCALL_64_fastpath+0x7f/0x81
[<ffffffffffffffff>] 0xffffffffffffffff

# cat /proc/581286/status
Name:   lxc-start
Umask:  0022
State:  D (disk sleep)
Tgid:   581286
Ngid:   0
Pid:    581286
PPid:   1
TracerPid:      0
Uid:    0       0       0       0
Gid:    0       0       0       0
FDSize: 64
Groups:
NStgid: 581286
NSpid:  581286
NSpgid: 581286
NSsid:  581286
VmPeak:    50260 kB
VmSize:    50216 kB
VmLck:         0 kB
VmPin:         0 kB
VmHWM:      3616 kB
VmRSS:      3600 kB
RssAnon:             744 kB
RssFile:            2856 kB
RssShmem:              0 kB
VmData:      496 kB
VmStk:       132 kB
VmExe:        16 kB
VmLib:      6312 kB
VmPTE:       116 kB
VmPMD:        12 kB
VmSwap:        0 kB
HugetlbPages:          0 kB
Threads:        1
SigQ:   9/644304
SigPnd: 0000000000000100
ShdPnd: 0000000000000100
SigBlk: fffffffe77fbfab7
SigIgn: 0000000000001000
SigCgt: 0000000180000000
CapInh: 0000000000000000
CapPrm: 0000003fffffffff
CapEff: 0000003fffffffff
CapBnd: 0000003fffffffff
CapAmb: 0000000000000000
NoNewPrivs:     0
Seccomp:        0
Cpus_allowed:   ffffffff
Cpus_allowed_list:      0-31
Mems_allowed:   00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000003
Mems_allowed_list:      0-1
voluntary_ctxt_switches:        580526
nonvoluntary_ctxt_switches:     393

system info:

Code:

# pveversion -v
proxmox-ve: 5.1-34 (running kernel: 4.13.13-3-pve)
pve-manager: 5.1-41 (running version: 5.1-41/0b958203)
pve-kernel-4.13.4-1-pve: 4.13.4-26
pve-kernel-4.13.13-2-pve: 4.13.13-33
pve-kernel-4.10.17-2-pve: 4.10.17-20
pve-kernel-4.13.8-3-pve: 4.13.8-30
pve-kernel-4.13.13-3-pve: 4.13.13-34
pve-kernel-4.10.17-3-pve: 4.10.17-23
libpve-http-server-perl: 2.0-8
lvm2: 2.02.168-pve6
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-19
qemu-server: 5.0-18
pve-firmware: 2.0-3
libpve-common-perl: 5.0-25
libpve-guest-common-perl: 2.0-14
libpve-access-control: 5.0-7
libpve-storage-perl: 5.0-17
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-3
pve-docs: 5.1-15
pve-qemu-kvm: 2.9.1-5
pve-container: 2.0-18
pve-firewall: 3.0-5
pve-ha-manager: 2.0-4
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.1.1-2
lxcfs: 2.0.8-1
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.7.3-pve1~bpo9
ceph: 12.2.2-pve1

Alwin · Jan 22, 2018

If you map the rbd image, does the lxc-monitor change its state?

alexskysilk · Jan 22, 2018

The RBD is mapped at the time of the fault (in this case to rbd31.) Do you mean to map it to another device?

Additional information:

it has happened again, this time with a container stuck stopping. the common denominator between these events is that the systems were processing vzdump tasks at the time, which appears to have finished but not clearing, and had to be killed. Even after the vzdump task wask killed the hung container cannot be killed.

# cat /proc/2052791/status
Name: lxc-start
Umask: 0022
State: D (disk sleep)

# cat /proc/2052791/stack
[<ffffffffb69283f7>] call_rwsem_down_write_failed+0x17/0x30
[<ffffffffb61d1a29>] unregister_shrinker+0x19/0x60
[<ffffffffb625641b>] deactivate_locked_super+0x3b/0x70
[<ffffffffb625693e>] deactivate_super+0x4e/0x60
[<ffffffffb62785ff>] cleanup_mnt+0x3f/0x80
[<ffffffffb6278682>] __cleanup_mnt+0x12/0x20
[<ffffffffb60a5480>] task_work_run+0x80/0xa0
[<ffffffffb60031c4>] exit_to_usermode_loop+0xc4/0xd0
[<ffffffffb6003a19>] syscall_return_slowpath+0x59/0x60
[<ffffffffb6a000ec>] entry_SYSCALL_64_fastpath+0x7f/0x81
[<ffffffffffffffff>] 0xffffffffffffffff

gsupp · Jan 22, 2018

For what it's worth, I believe I am also having this same issue. Although my storage is local ZFS rather than ceph. '/etc/pve' is empty, corosync is running, 'pve-cluster' won't start, 'pmxcfs' is in D status and can't be killed. At first I tried to restart pveproxy and pvestatd, which didn't help. Trying to restart pve-cluster is when I lost '/etc/pve' filesystem. This all happened after trying to start a LXC container that hung. I'm going to try rebooting tonight after-hours.

gsupp · Jan 22, 2018

Couldn't edit my last post, said it was spam-like or something. I'm seeing:

Code:

# ps aux|grep pmxcfs
root      5612  0.2  0.0 812048 17652 ?        Ds   Jan16  19:03 /usr/bin/pmxcfs

# cat /proc/5612/stack 
[<ffffffff9f129497>] call_rwsem_down_write_failed+0x17/0x30
[<ffffffff9e9d22a9>] unregister_shrinker+0x19/0x60
[<ffffffff9ea56c9b>] deactivate_locked_super+0x3b/0x70
[<ffffffff9ea571be>] deactivate_super+0x4e/0x60
[<ffffffff9ea78e7f>] cleanup_mnt+0x3f/0x80
[<ffffffff9ea78f02>] __cleanup_mnt+0x12/0x20
[<ffffffff9e8a5cd0>] task_work_run+0x80/0xa0
[<ffffffff9e8031c4>] exit_to_usermode_loop+0xc4/0xd0
[<ffffffff9e803a19>] syscall_return_slowpath+0x59/0x60
[<ffffffff9f20024d>] entry_SYSCALL_64_fastpath+0xa1/0xa3
[<ffffffffffffffff>] 0xffffffffffffffff

# ls -l /etc/pve
total 0

# pvecm status
pve configuration filesystem not mounted

# pveversion -v
proxmox-ve: 5.1-36 (running kernel: 4.13.13-5-pve)
pve-manager: 5.1-43 (running version: 5.1-43/bdb08029)
pve-kernel-4.13.4-1-pve: 4.13.4-26
pve-kernel-4.13.13-4-pve: 4.13.13-35
pve-kernel-4.13.13-2-pve: 4.13.13-33
pve-kernel-4.13.13-5-pve: 4.13.13-36
pve-kernel-4.13.13-1-pve: 4.13.13-31
libpve-http-server-perl: 2.0-8
lvm2: 2.02.168-pve6
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-19
qemu-server: 5.0-20
pve-firmware: 2.0-3
libpve-common-perl: 5.0-25
libpve-guest-common-perl: 2.0-14
libpve-access-control: 5.0-7
libpve-storage-perl: 5.0-17
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-3
pve-docs: 5.1-16
pve-qemu-kvm: 2.9.1-6
pve-container: 2.0-18
pve-firewall: 3.0-5
pve-ha-manager: 2.0-4
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.1.1-2
lxcfs: 2.0.8-1
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.7.3-pve1~bpo9

Alwin · Jan 23, 2018

alexskysilk said:
The RBD is mapped at the time of the fault (in this case to rbd31.) Do you mean to map it to another device?

Additional information:

it has happened again, this time with a container stuck stopping. the common denominator between these events is that the systems were processing vzdump tasks at the time, which appears to have finished but not clearing, and had to be killed. Even after the vzdump task wask killed the hung container cannot be killed.

# cat /proc/2052791/status
Name: lxc-start
Umask: 0022
State: D (disk sleep)

# cat /proc/2052791/stack
[<ffffffffb69283f7>] call_rwsem_down_write_failed+0x17/0x30
[<ffffffffb61d1a29>] unregister_shrinker+0x19/0x60
[<ffffffffb625641b>] deactivate_locked_super+0x3b/0x70
[<ffffffffb625693e>] deactivate_super+0x4e/0x60
[<ffffffffb62785ff>] cleanup_mnt+0x3f/0x80
[<ffffffffb6278682>] __cleanup_mnt+0x12/0x20
[<ffffffffb60a5480>] task_work_run+0x80/0xa0
[<ffffffffb60031c4>] exit_to_usermode_loop+0xc4/0xd0
[<ffffffffb6003a19>] syscall_return_slowpath+0x59/0x60
[<ffffffffb6a000ec>] entry_SYSCALL_64_fastpath+0x7f/0x81
[<ffffffffffffffff>] 0xffffffffffffffff

If the rbd image is still mapped, then try to unmap the image. If it fails, then check if it has any watchers.

Code:

rbd showmapped

shows all images mapped locally.

Code:

rbd status <pool>/<disk>

Tells you if any disk has watchers (open connections).

alexskysilk · Jan 23, 2018

so attempt to force unmount? hmm. I wonder if that would work. I'll try that next time it happens and report back.

alexskysilk · Jan 27, 2018

Today I had another occurrence. I force unmapped the rbd as suggested, but the container remains in D state. Once again, it happened during vzdump backups although it doesnt appear the affected container was involved.

There is a vzdump task still holding the queue even though its log file is showing the task was finished:

659637 ? Ss 0:00 task UPID:sky11:000A10B5:02B72A45:5A6C0D56:vzdump::skysilkapi2@pve:

Code:

find /var/log/pve/tasks/ -name *5A6C0D56:vzdump* | xargs -I {} cat {}
INFO: starting new backup job: vzdump 15663 --mode snapshot --node sky11 --storage adhoc --remove 0 --compress lzo
INFO: Starting Backup of VM 15663 (lxc)
INFO: status = running
INFO: CT Name: wordpressweb.site
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: create storage snapshot 'vzdump'
/dev/rbd44
INFO: creating archive '/mnt/pve/adhoc/dump/vzdump-lxc-15663-2018_01_26-21_25_42.tar.lzo'
INFO: Total bytes written: 27697930240 (26GiB, 62MiB/s)
INFO: archive file size: 10.82GB

Once killed, the task update to show
command 'umount -l -d /mnt/vzsnap0/' failed: interrupted by signal

The plot thickens.

# ps ax | grep vzsnap
737063 ? D 0:00 umount -l -d /mnt/vzsnap0/
1179554 ? D 0:00 mount -o ro,noload /dev/rbd68 /mnt/vzsnap0//

lookie whats happening. looks like we have contention for /mnt/vzsnap; and the mountpoint is getting clobbered. I will need to reboot the node to clear this up, but there is obviously a race condition with backups. devs, this is probably a bug.

Alwin · Jan 29, 2018

How is (are) your backup plan(s) configured? And how does your storage configuration look like?

alexskysilk · Jan 29, 2018

Alwin said:
How is (are) your backup plan(s) configured?

vzdump --mailto $recipient --mailnotification failure --mode snapshot --storage altbackup --compress lzo --quiet 1 --all 1 --exclude ${exclude[*]}

And how does your storage configuration look like?

From:
rbd: vdisk-3pg

content rootdir,images
krbd 1
monhost 10.18.21.21;10.18.21.23;10.18.21.25;10.18.21.26;10.18.21.27
pool rbd
username admin

To:
nfs: altbackup

export /nfsroot/backup61
path /mnt/pve/altbackup
server 10.14.0.61
content backup
maxfiles 3
options vers=3,noatime,tcp,nolock

Alwin · Jan 30, 2018

alexskysilk said:
# ps ax | grep vzsnap
737063 ? D 0:00 umount -l -d /mnt/vzsnap0/
1179554 ? D 0:00 mount -o ro,noload /dev/rbd68 /mnt/vzsnap0//

The ceph storage is working properly when this happens? As the mount already waits in the 'D' state, is it possible to create a snapshot, map and mount it by hand? Is something in the ceph logs of the host?

alexskysilk said:
monhost 10.18.21.21;10.18.21.23;10.18.21.25;10.18.21.26;10.18.21.27

On the side, if you are not having a cluster with 1000's of ceph host/clients, you don't need 5 monitors. The extra two bring a little performance overhead.

alexskysilk · Jan 30, 2018

Alwin said:
The ceph storage is working properly when this happens? As the mount already waits in the 'D' state, is it possible to create a snapshot, map and mount it by hand? Is something in the ceph logs of the host?

The filesystem is healthy during the event (ceph health shows HEALTH_OK). As for the mount, if you're referring the the underlying rbd- yes, its working normally, is mounted, can be unmounted and unmapped. the snapshot isnt. ceph logs dont show any unusual behavior. I can produce a complete data set for regression (during the weekend

) just let me know what you'll be looking for.

Alwin said:
On the side, if you are not having a cluster with 1000's of ceph host/clients, you don't need 5 monitors. The extra two bring a little performance overhead.

Thanks

I'll kill the extras.

Alwin · Jan 31, 2018

I am still looking for the bottleneck that hangs the container and hence the backup. Could you run the the container in foreground and hang a strace on it, to see where it might hang?

Code:

lxc-start -n ID -F -l DEBUG -o /tmp/lxc-ID.log

Maybe you also need to increase the logging for RBD to get something int logs.
http://docs.ceph.com/docs/master/rados/troubleshooting/log-and-debug/

Please also update to the newest kernel, there have been some follow-up fixes on the Meltdown/Spectre bug.

lastb0isct · Feb 2, 2018

I'm seeing something similar to this where all of my LXCs and the node itself shows a "?"

But containers are still running mostly. This is after trying ot restart one LXC.

root@pve:~# ps aux| grep pmxcfs
root 1955 0.1 0.1 710848 43664 ? Ssl Jan29 7:15 /usr/bin/pmxcfs
root 8292 0.0 0.0 12788 976 pts/12 S+ 13:37 0:00 grep pmxcfs

root@pve:~# cat /proc/1955/stack
[<ffffffffb47116ea>] futex_wait_queue_me+0xca/0x130
[<ffffffffb4712019>] futex_wait+0x119/0x260
[<ffffffffb4714477>] do_futex+0x2d7/0xaa0
[<ffffffffb4714cc3>] SyS_futex+0x83/0x180
[<ffffffffb500008b>] entry_SYSCALL_64_fastpath+0x1e/0x81
[<ffffffffffffffff>] 0xffffffffffffffff

root@pve:~# ls -l /etc/pve
total 4
-rw-r----- 1 root www-data 451 Dec 23 2015 authkey.pub
-rw-r----- 1 root www-data 441 Jul 27 2017 corosync.conf
-rw-r----- 1 root www-data 16 Dec 23 2015 datacenter.cfg
drwxr-xr-x 2 root www-data 0 Jul 27 2017 ha
lrwxr-xr-x 1 root www-data 0 Dec 31 1969 local -> nodes/pve
lrwxr-xr-x 1 root www-data 0 Dec 31 1969 lxc -> nodes/pve/lxc
drwxr-xr-x 2 root www-data 0 Dec 23 2015 nodes
lrwxr-xr-x 1 root www-data 0 Dec 31 1969 openvz -> nodes/pve/openvz
drwx------ 2 root www-data 0 Dec 23 2015 priv
-rw-r----- 1 root www-data 1350 Dec 23 2015 pve-root-ca.pem
-rw-r----- 1 root www-data 1679 Dec 23 2015 pve-www.key
lrwxr-xr-x 1 root www-data 0 Dec 31 1969 qemu-server -> nodes/pve/qemu-server
-rw-r----- 1 root www-data 782 Jan 2 16:47 storage.cfg
-rw-r----- 1 root www-data 48 May 12 2016 user.cfg
-rw-r----- 1 root www-data 618 Jan 3 17:19 vzdump.cron

alexskysilk · Feb 3, 2018

Alwin said:
Could you run the the container in foreground and hang a strace on it, to see where it might hang?

So its saturday, and I have my now-predictable hung container (only one this time.) I cant run it since its already running and in D state, so I cant kill it to restart. ceph is healthy, and there is nothing overt in the log to suggest a ceph/rbd fault. new tasks operate normally but the node is greyed out (as in @lasrb0isct's illustration above) and pct commands work for specific containers, not for node wide operations (eg pct list.)

It is becoming painfully clear that I should not use vzdump in production; regardless of the source of the fault, the potential for blocking cluster management is a major design flaw.

Alwin · Feb 5, 2018

The node is grey, as the pvestatd daemon tries to establish the state of the container and hangs. Is it possible to create a snapshot by hand, with ceph tools (rbd snap) and then map it?

alexskysilk · Feb 5, 2018

Alwin said:
Is it possible to create a snapshot by hand, with ceph tools (rbd snap) and then map it?

I imagine so, but cant prove until the situation repeats. will do next weekend, but it may of instructive if you told me what that would prove (or not.)

Alwin · Feb 6, 2018

No, I meant in general. If so, then it would mean that the there is a problem with mapping a snapshot and we could start to investigate deeper there.

Alwin · Mar 12, 2018

Your thread looks similar to this one: https://forum.proxmox.com/threads/lxc-container-reboot-fails-lxc-becomes-unusable.41264

alexskysilk · Mar 12, 2018

Not really. there is a substantial difference in the fault; namely that the hung process in @denos situation is recoverable. In my scenario the lxc process is a zombie and cannot be killed. while both setups utilize kernel modules to access the storage, I dont know if this particular issue is specific to ceph (me), zfs (denos) or kernel specific...

LXC Container stuck on startup, hangs pveproxy

Distinguished Member

Proxmox Retired Staff

Distinguished Member

Well-Known Member

Well-Known Member

Proxmox Retired Staff

Distinguished Member

Distinguished Member

Proxmox Retired Staff

Distinguished Member

Proxmox Retired Staff

Distinguished Member

Proxmox Retired Staff

Renowned Member

Distinguished Member

Proxmox Retired Staff

Distinguished Member

Proxmox Retired Staff

Proxmox Retired Staff

Distinguished Member

We value your privacy