LXC container reboot fails - LXC becomes unusable

denos · Mar 5, 2018

Correct, 4.14.20 (or later) completely resolve the issue.

Vasu Sreekumar · Mar 6, 2018

Today we had a different issue.

We terminated a CT 154, and Node went RED. 154 got deleted, Node and all other CTs pinging fine.

Result of ps aux | grep 154

root@P158:~# ps aux | grep 154
root 154 0.0 0.0 0 0 ? S< Mar02 0:00 [netns]
27 5095 0.0 0.0 113276 1548 ? Ss Mar02 0:00 /bin/sh /usr/bin/mysqld_safe --basedir=/usr
root 5148 0.0 0.0 20576 1548 ? Ss Mar02 0:03 /usr/sbin/dovecot
114 5389 0.0 0.0 225184 15484 ? S Mar02 0:08 /usr/lib/postgresql/9.4/bin/postgres -D /var/lib/postgresql/9.4/main -c config_file=/etc/postgresql/9.4/main/postgresql.conf
root 6154 0.0 0.0 0 0 ? S 15:34 0:00 [kworker/10:0]
root 15428 0.0 0.0 125092 3544 ? Ss Mar04 0:06 /sbin/init
110 15449 0.0 0.0 47292 5684 ? S 15:41 0:00 smtpd -n smtp -t inet -u -c -o stress= -s 2
110 15450 0.0 0.0 39960 3232 ? S 15:41 0:00 proxymap -t unix -u
root 15453 0.0 0.1 544616 59176 ? S 11:30 0:15 pvedaemon worker
110 15460 0.0 0.0 39960 3268 ? S 15:41 0:00 anvil -l -t unix -u -c
root 20988 0.0 0.1 111540 63076 ? Ss Mar02 1:52 /usr/lib/systemd/systemd-journald
root 22626 0.0 0.0 12788 920 pts/19 S+ 15:43 0:00 grep 154
root 25154 0.2 0.3 1065716 99068 ? Sl Mar04 3:10 /usr/bin/node ./eachDomainWise/domainWise rockstarfly.com 1342

Vasu Sreekumar · Mar 6, 2018

service pvestatd restart

This made node green, but all CT still grey.

denos · Mar 6, 2018

Vasu Sreekumar said:
Today we had a different issue.

We terminated a CT 154, and Node went RED. 154 got deleted, Node and all other CTs pinging fine.

Still many process for CT 154 running.

root@P158:~# ps aux | grep 154

That command doesn't show processes from container 154. It's simply matching anything in the output of ps with the text "154" anywhere on the line.

Vasu Sreekumar · Mar 6, 2018

What i meant is lxc monitor is not running like other issue.

I didn't mention in detail.

I assumed we all knew.

fabian · Mar 6, 2018

denos said:
Correct, 4.14.20 (or later) completely resolve the issue.

could you please test 4.14 (the first release of the 4.14 kernel series) and report whether it works or not? if it does not, this trims down the range of potentially fixing commits quite a lot!

denos · Mar 6, 2018

fabian said:
could you please test 4.14 (the first release of the 4.14 kernel series) and report whether it works or not? if it does not, this trims down the range of potentially fixing commits quite a lot!

Will do. If it fails, I'll figure out exactly which version works but it may take a few days as I can only test in the evenings.

FibreFoX · Mar 6, 2018

denos said:
Will do. If it fails, I'll figure out exactly which version works but it may take a few days as I can only test in the evenings.

Thanks a lot, I'm hitting this bug too and was confused that I did something wrong. Hopefully you can help pinpointing this annoying bug. Thanks a lot! (I just recently started using PROXMOX, so this is no upgrade-bug for me and it started to get frustrating not being able to use LXC-containers...)

Vasu Sreekumar · Mar 7, 2018

You are lucky.

I have 25 live nodes, and for last one week, I am having sleepless nights.

I lost complete faith in Proxmox.

denos · Mar 7, 2018

Good news: I bisected the 4.14 kernel releases and determined that 4.14.4 is the first working kernel version. Looking at the change log, I'd bet that this is the fix we're after:

Code:

commit 84779085fa10014b9e8208d7e71b54bced73075c
Author: Vasily Averin <vvs@virtuozzo.com>
Date:   Thu Nov 2 13:03:42 2017 +0300

    lockd: lost rollback of set_grace_period() in lockd_down_net()
    
    commit 3a2b19d1ee5633f76ae8a88da7bc039a5d1732aa upstream.
    
    Commit efda760fe95ea ("lockd: fix lockd shutdown race") is incorrect,
    it removes lockd_manager and disarm grace_period_end for init_net only.
    
    If nfsd was started from another net namespace lockd_up_net() calls
    set_grace_period() that adds lockd_manager into per-netns list
    and queues grace_period_end delayed work.
    
    These action should be reverted in lockd_down_net().
    Otherwise it can lead to double list_add on after restart nfsd in netns,
    and to use-after-free if non-disarmed delayed work will be executed after netns destroy.
    
    Fixes: efda760fe95e ("lockd: fix lockd shutdown race")
    Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
    Signed-off-by: J. Bruce Fields <bfields@redhat.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

Fingers crossed and looking forward to the next pve-kernel to test.

Vasu Sreekumar · Mar 7, 2018

Great news.

I hope we will get the new update for proxmox soon.

That will end my sleepless nights.

fabian · Mar 7, 2018

are all of you mounting or exporting NFS shares inside your containers? if so it would have been a good idea to include this in your reports, as it is a setup that we advise against and do not test at all.

I can reproduce a hang ONLY when I do either mount or export NFS shares within a container (which requires modifying / disabling AppArmor!), and even then reboot the container in question generates the following kernel BUG trace which would have immediately pointed to NFS as the culprit:

Code:

Mar 07 12:50:14 host kernel: ------------[ cut here ]------------
Mar 07 12:50:14 host kernel: kernel BUG at fs/nfs_common/grace.c:107!
Mar 07 12:50:14 host kernel: invalid opcode: 0000 [#1] SMP PTI
Mar 07 12:50:14 host kernel: Modules linked in: rpcsec_gss_krb5 nfsv4 nfsd auth_rpcgss veth rbd libceph nfsv3 nfs_acl nfs lockd grace fscache ip_set ip6table_filter ip6_tables xfs iptable_filter ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack softdog nfnetlink_log nfnetlink dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd glue_helper ppdev hid_generic cryptd zfs(PO) zunicode(PO) zavl(PO) icp(PO) snd_pcm snd_timer snd soundcore pcspkr joydev input_leds serio_raw shpchp parport_pc parport qemu_fw_cfg mac_hid usbhid hid zcommon(PO) znvpair(PO) spl(O) vhost_net vhost tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp
Mar 07 12:50:14 host kernel:  libiscsi scsi_transport_iscsi sunrpc ip_tables x_tables autofs4 btrfs xor raid6_pq psmouse virtio_net virtio_scsi floppy pata_acpi i2c_piix4
Mar 07 12:50:14 host kernel: CPU: 1 PID: 90 Comm: kworker/u4:2 Tainted: P           O    4.13.13-6-pve #1
Mar 07 12:50:14 host kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/2014
Mar 07 12:50:14 host kernel: Workqueue: netns cleanup_net
Mar 07 12:50:14 host kernel: task: ffff941fe5475f00 task.stack: ffffb9a181d30000
Mar 07 12:50:14 host kernel: RIP: 0010:grace_exit_net+0x24/0x30 [grace]
Mar 07 12:50:14 host kernel: RSP: 0000:ffffb9a181d33dc8 EFLAGS: 00010212
Mar 07 12:50:14 host kernel: RAX: ffff941fe6f209e0 RBX: ffff941f902aaf80 RCX: 0000000000000000
Mar 07 12:50:14 host kernel: RDX: ffff941f9010ed38 RSI: ffffffffc0ac1020 RDI: ffff941f902aaf80
Mar 07 12:50:14 host kernel: RBP: ffffb9a181d33dc8 R08: ffff941f9010e0c0 R09: 000000018015000d
Mar 07 12:50:14 host kernel: R10: ffffb9a181d33d18 R11: 0000000000000000 R12: ffffb9a181d33e20
Mar 07 12:50:14 host kernel: R13: ffffffffc0ac1018 R14: ffffffffc0ac1020 R15: 0000000000000000
Mar 07 12:50:14 host kernel: FS:  0000000000000000(0000) GS:ffff941fffd00000(0000) knlGS:0000000000000000
Mar 07 12:50:14 host kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar 07 12:50:14 host kernel: CR2: 000056234982b078 CR3: 000000029700a002 CR4: 00000000003606e0
Mar 07 12:50:14 host kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Mar 07 12:50:14 host kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Mar 07 12:50:14 host kernel: Call Trace:
Mar 07 12:50:14 host kernel:  ops_exit_list.isra.8+0x3b/0x70
Mar 07 12:50:14 host kernel:  cleanup_net+0x1ca/0x2b0
Mar 07 12:50:14 host kernel:  process_one_work+0x1ee/0x410
Mar 07 12:50:14 host kernel:  worker_thread+0x4b/0x420
Mar 07 12:50:14 host kernel:  kthread+0x10c/0x140
Mar 07 12:50:14 host kernel:  ? process_one_work+0x410/0x410
Mar 07 12:50:14 host kernel:  ? kthread_create_on_node+0x70/0x70
Mar 07 12:50:14 host kernel:  ret_from_fork+0x35/0x40
Mar 07 12:50:14 host kernel: Code: 1f 84 00 00 00 00 00 0f 1f 44 00 00 8b 15 79 22 00 00 48 8b 87 88 12 00 00 55 48 89 e5 48 8b 04 d0 48 8b 10 48 39 d0 75 02 5d c3 <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 8b 15 49 22
Mar 07 12:50:14 host kernel: RIP: grace_exit_net+0x24/0x30 [grace] RSP: ffffb9a181d33dc8
Mar 07 12:50:14 host kernel: ---[ end trace ce4a24d79fcca3bb ]---

I'll verify that the commit in questions fixes the issue (it seems very likely).

Vasu Sreekumar · Mar 7, 2018

I don't use NFS.

I use ZFS.

Only local drives.

I have 25 live nodes all with local ZFS drives, all 25 modes are facing issues atleast once every 2 days.

FibreFoX · Mar 7, 2018

I can second this, I do not use NFS, only ZFS (raid1/mirror-mode) with two drives, without having special ZFS-configuration (used out-of-box configuration).

I'm having several QEMU/KVM-VMs running, some are producing some load, running containers are working, but after rebooting/shutting down via webinterface, they are shutting down, but fail to come up again.

This is not NFS-related AFAIK.

fabian · Mar 7, 2018

@FibreFoX @Vasu Sreekumar

then it is likely that your issue is a different one and not the one @denos bisected to - the commit in question is for code that is only used for NFS AFAICT. I tried reproducing with just an NFS mount and/or export on the PVE node itself, but that does not trigger the problem so far.

@denos: can you confirm whether you are using NFS inside the container or not?

FibreFoX · Mar 7, 2018

@fabian
Thanks for the response, will try to create something to reproduce. Lets wait for @denos to give some more input

Vasu Sreekumar · Mar 7, 2018

My issue is exactly denos is talking about, and it has no NFS involved

Vasu Sreekumar · Mar 7, 2018

Reproducing is easy.

Create 5 LXC CT, run a cron to stop and start each CT every minute.

Within minutes you will see issue.

FibreFoX · Mar 7, 2018

Vasu Sreekumar said:
Reproducing is easy.

Create 5 LXC CT, run a cron to stop and start each CT every minute.

Within minutes you will see issue.

Seems not so easy for the proxmox-team

so I'll try to have this reproducable within some virtualbox or something like that.

denos · Mar 7, 2018

Fabian: I do use NFS inside containers on my home server but not on two of the hypervisors that have had a network namespace lockup at work. It has been very easy to duplicate the issue at home (minutes - likely the NFS patch listed) but much harder on the hypervisors at work (up to 3 days of reboots before it occurs). I was excited to have made some progress but I agree with your assessment - the patch looks it's only addressing an NFS namespace issue. I think we're looking at more than one network namespace kernel issue that has been addressed somewhere in 4.14. Which is frustrating for everyone.

I appreciate that this is a very difficult issue to investigate, especially without steps to duplicate and am grateful for everyone's effort trying to pin it down.

If you have landed on this thread and want to confirm that it's relevant, wait for the issue to occur then run this command:

Code:

grep copy_net_ns /proc/*/stack

If that returns anything, this thread will be relevant. If not, you have a different issue.

As noted earlier in this thread, Docker users have reported an error with similar symptoms and a similar stack trace (hang on copy_net_ns):
https://github.com/coreos/bugs/issues/254
The bottom post in that thread carries on to this thread:
https://github.com/moby/moby/issues/5618
where they indicate kernel patches introduced as recently as Feb / 2018 may be relevant.

To reiterate, any server running a plain 4.14.20 kernel or later has had no further recurrence of this issue.

LXC container reboot fails - LXC becomes unusable

Well-Known Member

Active Member

Active Member

Well-Known Member

Active Member

Proxmox Staff Member

Well-Known Member

Active Member

Active Member

Well-Known Member

Active Member

Proxmox Staff Member

Active Member

Active Member

Proxmox Staff Member

Active Member

Active Member

Active Member

Active Member

Well-Known Member

We value your privacy