all the linked commits are already cherry-picked/backported into our 4.13.13-6-pve /-41 kernel (except for the NFS one you bisected to, which we will include in the next round of upgrades). we cannot reproduce this issue at all - we've been trying using various setups and machines without any luck. unfortunately this area is one where we have seen frequent regressions in the past, so it is likely there is some race/refcount leak that is triggered by some yet-unknown factor.Fabian: I do use NFS inside containers on my home server but not on two of the hypervisors that have had a network namespace lockup at work. It has been very easy to duplicate the issue at home (minutes - likely the NFS patch listed) but much harder on the hypervisors at work (up to 3 days of reboots before it occurs). I was excited to have made some progress but I agree with your assessment - the patch looks it's only addressing an NFS namespace issue. I think we're looking at more than one network namespace kernel issue that has been addressed somewhere in 4.14. Which is frustrating for everyone.
I appreciate that this is a very difficult issue to investigate, especially without steps to duplicate and am grateful for everyone's effort trying to pin it down.
If you have landed on this thread and want to confirm that it's relevant, wait for the issue to occur then run this command:
If that returns anything, this thread will be relevant. If not, you have a different issue.Code:
grep copy_net_ns /proc/*/stack
As noted earlier in this thread, Docker users have reported an error with similar symptoms and a similar stack trace (hang on copy_net_ns):
The bottom post in that thread carries on to this thread:
where they indicate kernel patches introduced as recently as Feb / 2018 may be relevant.
To reiterate, any server running a plain 4.14.20 kernel or later has had no further recurrence of this issue.
dpkg -i linux-image-4.14.25-lxcfix-2_amd64.deb
dpkg --purge linux-image-4.14.25-lxcfix
a19f3b2228be2cbd64aad41049e25e73c42a64f71f7fd0f96ac1b0046f6e99ee linux-headers-4.14.25-lxcfix-2_amd64.deb ecfe10018a7093af3d2757041867d65b6954c0be77bb96d74e9c791de53bee0d linux-image-4.14.25-lxcfix-2_amd64.deb
Linux proxmox 4.15.18-5-pve #1 SMP PVE 4.15.18-24 (Thu, 13 Sep 2018 09:15:10 +0200) x86_64 GNU/Linux
lxc-start -n 200 -F -l DEBUG -o /root/lxc-200.log
systemctl status firstname.lastname@example.org ● email@example.com - LXC Container: 200 Loaded: loaded (/lib/systemd/system/lxc@.service; disabled; vendor preset: enabled) Drop-In: /lib/systemd/system/lxc@.service.d └─pve-reboot.conf Active: inactive (dead) Docs: man:lxc-start man:lxc
# systemctl status pvestatd ● pvestatd.service - PVE Status Daemon Loaded: loaded (/lib/systemd/system/pvestatd.service; enabled; vendor preset: enabled) Active: active (running) since Sun 2018-10-07 16:44:26 EDT; 3 weeks 4 days ago Process: 6972 ExecStart=/usr/bin/pvestatd start (code=exited, status=0/SUCCESS) Main PID: 7102 (pvestatd) Tasks: 2 (limit: 4915) Memory: 88.2M CPU: 1d 1h 58min 48.573s CGroup: /system.slice/pvestatd.service ├─ 7102 pvestatd └─11278 lxc-info -n 200 -p Oct 07 16:44:25 proxmox systemd: Starting PVE Status Daemon... Oct 07 16:44:26 proxmox pvestatd: starting server Oct 07 16:44:26 proxmox systemd: Started PVE Status Daemon. Oct 08 14:51:57 proxmox pvestatd: modified cpu set for lxc/200: 1-2,4,13 Oct 08 15:02:06 proxmox pvestatd: unable to get PID for CT 225 (not running?) Oct 08 15:02:06 proxmox pvestatd: unable to get PID for CT 225 (not running?)
root@proxmox:~/container_config# ps -Af |grep -i lxc-info root 11278 7102 0 14:28 ? 00:00:00 lxc-info -n 200 -p