Fabian: I do use NFS inside containers on my home server but not on two of the hypervisors that have had a network namespace lockup at work. It has been very easy to duplicate the issue at home (minutes - likely the NFS patch listed) but much harder on the hypervisors at work (up to 3 days of reboots before it occurs). I was excited to have made some progress but I agree with your assessment - the patch looks it's only addressing an NFS namespace issue. I think we're looking at more than one network namespace kernel issue that has been addressed somewhere in 4.14. Which is frustrating for everyone.
I appreciate that this is a very difficult issue to investigate, especially without steps to duplicate and am grateful for everyone's effort trying to pin it down.
If you have landed on this thread and want to confirm that it's relevant, wait for the issue to occur then run this command:
Code:
grep copy_net_ns /proc/*/stack
If that returns anything, this thread will be relevant. If not, you have a different issue.
As noted earlier in this thread, Docker users have reported an error with similar symptoms and a similar stack trace (hang on copy_net_ns):
https://github.com/coreos/bugs/issues/254
The bottom post in that thread carries on to this thread:
https://github.com/moby/moby/issues/5618
where they indicate kernel patches introduced as recently as Feb / 2018 may be relevant.
To reiterate, any server running a plain 4.14.20 kernel or later has had no further recurrence of this issue.