Proxmox 4.0 lxc container's network unstable

Code:
root@node9 ~ # dmesg
[188956.209429] usb 1-7: USB disconnect, device number 7
[188956.209466] usb 1-7.1: USB disconnect, device number 8
[188956.229334] usb 1-7.2: USB disconnect, device number 9
[188956.245358] usb 1-7.3: USB disconnect, device number 10
[188956.265212] usb 1-7.4: USB disconnect, device number 11
[235745.376921] audit: type=1400 audit(1459048921.271:91): apparmor="DENIED" operation="mount" info="failed flags match" error=-13 profile="lxc-container-default" name="/" pid=19729 comm="(httpd)" flags="rw, rslave"
[241182.662828] audit: type=1400 audit(1459054363.629:92): apparmor="DENIED" operation="mount" info="failed flags match" error=-13 profile="lxc-container-default" name="/" pid=7713 comm="(amavisd)" flags="rw, rslave"
[271343.369155] audit: type=1400 audit(1459084552.414:93): apparmor="DENIED" operation="mount" info="failed flags match" error=-13 profile="lxc-container-default" name="/" pid=10109 comm="(tmpwatch)" flags="rw, rslave"
[271343.379485] audit: type=1400 audit(1459084552.422:94): apparmor="DENIED" operation="mount" info="failed flags match" error=-13 profile="lxc-container-default" name="/" pid=10137 comm="(tmpwatch)" flags="rw, rslave"
[357667.083984] audit: type=1400 audit(1459170956.541:95): apparmor="DENIED" operation="mount" info="failed flags match" error=-13 profile="lxc-container-default" name="/" pid=30827 comm="(tmpwatch)" flags="rw, rslave"
[357667.094662] audit: type=1400 audit(1459170956.549:96): apparmor="DENIED" operation="mount" info="failed flags match" error=-13 profile="lxc-container-default" name="/" pid=30842 comm="(tmpwatch)" flags="rw, rslave"
[443989.746415] audit: type=1400 audit(1459257359.610:97): apparmor="DENIED" operation="mount" info="failed flags match" error=-13 profile="lxc-container-default" name="/" pid=10163 comm="(tmpwatch)" flags="rw, rslave"
[443989.757883] audit: type=1400 audit(1459257359.622:98): apparmor="DENIED" operation="mount" info="failed flags match" error=-13 profile="lxc-container-default" name="/" pid=10178 comm="(tmpwatch)" flags="rw, rslave"
[530311.584988] audit: type=1400 audit(1459343761.849:99): apparmor="DENIED" operation="mount" info="failed flags match" error=-13 profile="lxc-container-default" name="/" pid=17824 comm="(tmpwatch)" flags="rw, rslave"
[530311.593891] audit: type=1400 audit(1459343761.857:100): apparmor="DENIED" operation="mount" info="failed flags match" error=-13 profile="lxc-container-default" name="/" pid=17839 comm="(tmpwatch)" flags="rw, rslave"
[595485.893994] INFO: task bash:1879 blocked for more than 120 seconds.
[595485.894033]       Tainted: G        W       4.2.8-1-pve #1
[595485.894067] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[595485.894125] bash            D 0000000000000009     0  1879   1876 0x00000104
[595485.894169]  ffff880581ddbbe8 0000000000000086 ffff881ff2139b80 ffff880008f53700
[595485.894238]  0000000000000246 ffff880581ddc000 ffff880581ddbc38 ffff881fedf3b390
[595485.894306]  ffff881fe906a840 fffffffffffffe00 ffff880581ddbc08 ffffffff81806967
[595485.894375] Call Trace:
[595485.894406]  [<ffffffff81806967>] schedule+0x37/0x80
[595485.894443]  [<ffffffff812fb863>] request_wait_answer+0x163/0x280
[595485.894480]  [<ffffffff810bdd30>] ? wait_woken+0x90/0x90
[595485.894514]  [<ffffffff812fba10>] __fuse_request_send+0x90/0xa0
[595485.894550]  [<ffffffff812fba47>] fuse_request_send+0x27/0x30
[595485.894586]  [<ffffffff81306128>] fuse_direct_io+0x3a8/0x5b0
[595485.894621]  [<ffffffff81306374>] __fuse_direct_read+0x44/0x60
[595485.894656]  [<ffffffff813063d0>] fuse_direct_read_iter+0x40/0x60
[595485.894693]  [<ffffffff811fd264>] new_sync_read+0x94/0xd0
[595485.894728]  [<ffffffff811fd2c6>] __vfs_read+0x26/0x40
[595485.894761]  [<ffffffff811fd91a>] vfs_read+0x8a/0x130
[595485.894796]  [<ffffffff811fe7e5>] SyS_read+0x55/0xc0
[595485.894830]  [<ffffffff8180aaf2>] entry_SYSCALL_64_fastpath+0x16/0x75
[595605.782323] INFO: task bash:1879 blocked for more than 120 seconds.
[595605.782363]       Tainted: G        W       4.2.8-1-pve #1
[595605.782397] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[595605.782455] bash            D 0000000000000009     0  1879   1876 0x00000104
[595605.782499]  ffff880581ddbbe8 0000000000000086 ffff881ff2139b80 ffff880008f53700
[595605.782567]  0000000000000246 ffff880581ddc000 ffff880581ddbc38 ffff881fedf3b390
[595605.782636]  ffff881fe906a840 fffffffffffffe00 ffff880581ddbc08 ffffffff81806967
[595605.782704] Call Trace:
[595605.782735]  [<ffffffff81806967>] schedule+0x37/0x80
[595605.782772]  [<ffffffff812fb863>] request_wait_answer+0x163/0x280
[595605.782809]  [<ffffffff810bdd30>] ? wait_woken+0x90/0x90
[595605.782843]  [<ffffffff812fba10>] __fuse_request_send+0x90/0xa0
[595605.782879]  [<ffffffff812fba47>] fuse_request_send+0x27/0x30
[595605.782915]  [<ffffffff81306128>] fuse_direct_io+0x3a8/0x5b0
[595605.782950]  [<ffffffff81306374>] __fuse_direct_read+0x44/0x60
[595605.782985]  [<ffffffff813063d0>] fuse_direct_read_iter+0x40/0x60
[595605.783022]  [<ffffffff811fd264>] new_sync_read+0x94/0xd0
[595605.783057]  [<ffffffff811fd2c6>] __vfs_read+0x26/0x40
[595605.783091]  [<ffffffff811fd91a>] vfs_read+0x8a/0x130
[595605.783125]  [<ffffffff811fe7e5>] SyS_read+0x55/0xc0
[595605.783159]  [<ffffffff8180aaf2>] entry_SYSCALL_64_fastpath+0x16/0x75
 

Attachments

  • dmesg_host.txt
    16.6 KB · Views: 0
Definetely the cron inside the containers:
-crond-+-1150*[crond]

pstree_host.txt
 

Attachments

  • pstree_host.txt
    8.1 KB · Views: 6
Tried to kill the crond processes, but ended with 5.000+ zombies. I got rid of the problem by rebooting the server. Everything for now okay again.

I have seen on our other Proxmox server that we don't have this issue anymore. No never ending crond processes. Using the same container template.

The only difference is that after I upgraded that machinen I rebooted it. Maybe this is the cause. So my hope is that the issue is now resolved, because of rebooting the machine. I will keep you up-to-date. Thanks Fabian!
 
Ok some news. I don't see any crond processes hanging anymore:

root@node9 ~ # ps faxl|grep cron
0 0 26101 26036 20 0 14196 2132 pipe_w S+ pts/4 0:00 \_ grep cron
4 0 1484 1 20 0 28972 2740 hrtime Ss ? 0:00 /usr/sbin/cron -f
4 0 2402 1822 20 0 22760 2820 hrtime Ss ? 0:00 \_ /usr/sbin/crond -n
4 0 4569 3360 20 0 22760 2676 hrtime Ss ? 0:00 \_ /usr/sbin/crond -n
4 0 6407 5042 20 0 22756 2664 hrtime Ss ? 0:00 \_ /usr/sbin/crond -n
4 0 8018 6946 20 0 22760 2664 hrtime Ss ? 0:00 \_ /usr/sbin/crond -n

I bet the reboot was really needed after installing the updates.
 
Last edited:
This morning checked the processes again. No double cron daemons. The issue looks completely resolved after rebooting.
 
If you haven't rebooted yet, install lxcfs-dbg and gdb and attach gdb to the process 12919 (lxfs):
Code:
apt-get install lxcfs-dbg gdb
gdb
attach 12919
bt
detach
quit

and post this and the full output of "pveversion -v" here.
 
Hi Fabian,

I did already reboot. But again I confirm all problems are resolved. :)

Rebooting is most of the time optional, but in some specific scenario's really needed. Especially in my case after updating the lxfs driver.
 
Picking up on this....Still have this randomly on different hosts.

Latest backtrace:

Code:
#0  0x00007fbc42ebb536 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x7ffdc1090c00) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
#1  do_futex_wait (sem=sem@entry=0x7ffdc1090c00, abstime=0x0) at sem_waitcommon.c:111
#2  0x00007fbc42ebb5e4 in __new_sem_wait_slow (sem=0x7ffdc1090c00, abstime=0x0) at sem_waitcommon.c:181
#3  0x00007fbc432dff88 in fuse_session_loop_mt () from /lib/x86_64-linux-gnu/libfuse.so.2
#4  0x00007fbc432e5727 in fuse_loop_mt () from /lib/x86_64-linux-gnu/libfuse.so.2
#5  0x00007fbc432e83dd in ?? () from /lib/x86_64-linux-gnu/libfuse.so.2
#6  0x00005590a887739f in main (argc=<optimized out>, argv=<optimized out>) at lxcfs.c:889

Code:
proxmox-ve: 5.0-16 (running kernel: 4.10.15-1-pve)
pve-manager: 5.0-23 (running version: 5.0-23/af4267bf)
pve-kernel-4.10.15-1-pve: 4.10.15-15
pve-kernel-4.4.67-1-pve: 4.4.67-92
pve-kernel-4.10.17-1-pve: 4.10.17-16
libpve-http-server-perl: 2.0-5
lvm2: 2.02.168-pve2
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-12
qemu-server: 5.0-14
pve-firmware: 2.0-2
libpve-common-perl: 5.0-16
libpve-guest-common-perl: 2.0-11
libpve-access-control: 5.0-5
libpve-storage-perl: 5.0-12
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-2
pve-docs: 5.0-9
pve-qemu-kvm: 2.9.0-2
pve-container: 2.0-15
pve-firewall: 3.0-2
pve-ha-manager: 2.0-2
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.0.8-3
lxcfs: 2.0.7-pve2
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.6.5.9-pve16~bpo90
 
@morph027 please describe more in-depth which kind of issue you are experiencing, including any logs / error messages. the back trace you posted just shows that lxcfs is waiting in the fuse loop, which is normal.
 
Example:

Code:
# lxc-exec-all "pgrep -cx cron"
executing "pgrep -cx cron" container #103 (smtp)
1
executing "pgrep -cx cron" container #106 (fileserver)
36
executing "pgrep -cx cron" container #109 (dokuwiki)
97
executing "pgrep -cx cron" container #111 (monitor)
1
executing "pgrep -cx cron" container #112 (apt-cacher)
1
executing "pgrep -cx cron" container #113 (repo)
1
 
Not sure if these AppArmor stuff is involved:

Code:
Aug 22 15:42:53 proxmox3 pct[5934]: <root@pam> starting task UPID:proxmox3:00001730:01085DB5:599C34DD:vzstart:109:root@pam:
Aug 22 15:42:53 proxmox3 pct[5936]: starting CT 109: UPID:proxmox3:00001730:01085DB5:599C34DD:vzstart:109:root@pam:
Aug 22 15:42:53 proxmox3 systemd[1]: Starting LXC Container: 109...
Aug 22 15:42:54 proxmox3 systemd-udevd[5955]: Could not generate persistent MAC address for vethSIPTH7: No such file or directory
Aug 22 15:42:54 proxmox3 kernel: IPv6: ADDRCONF(NETDEV_UP): veth109i0: link is not ready
Aug 22 15:42:54 proxmox3 kernel: vmbr0: port 5(veth109i0) entered blocking state
Aug 22 15:42:54 proxmox3 kernel: vmbr0: port 5(veth109i0) entered disabled state
Aug 22 15:42:54 proxmox3 kernel: device veth109i0 entered promiscuous mode
Aug 22 15:42:54 proxmox3 kernel: eth0: renamed from vethSIPTH7
Aug 22 15:42:55 proxmox3 systemd[1]: Started LXC Container: 109.
Aug 22 15:42:55 proxmox3 pct[5934]: <root@pam> end task UPID:proxmox3:00001730:01085DB5:599C34DD:vzstart:109:root@pam: OK
Aug 22 15:42:55 proxmox3 audit[6179]: AVC apparmor="DENIED" operation="mount" info="failed type match" error=-13 profile="lxc-container-default-cgns" name="/sys/fs/pstore/" pid=6179 comm="mount" fstype="pstore" srcname="pstore"
Aug 22 15:42:55 proxmox3 audit[6179]: AVC apparmor="DENIED" operation="mount" info="failed type match" error=-13 profile="lxc-container-default-cgns" name="/sys/fs/pstore/" pid=6179 comm="mount" fstype="pstore" srcname="pstore" flags="ro"
Aug 22 15:42:55 proxmox3 kernel: audit: type=1400 audit(1503409375.364:81): apparmor="DENIED" operation="mount" info="failed type match" error=-13 profile="lxc-container-default-cgns" name="/sys/fs/pstore/" pid=6179 comm="mount" fstype="pstore" srcname="pstore"
Aug 22 15:42:55 proxmox3 kernel: audit: type=1400 audit(1503409375.364:82): apparmor="DENIED" operation="mount" info="failed type match" error=-13 profile="lxc-container-default-cgns" name="/sys/fs/pstore/" pid=6179 comm="mount" fstype="pstore" srcname="pstore" flags="ro"
Aug 22 15:42:55 proxmox3 kernel: audit: type=1400 audit(1503409375.364:83): apparmor="DENIED" operation="mount" info="failed type match" error=-13 profile="lxc-container-default-cgns" name="/sys/kernel/config/" pid=6187 comm="mount" fstype="configfs" srcname="configfs"
Aug 22 15:42:55 proxmox3 kernel: audit: type=1400 audit(1503409375.364:84): apparmor="DENIED" operation="mount" info="failed type match" error=-13 profile="lxc-container-default-cgns" name="/sys/kernel/config/" pid=6187 comm="mount" fstype="configfs" srcname="configfs" flags="ro"
Aug 22 15:42:55 proxmox3 kernel: audit: type=1400 audit(1503409375.440:85): apparmor="DENIED" operation="mount" info="failed flags match" error=-13 profile="lxc-container-default-cgns" name="/" pid=6296 comm="mount" flags="rw, remount, silent"
Aug 22 15:42:55 proxmox3 kernel: audit: type=1400 audit(1503409375.440:86): apparmor="DENIED" operation="mount" info="failed flags match" error=-13 profile="lxc-container-default-cgns" name="/" pid=6297 comm="mount" flags="rw, remount"
Aug 22 15:42:55 proxmox3 kernel: audit: type=1400 audit(1503409375.468:87): apparmor="DENIED" operation="mount" info="failed type match" error=-13 profile="lxc-container-default-cgns" name="/run/" pid=6352 comm="mount" flags="rw, nosuid, noexec, remount, relatime"
Aug 22 15:42:55 proxmox3 kernel: audit: type=1400 audit(1503409375.488:88): apparmor="DENIED" operation="mount" info="failed type match" error=-13 profile="lxc-container-default-cgns" name="/run/lock/" pid=6363 comm="mount" flags="rw, nosuid, nodev, noexec, remount, relatime"
Aug 22 15:42:55 proxmox3 kernel: audit: type=1400 audit(1503409375.536:89): apparmor="DENIED" operation="mount" info="failed type match" error=-13 profile="lxc-container-default-cgns" name="/run/shm/" pid=6450 comm="mount" flags="rw, nosuid, nodev, noexec, remount, relatime"
Aug 22 15:42:55 proxmox3 kernel: audit: type=1400 audit(1503409375.536:90): apparmor="DENIED" operation="mount" info="failed type match" error=-13 profile="lxc-container-default-cgns" name="/dev/pts/" pid=6458 comm="mount" flags="rw, nosuid, noexec, remount, relatime"
Aug 22 15:42:55 proxmox3 kernel: vmbr0: port 5(veth109i0) entered blocking state
Aug 22 15:42:55 proxmox3 kernel: vmbr0: port 5(veth109i0) entered forwarding state

Looking at /etc/apparmor.d/lxc/lxc-default-cgns this should be fine ;)
 
When restarting:

Code:
lxc-wait: monitor.c: lxc_monitor_open: 237 Failed to connect to monitor socket. Retrying in 10 ms: Connection refused
lxc-wait: monitor.c: lxc_monitor_open: 237 Failed to connect to monitor socket. Retrying in 50 ms: Connection refused
lxc-wait: monitor.c: lxc_monitor_open: 237 Failed to connect to monitor socket. Retrying in 100 ms: Connection refused
 
are there any processes inside the container which are hanging? if so, their states and backtraces might be interesting..
 
Will keep an eye on! Same gdb commands to debug? (Sorry, me being just DevOp, no programmer *g*)

for now the content of /proc/PID/stack is probably enough (replace PID accordingly).
 
Code:
# gdb
GNU gdb (Debian 7.12-6) 7.12.0.20161007-git
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word".
(gdb) attach
Argument required (process-id to attach).
(gdb) attach 4200
Attaching to process 4200
[New LWP 4258]
[New LWP 4259]
[New LWP 30905]
warning: Could not load vsyscall page because no executable was specified
0x00007f14c5040536 in ?? ()
(gdb) bt
#0  0x00007f14c5040536 in ?? ()
#1  0x0000000000021000 in ?? ()
#2  0x26c6942ac2269e00 in ?? ()
#3  0x00007fff6bdc2340 in ?? ()
#4  0xfffffffeffffffff in ?? ()
#5  0x00007fff6bdc2290 in ?? ()
#6  0x00007f14c50405e4 in ?? ()
#7  0x0000000000000000 in ?? ()
(gdb)

Inside LXC:

Code:
# ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.1  15488  1716 ?        Ss   Aug22   0:00 init [2]  
root       814  0.0  0.1   8436  1840 ?        S    Aug22   0:00 /usr/sbin/syslogd --no-forward
root       857  0.0  0.5 214708  5456 ?        Ss   Aug22   0:00 /usr/sbin/nmbd -D
message+   873  0.0  0.1  42008  1956 ?        Ss   Aug22   0:00 /usr/bin/dbus-daemon --system
root       874  0.0  0.2  51008  2820 ?        Ss   Aug22   0:00 /usr/sbin/sshd
root       882  0.0  0.1  16040  1856 ?        Ss   Aug22   0:00 /usr/sbin/xinetd -pidfile /run/xinetd.pid -stayalive -inetd_compat -inetd_ipv6
root       913  0.0  1.2 317492 13544 ?        Ss   Aug22   0:01 /usr/sbin/smbd -D
root       916  0.0  0.4 308864  4504 ?        S    Aug22   0:01 /usr/sbin/smbd -D
root       919  0.0  0.4 308880  4264 ?        S    Aug22   0:00 /usr/sbin/smbd -D
root      1000  0.0  0.1  21156  1876 ?        Ss   Aug22   0:00 /usr/sbin/cron
root      1024  0.0  0.1   6344  1600 tty1     Ss+  Aug22   0:00 /sbin/getty --noclear 38400 tty1
root      1025  0.0  0.1   6344  1572 tty2     Ss+  Aug22   0:00 /sbin/getty --noclear 38400 tty2
root      1028  0.0  0.5 318000  5924 ?        S    Aug22   0:00 /usr/sbin/smbd -D
root     14852  0.0  2.0 333992 21464 ?        S    11:16   0:00 /usr/sbin/smbd -D
root     22149  0.0  1.7 334704 18468 ?        S    14:24   0:00 /usr/sbin/smbd -D
root     23968  0.0  1.0 274876 11168 ?        Ss   14:59   0:00 /usr/sbin/winbindd
root     23969  0.0  1.3 401972 14044 ?        S    14:59   0:00 /usr/sbin/winbindd
root     23975  0.0  0.9 275484  9520 ?        S    15:00   0:00 /usr/sbin/winbindd
root     24325  0.0  1.8 335104 19768 ?        S    15:10   0:00 /usr/sbin/smbd -D
root     25708  0.0  0.2  56944  2504 ?        S    16:00   0:00 /usr/sbin/CRON
root     25709  0.0  1.9 332208 20252 ?        S    16:00   0:43 /usr/sbin/smbd -D
root     25711  0.0  0.2  56944  2504 ?        S    16:01   0:00 /usr/sbin/CRON
root     25767  0.0  0.2  56944  2504 ?        S    16:02   0:00 /usr/sbin/CRON
root     25769  0.0  0.2  56944  2504 ?        S    16:03   0:00 /usr/sbin/CRON
root     25825  0.0  0.2  56944  2504 ?        S    16:04   0:00 /usr/sbin/CRON
root     25827  0.0  0.2  56944  2504 ?        S    16:05   0:00 /usr/sbin/CRON
root     25883  0.0  0.2  56944  2504 ?        S    16:06   0:00 /usr/sbin/CRON
root     25886  0.0  0.3  16552  3944 ?        Ss   16:06   0:00 /bin/bash
root     25892  0.0  0.1  12780  2048 ?        R+   16:07   0:00 ps aux

One of those cron's

Code:
# cat /proc/25708/stack
[<ffffffffaca6b7d9>] unix_wait_for_peer+0x99/0xe0
[<ffffffffaca6fa59>] unix_dgram_sendmsg+0x6a9/0x700
[<ffffffffac99fb28>] sock_sendmsg+0x38/0x50
[<ffffffffac9a0131>] SYSC_sendto+0x101/0x190
[<ffffffffac9a0c8e>] SyS_sendto+0xe/0x10
[<ffffffffac203b6b>] do_syscall_64+0x5b/0xc0
[<ffffffffacafa4af>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff
 
that looks like a different, unrelated to FUSE/lxcfs problem.. anything else that looks wrong in the journal? are all containers affected or just some with specific distros, workloads, software installed? how is the general system load?
 
System load (I/O, as this is a fileserver ;) ) looks high in this case

Bildschirmfoto vom 2017-08-24 09-58-21.png
But we also encounter this across the whole cluster on systems with low load an random containers.

Bildschirmfoto vom 2017-08-24 10-03-39.png

Mostly affected containers are:
- postfix relay
- samba domain controller (plus dhcp server)
- samba file server
- databases (mysql/postgres)
- check_mk instance (running all sorts off daemons)

There's no pattern to me, it's just when this occurs, a lot of cron pids are a common error symptom.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!