lxcfs segfaulting after upgrade to 6.2 (lxc 4.0)

Sebastian Schubert · Jun 15, 2020

Hi there,

we just upgraded to the 6.2 release with lxc 4.0 and after running about 250 containers on each node, we now get the following error (previously: https://forum.proxmox.com/threads/lxcfs-br0ke-cgroup-limit.69015/#post-309442)

root@lxc-prox4:~# grep -A 5 -B 5 lxcfs /var/log/messages
Jun 15 12:31:15 lxc-prox4 kernel: [ 2007.338970] perf: interrupt took too long (3191 > 3168), lowering kernel.perf_event_max_sample_rate to 62500
Jun 15 12:36:15 lxc-prox4 kernel: [ 2307.893094] perf: interrupt took too long (3996 > 3988), lowering kernel.perf_event_max_sample_rate to 50000
Jun 15 12:40:58 lxc-prox4 kernel: [ 2590.728211] perf: interrupt took too long (5032 > 4995), lowering kernel.perf_event_max_sample_rate to 39500
Jun 15 12:43:23 lxc-prox4 kernel: [ 2735.065076] TCP: request_sock_TCP: Possible SYN flooding on port 465. Sending cookies. Check SNMP counters.
Jun 15 12:47:50 lxc-prox4 kernel: [ 3002.370383] perf: interrupt took too long (6291 > 6290), lowering kernel.perf_event_max_sample_rate to 31750
Jun 15 12:50:27 lxc-prox4 kernel: [ 3159.108157] lxcfs[4011170]: segfault at 0 ip 00007f2a08d8f99c sp 00007f2242ffc7c0 error 4 in liblxcfs.so[7f2a08d82000+14000]
Jun 15 12:50:27 lxc-prox4 kernel: [ 3159.108177] Code: 25 80 80 80 80 74 e9 89 c2 c1 ea 10 a9 80 80 00 00 0f 44 c2 48 8d 53 02 48 0f 44 da 89 c6 40 00 c6 58 5a 48 83 db 03 4c 29 eb <41> 80 3e 00 75 25 eb 4a 0f 1f 40 00 be 0a 00 00 00 4c 89 f7 e8 ab
Jun 15 12:50:27 lxc-prox4 kernel: [ 3159.513823] new mount options do not match the existing superblock, will be ignored
Jun 15 13:07:36 lxc-prox4 kernel: [ 4188.598719] vmbr0: port 112(veth559i0) entered disabled state
Jun 15 13:08:00 lxc-prox4 kernel: [ 4212.297433] vmbr0: port 112(veth559i0) entered disabled state
Jun 15 13:08:00 lxc-prox4 kernel: [ 4212.299033] device veth559i0 left promiscuous mode

see attached output from the journal as well (journalctl -u lxcfs)

any idea why this happens?

Sebastian Schubert · Jun 15, 2020

pveversion -v

Code:

proxmox-ve: 6.2-1 (running kernel: 5.4.41-1-pve)
pve-manager: 6.2-4 (running version: 6.2-4/9824574a)
pve-kernel-5.4: 6.2-2
pve-kernel-helper: 6.2-2
pve-kernel-5.3: 6.1-6
pve-kernel-5.0: 6.0-11
pve-kernel-5.4.41-1-pve: 5.4.41-1
pve-kernel-4.15: 5.4-6
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.13-1-pve: 5.3.13-1
pve-kernel-5.3.10-1-pve: 5.3.10-1
pve-kernel-5.0.21-5-pve: 5.0.21-10
pve-kernel-4.15.18-18-pve: 4.15.18-44
pve-kernel-4.15.18-12-pve: 4.15.18-36
ceph: 14.2.9-pve1
ceph-fuse: 14.2.9-pve1
corosync: 3.0.3-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 2.0.1-1+pve8
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.15-pve1
libproxmox-acme-perl: 1.0.4
libpve-access-control: 6.1-1
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.1-2
libpve-guest-common-perl: 3.0-10
libpve-http-server-perl: 3.0-5
libpve-storage-perl: 6.1-8
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.2-1
lxcfs: 4.0.3-pve2
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-1
pve-cluster: 6.1-8
pve-container: 3.1-6
pve-docs: 6.2-4
pve-edk2-firmware: 2.20200229-1
pve-firewall: 4.1-2
pve-firmware: 3.1-1
pve-ha-manager: 3.0-9
pve-i18n: 2.1-2
pve-qemu-kvm: 5.0.0-2
pve-xtermjs: 4.3.0-1
qemu-server: 6.2-2
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.4-pve1

Sebastian Schubert · Jun 15, 2020

inside the lxcfs process fd directory it seems that something is "running away/leaking", as we hit 1024 fds. normally the fd's stay around 20

Sebastian Schubert · Jun 15, 2020

We did some adjustments (mainly LimitNoFile set to a higher value) and observed the following behaviour:

Everything works well, until we run about 220 guests per node (~20-25 FDs used) - then the prometheus node_exporter in every running guest produces too much "noise" (scraped every 120 seconds) and lxcfs "burns" FDs till it reaches the NoFile Limit and Segfaults after a while (not deterministic, between 2-20 minutes)

t.lamprecht · Jun 16, 2020

Hi,

There where some upstream fixes regarding leaked FDs and memory in LXCFS, we'll look into them and prepare a package update the coming days, no promise yet, but it sounds like your issue could be one of the fixed ones.

mastertheknife · Jun 20, 2020

Hi, i have the same issue with lxcfs on multiple servers.
As a workaround, i have downgraded it to 3.0.3-pve60 on the production servers and the issue is gone.
There is definitely something wrong with the current lxcfs.

Here is one of the recent crashes, before downgrading to 3.0.3-pve60:

Jun 09 04:55:23 zeus systemd[1]: Started FUSE filesystem for LXC.
Jun 09 04:55:23 zeus lxcfs[599]: Running constructor lxcfs_init to reload liblxcfs
Jun 09 04:55:23 zeus lxcfs[599]: mount namespace: 4
Jun 09 04:55:23 zeus lxcfs[599]: hierarchies:
Jun 09 04:55:23 zeus lxcfs[599]: 0: fd: 5:
Jun 09 04:55:23 zeus lxcfs[599]: 1: fd: 6: name=systemd
Jun 09 04:55:23 zeus lxcfs[599]: 2: fd: 7: freezer
Jun 09 04:55:23 zeus lxcfs[599]: 3: fd: 8: blkio
Jun 09 04:55:23 zeus lxcfs[599]: 4: fd: 9: rdma
Jun 09 04:55:23 zeus lxcfs[599]: 5: fd: 10: perf_event
Jun 09 04:55:23 zeus lxcfs[599]: 6: fd: 11: hugetlb
Jun 09 04:55:23 zeus lxcfs[599]: 7: fd: 12: cpuset
Jun 09 04:55:23 zeus lxcfs[599]: 8: fd: 13: net_cls,net_prio
Jun 09 04:55:23 zeus lxcfs[599]: 9: fd: 14: devices
Jun 09 04:55:23 zeus lxcfs[599]: 10: fd: 15: pids
Jun 09 04:55:23 zeus lxcfs[599]: 11: fd: 16: cpu,cpuacct
Jun 09 04:55:23 zeus lxcfs[599]: 12: fd: 17: memory
Jun 09 04:55:23 zeus lxcfs[599]: Kernel supports pidfds
Jun 09 04:55:23 zeus lxcfs[599]: api_extensions:
Jun 09 04:55:23 zeus lxcfs[599]: - cgroups
Jun 09 04:55:23 zeus lxcfs[599]: - sys_cpu_online
Jun 09 04:55:23 zeus lxcfs[599]: - proc_cpuinfo
Jun 09 04:55:23 zeus lxcfs[599]: - proc_diskstats
Jun 09 04:55:23 zeus lxcfs[599]: - proc_loadavg
Jun 09 04:55:23 zeus lxcfs[599]: - proc_meminfo
Jun 09 04:55:23 zeus lxcfs[599]: - proc_stat
Jun 09 04:55:23 zeus lxcfs[599]: - proc_swaps
Jun 09 04:55:23 zeus lxcfs[599]: - proc_uptime
Jun 09 04:55:23 zeus lxcfs[599]: - shared_pidns
Jun 09 04:55:23 zeus lxcfs[599]: - cpuview_daemon
Jun 09 04:55:23 zeus lxcfs[599]: - loadavg_daemon
Jun 09 04:55:23 zeus lxcfs[599]: - pidfds
Jun 12 06:25:45 zeus lxcfs[599]: free(): double free detected in tcache 2
Jun 12 06:25:45 zeus systemd[1]: lxcfs.service: Main process exited, code=killed, status=6/ABRT

Search

Search

lxcfs segfaulting after upgrade to 6.2 (lxc 4.0)

Sebastian Schubert

Well-Known Member

Attachments

Sebastian Schubert

Well-Known Member

Sebastian Schubert

Well-Known Member

Attachments

Sebastian Schubert

Well-Known Member

t.lamprecht

Proxmox Staff Member

mastertheknife

New Member