lxcfs segfaulting after upgrade to 6.2 (lxc 4.0)

Aug 28, 2017
67
12
48
45
Hi there,

we just upgraded to the 6.2 release with lxc 4.0 and after running about 250 containers on each node, we now get the following error (previously: https://forum.proxmox.com/threads/lxcfs-br0ke-cgroup-limit.69015/#post-309442)

root@lxc-prox4:~# grep -A 5 -B 5 lxcfs /var/log/messages
Jun 15 12:31:15 lxc-prox4 kernel: [ 2007.338970] perf: interrupt took too long (3191 > 3168), lowering kernel.perf_event_max_sample_rate to 62500
Jun 15 12:36:15 lxc-prox4 kernel: [ 2307.893094] perf: interrupt took too long (3996 > 3988), lowering kernel.perf_event_max_sample_rate to 50000
Jun 15 12:40:58 lxc-prox4 kernel: [ 2590.728211] perf: interrupt took too long (5032 > 4995), lowering kernel.perf_event_max_sample_rate to 39500
Jun 15 12:43:23 lxc-prox4 kernel: [ 2735.065076] TCP: request_sock_TCP: Possible SYN flooding on port 465. Sending cookies. Check SNMP counters.
Jun 15 12:47:50 lxc-prox4 kernel: [ 3002.370383] perf: interrupt took too long (6291 > 6290), lowering kernel.perf_event_max_sample_rate to 31750
Jun 15 12:50:27 lxc-prox4 kernel: [ 3159.108157] lxcfs[4011170]: segfault at 0 ip 00007f2a08d8f99c sp 00007f2242ffc7c0 error 4 in liblxcfs.so[7f2a08d82000+14000]
Jun 15 12:50:27 lxc-prox4 kernel: [ 3159.108177] Code: 25 80 80 80 80 74 e9 89 c2 c1 ea 10 a9 80 80 00 00 0f 44 c2 48 8d 53 02 48 0f 44 da 89 c6 40 00 c6 58 5a 48 83 db 03 4c 29 eb <41> 80 3e 00 75 25 eb 4a 0f 1f 40 00 be 0a 00 00 00 4c 89 f7 e8 ab
Jun 15 12:50:27 lxc-prox4 kernel: [ 3159.513823] new mount options do not match the existing superblock, will be ignored
Jun 15 13:07:36 lxc-prox4 kernel: [ 4188.598719] vmbr0: port 112(veth559i0) entered disabled state
Jun 15 13:08:00 lxc-prox4 kernel: [ 4212.297433] vmbr0: port 112(veth559i0) entered disabled state
Jun 15 13:08:00 lxc-prox4 kernel: [ 4212.299033] device veth559i0 left promiscuous mode

see attached output from the journal as well (journalctl -u lxcfs)

any idea why this happens?
 

Attachments

Last edited:
pveversion -v

Code:
proxmox-ve: 6.2-1 (running kernel: 5.4.41-1-pve)
pve-manager: 6.2-4 (running version: 6.2-4/9824574a)
pve-kernel-5.4: 6.2-2
pve-kernel-helper: 6.2-2
pve-kernel-5.3: 6.1-6
pve-kernel-5.0: 6.0-11
pve-kernel-5.4.41-1-pve: 5.4.41-1
pve-kernel-4.15: 5.4-6
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.13-1-pve: 5.3.13-1
pve-kernel-5.3.10-1-pve: 5.3.10-1
pve-kernel-5.0.21-5-pve: 5.0.21-10
pve-kernel-4.15.18-18-pve: 4.15.18-44
pve-kernel-4.15.18-12-pve: 4.15.18-36
ceph: 14.2.9-pve1
ceph-fuse: 14.2.9-pve1
corosync: 3.0.3-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 2.0.1-1+pve8
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.15-pve1
libproxmox-acme-perl: 1.0.4
libpve-access-control: 6.1-1
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.1-2
libpve-guest-common-perl: 3.0-10
libpve-http-server-perl: 3.0-5
libpve-storage-perl: 6.1-8
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.2-1
lxcfs: 4.0.3-pve2
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-1
pve-cluster: 6.1-8
pve-container: 3.1-6
pve-docs: 6.2-4
pve-edk2-firmware: 2.20200229-1
pve-firewall: 4.1-2
pve-firmware: 3.1-1
pve-ha-manager: 3.0-9
pve-i18n: 2.1-2
pve-qemu-kvm: 5.0.0-2
pve-xtermjs: 4.3.0-1
qemu-server: 6.2-2
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.4-pve1
 
We did some adjustments (mainly LimitNoFile set to a higher value) and observed the following behaviour:

Everything works well, until we run about 220 guests per node (~20-25 FDs used) - then the prometheus node_exporter in every running guest produces too much "noise" (scraped every 120 seconds) and lxcfs "burns" FDs till it reaches the NoFile Limit and Segfaults after a while (not deterministic, between 2-20 minutes)
 
Hi,

There where some upstream fixes regarding leaked FDs and memory in LXCFS, we'll look into them and prepare a package update the coming days, no promise yet, but it sounds like your issue could be one of the fixed ones.
 
  • Like
Reactions: Sebastian Schubert
Hi, i have the same issue with lxcfs on multiple servers.
As a workaround, i have downgraded it to 3.0.3-pve60 on the production servers and the issue is gone.
There is definitely something wrong with the current lxcfs.

Here is one of the recent crashes, before downgrading to 3.0.3-pve60:
Jun 09 04:55:23 zeus systemd[1]: Started FUSE filesystem for LXC.
Jun 09 04:55:23 zeus lxcfs[599]: Running constructor lxcfs_init to reload liblxcfs
Jun 09 04:55:23 zeus lxcfs[599]: mount namespace: 4
Jun 09 04:55:23 zeus lxcfs[599]: hierarchies:
Jun 09 04:55:23 zeus lxcfs[599]: 0: fd: 5:
Jun 09 04:55:23 zeus lxcfs[599]: 1: fd: 6: name=systemd
Jun 09 04:55:23 zeus lxcfs[599]: 2: fd: 7: freezer
Jun 09 04:55:23 zeus lxcfs[599]: 3: fd: 8: blkio
Jun 09 04:55:23 zeus lxcfs[599]: 4: fd: 9: rdma
Jun 09 04:55:23 zeus lxcfs[599]: 5: fd: 10: perf_event
Jun 09 04:55:23 zeus lxcfs[599]: 6: fd: 11: hugetlb
Jun 09 04:55:23 zeus lxcfs[599]: 7: fd: 12: cpuset
Jun 09 04:55:23 zeus lxcfs[599]: 8: fd: 13: net_cls,net_prio
Jun 09 04:55:23 zeus lxcfs[599]: 9: fd: 14: devices
Jun 09 04:55:23 zeus lxcfs[599]: 10: fd: 15: pids
Jun 09 04:55:23 zeus lxcfs[599]: 11: fd: 16: cpu,cpuacct
Jun 09 04:55:23 zeus lxcfs[599]: 12: fd: 17: memory
Jun 09 04:55:23 zeus lxcfs[599]: Kernel supports pidfds
Jun 09 04:55:23 zeus lxcfs[599]: api_extensions:
Jun 09 04:55:23 zeus lxcfs[599]: - cgroups
Jun 09 04:55:23 zeus lxcfs[599]: - sys_cpu_online
Jun 09 04:55:23 zeus lxcfs[599]: - proc_cpuinfo
Jun 09 04:55:23 zeus lxcfs[599]: - proc_diskstats
Jun 09 04:55:23 zeus lxcfs[599]: - proc_loadavg
Jun 09 04:55:23 zeus lxcfs[599]: - proc_meminfo
Jun 09 04:55:23 zeus lxcfs[599]: - proc_stat
Jun 09 04:55:23 zeus lxcfs[599]: - proc_swaps
Jun 09 04:55:23 zeus lxcfs[599]: - proc_uptime
Jun 09 04:55:23 zeus lxcfs[599]: - shared_pidns
Jun 09 04:55:23 zeus lxcfs[599]: - cpuview_daemon
Jun 09 04:55:23 zeus lxcfs[599]: - loadavg_daemon
Jun 09 04:55:23 zeus lxcfs[599]: - pidfds
Jun 12 06:25:45 zeus lxcfs[599]: free(): double free detected in tcache 2
Jun 12 06:25:45 zeus systemd[1]: lxcfs.service: Main process exited, code=killed, status=6/ABRT
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!