lxcfs.service: Main process exited code=killed, status=11/SEGV

michaelj · Nov 29, 2020

Hi Community,

I wanted to know if anyone has ever encountered the following problem:

logs:

Code:

2020-11-29T04:37:30.054129+01:00        hv18 systemd[1]: lxcfs.service: Main process exited,        code=killed, status=11/SEGV
     2020-11-29T04:37:30.094752+01:00 hv18 systemd[1]: lxcfs.service:      Unit entered failed state.
     2020-11-29T04:37:30.095112+01:00 hv18 systemd[1]: lxcfs.service:      Failed with result 'signal'.
     2020-11-29T04:37:30.423594+01:00 hv18 systemd[1]: lxcfs.service:      Service hold-off time over, scheduling restart.
     2020-11-29T04:37:30.424265+01:00 hv18 systemd[1]: Stopped FUSE        filesystem for LXC.
     2020-11-29T04:37:30.424911+01:00 hv18 systemd[1]: Started FUSE        filesystem for LXC.
     2020-11-29T04:37:30.457851+01:00 hv18 lxcfs[18507]: mount      namespace: 5
     2020-11-29T04:37:30.458194+01:00 hv18 lxcfs[18507]: hierarchies:
     2020-11-29T04:37:30.458425+01:00 hv18 lxcfs[18507]:   0: fd:   6:      rdma
     2020-11-29T04:37:30.458632+01:00 hv18 lxcfs[18507]:   1: fd:   7:      net_cls,net_prio
     2020-11-29T04:37:30.458827+01:00 hv18 lxcfs[18507]:   2: fd:   8:      perf_event
     2020-11-29T04:37:30.458985+01:00 hv18 lxcfs[18507]:   3: fd:   9:      freezer
     2020-11-29T04:37:30.459184+01:00 hv18 lxcfs[18507]:   4: fd:  10:      hugetlb
     2020-11-29T04:37:30.459388+01:00 hv18 lxcfs[18507]:   5: fd:  11:      cpuset
     2020-11-29T04:37:30.459607+01:00 hv18 lxcfs[18507]:   6: fd:  12:      cpu,cpuacct
     2020-11-29T04:37:30.459809+01:00 hv18 lxcfs[18507]:   7: fd:  13:      blkio
     2020-11-29T04:37:30.459973+01:00 hv18 lxcfs[18507]:   8: fd:  14:      pids
     2020-11-29T04:37:30.460179+01:00 hv18 lxcfs[18507]:   9: fd:  15:      devices
     2020-11-29T04:37:30.460371+01:00 hv18 lxcfs[18507]:  10: fd:  16:      memory
     2020-11-29T04:37:30.460572+01:00 hv18 lxcfs[18507]:  11: fd:  17:      name=systemd
     2020-11-29T04:37:30.460816+01:00 hv18 lxcfs[18507]:  12: fd:  18:      unified

The consequence on lxc containers :

Code:

Error: /proc must be mounted
  To mount /proc at boot you need an /etc/fstab line like:
      proc   /proc   proc    defaults
  In the meantime, run "mount proc /proc -t proc"

 /proc/cpuinfo is not accessible: Transport endpoint is not connected

Informations :

proxmox-ve: 5.4-2 (running kernel: 4.15.18-30-pve)
pve-manager: 5.4-15 (running version: 5.4-15/d0ec33c6)
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-12
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-56
libpve-guest-common-perl: 2.0-20
libpve-http-server-perl: 2.0-14
libpve-storage-perl: 5.0-44
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.1.0-7
lxcfs: 3.0.3-pve1
novnc-pve: 1.0.0-3
proxmox-widget-toolkit: 1.0-28
pve-cluster: 5.0-38
pve-container: 2.0-42
pve-docs: 5.4-2
pve-edk2-firmware: 1.20190312-1
pve-firewall: 3.0-22
pve-firmware: 2.0-7
pve-ha-manager: 2.0-9
pve-i18n: 1.1-4
pve-libspice-server1: 0.14.1-2
pve-qemu-kvm: 3.0.1-4
pve-xtermjs: 3.12.0-1
qemu-server: 5.0-56
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.13-pve1~bpo2

If i umount /proc then remount it inside lxc container, the errors disappear but I would like to know the origin of the problem and know if we can correct it?

Plus, the problem happens randomly.

Regards,

Michael.

oguz · Nov 30, 2020

hi,

your node is outdated (5.x is EOL, see update instructions [0]), meanwhile for these pakcages we're at

Code:

lxc-pve: 4.0.3-1
lxcfs: 4.0.3-pve3

[0]: https://pve.proxmox.com/wiki/Upgrade_from_5.x_to_6.0

michaelj · Nov 30, 2020

Thanks Oguz, we will plan to switch to the v6.

Regards.

michaelj · Mar 17, 2021

Hi @oguz ,

I comment again on this case because I encountered this error on an updated version.

logs on the host :

Code:

`Mar 16 15:55:00 hvr2 kernel: lxcfs20228: segfault at 8018 ip 00007fcecfa6e00e sp 00007fceaf7fdaa0 error 4 in liblxcfs.so[7fcecfa5e000+14000]
Mar 16 15:55:00 hvr2 kernel: Code: 41 57 41 56 41 55 41 54 53 48 81 ec d8 00 00 00 48 89 8d 38 ff ff ff 49 8b 58 18 64 48 8b 04 25 28 00 00 00 48 89 45 c8 31 c0 <83> 7b 18 0
Mar 16 15:55:00 hvr2 systemd1: lxcfs.service: Main process exited, code=killed, status=11/SEGV
Mar 16 15:55:00 hvr2 systemd1: Starting Proxmox VE replication runner...
Mar 16 15:55:00 hvr2 systemd1: var-lib-lxcfs.mount: Succeeded.
Mar 16 15:55:00 hvr2 systemd1: lxcfs.service: Failed with result 'signal'.
Mar 16 15:55:00 hvr2 systemd1: lxcfs.service: Service RestartSec=100ms expired, scheduling restart.
Mar 16 15:55:00 hvr2 systemd1: lxcfs.service: Scheduled restart job, restart counter is at 1.
Mar 16 15:55:00 hvr2 systemd1: Stopped FUSE filesystem for LXC.
Mar 16 15:55:00 hvr2 systemd1: Started FUSE filesystem for LXC.
Mar 16 15:55:00 hvr2 lxcfs27186: Running constructor lxcfs_init to reload liblxcfs

Code:

pveversion
pve-manager/6.3-6/2184247e (running kernel: 5.4.78-2-pve)

dpkg -l |grep lxc
ii  lxc-pve                              4.0.6-2                      amd64        Linux containers userspace tools
ii  lxcfs                                4.0.6-pve1                   amd64        LXC userspace filesystem
ii  pve-lxc-syscalld                     0.9.1-1                      amd64        PVE LXC syscall daemon

Can we dig this error ?

Regards.

fiona · Mar 18, 2021

Hi,
could you try and install the debug symbols apt install lxcfs-dbgsym and systemd-coredump? When the crash happens next time, there should be a core dump in /var/lib/systemd/coredump. It would be great if you could provide that. And please provide the exact version that is installed at that time.

michaelj · Mar 22, 2021

Hi @Fabian_E,

I had the problem again (twice) on proxmox v5, can you still have a look?

Code:

ii  lxcfs-dbgsym                         3.0.3-pve1                        amd64        Debug symbols for lxcfs
ii  systemd-coredump                     232-25+deb9u12                    amd64        tools for storing and retrieving coredumps

As soon as the problem reappears on v6, I will post the same informations.

Regards.

fiona · Mar 23, 2021

AFAICT this looks like a memory corruption is going on (whose origin might be in a lot of places), so sadly, it's not easy to debug.

michaelj · May 10, 2021

Hi @Fabian_E,

This happened to me again severals time recently, i understand your last answer but how can we try to find the origin of the problem?

This is problematic in a production environment, we are obliged to stop star the containers.

Code:

proxmox-ve: 6.4-1 (running kernel: 5.4.78-2-pve)
pve-manager: 6.4-5 (running version: 6.4-5/6c7bf5de)
pve-kernel-5.4: 6.4-1
pve-kernel-helper: 6.4-1
pve-kernel-5.4.106-1-pve: 5.4.106-1
pve-kernel-5.4.103-1-pve: 5.4.103-1
pve-kernel-5.4.101-1-pve: 5.4.101-1
pve-kernel-5.4.98-1-pve: 5.4.98-1
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.4.73-1-pve: 5.4.73-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.1.2-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libproxmox-acme-perl: 1.0.8
libproxmox-backup-qemu0: 1.0.3-1
libpve-access-control: 6.4-1
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.4-2
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.2-1
libpve-storage-perl: 6.4-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.1.5-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.5-3
pve-cluster: 6.4-1
pve-container: 3.3-5
pve-docs: 6.4-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.2-2
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.2.0-6
pve-xtermjs: 4.7.0-3
qemu-server: 6.4-2
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.4-pve1

oguz · May 11, 2021

michaelj said:
proxmox-ve: 6.4-1 (running kernel: 5.4.78-2-pve)
pve-manager: 6.4-5 (running version: 6.4-5/6c7bf5de)
pve-kernel-5.4: 6.4-1
pve-kernel-helper: 6.4-1
pve-kernel-5.4.106-1-pve: 5.4.106-1

please reboot your machine (kernel upgrade) and see if problem persists

michaelj · May 19, 2021

Hi @oguz ,

I restarted the server on a recent kernel proxmox-ve: 6.4-1 (running kernel: 5.4.106-1-pve), but the problem occurred again.

Do you have other leads ?

oguz · May 19, 2021

michaelj said:
Hi @oguz ,

I restarted the server on a recent kernel proxmox-ve: 6.4-1 (running kernel: 5.4.106-1-pve), but the problem occurred again.

Do you have other leads ?

thanks for reporting

could you provide the coredump from the new crash?
this should help us identifying where or why the crash is happening.

please also provide the full output of pveversion -v again.

i also realize you haven't given us much info about your setup.

* how many containers do you have? and how many are running at the time of the crash?
* it would also be helpful if you could provide some container configs from the affected machine (pct config CTID. more interesting would be if you could identify which container is causing it, and provide the config for that one.
* how often did/does this crash happen?

michaelj · May 19, 2021

thanks for answering

could you provide the coredump from the new crash?

Unfortunately the folder is empty in /var/lib/systemd/coredump, and a search for a filename starting with 'core.lxcfs' does not return any results, while packages are installed:

Code:

ii lxcfs-dbgsym 4.0.6-pve1 amd64 debug symbols for lxcfs
ii systemd-coredump 241-7 ~ deb10u7 amd64 tools for storing and retrieving coredumps

Do you have any idea why the directory /var/lib/systemd/coredump is empty?

please also provide the full output of pveversion -v again.

Code:

proxmox-ve: 6.4-1 (running kernel: 5.4.106-1-pve)
pve-manager: 6.4-6 (running version: 6.4-6/be2fa32c)
pve-kernel-5.4: 6.4-2
pve-kernel-helper: 6.4-2
pve-kernel-5.4.114-1-pve: 5.4.114-1
pve-kernel-5.4.106-1-pve: 5.4.106-1
pve-kernel-5.4.103-1-pve: 5.4.103-1
pve-kernel-5.4.101-1-pve: 5.4.101-1
pve-kernel-5.4.98-1-pve: 5.4.98-1
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.4.73-1-pve: 5.4.73-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.1.2-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libproxmox-acme-perl: 1.1.0
libproxmox-backup-qemu0: 1.0.3-1
libpve-access-control: 6.4-1
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.4-3
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.2-2
libpve-storage-perl: 6.4-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.1.6-2
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.5-4
pve-cluster: 6.4-1
pve-container: 3.3-5
pve-docs: 6.4-2
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.2-3
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.2.0-6
pve-xtermjs: 4.7.0-3
qemu-server: 6.4-2
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.4-pve1

* how many containers do you have? and how many are running at the time of the crash?

Our recent crash are on two servers, one hosts 6 containers, the other 8, the other servers host the same types of containers as well, very little change.

All the containers are running at the time of the crash.

* it would also be helpful if you could provide some container configs from the affected machine (pct config CTID. more interesting would be if you could identify which container is causing it, and provide the config for that one.

We are unable to identify one container more than another.

Here is the pct config of the containers which is similar between the two servers

Code:

arch: amd64
cores: 8
features: nesting=1
hostname: xxx
memory: 6096
mp0: /apps/scripts,mp=/apps/scripts
mp1: /share,mp=/share
net0: name=eth3,bridge=vmbr2,hwaddr=ddd:01:2F:DC:8D,ip=yyyyy/16,type=veth
net1: name=eth2,bridge=vmbr2,gw=xxx,hwaddr=eeee1:BC:EE:F2,ip=eeee/27,type=veth
onboot: 1
ostype: debian
rootfs: local-zfs:subvol-319-disk-0,size=20G
swap: 4096
lxc.prlimit.nofile: 65536

-----


arch: amd64
cpulimit: 8
hostname: yyyy
memory: 20000
mp0: /apps/scripts,mp=/apps/scripts
mp1: /apps/releases,mp=/apps/releases
mp2: /opt/sources,mp=/opt/sources
mp3: /ftp,mp=/ftp
mp4: /share,mp=/share
mp5: /nfslogs,mp=/nfslogs
mp6: /logs/538,mp=/logs
nameserver: xxxxx
net0: name=eth3,bridge=vmbr2,hwaddr=eee:F1:F1:EA,ip=xxxx/16,type=veth
onboot: 1
ostype: debian
rootfs: local-zfs:subvol-538-disk-0,size=15G
searchdomain: vrack
swap: 2048

* how often did/does this crash happen?

This is random, on May's month, once each on these two servers.

Thanks for your help !

oguz · May 19, 2021

thanks for the infos.

michaelj said:
Do you have any idea why the directory /var/lib/systemd/coredump is empty?

hmmm,

* can you post the journal entry with the crash like last time?

* what do you get if you run coredumpctl?

* can you check also the other servers (maybe it crashed on a different node) ?

if there's nothing, i suggest to install the lxcfs-dbgsym and systemd-coredump on all your nodes, and we can catch it next time it happens.

michaelj · May 19, 2021

* can you post the journal entry with the crash like last time?

Journactl's system log retention was not active, I just did it for the next few times.

* what do you get if you run coredumpctl?

No coredumps found.

* can you check also the other servers (maybe it crashed on a different node) ?

Both packages are installed on all nodes.

We will therefore wait for the next crash to have a few logs.

Quick question : usually we have the following errors in the containers once the problem occurs:

Code:

Error: / proc must be mounted
To mount / proc at boot you need an / etc / fstab line like:
proc / proc proc defaults

Is it healthy and if possible to re-mount inside the containers "/proc" with the commandline ? Or does the container have to be restarted to resume a safe state?

Regards.

oguz · May 20, 2021

you can remount it like mount proc /proc -t proc but i'm not sure if the container will work without problems.

i'll wait for the coredump from your side to debug this further for now

michaelj · Jun 14, 2021

Hi @oguz,

I hope you are doing well.

I got the same message again, I have the same logs in journalctl but in coredumpctl I do not see anything related to lxcfs.

Code:

Jun 13 15:35:45 hvr2 kernel: lxcfs[12664]: segfault at 8018 ip 00007f389acfe00e sp 00007f38617f9aa0 error 4 in liblxcfs.so[7f389acee000+14000]
Jun 13 15:40:45 hvr2 systemd[1]: lxcfs.service: Main process exited, code=killed, status=11/SEGV
Jun 13 15:40:45 hvr2 systemd[1]: var-lib-lxcfs.mount: Succeeded.
Jun 13 15:40:45 hvr2 systemd[1]: lxcfs.service: Failed with result 'signal'.
Jun 13 15:40:45 hvr2 systemd[1]: lxcfs.service: Service RestartSec=100ms expired, scheduling restart.
Jun 13 15:40:45 hvr2 systemd[1]: lxcfs.service: Scheduled restart job, restart counter is at 2.
Jun 13 15:40:45 hvr2 lxcfs[17136]: Running constructor lxcfs_init to reload liblxcfs
Jun 13 15:40:45 hvr2 lxcfs[17136]: mount namespace: 4
Jun 13 15:40:45 hvr2 lxcfs[17136]: hierarchies:
Jun 13 15:40:45 hvr2 lxcfs[17136]:   0: fd:   5:
Jun 13 15:40:45 hvr2 lxcfs[17136]:   1: fd:   6: name=systemd
Jun 13 15:40:45 hvr2 lxcfs[17136]:   2: fd:   7: rdma
Jun 13 15:40:45 hvr2 lxcfs[17136]:   3: fd:   8: cpuset
Jun 13 15:40:45 hvr2 lxcfs[17136]:   4: fd:   9: net_cls,net_prio
Jun 13 15:40:45 hvr2 lxcfs[17136]:   5: fd:  10: devices
Jun 13 15:40:45 hvr2 lxcfs[17136]:   6: fd:  11: freezer
Jun 13 15:40:45 hvr2 lxcfs[17136]:   7: fd:  12: cpu,cpuacct
Jun 13 15:40:45 hvr2 lxcfs[17136]:   8: fd:  13: memory
Jun 13 15:40:45 hvr2 lxcfs[17136]:   9: fd:  14: pids
Jun 13 15:40:45 hvr2 lxcfs[17136]:  10: fd:  15: blkio
Jun 13 15:40:45 hvr2 lxcfs[17136]:  11: fd:  16: perf_event
Jun 13 15:40:45 hvr2 lxcfs[17136]:  12: fd:  17: hugetlb
Jun 13 15:40:45 hvr2 lxcfs[17136]: Kernel supports pidfds
Jun 13 15:40:45 hvr2 lxcfs[17136]: Kernel supports swap accounting
Jun 13 15:40:45 hvr2 lxcfs[17136]: api_extensions:
Jun 13 15:40:45 hvr2 lxcfs[17136]: - cgroups
Jun 13 15:40:45 hvr2 lxcfs[17136]: - sys_cpu_online
Jun 13 15:40:45 hvr2 lxcfs[17136]: - proc_cpuinfo
Jun 13 15:40:45 hvr2 lxcfs[17136]: - proc_diskstats
Jun 13 15:40:45 hvr2 lxcfs[17136]: - proc_loadavg
Jun 13 15:40:45 hvr2 lxcfs[17136]: - proc_meminfo
Jun 13 15:40:45 hvr2 lxcfs[17136]: - proc_stat
Jun 13 15:40:45 hvr2 lxcfs[17136]: - proc_swaps
Jun 13 15:40:45 hvr2 lxcfs[17136]: - proc_uptime
Jun 13 15:40:45 hvr2 lxcfs[17136]: - shared_pidns
Jun 13 15:40:45 hvr2 lxcfs[17136]: - cpuview_daemon
Jun 13 15:40:45 hvr2 lxcfs[17136]: - loadavg_daemon
Jun 13 15:40:45 hvr2 lxcfs[17136]: - pidfds

coredumpctl :

Code:

Sun 2021-06-13 15:54:40 CEST  18421     0     0   6 present   /usr/sbin/keepalived
Sun 2021-06-13 15:54:40 CEST  18403     0     0   6 present   /usr/sbin/keepalived

I think the keepalived coredump is linked to the stop/start of the virtual machine that I made to restart in order to get a stable state.

I don't understand why i have no coredump related to lxcfs? Do you have another research idea?

Regards.

oguz · Jun 16, 2021

michaelj said:
I don't understand why i have no coredump related to lxcfs? Do you have another research idea?

could you check the contents of /etc/systemd/coredump.conf? it's possible that the coredump is too big (although unlikely, since the default is set to 2G)

have you applied latest upgrades and rebooted the nodes? (just asking since your last pveversion -v output still has a newer kernel installed than the running one). it's also good to restart lxcfs and the containers after a kernel upgrade has been made.

you should restart or migrate the containers soon after the kernel upgrades are performed. ~~restarting lxcfs is also a good idea.~~

it's also worthwhile to create a bug report on https://bugzilla.proxmox.com to keep track of the issue better

EDIT: remove recommendation about restarting lxcfs -- it has a hot-reload mechanism during package upgrades

michaelj · Jun 16, 2021

Hi @oguz,

Thanks for your feedback, this is the content of /etc/systemd/coredump.conf :

Code:

[Coredump]
#Storage=external
#Compress=yes
#ProcessSizeMax=2G
#ExternalSizeMax=2G
#JournalSizeMax=767M
#MaxUse=
#KeepFree=

I've applied the update but haven't rebooted yet as in a production environment it is complicated.

Thank you for your recommendations.

I also created a bug report.

Regards.

oguz · Jun 16, 2021

michaelj said:
Thanks for your feedback, this is the content of /etc/systemd/coredump.conf :

looks normal, really puzzling why there's no crashdump...also not easy since it seems to happen at random.

thanks for creating the bug report!

lxcfs.service: Main process exited code=killed, status=11/SEGV

Renowned Member

Proxmox Retired Staff

Renowned Member

Renowned Member

Proxmox Staff Member

Renowned Member

Attachments

Proxmox Staff Member

Renowned Member

Proxmox Retired Staff

Renowned Member

Proxmox Retired Staff

Renowned Member

Proxmox Retired Staff

Renowned Member

Proxmox Retired Staff

Renowned Member

Proxmox Retired Staff

Renowned Member

Proxmox Retired Staff

We value your privacy