Segfault: error 4 in pmxcfs

erik.deneve

New Member
Mar 16, 2020
11
0
1
37
Hello,

This morning we had a major incident, all our proxmox nodes where fenced at the same time.
In the log, this seems to be the problem:
messages.log

Sep 1 10:50:14 hostname kernel: [932130.006753] show_signal_msg: 6 callbacks suppressed
Sep 1 10:50:14 hostname kernel: [932130.006757] cfs_loop[1832]: segfault at 7f7300c07f99 ip 000055bf7f70e7b0 sp 00007f730634f318 error 4 in pmxcfs[55bf7f6f5000+1b000]
Sep 1 10:50:14 hostname kernel: [932130.006771] Code: 10 48 89 c6 48 89 ef 48 89 10 48 8b 53 08 48 89 50 08 48 89 c2 e8 50 74 fe ff b8 01 00 00 00 e9 4a ff ff ff 66 0f 1f 44 00 00 <8b> 47 0c 8b 56 0c 39 d0 75 0d 48 8b 47 10 48 8b 56 10 48 39 d0 74


This is very annoying, because it's our production environment.
Does anyone knows what was triggering this issue? Is this a bug?

I found some related threads with the same problem, but no real solution:
https://forum.proxmox.com/threads/segfault-on-all-cluster-nodes.51960/
https://forum.proxmox.com/threads/a...cluster-mit-12-nodes-segfault-cfs_loop.51913/

Any ideas what triggered this segfault and how to avoid it in the future?
Attached the daemon.log with some more info.

Thanks!
Kind regards,
Erik

Proxmox version
Code:
# pveversion -v
proxmox-ve: 6.2-1 (running kernel: 5.4.44-2-pve)
pve-manager: 6.2-10 (running version: 6.2-10/a20769ed)
pve-kernel-5.4: 6.2-4
pve-kernel-helper: 6.2-4
pve-kernel-5.4.44-2-pve: 5.4.44-2
pve-kernel-5.4.44-1-pve: 5.4.44-1
pve-kernel-5.4.41-1-pve: 5.4.41-1
pve-kernel-5.4.34-1-pve: 5.4.34-2
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve2
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.4
libpve-access-control: 6.1-2
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.1-5
libpve-guest-common-perl: 3.1-2
libpve-http-server-perl: 3.0-6
libpve-storage-perl: 6.2-5
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.2-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-9
pve-cluster: 6.1-8
pve-container: 3.1-12
pve-docs: 6.2-5
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-2
pve-firmware: 3.1-1
pve-ha-manager: 3.0-9
pve-i18n: 2.1-3
pve-qemu-kvm: 5.0.0-11
pve-xtermjs: 4.3.0-1
qemu-server: 6.2-11
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.4-pve1
 

Attachments

  • daemon.log
    10.5 KB · Views: 2

aaron

Proxmox Staff Member
Staff member
Jun 3, 2019
2,606
384
88
Do you have any coredumps in /var/lib/systemd/coredump? These could help tremendously to figure out what is going wrong.

If this is happening regularly, I recommend to temporarily disable the HA services to avoid any fencing.

To do so you can first stop the LRM on all nodes and then the CRM on all nodes.
Code:
systemctl stop pve-ha-lrm
systemctl stop pve-ha-crm

Just be aware that they will start again if the node reboots.
 

erik.deneve

New Member
Mar 16, 2020
11
0
1
37
Do you have any coredumps in /var/lib/systemd/coredump? These could help tremendously to figure out what is going wrong.

If this is happening regularly, I recommend to temporarily disable the HA services to avoid any fencing.

To do so you can first stop the LRM on all nodes and then the CRM on all nodes.
Code:
systemctl stop pve-ha-lrm
systemctl stop pve-ha-crm

Just be aware that they will start again if the node reboots.

We don't have coredumps enabled.
Can you tell how to enable the coredumps (or point to the documentation), we will do that.

Thanks,
Erik
 

aaron

Proxmox Staff Member
Staff member
Jun 3, 2019
2,606
384
88
They should be generated automatically. Though if you have the HA services enabled it is possible that the nodes fence themselves before the coredump is safely written to disk.

Therefore, I suggest disabling them if you have the problem repeatedly to avoid the fencing of the cluster and to get the coredumps.

As usual, the Arch Linux wiki as a good article about coredumps: https://wiki.archlinux.org/index.php/Core_dump
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!