Promox cluster keeps crashing - segfault in pmxcfs?

victorhooi

Well-Known Member
Apr 3, 2018
251
20
58
38
Hi,

I have a 4-node cluster running Proxmox/Ceph.

In the last week - two of the nodes have gone down multiple times - each time, the nodes seems responsive - however, it disappears from the cluster.

On the console I see a message about a segfault in pmxcfs

Screen Shot 2020-04-14 at 7.43.32 pm.png

Here is the output of pveversion from one of the nodes as well:

Code:
# pveversion --verbose
proxmox-ve: 6.1-2 (running kernel: 5.3.18-3-pve)
pve-manager: 6.1-8 (running version: 6.1-8/806edfe1)
pve-kernel-helper: 6.1-8
pve-kernel-5.3: 6.1-6
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.10-1-pve: 5.3.10-1
ceph: 14.2.8-pve1
ceph-fuse: 14.2.8-pve1
corosync: 3.0.3-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 2.0.1-1+pve8
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.15-pve1
libpve-access-control: 6.0-6
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.0-19
libpve-guest-common-perl: 3.0-6
libpve-http-server-perl: 3.0-5
libpve-storage-perl: 6.1-5
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.0-2
lxcfs: 4.0.2-pve1
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.1-3
pve-cluster: 6.1-5
pve-container: 3.1-1
pve-docs: 6.1-6
pve-edk2-firmware: 2.20200229-1
pve-firewall: 4.0-10
pve-firmware: 3.0-7
pve-ha-manager: 3.0-9
pve-i18n: 2.0-5
pve-qemu-kvm: 4.2.0-1
pve-xtermjs: 4.3.0-1
qemu-server: 6.1-13
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.3-pve1

Any ideas on what's going on?

(Underlying hardware is a AMD Rome-based (EPYC 7002) system if that matters).

Thanks,
Victor
 
Last edited:
Do you see any problems in the syslog regarding Corosync?
grep corosync /var/log/syslog | less
 
There is nothing about corosync is any of the syslogs in any of the four nodes.

From the crash message - are you thinking the issue is in corosync?

I just saw this earlier thread - based on that I installed the systemd-coredump package, and edited /etc/systemd/journald.conf to add "Storage=persistent".

Will that help you debug if it crashes again?

(I did try restarting the journald service instead of rebooting the node - but then I got this error:

Code:
# systemctl restart systemd-journald
Job for systemd-journald.service failed because a fatal signal was delivered causing the control process to dump core.
See "systemctl status systemd-journald.service" and "journalctl -xe" for details.

And then when I checked journalctl -xe, I see:
Code:
Apr 15 05:48:10 example-node3 systemd[1]: Starting Journal Service...
Apr 15 05:48:10 example-node3 systemd-journald[4187007]: Failed to create new runtime journal: No such file or directory
Apr 15 05:48:10 example-node3 systemd-journald[4187007]: Assertion 'f' failed at ../src/journal/journal-file.c:338, function journal_file_close(). Aborting.
Apr 15 05:48:10 example-node3 systemd[1]: systemd-journald.service: Main process exited, code=dumped, status=6/ABRT
Apr 15 05:48:10 example-node3 systemd[1]: systemd-journald.service: Failed with result 'core-dump'.
Apr 15 05:48:10 example-node3 systemd[1]: Failed to start Journal Service.
Apr 15 05:48:10 example-node3 systemd[1]: Dependency failed for Flush Journal to Persistent Storage.
Apr 15 05:48:10 example-node3 systemd[1]: systemd-journal-flush.service: Job systemd-journal-flush.service/start failed with result 'dependency'.
Apr 15 05:48:10 example-node3 systemd[1]: systemd-journald.service: Service has no hold-off time (RestartSec=0), scheduling restart.
Apr 15 05:48:10 example-node3 systemd[1]: systemd-journald.service: Scheduled restart job, restart counter is at 1.
Apr 15 05:48:10 example-node3 systemd[1]: Stopped Journal Service.
Apr 15 05:48:10 example-node3 systemd[1]: Starting Journal Service...
Apr 15 05:48:10 example-node3 systemd-coredump[4187559]: Process 4187007 (systemd-journal) of user 0 dumped core.
Apr 15 05:48:10 example-node3 systemd-coredump[4187559]: Coredump diverted to /var/lib/systemd/coredump/core.systemd-journal.0.b501a2f2e02e4af9bec9015d471a9239.4187007.15868936900000
Apr 15 05:48:10 example-node3 systemd-coredump[4187559]: Stack trace of thread 4187007:
Apr 15 05:48:10 example-node3 systemd-coredump[4187559]: #0  0x00007f02dc0897bb raise (libc.so.6)
Apr 15 05:48:10 example-node3 systemd-coredump[4187559]: #1  0x00007f02dc074535 abort (libc.so.6)
Apr 15 05:48:10 example-node3 systemd-coredump[4187559]: #2  0x00007f02dbe1007a n/a (libsystemd-shared-241.so)
Apr 15 05:48:10 example-node3 systemd-coredump[4187559]: #3  0x00007f02dbe2e691 journal_file_close (libsystemd-shared-241.so)
Apr 15 05:48:10 example-node3 systemd-coredump[4187559]: #4  0x000055d738bd883e n/a (systemd-journald)
Apr 15 05:48:10 example-node3 systemd-coredump[4187559]: #5  0x000055d738bcc485 n/a (systemd-journald)
Apr 15 05:48:10 example-node3 systemd-coredump[4187559]: #6  0x00007f02dc07609b __libc_start_main (libc.so.6)
Apr 15 05:48:10 example-node3 kernel: printk: systemd-coredum: 7 output lines suppressed due to ratelimiting
Apr 15 05:48:10 example-node3 systemd-journald[4187563]: Journal started
-- Subject: The journal has been started
-- Defined-By: systemd
-- Support: https://www.debian.org/support
--
-- The system journal process has started up, opened the journal
-- files for writing and is now ready to process requests.
Apr 15 05:48:10 example-node3 systemd-journald[4187563]: System journal (/var/log/journal/410a57a049f74b1b9b98eb17e755bf5a) is 79.1M, max 4.0G, 3.9G free.
-- Subject: Disk space used by the journal
-- Defined-By: systemd
-- Support: https://www.debian.org/support
--
-- System journal (/var/log/journal/410a57a049f74b1b9b98eb17e755bf5a) is currently using 79.1M.
-- Maximum allowed usage is set to 4.0G.
-- Leaving at least 4.0G free (of currently available 920.6G of disk space).
-- Enforced usage limit is thus 4.0G, of which 3.9G are still available.
--
-- The limits controlling how much disk space is used by the journal may
-- be configured with SystemMaxUse=, SystemKeepFree=, SystemMaxFileSize=,
-- RuntimeMaxUse=, RuntimeKeepFree=, RuntimeMaxFileSize= settings in
-- /etc/systemd/journald.conf. See journald.conf(5) for details.
Apr 15 05:48:04 example-node3 pvestatd[6565]: unable to activate storage 'example_cephfs' - directory '/mnt/pve/example_cephfs' does not exist or is unreachable
Apr 15 05:48:10 example-node3 systemd[1]: Started Journal Service.
Apr 15 05:48:14 example-node3 pvestatd[6565]: unable to activate storage 'example_cephfs' - directory '/mnt/pve/example_cephfs' does not exist or is unreachable
Apr 15 05:48:15 example-node3 pmxcfs[2790706]: [status] notice: received log
Apr 15 05:48:15 example-node3 pmxcfs[2790706]: [status] notice: received log
Apr 15 05:48:24 example-node3 pvestatd[6565]: unable to activate storage 'example_cephfs' - directory '/mnt/pve/example_cephfs does not exist or is unreachable

Anyhow, I've rebooted that node, and journald seems to be up now.

Should I just monitor to see if the crash recurs?
 
Last edited:
From the crash message - are you thinking the issue is in corosync?
I am not sure, it was a first guess to see if it could be that.

Will that help you debug if it crashes again?
Maybe.

To rule out possible hardware problems; have you checked if there are BIOS updates available? If possible a memory test would be interesting too.

In the last week - two of the nodes have gone down multiple times
Did you change anything before the problem started to show? Installed updates maybe?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!