Promox cluster keeps crashing - segfault in pmxcfs?

victorhooi · Apr 14, 2020

Hi,

I have a 4-node cluster running Proxmox/Ceph.

In the last week - two of the nodes have gone down multiple times - each time, the nodes seems responsive - however, it disappears from the cluster.

On the console I see a message about a segfault in pmxcfs

Screen Shot 2020-04-14 at 7.43.32 pm.png

Here is the output of pveversion from one of the nodes as well:

Code:

# pveversion --verbose
proxmox-ve: 6.1-2 (running kernel: 5.3.18-3-pve)
pve-manager: 6.1-8 (running version: 6.1-8/806edfe1)
pve-kernel-helper: 6.1-8
pve-kernel-5.3: 6.1-6
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.10-1-pve: 5.3.10-1
ceph: 14.2.8-pve1
ceph-fuse: 14.2.8-pve1
corosync: 3.0.3-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 2.0.1-1+pve8
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.15-pve1
libpve-access-control: 6.0-6
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.0-19
libpve-guest-common-perl: 3.0-6
libpve-http-server-perl: 3.0-5
libpve-storage-perl: 6.1-5
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.0-2
lxcfs: 4.0.2-pve1
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.1-3
pve-cluster: 6.1-5
pve-container: 3.1-1
pve-docs: 6.1-6
pve-edk2-firmware: 2.20200229-1
pve-firewall: 4.0-10
pve-firmware: 3.0-7
pve-ha-manager: 3.0-9
pve-i18n: 2.0-5
pve-qemu-kvm: 4.2.0-1
pve-xtermjs: 4.3.0-1
qemu-server: 6.1-13
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.3-pve1

Any ideas on what's going on?

(Underlying hardware is a AMD Rome-based (EPYC 7002) system if that matters).

Thanks,
Victor

aaron · Apr 14, 2020

Do you see any problems in the syslog regarding Corosync?
grep corosync /var/log/syslog | less

victorhooi · Apr 14, 2020

There is nothing about corosync is any of the syslogs in any of the four nodes.

From the crash message - are you thinking the issue is in corosync?

I just saw this earlier thread - based on that I installed the systemd-coredump package, and edited /etc/systemd/journald.conf to add "Storage=persistent".

Will that help you debug if it crashes again?

(I did try restarting the journald service instead of rebooting the node - but then I got this error:

Code:

# systemctl restart systemd-journald
Job for systemd-journald.service failed because a fatal signal was delivered causing the control process to dump core.
See "systemctl status systemd-journald.service" and "journalctl -xe" for details.

And then when I checked journalctl -xe, I see:

Code:

Apr 15 05:48:10 example-node3 systemd[1]: Starting Journal Service...
Apr 15 05:48:10 example-node3 systemd-journald[4187007]: Failed to create new runtime journal: No such file or directory
Apr 15 05:48:10 example-node3 systemd-journald[4187007]: Assertion 'f' failed at ../src/journal/journal-file.c:338, function journal_file_close(). Aborting.
Apr 15 05:48:10 example-node3 systemd[1]: systemd-journald.service: Main process exited, code=dumped, status=6/ABRT
Apr 15 05:48:10 example-node3 systemd[1]: systemd-journald.service: Failed with result 'core-dump'.
Apr 15 05:48:10 example-node3 systemd[1]: Failed to start Journal Service.
Apr 15 05:48:10 example-node3 systemd[1]: Dependency failed for Flush Journal to Persistent Storage.
Apr 15 05:48:10 example-node3 systemd[1]: systemd-journal-flush.service: Job systemd-journal-flush.service/start failed with result 'dependency'.
Apr 15 05:48:10 example-node3 systemd[1]: systemd-journald.service: Service has no hold-off time (RestartSec=0), scheduling restart.
Apr 15 05:48:10 example-node3 systemd[1]: systemd-journald.service: Scheduled restart job, restart counter is at 1.
Apr 15 05:48:10 example-node3 systemd[1]: Stopped Journal Service.
Apr 15 05:48:10 example-node3 systemd[1]: Starting Journal Service...
Apr 15 05:48:10 example-node3 systemd-coredump[4187559]: Process 4187007 (systemd-journal) of user 0 dumped core.
Apr 15 05:48:10 example-node3 systemd-coredump[4187559]: Coredump diverted to /var/lib/systemd/coredump/core.systemd-journal.0.b501a2f2e02e4af9bec9015d471a9239.4187007.15868936900000
Apr 15 05:48:10 example-node3 systemd-coredump[4187559]: Stack trace of thread 4187007:
Apr 15 05:48:10 example-node3 systemd-coredump[4187559]: #0  0x00007f02dc0897bb raise (libc.so.6)
Apr 15 05:48:10 example-node3 systemd-coredump[4187559]: #1  0x00007f02dc074535 abort (libc.so.6)
Apr 15 05:48:10 example-node3 systemd-coredump[4187559]: #2  0x00007f02dbe1007a n/a (libsystemd-shared-241.so)
Apr 15 05:48:10 example-node3 systemd-coredump[4187559]: #3  0x00007f02dbe2e691 journal_file_close (libsystemd-shared-241.so)
Apr 15 05:48:10 example-node3 systemd-coredump[4187559]: #4  0x000055d738bd883e n/a (systemd-journald)
Apr 15 05:48:10 example-node3 systemd-coredump[4187559]: #5  0x000055d738bcc485 n/a (systemd-journald)
Apr 15 05:48:10 example-node3 systemd-coredump[4187559]: #6  0x00007f02dc07609b __libc_start_main (libc.so.6)
Apr 15 05:48:10 example-node3 kernel: printk: systemd-coredum: 7 output lines suppressed due to ratelimiting
Apr 15 05:48:10 example-node3 systemd-journald[4187563]: Journal started
-- Subject: The journal has been started
-- Defined-By: systemd
-- Support: https://www.debian.org/support
--
-- The system journal process has started up, opened the journal
-- files for writing and is now ready to process requests.
Apr 15 05:48:10 example-node3 systemd-journald[4187563]: System journal (/var/log/journal/410a57a049f74b1b9b98eb17e755bf5a) is 79.1M, max 4.0G, 3.9G free.
-- Subject: Disk space used by the journal
-- Defined-By: systemd
-- Support: https://www.debian.org/support
--
-- System journal (/var/log/journal/410a57a049f74b1b9b98eb17e755bf5a) is currently using 79.1M.
-- Maximum allowed usage is set to 4.0G.
-- Leaving at least 4.0G free (of currently available 920.6G of disk space).
-- Enforced usage limit is thus 4.0G, of which 3.9G are still available.
--
-- The limits controlling how much disk space is used by the journal may
-- be configured with SystemMaxUse=, SystemKeepFree=, SystemMaxFileSize=,
-- RuntimeMaxUse=, RuntimeKeepFree=, RuntimeMaxFileSize= settings in
-- /etc/systemd/journald.conf. See journald.conf(5) for details.
Apr 15 05:48:04 example-node3 pvestatd[6565]: unable to activate storage 'example_cephfs' - directory '/mnt/pve/example_cephfs' does not exist or is unreachable
Apr 15 05:48:10 example-node3 systemd[1]: Started Journal Service.
Apr 15 05:48:14 example-node3 pvestatd[6565]: unable to activate storage 'example_cephfs' - directory '/mnt/pve/example_cephfs' does not exist or is unreachable
Apr 15 05:48:15 example-node3 pmxcfs[2790706]: [status] notice: received log
Apr 15 05:48:15 example-node3 pmxcfs[2790706]: [status] notice: received log
Apr 15 05:48:24 example-node3 pvestatd[6565]: unable to activate storage 'example_cephfs' - directory '/mnt/pve/example_cephfs does not exist or is unreachable

Anyhow, I've rebooted that node, and journald seems to be up now.

Should I just monitor to see if the crash recurs?

aaron · Apr 15, 2020

victorhooi said:
From the crash message - are you thinking the issue is in corosync?

I am not sure, it was a first guess to see if it could be that.

victorhooi said:
Will that help you debug if it crashes again?

Maybe.

To rule out possible hardware problems; have you checked if there are BIOS updates available? If possible a memory test would be interesting too.

victorhooi said:
In the last week - two of the nodes have gone down multiple times

Did you change anything before the problem started to show? Installed updates maybe?

Search

Search

Promox cluster keeps crashing - segfault in pmxcfs?

victorhooi

Well-Known Member

aaron

Proxmox Staff Member

victorhooi

Well-Known Member

aaron

Proxmox Staff Member