All nodes rebooted in cluster suddenly

zaxx

Active Member
Dec 28, 2018
7
0
41
41
It's happen today for me at latest 5.3, all nodes rebooted simultaneously in cluster, and I have no idea what happened:

Dec 28 07:27:38 kvm-02 pmxcfs[18320]: [status] notice: received log
Dec 28 07:27:38 kvm-02 kernel: [1074940.672331] server[18412]: segfault at 7f7b23bdbaaa ip 0000562167d517b9 sp 00007f7ab916fa10 error 4 in pmxcfs[562167d34000+2b000]
Dec 28 07:27:38 kvm-02 systemd[1]: pve-cluster.service: Main process exited, code=killed, status=11/SEGV
Dec 28 07:27:38 kvm-02 systemd[1]: pve-cluster.service: Unit entered failed state.
Dec 28 07:27:38 kvm-02 systemd[1]: pve-cluster.service: Failed with result 'signal'.
Dec 28 07:27:39 kvm-02 pveproxy[7450]: ipcc_send_rec[1] failed: Connection refused

There is the same log at all nodes. I had already checked network side, it's ok.
 
Last edited:
Do you have anymore informations about your Cluster (Hardware, Software, etc. pp.)? How about all the other Log Files?
 
4 node cluster based on Supermicro servers with rbd, nfs storages
What other log files are you asked?

proxmox-ve: 5.3-1 (running kernel: 4.15.18-9-pve)
pve-manager: 5.3-5 (running version: 5.3-5/97ae681d)
pve-kernel-4.15: 5.2-12
pve-kernel-4.13: 5.2-2
pve-kernel-4.15.18-9-pve: 4.15.18-30
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-3
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-43
libpve-guest-common-perl: 2.0-18
libpve-http-server-perl: 2.0-11
libpve-storage-perl: 5.0-33
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.0.2+pve1-5
lxcfs: 3.0.2-2
novnc-pve: 1.0.0-2
proxmox-widget-toolkit: 1.0-22
pve-cluster: 5.0-31
pve-container: 2.0-31
pve-docs: 5.3-1
pve-edk2-firmware: 1.20181023-1
pve-firewall: 3.0-16
pve-firmware: 2.0-6
pve-ha-manager: 2.0-5
pve-i18n: 1.0-9
pve-libspice-server1: 0.14.1-1
pve-qemu-kvm: 2.12.1-1
pve-xtermjs: 1.0-5
qemu-server: 5.0-43
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.12-pve1~bpo1
 
So you using CEPH? As HCI or Standalone? Let us know more about your configuration, what MB, CPUs, Network, how many LOMs you have connected, do you have 1GbE or 10GbE or you using FC or Infiband? How about your Server load? Do you have metrics like a monitoring system (check_mk, icinga, Nagios, etc)? What about the HA functionality in PVE?

You have multiple log files in your log folder for example dmesg, syslog, messages etc. there could be more information in it.

It's very hard to help you in this case if you are not able to deliver such basic informations we need. We do not have any knowledge about your setup, so we do not know where might be a problem. So let us know and explain us your setup ;)
 
it's happened again today.

From syslog (same at all nodes):
Jan 14 16:16:49 kvm-02 kernel: [ 8528.966852] cfs_loop[16796]: segfault at 7f3c0cc98578 ip 000055e9bb75d6b0 sp 00007f3ba70b73b8 error 4 in pmxcfs[55e9bb740000+2b000]
Jan 14 16:16:49 kvm-02 systemd[1]: pve-cluster.service: Main process exited, code=killed, status=11/SEGV
Jan 14 16:16:49 kvm-02 systemd[1]: pve-cluster.service: Unit entered failed state.
Jan 14 16:16:49 kvm-02 systemd[1]: pve-cluster.service: Failed with result 'signal'.
Jan 14 16:16:50 kvm-02 pve-ha-crm[17071]: status change slave => wait_for_quorum
Jan 14 16:16:51 kvm-02 pveproxy[5364]: ipcc_send_rec[1] failed: Connection refused

cat /var/log/messages | grep "Jan 14 16:16"
Jan 14 16:16:49 kvm-02 kernel: [ 8528.966852] cfs_loop[16796]: segfault at 7f3c0cc98578 ip 000055e9bb75d6b0 sp 00007f3ba70b73b8 error 4 in pmxcfs[55e9bb740000+2b000]

There is cluster with 5 nodes. CPU load ~30%, Memory ~50% for each node. Each node has 1xXeon E5-2680 v4, 256Gb RAM. Connected to network via double (in bond) 10gb ethernet interfaces, same as used for VM. Attached two ceph based storages for VM and NFS based for VM backups.
 
I would guess you have had problems on your network where corosync is communicating and because you have HA enabled nodes rebooted themselves after loosing quorum. Add redundant ring to corosync or disable HA.
 
Network problem was first I think, but network team double checked and not found any issue.
I understand why nodes rebooted, cause nodes loose each other. But why node loose others?
 
Because communication among servers was/is hindered.
Because all of them rebooted, not just a few, problem is not local to the nodes, but comes from the network.
There is a small chance that the problem was local and happened on all nodes at once, but I doubt it.

I would not believe network guys and would set up my own monitoring to prove them wrong.
I would ask open a new thread to discuss (unless someone explains here):
Code:
an 14 16:16:49 kvm-02 kernel: [ 8528.966852] cfs_loop[16796]: segfault at 7f3c0cc98578 ip 000055e9bb75d6b0 sp 00007f3ba70b73b8 error 4 in pmxcfs[55e9bb740000+2b000]
Jan 14 16:16:49 kvm-02 systemd[1]: pve-cluster.service: Main process exited, code=killed, status=11/SEGV
because I'm not sure if it is normal to get segfault with cfs_loop, if network goes down.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!