All nodes rebooted in cluster suddenly

zaxx

New Member
Dec 28, 2018
4
0
1
37
It's happen today for me at latest 5.3, all nodes rebooted simultaneously in cluster, and I have no idea what happened:

Dec 28 07:27:38 kvm-02 pmxcfs[18320]: [status] notice: received log
Dec 28 07:27:38 kvm-02 kernel: [1074940.672331] server[18412]: segfault at 7f7b23bdbaaa ip 0000562167d517b9 sp 00007f7ab916fa10 error 4 in pmxcfs[562167d34000+2b000]
Dec 28 07:27:38 kvm-02 systemd[1]: pve-cluster.service: Main process exited, code=killed, status=11/SEGV
Dec 28 07:27:38 kvm-02 systemd[1]: pve-cluster.service: Unit entered failed state.
Dec 28 07:27:38 kvm-02 systemd[1]: pve-cluster.service: Failed with result 'signal'.
Dec 28 07:27:39 kvm-02 pveproxy[7450]: ipcc_send_rec[1] failed: Connection refused

There is the same log at all nodes. I had already checked network side, it's ok.
 
Last edited:

sb-jw

Active Member
Jan 23, 2018
551
49
28
28
Do you have anymore informations about your Cluster (Hardware, Software, etc. pp.)? How about all the other Log Files?
 

zaxx

New Member
Dec 28, 2018
4
0
1
37
4 node cluster based on Supermicro servers with rbd, nfs storages
What other log files are you asked?

proxmox-ve: 5.3-1 (running kernel: 4.15.18-9-pve)
pve-manager: 5.3-5 (running version: 5.3-5/97ae681d)
pve-kernel-4.15: 5.2-12
pve-kernel-4.13: 5.2-2
pve-kernel-4.15.18-9-pve: 4.15.18-30
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-3
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-43
libpve-guest-common-perl: 2.0-18
libpve-http-server-perl: 2.0-11
libpve-storage-perl: 5.0-33
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.0.2+pve1-5
lxcfs: 3.0.2-2
novnc-pve: 1.0.0-2
proxmox-widget-toolkit: 1.0-22
pve-cluster: 5.0-31
pve-container: 2.0-31
pve-docs: 5.3-1
pve-edk2-firmware: 1.20181023-1
pve-firewall: 3.0-16
pve-firmware: 2.0-6
pve-ha-manager: 2.0-5
pve-i18n: 1.0-9
pve-libspice-server1: 0.14.1-1
pve-qemu-kvm: 2.12.1-1
pve-xtermjs: 1.0-5
qemu-server: 5.0-43
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.12-pve1~bpo1
 

sb-jw

Active Member
Jan 23, 2018
551
49
28
28
So you using CEPH? As HCI or Standalone? Let us know more about your configuration, what MB, CPUs, Network, how many LOMs you have connected, do you have 1GbE or 10GbE or you using FC or Infiband? How about your Server load? Do you have metrics like a monitoring system (check_mk, icinga, Nagios, etc)? What about the HA functionality in PVE?

You have multiple log files in your log folder for example dmesg, syslog, messages etc. there could be more information in it.

It's very hard to help you in this case if you are not able to deliver such basic informations we need. We do not have any knowledge about your setup, so we do not know where might be a problem. So let us know and explain us your setup ;)
 

zaxx

New Member
Dec 28, 2018
4
0
1
37
it's happened again today.

From syslog (same at all nodes):
Jan 14 16:16:49 kvm-02 kernel: [ 8528.966852] cfs_loop[16796]: segfault at 7f3c0cc98578 ip 000055e9bb75d6b0 sp 00007f3ba70b73b8 error 4 in pmxcfs[55e9bb740000+2b000]
Jan 14 16:16:49 kvm-02 systemd[1]: pve-cluster.service: Main process exited, code=killed, status=11/SEGV
Jan 14 16:16:49 kvm-02 systemd[1]: pve-cluster.service: Unit entered failed state.
Jan 14 16:16:49 kvm-02 systemd[1]: pve-cluster.service: Failed with result 'signal'.
Jan 14 16:16:50 kvm-02 pve-ha-crm[17071]: status change slave => wait_for_quorum
Jan 14 16:16:51 kvm-02 pveproxy[5364]: ipcc_send_rec[1] failed: Connection refused

cat /var/log/messages | grep "Jan 14 16:16"
Jan 14 16:16:49 kvm-02 kernel: [ 8528.966852] cfs_loop[16796]: segfault at 7f3c0cc98578 ip 000055e9bb75d6b0 sp 00007f3ba70b73b8 error 4 in pmxcfs[55e9bb740000+2b000]

There is cluster with 5 nodes. CPU load ~30%, Memory ~50% for each node. Each node has 1xXeon E5-2680 v4, 256Gb RAM. Connected to network via double (in bond) 10gb ethernet interfaces, same as used for VM. Attached two ceph based storages for VM and NFS based for VM backups.
 

mailinglists

Active Member
Mar 14, 2012
407
34
28
I would guess you have had problems on your network where corosync is communicating and because you have HA enabled nodes rebooted themselves after loosing quorum. Add redundant ring to corosync or disable HA.
 

zaxx

New Member
Dec 28, 2018
4
0
1
37
Network problem was first I think, but network team double checked and not found any issue.
I understand why nodes rebooted, cause nodes loose each other. But why node loose others?
 

mailinglists

Active Member
Mar 14, 2012
407
34
28
Because communication among servers was/is hindered.
Because all of them rebooted, not just a few, problem is not local to the nodes, but comes from the network.
There is a small chance that the problem was local and happened on all nodes at once, but I doubt it.

I would not believe network guys and would set up my own monitoring to prove them wrong.
I would ask open a new thread to discuss (unless someone explains here):
Code:
an 14 16:16:49 kvm-02 kernel: [ 8528.966852] cfs_loop[16796]: segfault at 7f3c0cc98578 ip 000055e9bb75d6b0 sp 00007f3ba70b73b8 error 4 in pmxcfs[55e9bb740000+2b000]
Jan 14 16:16:49 kvm-02 systemd[1]: pve-cluster.service: Main process exited, code=killed, status=11/SEGV
because I'm not sure if it is normal to get segfault with cfs_loop, if network goes down.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE and Proxmox Mail Gateway. We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!