All nodes rebooted in cluster suddenly

zaxx · Dec 28, 2018

It's happen today for me at latest 5.3, all nodes rebooted simultaneously in cluster, and I have no idea what happened:

Dec 28 07:27:38 kvm-02 pmxcfs[18320]: [status] notice: received log
Dec 28 07:27:38 kvm-02 kernel: [1074940.672331] server[18412]: segfault at 7f7b23bdbaaa ip 0000562167d517b9 sp 00007f7ab916fa10 error 4 in pmxcfs[562167d34000+2b000]
Dec 28 07:27:38 kvm-02 systemd[1]: pve-cluster.service: Main process exited, code=killed, status=11/SEGV
Dec 28 07:27:38 kvm-02 systemd[1]: pve-cluster.service: Unit entered failed state.
Dec 28 07:27:38 kvm-02 systemd[1]: pve-cluster.service: Failed with result 'signal'.
Dec 28 07:27:39 kvm-02 pveproxy[7450]: ipcc_send_rec[1] failed: Connection refused

There is the same log at all nodes. I had already checked network side, it's ok.

sb-jw · Dec 28, 2018

Do you have anymore informations about your Cluster (Hardware, Software, etc. pp.)? How about all the other Log Files?

zaxx · Dec 29, 2018

4 node cluster based on Supermicro servers with rbd, nfs storages
What other log files are you asked?

proxmox-ve: 5.3-1 (running kernel: 4.15.18-9-pve)
pve-manager: 5.3-5 (running version: 5.3-5/97ae681d)
pve-kernel-4.15: 5.2-12
pve-kernel-4.13: 5.2-2
pve-kernel-4.15.18-9-pve: 4.15.18-30
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-3
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-43
libpve-guest-common-perl: 2.0-18
libpve-http-server-perl: 2.0-11
libpve-storage-perl: 5.0-33
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.0.2+pve1-5
lxcfs: 3.0.2-2
novnc-pve: 1.0.0-2
proxmox-widget-toolkit: 1.0-22
pve-cluster: 5.0-31
pve-container: 2.0-31
pve-docs: 5.3-1
pve-edk2-firmware: 1.20181023-1
pve-firewall: 3.0-16
pve-firmware: 2.0-6
pve-ha-manager: 2.0-5
pve-i18n: 1.0-9
pve-libspice-server1: 0.14.1-1
pve-qemu-kvm: 2.12.1-1
pve-xtermjs: 1.0-5
qemu-server: 5.0-43
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.12-pve1~bpo1

sb-jw · Dec 29, 2018

So you using CEPH? As HCI or Standalone? Let us know more about your configuration, what MB, CPUs, Network, how many LOMs you have connected, do you have 1GbE or 10GbE or you using FC or Infiband? How about your Server load? Do you have metrics like a monitoring system (check_mk, icinga, Nagios, etc)? What about the HA functionality in PVE?

You have multiple log files in your log folder for example dmesg, syslog, messages etc. there could be more information in it.

It's very hard to help you in this case if you are not able to deliver such basic informations we need. We do not have any knowledge about your setup, so we do not know where might be a problem. So let us know and explain us your setup

zaxx · Jan 14, 2019

it's happened again today.

From syslog (same at all nodes):
Jan 14 16:16:49 kvm-02 kernel: [ 8528.966852] cfs_loop[16796]: segfault at 7f3c0cc98578 ip 000055e9bb75d6b0 sp 00007f3ba70b73b8 error 4 in pmxcfs[55e9bb740000+2b000]
Jan 14 16:16:49 kvm-02 systemd[1]: pve-cluster.service: Main process exited, code=killed, status=11/SEGV
Jan 14 16:16:49 kvm-02 systemd[1]: pve-cluster.service: Unit entered failed state.
Jan 14 16:16:49 kvm-02 systemd[1]: pve-cluster.service: Failed with result 'signal'.
Jan 14 16:16:50 kvm-02 pve-ha-crm[17071]: status change slave => wait_for_quorum
Jan 14 16:16:51 kvm-02 pveproxy[5364]: ipcc_send_rec[1] failed: Connection refused

cat /var/log/messages | grep "Jan 14 16:16"
Jan 14 16:16:49 kvm-02 kernel: [ 8528.966852] cfs_loop[16796]: segfault at 7f3c0cc98578 ip 000055e9bb75d6b0 sp 00007f3ba70b73b8 error 4 in pmxcfs[55e9bb740000+2b000]

There is cluster with 5 nodes. CPU load ~30%, Memory ~50% for each node. Each node has 1xXeon E5-2680 v4, 256Gb RAM. Connected to network via double (in bond) 10gb ethernet interfaces, same as used for VM. Attached two ceph based storages for VM and NFS based for VM backups.

mailinglists · Jan 14, 2019

I would guess you have had problems on your network where corosync is communicating and because you have HA enabled nodes rebooted themselves after loosing quorum. Add redundant ring to corosync or disable HA.

zaxx · Jan 14, 2019

Network problem was first I think, but network team double checked and not found any issue.
I understand why nodes rebooted, cause nodes loose each other. But why node loose others?

mailinglists · Jan 14, 2019

Because communication among servers was/is hindered.
Because all of them rebooted, not just a few, problem is not local to the nodes, but comes from the network.
There is a small chance that the problem was local and happened on all nodes at once, but I doubt it.

I would not believe network guys and would set up my own monitoring to prove them wrong.
I would ask open a new thread to discuss (unless someone explains here):

Code:

an 14 16:16:49 kvm-02 kernel: [ 8528.966852] cfs_loop[16796]: segfault at 7f3c0cc98578 ip 000055e9bb75d6b0 sp 00007f3ba70b73b8 error 4 in pmxcfs[55e9bb740000+2b000]
Jan 14 16:16:49 kvm-02 systemd[1]: pve-cluster.service: Main process exited, code=killed, status=11/SEGV

because I'm not sure if it is normal to get segfault with cfs_loop, if network goes down.

Search

Search

All nodes rebooted in cluster suddenly

zaxx

Active Member

sb-jw

Famous Member

zaxx

Active Member

sb-jw

Famous Member

zaxx

Active Member

mailinglists

Renowned Member

zaxx

Active Member

mailinglists

Renowned Member