All nodes rebooted in cluster suddenly

Discussion in 'Proxmox VE: Installation and configuration' started by zaxx, Dec 28, 2018.

  1. zaxx

    zaxx New Member

    Joined:
    Dec 28, 2018
    Messages:
    4
    Likes Received:
    0
    It's happen today for me at latest 5.3, all nodes rebooted simultaneously in cluster, and I have no idea what happened:

    Dec 28 07:27:38 kvm-02 pmxcfs[18320]: [status] notice: received log
    Dec 28 07:27:38 kvm-02 kernel: [1074940.672331] server[18412]: segfault at 7f7b23bdbaaa ip 0000562167d517b9 sp 00007f7ab916fa10 error 4 in pmxcfs[562167d34000+2b000]
    Dec 28 07:27:38 kvm-02 systemd[1]: pve-cluster.service: Main process exited, code=killed, status=11/SEGV
    Dec 28 07:27:38 kvm-02 systemd[1]: pve-cluster.service: Unit entered failed state.
    Dec 28 07:27:38 kvm-02 systemd[1]: pve-cluster.service: Failed with result 'signal'.
    Dec 28 07:27:39 kvm-02 pveproxy[7450]: ipcc_send_rec[1] failed: Connection refused

    There is the same log at all nodes. I had already checked network side, it's ok.
     
    #1 zaxx, Dec 28, 2018
    Last edited: Dec 28, 2018
  2. sb-jw

    sb-jw Active Member

    Joined:
    Jan 23, 2018
    Messages:
    486
    Likes Received:
    42
    Do you have anymore informations about your Cluster (Hardware, Software, etc. pp.)? How about all the other Log Files?
     
  3. zaxx

    zaxx New Member

    Joined:
    Dec 28, 2018
    Messages:
    4
    Likes Received:
    0
    4 node cluster based on Supermicro servers with rbd, nfs storages
    What other log files are you asked?

    proxmox-ve: 5.3-1 (running kernel: 4.15.18-9-pve)
    pve-manager: 5.3-5 (running version: 5.3-5/97ae681d)
    pve-kernel-4.15: 5.2-12
    pve-kernel-4.13: 5.2-2
    pve-kernel-4.15.18-9-pve: 4.15.18-30
    corosync: 2.4.4-pve1
    criu: 2.11.1-1~bpo90
    glusterfs-client: 3.8.8-1
    ksm-control-daemon: 1.2-2
    libjs-extjs: 6.0.1-2
    libpve-access-control: 5.1-3
    libpve-apiclient-perl: 2.0-5
    libpve-common-perl: 5.0-43
    libpve-guest-common-perl: 2.0-18
    libpve-http-server-perl: 2.0-11
    libpve-storage-perl: 5.0-33
    libqb0: 1.0.3-1~bpo9
    lvm2: 2.02.168-pve6
    lxc-pve: 3.0.2+pve1-5
    lxcfs: 3.0.2-2
    novnc-pve: 1.0.0-2
    proxmox-widget-toolkit: 1.0-22
    pve-cluster: 5.0-31
    pve-container: 2.0-31
    pve-docs: 5.3-1
    pve-edk2-firmware: 1.20181023-1
    pve-firewall: 3.0-16
    pve-firmware: 2.0-6
    pve-ha-manager: 2.0-5
    pve-i18n: 1.0-9
    pve-libspice-server1: 0.14.1-1
    pve-qemu-kvm: 2.12.1-1
    pve-xtermjs: 1.0-5
    qemu-server: 5.0-43
    smartmontools: 6.5+svn4324-1
    spiceterm: 3.0-5
    vncterm: 1.5-3
    zfsutils-linux: 0.7.12-pve1~bpo1
     
  4. sb-jw

    sb-jw Active Member

    Joined:
    Jan 23, 2018
    Messages:
    486
    Likes Received:
    42
    So you using CEPH? As HCI or Standalone? Let us know more about your configuration, what MB, CPUs, Network, how many LOMs you have connected, do you have 1GbE or 10GbE or you using FC or Infiband? How about your Server load? Do you have metrics like a monitoring system (check_mk, icinga, Nagios, etc)? What about the HA functionality in PVE?

    You have multiple log files in your log folder for example dmesg, syslog, messages etc. there could be more information in it.

    It's very hard to help you in this case if you are not able to deliver such basic informations we need. We do not have any knowledge about your setup, so we do not know where might be a problem. So let us know and explain us your setup ;)
     
  5. zaxx

    zaxx New Member

    Joined:
    Dec 28, 2018
    Messages:
    4
    Likes Received:
    0
    it's happened again today.

    From syslog (same at all nodes):
    Jan 14 16:16:49 kvm-02 kernel: [ 8528.966852] cfs_loop[16796]: segfault at 7f3c0cc98578 ip 000055e9bb75d6b0 sp 00007f3ba70b73b8 error 4 in pmxcfs[55e9bb740000+2b000]
    Jan 14 16:16:49 kvm-02 systemd[1]: pve-cluster.service: Main process exited, code=killed, status=11/SEGV
    Jan 14 16:16:49 kvm-02 systemd[1]: pve-cluster.service: Unit entered failed state.
    Jan 14 16:16:49 kvm-02 systemd[1]: pve-cluster.service: Failed with result 'signal'.
    Jan 14 16:16:50 kvm-02 pve-ha-crm[17071]: status change slave => wait_for_quorum
    Jan 14 16:16:51 kvm-02 pveproxy[5364]: ipcc_send_rec[1] failed: Connection refused

    cat /var/log/messages | grep "Jan 14 16:16"
    Jan 14 16:16:49 kvm-02 kernel: [ 8528.966852] cfs_loop[16796]: segfault at 7f3c0cc98578 ip 000055e9bb75d6b0 sp 00007f3ba70b73b8 error 4 in pmxcfs[55e9bb740000+2b000]

    There is cluster with 5 nodes. CPU load ~30%, Memory ~50% for each node. Each node has 1xXeon E5-2680 v4, 256Gb RAM. Connected to network via double (in bond) 10gb ethernet interfaces, same as used for VM. Attached two ceph based storages for VM and NFS based for VM backups.
     
  6. mailinglists

    mailinglists Active Member

    Joined:
    Mar 14, 2012
    Messages:
    372
    Likes Received:
    33
    I would guess you have had problems on your network where corosync is communicating and because you have HA enabled nodes rebooted themselves after loosing quorum. Add redundant ring to corosync or disable HA.
     
  7. zaxx

    zaxx New Member

    Joined:
    Dec 28, 2018
    Messages:
    4
    Likes Received:
    0
    Network problem was first I think, but network team double checked and not found any issue.
    I understand why nodes rebooted, cause nodes loose each other. But why node loose others?
     
  8. mailinglists

    mailinglists Active Member

    Joined:
    Mar 14, 2012
    Messages:
    372
    Likes Received:
    33
    Because communication among servers was/is hindered.
    Because all of them rebooted, not just a few, problem is not local to the nodes, but comes from the network.
    There is a small chance that the problem was local and happened on all nodes at once, but I doubt it.

    I would not believe network guys and would set up my own monitoring to prove them wrong.
    I would ask open a new thread to discuss (unless someone explains here):
    Code:
    an 14 16:16:49 kvm-02 kernel: [ 8528.966852] cfs_loop[16796]: segfault at 7f3c0cc98578 ip 000055e9bb75d6b0 sp 00007f3ba70b73b8 error 4 in pmxcfs[55e9bb740000+2b000]
    Jan 14 16:16:49 kvm-02 systemd[1]: pve-cluster.service: Main process exited, code=killed, status=11/SEGV
    
    because I'm not sure if it is normal to get segfault with cfs_loop, if network goes down.
     
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice