corosync crash when network unstable work

Discussion in 'Proxmox VE: Networking and Firewall' started by Alibek, Sep 28, 2018.

  1. Alibek

    Alibek Member

    Joined:
    Jan 13, 2017
    Messages:
    30
    Likes Received:
    3
    Hi all!

    I found next situation:
    when link is unstable, (such as network card 10Gbps attached to switch by connectors RJ45 w/o gilding - contacts on network card and on connector can be oxidized), switch can up/down links and after which corosync goes "crazy" - starts loading one core of the processor at 30-100% and after a few hours corosync is crush. Corosync crush arbitrarily, for example, on 5/8 cluster servers.

    I fix that by simple script:
    Code:
    #!/usr/bin/env bash
    killall corosync -9
    sleep 2
    systemctl stop pve-ha-lrm.service
    sleep 2
    systemctl stop pve-ha-crm.service
    sleep 2
    systemctl restart pvedaemon.service
    sleep 2
    systemctl start pve-ha-lrm.service
    
    But probably it is necessary to check more thoroughly the code of Corosync.

    Code:
    # pveversion --verbose
    proxmox-ve: 5.2-2 (running kernel: 4.15.18-4-pve)
    pve-manager: 5.2-8 (running version: 5.2-8/fdf39912)
    pve-kernel-4.15: 5.2-7
    pve-kernel-4.15.18-4-pve: 4.15.18-23
    pve-kernel-4.15.18-2-pve: 4.15.18-21
    pve-kernel-4.15.18-1-pve: 4.15.18-19
    pve-kernel-4.15.17-3-pve: 4.15.17-14
    ceph: 12.2.8-pve1
    corosync: 2.4.2-pve5
    criu: 2.11.1-1~bpo90
    glusterfs-client: 3.8.8-1
    ksm-control-daemon: not correctly installed
    libjs-extjs: 6.0.1-2
    libpve-access-control: 5.0-8
    libpve-apiclient-perl: 2.0-5
    libpve-common-perl: 5.0-38
    libpve-guest-common-perl: 2.0-17
    libpve-http-server-perl: 2.0-10
    libpve-storage-perl: 5.0-27
    libqb0: 1.0.1-1
    lvm2: 2.02.168-pve6
    lxc-pve: 3.0.2+pve1-2
    lxcfs: 3.0.0-1
    novnc-pve: 1.0.0-2
    proxmox-widget-toolkit: 1.0-19
    pve-cluster: 5.0-30
    pve-container: 2.0-26
    pve-docs: 5.2-8
    pve-firewall: 3.0-14
    pve-firmware: 2.0-5
    pve-ha-manager: 2.0-5
    pve-i18n: 1.0-6
    pve-libspice-server1: 0.12.8-3
    pve-qemu-kvm: 2.11.2-1
    pve-xtermjs: 1.0-5
    qemu-server: 5.0-33
    smartmontools: 6.5+svn4324-1
    spiceterm: 3.0-5
    vncterm: 1.5-3
    zfsutils-linux: 0.7.9-pve1~bpo9
    
     
  2. wolfgang

    wolfgang Proxmox Staff Member
    Staff Member

    Joined:
    Oct 1, 2014
    Messages:
    3,862
    Likes Received:
    230
    Hi,
    Corosync is not made for unreliable networks.
    Corosync is a real-time message service and you can't make a real-time application latency tolerance because this is the opposite of a real-time app.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  3. Alibek

    Alibek Member

    Joined:
    Jan 13, 2017
    Messages:
    30
    Likes Received:
    3
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice