Problem after network failure

Discussion in 'Proxmox VE: Installation and configuration' started by axion.joey, Aug 15, 2019.

  1. axion.joey

    axion.joey Member

    Joined:
    Dec 29, 2009
    Messages:
    76
    Likes Received:
    1
    Hey Everyone,

    We recently had a network failure in one of our data centers. The network failure caused all of the proxmox nodes in our customer to fence themselves. They're back up an running, and the cluster shows all nodes in, but we're having the following issues:
    1. HA no longer works. Containers that are managed by HA can't be started. In order to start them we have to remove them from HA.
    2. We can't add new nodes to the cluster. When I try to add new nodes I get this response
    pvecm add 10.3.16.20
    root@10.3.16.20's password:
    copy corosync auth key
    stopping pve-cluster service
    backup old database
    Job for corosync.service failed. See 'systemctl status corosync.service' and 'journalctl -xn' for details.
    waiting for quorum...
    And then it hangs.

    Any help would be greatly appreciated.
     
  2. axion.joey

    axion.joey Member

    Joined:
    Dec 29, 2009
    Messages:
    76
    Likes Received:
    1
    Here's some additional info.
    proxmox-ve: 4.2-48 (running kernel: 4.4.6-1-pve)
    pve-manager: 4.2-2 (running version: 4.2-2/725d76f0)
    pve-kernel-4.4.6-1-pve: 4.4.6-48
    pve-kernel-4.2.6-1-pve: 4.2.6-36
    lvm2: 2.02.116-pve2
    corosync-pve: 2.3.5-2
    libqb0: 1.0-1
    pve-cluster: 4.0-39
    qemu-server: 4.0-72
    pve-firmware: 1.1-8
    libpve-common-perl: 4.0-59
    libpve-access-control: 4.0-16
    libpve-storage-perl: 4.0-50
    pve-libspice-server1: 0.12.5-2
    vncterm: 1.2-1
    pve-qemu-kvm: 2.5-14
    pve-container: 1.0-62
    pve-firewall: 2.0-25
    pve-ha-manager: 1.0-28
    ksm-control-daemon: 1.2-1
    glusterfs-client: 3.5.2-2+deb8u1
    lxc-pve: 1.1.5-7
    lxcfs: 2.0.0-pve2
    cgmanager: 0.39-pve1
    criu: 1.6.0-1
    zfsutils: 0.6.5-pve9~jessie

    systemctl status corosync.service
    ● corosync.service - Corosync Cluster Engine
    Loaded: loaded (/lib/systemd/system/corosync.service; enabled)
    Active: active (running) since Sat 2019-08-10 16:55:47 PDT; 4 days ago
    Process: 3199 ExecStart=/usr/share/corosync/corosync start (code=exited, status=0/SUCCESS)
    Main PID: 3209 (corosync)
    CGroup: /system.slice/corosync.service
    └─3209 corosync
    Aug 11 22:50:25 proxmoxnj1 corosync[3209]: [QUORUM] Members[9]: 1 2 3 4 5 6 7 8 9
    Aug 11 22:50:25 proxmoxnj1 corosync[3209]: [MAIN ] Completed service synchronization, ready to provide service.
    Aug 14 18:35:21 proxmoxnj1 corosync[3209]: [CFG ] Config reload requested by node 1
    Aug 14 19:04:51 proxmoxnj1 corosync[3209]: [TOTEM ] A new membership (10.3.16.20:1316) was formed. Members left: 4
    Aug 14 19:04:51 proxmoxnj1 corosync[3209]: [QUORUM] Members[8]: 1 2 3 5 6 7 8 9
    Aug 14 19:04:51 proxmoxnj1 corosync[3209]: [MAIN ] Completed service synchronization, ready to provide service.
    Aug 14 19:11:22 proxmoxnj1 corosync[3209]: [TOTEM ] A new membership (10.3.16.20:1320) was formed. Members joined: 4
    Aug 14 19:11:22 proxmoxnj1 corosync[3209]: [QUORUM] Members[9]: 1 2 3 4 5 6 7 8 9
    Aug 14 19:11:22 proxmoxnj1 corosync[3209]: [MAIN ] Completed service synchronization, ready to provide service.
    Aug 14 19:20:54 proxmoxnj1 corosync[3209]: [CFG ] Config reload requested by node 4

    cat /etc/pve/corosync.conf
    logging {
    debug: off
    to_syslog: yes
    }
    nodelist {
    node {
    name: proxmoxnj1
    nodeid: 1
    quorum_votes: 1
    ring0_addr: proxmoxnj1
    }
    node {
    name: ProxmoxCoreNJ2
    nodeid: 10
    quorum_votes: 1
    ring0_addr: ProxmoxCoreNJ2
    }
    node {
    name: proxmoxnj2
    nodeid: 2
    quorum_votes: 1
    ring0_addr: proxmoxnj2
    }
    node {
    name: proxmoxnj3
    nodeid: 3
    quorum_votes: 1
    ring0_addr: proxmoxnj3
    }
    node {
    name: ProxmoxCoreNJ1
    nodeid: 11
    quorum_votes: 1
    ring0_addr: ProxmoxCoreNJ1
    }
    node {
    name: ProxmoxNJ4
    nodeid: 4
    quorum_votes: 1
    ring0_addr: ProxmoxNJ4
    }
    node {
    name: ProxmoxNJ6
    nodeid: 6
    quorum_votes: 1
    ring0_addr: ProxmoxNJ6
    }
    node {
    name: ProxmoxNJ8
    nodeid: 8
    quorum_votes: 1
    ring0_addr: ProxmoxNJ8
    }
    node {
    name: ProxmoxNJ9
    nodeid: 9
    quorum_votes: 1
    ring0_addr: ProxmoxNJ9
    }
    node {
    name: ProxmoxNJ7
    nodeid: 7
    quorum_votes: 1
    ring0_addr: ProxmoxNJ7
    }
    node {
    name: ProxmoxNJ5
    nodeid: 5
    quorum_votes: 1
    ring0_addr: ProxmoxNJ5
    }
    }
    quorum {
    provider: corosync_votequorum
    }
    totem {
    cluster_name: proxmoxnj
    config_version: 15
    ip_version: ipv4
    secauth: on
    version: 2
    interface {
    bindnetaddr: 10.3.16.20
    ringnumber: 0
    }
    }
     
  3. axion.joey

    axion.joey Member

    Joined:
    Dec 29, 2009
    Messages:
    76
    Likes Received:
    1
    pvecm status
    Quorum information
    ------------------
    Date: Wed Aug 14 20:29:31 2019
    Quorum provider: corosync_votequorum
    Nodes: 9
    Node ID: 0x00000001
    Ring ID: 1320
    Quorate: Yes
    Votequorum information
    ----------------------
    Expected votes: 11
    Highest expected: 11
    Total votes: 9
    Quorum: 6
    Flags: Quorate
    Membership information
    ----------------------
    Nodeid Votes Name
    0x00000001 1 10.3.16.20 (local)
    0x00000002 1 10.3.16.21
    0x00000003 1 10.3.16.22
    0x00000004 1 10.3.16.25
    0x00000005 1 10.3.16.26
    0x00000006 1 10.3.16.27
    0x00000007 1 10.3.16.28
    0x00000008 1 10.3.16.40
    0x00000009 1 10.3.16.41

    the 2 missing nodes are nodes that i tried to add to the cluster today.
     
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice