[SOLVED] after deb upgrade and reboot node can not rejoin cluster

RobFantini

Famous Member
May 24, 2012
2,043
111
133
Boston,Mass
Hello

after deb update node will not rejoin cluster. from /var/log/apt/history.log these were updated:
Code:
Start-Date: 2018-04-11  17:05:02 
Commandline: apt-get dist-upgrade 
Install: genisoimage:amd64 (9:1.1.11-3+b2, automatic), pve-edk2-firmware:amd64 (1.20180316-1, automatic), libpve-apiclient-perl:amd64 (2.0-4, auto
matic), pve-kernel-4.13.16-2-pve:amd64 (4.13.16-47, automatic) 
Upgrade: proxmox-widget-toolkit:amd64 (1.0-11, 1.0-14), libpve-storage-perl:amd64 (5.0-17, 5.0-18), pve-qemu-kvm:amd64 (2.9.1-9, 2.11.1-5), pve-do
cs:amd64 (5.1-16, 5.1-17), zfs-initramfs:amd64 (0.7.6-pve1~bpo9, 0.7.7-pve1~bpo9), pve-firewall:amd64 (3.0-5, 3.0-7), pve-container:amd64 (2.0-19,
2.0-21), zfs-zed:amd64 (0.7.6-pve1~bpo9, 0.7.7-pve1~bpo9), pve-cluster:amd64 (5.0-20, 5.0-24), pve-kernel-4.13.16-1-pve:amd64 (4.13.16-43, 4.13.1
6-46), zfsutils-linux:amd64 (0.7.6-pve1~bpo9, 0.7.7-pve1~bpo9), pve-manager:amd64 (5.1-46, 5.1-49), spl:amd64 (0.7.6-pve1~bpo9, 0.7.7-pve1~bpo9), 
libzfs2linux:amd64 (0.7.6-pve1~bpo9, 0.7.7-pve1~bpo9), libpve-common-perl:amd64 (5.0-28, 5.0-30), lxc-pve:amd64 (2.1.1-3, 3.0.0-2), qemu-server:am
d64 (5.0-22, 5.0-24), libzpool2linux:amd64 (0.7.6-pve1~bpo9, 0.7.7-pve1~bpo9), pve-kernel-4.13:amd64 (5.1-43, 5.1-44), libnvpair1linux:amd64 (0.7.
6-pve1~bpo9, 0.7.7-pve1~bpo9), libuutil1linux:amd64 (0.7.6-pve1~bpo9, 0.7.7-pve1~bpo9), lxcfs:amd64 (2.0.8-2, 3.0.0-1) 
End-Date: 2018-04-11  17:08:07

I can ssh between all nodes.

Now we have 5 nodes and 4 are ok.

early tomorrow I'll continue to work on this.

any suggestions ?

also per Udo's suggestion in another thread I tried this , and results are identical on a working and the bad node
Code:
# find  /etc/pve/nodes -name pve-ssl.pem -exec openssl x509  -issuer -in {} -noout \; 
issuer=CN = Proxmox Virtual Environment, OU = 2e5b2fb4-38a7-474b-b2e4-7cd2710eac33, O = PVE Cluster Manager CA 
issuer=CN = Proxmox Virtual Environment, OU = 2e5b2fb4-38a7-474b-b2e4-7cd2710eac33, O = PVE Cluster Manager CA 
issuer=CN = Proxmox Virtual Environment, OU = 2e5b2fb4-38a7-474b-b2e4-7cd2710eac33, O = PVE Cluster Manager CA 
issuer=CN = Proxmox Virtual Environment, OU = 2e5b2fb4-38a7-474b-b2e4-7cd2710eac33, O = PVE Cluster Manager CA 
issuer=CN = Proxmox Virtual Environment, OU = 2e5b2fb4-38a7-474b-b2e4-7cd2710eac33, O = PVE Cluster Manager CA 
issuer=CN = Proxmox Virtual Environment, OU = 2e5b2fb4-38a7-474b-b2e4-7cd2710eac33, O = PVE Cluster Manager CA 
issuer=CN = Proxmox Virtual Environment, OU = 2e5b2fb4-38a7-474b-b2e4-7cd2710eac33, O = PVE Cluster Manager CA 
issuer=CN = Proxmox Virtual Environment, OU = 2e5b2fb4-38a7-474b-b2e4-7cd2710eac33, O = PVE Cluster Manager CA

thanks for reading and any suggestions even a shot in the dark appreciated.

my guess is that it it has something to do with pve-cluster deb update.
 
Code:
# systemctl list-units --state=failed
  UNIT                   LOAD   ACTIVE SUB    DESCRIPTION               
● ceph-mon@sys15.service loaded failed failed Ceph cluster monitor daemon
● corosync.service       loaded failed failed Corosync Cluster Engine

Code:
# systemctl status corosync.service
● corosync.service - Corosync Cluster Engine
   Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Wed 2018-04-11 17:16:27 EDT; 8h ago
     Docs: man:corosync
           man:corosync.conf
           man:corosync_overview
  Process: 4095 ExecStart=/usr/sbin/corosync -f $COROSYNC_OPTIONS (code=exited, status=20)
 Main PID: 4095 (code=exited, status=20)
      CPU: 69ms

Apr 11 17:16:27 sys15 corosync[4095]: info    [WD    ] no resources configured.
Apr 11 17:16:27 sys15 corosync[4095]: notice  [SERV  ] Service engine loaded: corosync watchdog service [7]
Apr 11 17:16:27 sys15 corosync[4095]: notice  [QUORUM] Using quorum provider corosync_votequorum
Apr 11 17:16:27 sys15 corosync[4095]: crit    [QUORUM] Quorum provider: corosync_votequorum failed to initialize.
Apr 11 17:16:27 sys15 corosync[4095]: error   [SERV  ] Service engine 'corosync_quorum' failed to load for reason 'configuration error:
Apr 11 17:16:27 sys15 corosync[4095]: error   [MAIN  ] Corosync Cluster Engine exiting with status 20 at service.c:356.
Apr 11 17:16:27 sys15 systemd[1]: corosync.service: Main process exited, code=exited, status=20/n/a
Apr 11 17:16:27 sys15 systemd[1]: Failed to start Corosync Cluster Engine.
Apr 11 17:16:27 sys15 systemd[1]: corosync.service: Unit entered failed state.
Apr 11 17:16:27 sys15 systemd[1]: corosync.service: Failed with result 'exit-code'.

still debugging...
 
grep nodelist - shows this file uses the word so
* corosync.conf :
Code:
ogging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: pve10
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 10.1.10.10
  }
  node {
    name: pve3
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.1.10.3
  }
  node {
    name: sys13
    nodeid: 5
    quorum_votes: 1
    ring0_addr: 10.1.10.13
  }
  node {
    name: sys15
    nodeid: 4
    quorum_votes: 1
    ring0_addr: sys15
  }
  node {
    name: sys8
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.1.10.8
  }
}

quorum {
  provider: corosync_votequorum
}
totem {
  cluster_name: 20170226
  config_version: 39
  interface {
    bindnetaddr: 10.1.10.10
    ringnumber: 0
  }
  ip_version: ipv4
  secauth: on
  version: 2
}


on the plus side ceph is ok , i ran this from sys15 the unconnected cluster node:
Code:
# ceph -s
  cluster:
    id:     220b9a53-4556-48e3-a73c-28deff665e45
    health: HEALTH_OK
  services:
    mon: 3 daemons, quorum pve3,sys8,pve10
    mgr: sys8(active), standbys: pve10, pve3
    osd: 24 osds: 24 up, 24 in
  data:
    pools:   2 pools, 1024 pgs
    objects: 246k objects, 946 GB
    usage:   2795 GB used, 7933 GB / 10728 GB avail
    pgs:     1024 active+clean
  io:
    client:   1713 kB/s wr, 0 op/s rd, 152 op/s wr
 
Hi Rob,

you should never mix names and IP addresses.
Change this line into the host IP address.

ring0_addr: sys15
 
*Solved by changing corosync.conf to use IP ring0_addr for sys15:
Code:
  node {
    name: sys15
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 10.1.10.15

after making the edit on a working node, rebooting sys15 - it joined the cluster.

note sys15 had been rebooted 10+ times since it was created a little over a week ago. i am doing hardware tests so it got restarted often.

i am not sure what caused the issue. I'll look for dns issues..