[SOLVED] after deb upgrade and reboot node can not rejoin cluster

RobFantini

Famous Member
May 24, 2012
2,023
107
133
Boston,Mass
Hello

after deb update node will not rejoin cluster. from /var/log/apt/history.log these were updated:
Code:
Start-Date: 2018-04-11  17:05:02 
Commandline: apt-get dist-upgrade 
Install: genisoimage:amd64 (9:1.1.11-3+b2, automatic), pve-edk2-firmware:amd64 (1.20180316-1, automatic), libpve-apiclient-perl:amd64 (2.0-4, auto
matic), pve-kernel-4.13.16-2-pve:amd64 (4.13.16-47, automatic) 
Upgrade: proxmox-widget-toolkit:amd64 (1.0-11, 1.0-14), libpve-storage-perl:amd64 (5.0-17, 5.0-18), pve-qemu-kvm:amd64 (2.9.1-9, 2.11.1-5), pve-do
cs:amd64 (5.1-16, 5.1-17), zfs-initramfs:amd64 (0.7.6-pve1~bpo9, 0.7.7-pve1~bpo9), pve-firewall:amd64 (3.0-5, 3.0-7), pve-container:amd64 (2.0-19,
2.0-21), zfs-zed:amd64 (0.7.6-pve1~bpo9, 0.7.7-pve1~bpo9), pve-cluster:amd64 (5.0-20, 5.0-24), pve-kernel-4.13.16-1-pve:amd64 (4.13.16-43, 4.13.1
6-46), zfsutils-linux:amd64 (0.7.6-pve1~bpo9, 0.7.7-pve1~bpo9), pve-manager:amd64 (5.1-46, 5.1-49), spl:amd64 (0.7.6-pve1~bpo9, 0.7.7-pve1~bpo9), 
libzfs2linux:amd64 (0.7.6-pve1~bpo9, 0.7.7-pve1~bpo9), libpve-common-perl:amd64 (5.0-28, 5.0-30), lxc-pve:amd64 (2.1.1-3, 3.0.0-2), qemu-server:am
d64 (5.0-22, 5.0-24), libzpool2linux:amd64 (0.7.6-pve1~bpo9, 0.7.7-pve1~bpo9), pve-kernel-4.13:amd64 (5.1-43, 5.1-44), libnvpair1linux:amd64 (0.7.
6-pve1~bpo9, 0.7.7-pve1~bpo9), libuutil1linux:amd64 (0.7.6-pve1~bpo9, 0.7.7-pve1~bpo9), lxcfs:amd64 (2.0.8-2, 3.0.0-1) 
End-Date: 2018-04-11  17:08:07

I can ssh between all nodes.

Now we have 5 nodes and 4 are ok.

early tomorrow I'll continue to work on this.

any suggestions ?

also per Udo's suggestion in another thread I tried this , and results are identical on a working and the bad node
Code:
# find  /etc/pve/nodes -name pve-ssl.pem -exec openssl x509  -issuer -in {} -noout \; 
issuer=CN = Proxmox Virtual Environment, OU = 2e5b2fb4-38a7-474b-b2e4-7cd2710eac33, O = PVE Cluster Manager CA 
issuer=CN = Proxmox Virtual Environment, OU = 2e5b2fb4-38a7-474b-b2e4-7cd2710eac33, O = PVE Cluster Manager CA 
issuer=CN = Proxmox Virtual Environment, OU = 2e5b2fb4-38a7-474b-b2e4-7cd2710eac33, O = PVE Cluster Manager CA 
issuer=CN = Proxmox Virtual Environment, OU = 2e5b2fb4-38a7-474b-b2e4-7cd2710eac33, O = PVE Cluster Manager CA 
issuer=CN = Proxmox Virtual Environment, OU = 2e5b2fb4-38a7-474b-b2e4-7cd2710eac33, O = PVE Cluster Manager CA 
issuer=CN = Proxmox Virtual Environment, OU = 2e5b2fb4-38a7-474b-b2e4-7cd2710eac33, O = PVE Cluster Manager CA 
issuer=CN = Proxmox Virtual Environment, OU = 2e5b2fb4-38a7-474b-b2e4-7cd2710eac33, O = PVE Cluster Manager CA 
issuer=CN = Proxmox Virtual Environment, OU = 2e5b2fb4-38a7-474b-b2e4-7cd2710eac33, O = PVE Cluster Manager CA

thanks for reading and any suggestions even a shot in the dark appreciated.

my guess is that it it has something to do with pve-cluster deb update.
 
Code:
# systemctl list-units --state=failed
  UNIT                   LOAD   ACTIVE SUB    DESCRIPTION               
● ceph-mon@sys15.service loaded failed failed Ceph cluster monitor daemon
● corosync.service       loaded failed failed Corosync Cluster Engine

Code:
# systemctl status corosync.service
● corosync.service - Corosync Cluster Engine
   Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Wed 2018-04-11 17:16:27 EDT; 8h ago
     Docs: man:corosync
           man:corosync.conf
           man:corosync_overview
  Process: 4095 ExecStart=/usr/sbin/corosync -f $COROSYNC_OPTIONS (code=exited, status=20)
 Main PID: 4095 (code=exited, status=20)
      CPU: 69ms

Apr 11 17:16:27 sys15 corosync[4095]: info    [WD    ] no resources configured.
Apr 11 17:16:27 sys15 corosync[4095]: notice  [SERV  ] Service engine loaded: corosync watchdog service [7]
Apr 11 17:16:27 sys15 corosync[4095]: notice  [QUORUM] Using quorum provider corosync_votequorum
Apr 11 17:16:27 sys15 corosync[4095]: crit    [QUORUM] Quorum provider: corosync_votequorum failed to initialize.
Apr 11 17:16:27 sys15 corosync[4095]: error   [SERV  ] Service engine 'corosync_quorum' failed to load for reason 'configuration error:
Apr 11 17:16:27 sys15 corosync[4095]: error   [MAIN  ] Corosync Cluster Engine exiting with status 20 at service.c:356.
Apr 11 17:16:27 sys15 systemd[1]: corosync.service: Main process exited, code=exited, status=20/n/a
Apr 11 17:16:27 sys15 systemd[1]: Failed to start Corosync Cluster Engine.
Apr 11 17:16:27 sys15 systemd[1]: corosync.service: Unit entered failed state.
Apr 11 17:16:27 sys15 systemd[1]: corosync.service: Failed with result 'exit-code'.

still debugging...
 
grep nodelist - shows this file uses the word so
* corosync.conf :
Code:
ogging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: pve10
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 10.1.10.10
  }
  node {
    name: pve3
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.1.10.3
  }
  node {
    name: sys13
    nodeid: 5
    quorum_votes: 1
    ring0_addr: 10.1.10.13
  }
  node {
    name: sys15
    nodeid: 4
    quorum_votes: 1
    ring0_addr: sys15
  }
  node {
    name: sys8
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.1.10.8
  }
}

quorum {
  provider: corosync_votequorum
}
totem {
  cluster_name: 20170226
  config_version: 39
  interface {
    bindnetaddr: 10.1.10.10
    ringnumber: 0
  }
  ip_version: ipv4
  secauth: on
  version: 2
}


on the plus side ceph is ok , i ran this from sys15 the unconnected cluster node:
Code:
# ceph -s
  cluster:
    id:     220b9a53-4556-48e3-a73c-28deff665e45
    health: HEALTH_OK
  services:
    mon: 3 daemons, quorum pve3,sys8,pve10
    mgr: sys8(active), standbys: pve10, pve3
    osd: 24 osds: 24 up, 24 in
  data:
    pools:   2 pools, 1024 pgs
    objects: 246k objects, 946 GB
    usage:   2795 GB used, 7933 GB / 10728 GB avail
    pgs:     1024 active+clean
  io:
    client:   1713 kB/s wr, 0 op/s rd, 152 op/s wr
 
Hi Rob,

you should never mix names and IP addresses.
Change this line into the host IP address.

ring0_addr: sys15
 
*Solved by changing corosync.conf to use IP ring0_addr for sys15:
Code:
  node {
    name: sys15
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 10.1.10.15

after making the edit on a working node, rebooting sys15 - it joined the cluster.

note sys15 had been rebooted 10+ times since it was created a little over a week ago. i am doing hardware tests so it got restarted often.

i am not sure what caused the issue. I'll look for dns issues..
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!