[SOLVED] after deb upgrade and reboot node can not rejoin cluster

RobFantini · Apr 12, 2018

Hello

after deb update node will not rejoin cluster. from /var/log/apt/history.log these were updated:

Code:

Start-Date: 2018-04-11  17:05:02 
Commandline: apt-get dist-upgrade 
Install: genisoimage:amd64 (9:1.1.11-3+b2, automatic), pve-edk2-firmware:amd64 (1.20180316-1, automatic), libpve-apiclient-perl:amd64 (2.0-4, auto
matic), pve-kernel-4.13.16-2-pve:amd64 (4.13.16-47, automatic) 
Upgrade: proxmox-widget-toolkit:amd64 (1.0-11, 1.0-14), libpve-storage-perl:amd64 (5.0-17, 5.0-18), pve-qemu-kvm:amd64 (2.9.1-9, 2.11.1-5), pve-do
cs:amd64 (5.1-16, 5.1-17), zfs-initramfs:amd64 (0.7.6-pve1~bpo9, 0.7.7-pve1~bpo9), pve-firewall:amd64 (3.0-5, 3.0-7), pve-container:amd64 (2.0-19,
2.0-21), zfs-zed:amd64 (0.7.6-pve1~bpo9, 0.7.7-pve1~bpo9), pve-cluster:amd64 (5.0-20, 5.0-24), pve-kernel-4.13.16-1-pve:amd64 (4.13.16-43, 4.13.1
6-46), zfsutils-linux:amd64 (0.7.6-pve1~bpo9, 0.7.7-pve1~bpo9), pve-manager:amd64 (5.1-46, 5.1-49), spl:amd64 (0.7.6-pve1~bpo9, 0.7.7-pve1~bpo9), 
libzfs2linux:amd64 (0.7.6-pve1~bpo9, 0.7.7-pve1~bpo9), libpve-common-perl:amd64 (5.0-28, 5.0-30), lxc-pve:amd64 (2.1.1-3, 3.0.0-2), qemu-server:am
d64 (5.0-22, 5.0-24), libzpool2linux:amd64 (0.7.6-pve1~bpo9, 0.7.7-pve1~bpo9), pve-kernel-4.13:amd64 (5.1-43, 5.1-44), libnvpair1linux:amd64 (0.7.
6-pve1~bpo9, 0.7.7-pve1~bpo9), libuutil1linux:amd64 (0.7.6-pve1~bpo9, 0.7.7-pve1~bpo9), lxcfs:amd64 (2.0.8-2, 3.0.0-1) 
End-Date: 2018-04-11  17:08:07

I can ssh between all nodes.

Now we have 5 nodes and 4 are ok.

early tomorrow I'll continue to work on this.

any suggestions ?

also per Udo's suggestion in another thread I tried this , and results are identical on a working and the bad node

Code:

# find  /etc/pve/nodes -name pve-ssl.pem -exec openssl x509  -issuer -in {} -noout \; 
issuer=CN = Proxmox Virtual Environment, OU = 2e5b2fb4-38a7-474b-b2e4-7cd2710eac33, O = PVE Cluster Manager CA 
issuer=CN = Proxmox Virtual Environment, OU = 2e5b2fb4-38a7-474b-b2e4-7cd2710eac33, O = PVE Cluster Manager CA 
issuer=CN = Proxmox Virtual Environment, OU = 2e5b2fb4-38a7-474b-b2e4-7cd2710eac33, O = PVE Cluster Manager CA 
issuer=CN = Proxmox Virtual Environment, OU = 2e5b2fb4-38a7-474b-b2e4-7cd2710eac33, O = PVE Cluster Manager CA 
issuer=CN = Proxmox Virtual Environment, OU = 2e5b2fb4-38a7-474b-b2e4-7cd2710eac33, O = PVE Cluster Manager CA 
issuer=CN = Proxmox Virtual Environment, OU = 2e5b2fb4-38a7-474b-b2e4-7cd2710eac33, O = PVE Cluster Manager CA 
issuer=CN = Proxmox Virtual Environment, OU = 2e5b2fb4-38a7-474b-b2e4-7cd2710eac33, O = PVE Cluster Manager CA 
issuer=CN = Proxmox Virtual Environment, OU = 2e5b2fb4-38a7-474b-b2e4-7cd2710eac33, O = PVE Cluster Manager CA

thanks for reading and any suggestions even a shot in the dark appreciated.

my guess is that it it has something to do with pve-cluster deb update.

Tapio Lehtonen · Apr 12, 2018

Did the upgade run to completion on that host? Try running

Code:

apt-get upgrade

to see if something is missing. It should tell you what commands to run to fix it.

RobFantini · Apr 12, 2018

Tapio Lehtonen said:
Did the upgade run to completion on that host? Try running

Code:

apt-get upgrade

to see if something is missing. It should tell you what commands to run to fix it.

Tapio - the upgrade did complete, re ran to check.

I just got up and will degug this.

any other suggestions?

RobFantini · Apr 12, 2018

Code:

# systemctl list-units --state=failed
  UNIT                   LOAD   ACTIVE SUB    DESCRIPTION               
● ceph-mon@sys15.service loaded failed failed Ceph cluster monitor daemon
● corosync.service       loaded failed failed Corosync Cluster Engine

Code:

# systemctl status corosync.service
● corosync.service - Corosync Cluster Engine
   Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Wed 2018-04-11 17:16:27 EDT; 8h ago
     Docs: man:corosync
           man:corosync.conf
           man:corosync_overview
  Process: 4095 ExecStart=/usr/sbin/corosync -f $COROSYNC_OPTIONS (code=exited, status=20)
 Main PID: 4095 (code=exited, status=20)
      CPU: 69ms

Apr 11 17:16:27 sys15 corosync[4095]: info    [WD    ] no resources configured.
Apr 11 17:16:27 sys15 corosync[4095]: notice  [SERV  ] Service engine loaded: corosync watchdog service [7]
Apr 11 17:16:27 sys15 corosync[4095]: notice  [QUORUM] Using quorum provider corosync_votequorum
Apr 11 17:16:27 sys15 corosync[4095]: crit    [QUORUM] Quorum provider: corosync_votequorum failed to initialize.
Apr 11 17:16:27 sys15 corosync[4095]: error   [SERV  ] Service engine 'corosync_quorum' failed to load for reason 'configuration error:
Apr 11 17:16:27 sys15 corosync[4095]: error   [MAIN  ] Corosync Cluster Engine exiting with status 20 at service.c:356.
Apr 11 17:16:27 sys15 systemd[1]: corosync.service: Main process exited, code=exited, status=20/n/a
Apr 11 17:16:27 sys15 systemd[1]: Failed to start Corosync Cluster Engine.
Apr 11 17:16:27 sys15 systemd[1]: corosync.service: Unit entered failed state.
Apr 11 17:16:27 sys15 systemd[1]: corosync.service: Failed with result 'exit-code'.

still debugging...

RobFantini · Apr 12, 2018

grep nodelist - shows this file uses the word so
* corosync.conf :

Code:

ogging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: pve10
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 10.1.10.10
  }
  node {
    name: pve3
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.1.10.3
  }
  node {
    name: sys13
    nodeid: 5
    quorum_votes: 1
    ring0_addr: 10.1.10.13
  }
  node {
    name: sys15
    nodeid: 4
    quorum_votes: 1
    ring0_addr: sys15
  }
  node {
    name: sys8
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.1.10.8
  }
}

quorum {
  provider: corosync_votequorum
}
totem {
  cluster_name: 20170226
  config_version: 39
  interface {
    bindnetaddr: 10.1.10.10
    ringnumber: 0
  }
  ip_version: ipv4
  secauth: on
  version: 2
}

on the plus side ceph is ok , i ran this from sys15 the unconnected cluster node:

Code:

# ceph -s
  cluster:
    id:     220b9a53-4556-48e3-a73c-28deff665e45
    health: HEALTH_OK
  services:
    mon: 3 daemons, quorum pve3,sys8,pve10
    mgr: sys8(active), standbys: pve10, pve3
    osd: 24 osds: 24 up, 24 in
  data:
    pools:   2 pools, 1024 pgs
    objects: 246k objects, 946 GB
    usage:   2795 GB used, 7933 GB / 10728 GB avail
    pgs:     1024 active+clean
  io:
    client:   1713 kB/s wr, 0 op/s rd, 152 op/s wr

wolfgang · Apr 12, 2018

Hi Rob,

you should never mix names and IP addresses.
Change this line into the host IP address.

ring0_addr: sys15

RobFantini · Apr 12, 2018

*Solved by changing corosync.conf to use IP ring0_addr for sys15:

Code:

  node {
    name: sys15
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 10.1.10.15

after making the edit on a working node, rebooting sys15 - it joined the cluster.

note sys15 had been rebooted 10+ times since it was created a little over a week ago. i am doing hardware tests so it got restarted often.

i am not sure what caused the issue. I'll look for dns issues..

RobFantini · Apr 12, 2018

wolfgang said:
Hi Rob,

you should never mix names and IP addresses.
Change this line into the host IP address.

ring0_addr: sys15

Wolfgang - thank you - I read this after making the fix - remembered it from reading threads some time ago..

Search

Search

[SOLVED] after deb upgrade and reboot node can not rejoin cluster

RobFantini

Famous Member

Tapio Lehtonen

Active Member

RobFantini

Famous Member

RobFantini

Famous Member

RobFantini

Famous Member

wolfgang

Proxmox Retired Staff

RobFantini

Famous Member

RobFantini

Famous Member

We value your privacy