[SOLVED] Node out of cluster after upgrade ?

NPK · Aug 7, 2023

Hi,

I have a problem with upgrade on my 17-nodes cluster.

I've upgraded my first node to PVE 8.0.4 and at reboot, all other nodes are red cross marked. It's impossible to connect on other nodes by GUI (SSH is OK) : "please wait" during a long time and, finally, "Login failed".

On node 1 :

Code:

root@cloud-node-01:/var/log# pvecm status
Cluster information
-------------------
Name:             CLOUD
Config Version:   22
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Mon Aug  7 16:44:10 2023
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000001
Ring ID:          1.5545
Quorate:          No

Votequorum information
----------------------
Expected votes:   17
Highest expected: 17
Total votes:      1
Quorum:           9 Activity blocked
Flags:

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 172.16.240.1 (local)

On other node, after loooong seconds... :

Code:

root@cloud-node-08:~# pvecm status
Cluster information
-------------------
Name:             CLOUD
Config Version:   22
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Mon Aug  7 16:46:07 2023
Quorum provider:  corosync_votequorum
Nodes:            16
Node ID:          0x0000000e
Ring ID:          2.4bc8
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   17
Highest expected: 17
Total votes:      16
Quorum:           9
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000002          1 172.16.240.2
0x00000003          1 172.16.240.5
0x00000004          1 172.16.240.4
0x00000005          1 172.16.240.3
0x00000006          1 172.16.240.80
0x00000007          1 172.16.240.51
0x00000008          1 172.16.240.52
0x00000009          1 172.16.240.53
0x0000000a          1 172.16.240.82
0x0000000b          1 172.16.240.81
0x0000000c          1 172.16.240.6
0x0000000d          1 172.16.240.7
0x0000000e          1 172.16.240.8 (local)
0x0000000f          1 172.16.240.9
0x00000010          1 172.16.240.10
0x00000011          1 172.16.240.11
root@cloud-node-08:~#

It looks like two clusters, with same name... what's wrong ?

Thanks !

Chris · Aug 7, 2023

Hi,
looks like the upgaded node lost connection to the rest of the cluster, check your cluster network. Please post the output of cat /etc/network/interfaces and append the content of journalctl -b > journal.txt as attachment. Maybe your interfaces got renamed?

NPK · Aug 7, 2023

Thanks for reply.

On node 1 (upgraded) :

Code:

auto lo
iface lo inet loopback

iface eno1 inet manual

iface ens4f1 inet manual

auto ens3f0
iface ens3f0 inet manual

iface enx3a68dd057237 inet manual

auto ens3f1
iface ens3f1 inet manual

iface ens3f2 inet manual

iface eno2 inet manual

iface ens3f3 inet manual

auto ens4f0
iface ens4f0 inet static
        address 192.168.0.1/24
        mtu 1500
#Carte 10G

auto bond0
iface bond0 inet manual
        bond-slaves ens3f0 ens3f1
        bond-miimon 100
        bond-mode 802.3ad

auto vmbr0
iface vmbr0 inet static
        address 172.16.240.1/20
        gateway 172.16.255.254
        bridge-ports eno1
        bridge-stp off
        bridge-fd 0

auto vmbr1
iface vmbr1 inet manual
        bridge-ports bond0
        bridge-stp off
        bridge-fd 0

On node 8 (not upgraded) :

Code:

auto lo
iface lo inet loopback

iface eno1 inet manual

auto ens2f0
iface ens2f0 inet manual

auto ens2f1
iface ens2f1 inet manual

auto ens2f2
iface ens2f2 inet manual

iface eno2 inet manual

auto ens4f0
iface ens4f0 inet static
        address 192.168.0.8/24
#Carte 10G

auto ens2f3
iface ens2f3 inet manual

iface ens4f1 inet manual

auto bond0
iface bond0 inet manual
        bond-slaves ens2f0 ens2f1 ens2f2 ens2f3
        bond-miimon 100
        bond-mode 802.3ad

auto vmbr0
iface vmbr0 inet static
        address 172.16.240.8/24
        gateway 172.16.255.254
        bridge-ports eno1
        bridge-stp off
        bridge-fd 0

auto vmbr1
iface vmbr1 inet manual
        bridge-ports bond0
        bridge-stp off
        bridge-fd 0

Ping is OK between node 1 & node 8.

Network cards on node 1 :

Code:

root@cloud-node-01:/var/log# ip link show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master vmbr0 state UP mode DEFAULT group default qlen 1000
    link/ether 38:68:dd:05:72:30 brd ff:ff:ff:ff:ff:ff
    altname enp10s0f0
3: eno2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 38:68:dd:05:72:31 brd ff:ff:ff:ff:ff:ff
    altname enp10s0f1
4: ens3f0: <NO-CARRIER,BROADCAST,MULTICAST,SLAVE,UP> mtu 1500 qdisc mq master bond0 state DOWN mode DEFAULT group default qlen 1000
    link/ether b4:96:91:55:56:ac brd ff:ff:ff:ff:ff:ff
    altname enp6s0f0
5: ens4f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 68:05:ca:9e:98:e8 brd ff:ff:ff:ff:ff:ff
    altname enp88s0f0
6: ens3f1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP mode DEFAULT group default qlen 1000
    link/ether b4:96:91:55:56:ac brd ff:ff:ff:ff:ff:ff permaddr b4:96:91:55:56:ad
    altname enp6s0f1
7: ens4f1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 68:05:ca:9e:98:e9 brd ff:ff:ff:ff:ff:ff
    altname enp88s0f1
8: ens3f2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether b4:96:91:55:56:ae brd ff:ff:ff:ff:ff:ff
    altname enp6s0f2
9: ens3f3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether b4:96:91:55:56:af brd ff:ff:ff:ff:ff:ff
    altname enp6s0f3
11: vmbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 38:68:dd:05:72:30 brd ff:ff:ff:ff:ff:ff
12: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue master vmbr1 state UP mode DEFAULT group default qlen 1000
    link/ether b4:96:91:55:56:ac brd ff:ff:ff:ff:ff:ff
13: vmbr1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether b4:96:91:55:56:ac brd ff:ff:ff:ff:ff:ff
16: enx3a68dd057237: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 3a:68:dd:05:72:37 brd ff:ff:ff:ff:ff:ff

And node 8 :

Code:

root@cloud-node-08:~# ip link show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master vmbr0 state UP mode DEFAULT group default qlen 1000
    link/ether 38:68:dd:8d:a7:10 brd ff:ff:ff:ff:ff:ff
    altname enp8s0f0
3: eno2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 38:68:dd:8d:a7:11 brd ff:ff:ff:ff:ff:ff
    altname enp8s0f1
4: ens2f0: <NO-CARRIER,BROADCAST,MULTICAST,SLAVE,UP> mtu 1500 qdisc mq master bond0 state DOWN mode DEFAULT group default qlen 1000
    link/ether 2e:37:07:a9:e6:1b brd ff:ff:ff:ff:ff:ff permaddr 6c:fe:54:53:f2:30
    altname enp47s0f0
5: ens4f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 6c:fe:54:51:1d:bc brd ff:ff:ff:ff:ff:ff
    altname enp88s0f0
6: ens4f1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 6c:fe:54:51:1d:bd brd ff:ff:ff:ff:ff:ff
    altname enp88s0f1
7: ens2f1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP mode DEFAULT group default qlen 1000
    link/ether 2e:37:07:a9:e6:1b brd ff:ff:ff:ff:ff:ff permaddr 6c:fe:54:53:f2:31
    altname enp47s0f1
8: ens2f2: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP mode DEFAULT group default qlen 1000
    link/ether 2e:37:07:a9:e6:1b brd ff:ff:ff:ff:ff:ff permaddr 6c:fe:54:53:f2:32
    altname enp47s0f2
9: ens2f3: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP mode DEFAULT group default qlen 1000
    link/ether 2e:37:07:a9:e6:1b brd ff:ff:ff:ff:ff:ff permaddr 6c:fe:54:53:f2:33
    altname enp47s0f3
10: vmbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 38:68:dd:8d:a7:10 brd ff:ff:ff:ff:ff:ff
11: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue master vmbr1 state UP mode DEFAULT group default qlen 1000
    link/ether 2e:37:07:a9:e6:1b brd ff:ff:ff:ff:ff:ff
12: vmbr1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 2e:37:07:a9:e6:1b brd ff:ff:ff:ff:ff:ff

NPK · Aug 8, 2023

Some additional informations:

On upgraded node :

Code:

root@cloud-node-01:~# systemctl status pvescheduler.service
○ pvescheduler.service - Proxmox VE scheduler
     Loaded: loaded (/lib/systemd/system/pvescheduler.service; enabled; preset: enabled)
     Active: inactive (dead)

On non upgraded nodes:

Code:

root@cloud-node-08:~# systemctl status pvescheduler.service
● pvescheduler.service - Proxmox VE scheduler
     Loaded: loaded (/lib/systemd/system/pvescheduler.service; enabled; vendor preset: enabled)
     Active: active (running) since Wed 2023-08-02 12:56:37 CEST; 5 days ago
    Process: 17612 ExecStart=/usr/bin/pvescheduler start (code=exited, status=0/SUCCESS)
   Main PID: 17613 (pvescheduler)
      Tasks: 3 (limit: 309034)
     Memory: 107.5M
        CPU: 5min 31.918s
     CGroup: /system.slice/pvescheduler.service
             ├─  17613 pvescheduler
             ├─1482700 pvescheduler
             └─1482701 pvescheduler

Aug 02 14:48:03 cloud-node-08 pvescheduler[286508]: replication: cfs-lock 'file-replication_cfg' error: got lock request timeout
Aug 02 14:48:09 cloud-node-08 pvescheduler[293898]: jobs: cfs-lock 'file-jobs_cfg' error: got lock request timeout
Aug 02 16:24:14 cloud-node-08 pvescheduler[526787]: jobs: cfs-lock 'file-jobs_cfg' error: got lock request timeout
Aug 02 16:24:57 cloud-node-08 pvescheduler[526786]: replication: cfs-lock 'file-replication_cfg' error: got lock request timeout
Aug 02 17:05:17 cloud-node-08 pvescheduler[637815]: jobs: cfs-lock 'file-jobs_cfg' error: got lock request timeout
Aug 02 17:05:17 cloud-node-08 pvescheduler[637814]: replication: cfs-lock 'file-replication_cfg' error: got lock request timeout
Aug 02 17:18:12 cloud-node-08 pvescheduler[671571]: jobs: cfs-lock 'file-jobs_cfg' error: got lock request timeout
Aug 02 17:18:12 cloud-node-08 pvescheduler[671570]: replication: cfs-lock 'file-replication_cfg' error: got lock request timeout
Aug 03 14:38:05 cloud-node-08 pvescheduler[3342350]: jobs: cfs-lock 'file-jobs_cfg' error: got lock request timeout
Aug 03 14:39:03 cloud-node-08 pvescheduler[3342349]: replication: cfs-lock 'file-replication_cfg' error: got lock request timeout

Chris · Aug 8, 2023

From your journal of node 1 we can see that the node sees the links to the other nodes come and go, but the node never manages to become part of the cluster. The other nodes should however be quorate and work as expected, what from the output of the pvescheduler service does not seem to be the case. Please provide also the logs from one of the other nodes.

NPK · Aug 8, 2023

Here is journal from node 08. Like the other one yesterday, I replaced some details (CLOUD instead of the cluster real name, and internet.local instead of our real domain name), for privacy reasons.

NPK · Aug 8, 2023

Additional information:

I restarted node 08 (without upgrade, so node 01 is on PVE v8.0.4, other nodes on PVE v7.4.16). At boot, many changes :
• I can login on GUI
• pvescheduler on node 08 is dead (like node 01)
• Node 01 and node 08 are on same cluster...

Node 01:

Code:

root@cloud-node-01:~# pvecm status
Cluster information
-------------------
Name:             CLOUD
Config Version:   22
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Tue Aug  8 11:06:17 2023
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000001
Ring ID:          1.aea5
Quorate:          No

Votequorum information
----------------------
Expected votes:   17
Highest expected: 17
Total votes:      2
Quorum:           9 Activity blocked
Flags:

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 172.16.240.1 (local)
0x0000000e          1 172.16.240.8

Node 08 :

Code:

root@cloud-node-08:~# pvecm status
Cluster information
-------------------
Name:             CLOUD
Config Version:   22
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Tue Aug  8 10:59:03 2023
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x0000000e
Ring ID:          1.ae19
Quorate:          No

Votequorum information
----------------------
Expected votes:   17
Highest expected: 17
Total votes:      2
Quorum:           9 Activity blocked
Flags:

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 172.16.240.1
0x0000000e          1 172.16.240.8 (local)

Maybe i have to restart every node to fix this cluster problem? But what about pvescheduler - I suppose pveschulder won't work until quorum is reached?

Chris · Aug 8, 2023

NPK said:

Additional information:

I restarted node 08 (without upgrade, so node 01 is on PVE v8.0.4, other nodes on PVE v7.4.16). At boot, many changes :
• I can login on GUI
• pvescheduler on node 08 is dead (like node 01)
• Node 01 and node 08 are on same cluster...

Node 01:

Code:

root@cloud-node-01:~# pvecm status
Cluster information
-------------------
Name:             CLOUD
Config Version:   22
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Tue Aug  8 11:06:17 2023
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000001
Ring ID:          1.aea5
Quorate:          No

Votequorum information
----------------------
Expected votes:   17
Highest expected: 17
Total votes:      2
Quorum:           9 Activity blocked
Flags:

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 172.16.240.1 (local)
0x0000000e          1 172.16.240.8

Node 08 :

Code:

root@cloud-node-08:~# pvecm status
Cluster information
-------------------
Name:             CLOUD
Config Version:   22
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Tue Aug  8 10:59:03 2023
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x0000000e
Ring ID:          1.ae19
Quorate:          No

Votequorum information
----------------------
Expected votes:   17
Highest expected: 17
Total votes:      2
Quorum:           9 Activity blocked
Flags:

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 172.16.240.1
0x0000000e          1 172.16.240.8 (local)

Maybe i have to restart every node to fix this cluster problem? But what about pvescheduler - I suppose pveschulder won't work until quorum is reached?

Yes, the pvescheduler wants to get a lock which is provided by the proxmox cluster filesystem, however that lock is not acquirable while there is no quorum.

However, this should not have happened just by the upgrade to PVE 8 of node 1. Did you perform the checks by running pve7to8 --full first? Did you follow the steps in https://pve.proxmox.com/wiki/Upgrade_from_7_to_8?

NPK · Aug 8, 2023

Yes, pve7to8 was executed without warnings. Not tried the --full mode (any interest if pve7to8 has no fail, no warning ?).

I used this resource to upgrade. I upgraded 2 other plaforms before this one, with the same resource, and no problem with them.

Can I restart all nodes and after that, upgrade another node?

NPK · Aug 9, 2023

Some information this morning : another node is "green checked", in the same cluster with node-01 & node-08, while I haven't updated or restarted this node (uptime : 6 days so no power outage).

Can I restart all nodes and after that, upgrade another nodes?

Chris · Aug 9, 2023

If by restarting the nodes you are able to rejoin the nodes to the cluster than yes, although you could try if a simple restart of the corosync.service and pve-cluster.service does the same.

NPK · Aug 10, 2023

I tried to restart some nodes, or just restart corosync.service & pve-cluster.service. Some nodes were alone in their cluster, some nodes don't... I finally decided to stop all 17 nodes and restart all : it was the solution, all nodes were up at restart, all in the same cluster.

Yesterday and today I upgraded all the 16 nodes (first one was already upgraded before asking help here) to 8.0.4 and no problem. All the cluster is alive and clean.

Now, this thread can be closed.

So, many thanks for your help @Chris !

Search

Search

[SOLVED] Node out of cluster after upgrade ?

NPK

Member

Chris

Proxmox Staff Member

NPK

Member

Attachments

NPK

Member

Chris

Proxmox Staff Member

NPK

Member

Attachments

NPK

Member

Chris

Proxmox Staff Member

NPK

Member

NPK

Member

Chris

Proxmox Staff Member

NPK

Member