All nodes restart when a nodes have connectivity issue

pionk · Feb 19, 2024

I have a cluster with 6 nodes members. each nodes has 2 bonding interfaces which look like this:
Bond0
vmbr0v450 << vlan 450 for cluster and management
IPs : 10.100.0.0/24
vmbr0v451 <<vlan 451 for ceph storage
IPs: 10.1.1.0/24
Bond1
vmbr1 << for Guest VM public interface

Code:

root@pve05:~# pvecm status
Cluster information
-------------------
Name:             rajamitra
Config Version:   8
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Mon Feb 19 22:14:56 2024
Quorum provider:  corosync_votequorum
Nodes:            6
Node ID:          0x00000005
Ring ID:          1.216a
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   6
Highest expected: 6
Total votes:      6
Quorum:           4 
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.100.0.100
0x00000002          1 10.100.0.101
0x00000003          1 10.100.0.102
0x00000004          1 10.100.0.103
0x00000005          1 10.100.0.104 (local)
0x00000006          1 10.100.0.125

Latest accident was when i want to move bond0 link on 10.100.0.102 to another switch. before doing that, i move all VMs on that node to another nodes then plug off the cable. Not long after. suddenly all nodes are rebooted. I dont know what causing nodes goes to reboot since only 1 node is on problem.

esi_y · Feb 19, 2024

Hey! Can you provide details on:

Code:

ha-manager status
ha-manager config

# for all nodes, use a period which includes your reboot
journalctl -u corosync -u pve-cluster -u pve-ha-crm -u pve-ha-lrm -u watchdog-mux --since=-1week

pionk · Feb 19, 2024

tempacc346235 said:

Hey! Can you provide details on:

Code:

ha-manager status
ha-manager config

# for all nodes, use a period which includes your reboot
journalctl -u corosync -u pve-cluster -u pve-ha-crm -u pve-ha-lrm -u watchdog-mux --since=-1week

Please check below

Code:

root@pve05:~# ha-manager status
Scalar value @dhcpmacs[-1] better written as $dhcpmacs[-1] at /usr/share/perl5/PVE/QemuServer/Cloudinit.pm line 246, <DATA> line 960.
quorum OK
master pve05 (active, Mon Feb 19 23:45:33 2024)
lrm pve-mnc-01 (active, Mon Feb 19 23:45:29 2024)
lrm pve01 (active, Mon Feb 19 23:45:31 2024)
lrm pve02 (active, Mon Feb 19 23:45:31 2024)
lrm pve03 (idle, Mon Feb 19 23:45:37 2024)
lrm pve04 (active, Mon Feb 19 23:45:31 2024)
lrm pve05 (active, Mon Feb 19 23:45:35 2024)

Code:

root@pve05:~# ha-manager config
Scalar value @dhcpmacs[-1] better written as $dhcpmacs[-1] at /usr/share/perl5/PVE/QemuServer/Cloudinit.pm line 246, <DATA> line 960.
vm:100
        group ClusterName
        max_relocate 1
        max_restart 1
        state started

vm:10001
        group ClusterName
        state started

vm:10002
        group ClusterName
        max_relocate 3
        max_restart 2
        state started

vm:10003
        group ClusterName
        state started

vm:101
        group ClusterName
        max_relocate 1
        max_restart 1
        state started

vm:102
        group ClusterName
        max_relocate 1
        max_restart 1
        state started

vm:103
        group ClusterName
        max_relocate 2
        max_restart 2
        state started

vm:104
        group ClusterName
        max_relocate 1
        max_restart 1
        state started

vm:105
        group ClusterName
        max_relocate 2
        max_restart 2
        state started

vm:106
        group ClusterName
        state started

vm:107
        group ClusterName
        state started

vm:108
        group ClusterName
        max_relocate 1
        max_restart 1
        state started

vm:109
        group ClusterName
        max_relocate 1
        max_restart 1
        state started

vm:110
        group ClusterName
        max_relocate 1
        max_restart 1
        state started

vm:112
        group ClusterName
        max_relocate 1
        max_restart 1
        state started

vm:113
        group ClusterName
        max_relocate 1
        max_restart 1
        state started

vm:115
        group ClusterName
        max_relocate 1
        max_restart 1
        state started

vm:116
        group ClusterName
        max_relocate 1
        max_restart 1
        state started

vm:117
        group ClusterName
        max_relocate 1
        max_restart 1
        state started

vm:118
        group ClusterName
        max_relocate 1
        max_restart 1
        state started

vm:119
        group ClusterName
        max_relocate 1
        max_restart 1
        state started

vm:121
        group ClusterName
        max_relocate 1
        max_restart 1
        state started

vm:122
        group ClusterName
        max_relocate 1
        max_restart 1
        state started

vm:123
        group ClusterName
        max_relocate 1
        max_restart 1
        state started

vm:124
        group ClusterName
        state started

vm:125
        group ClusterName
        state started

vm:126
        group ClusterName
        max_relocate 1
        max_restart 1
        state started

vm:127
        group ClusterName
        max_relocate 1
        max_restart 1
        state started

vm:128
        group ClusterName
        max_relocate 1
        max_restart 1
        state started

vm:129
        group ClusterName
        state started

vm:130
        group ClusterName
        max_relocate 1
        max_restart 1
        state started

vm:133
        group ClusterName
        max_relocate 1
        max_restart 1
        state started

vm:134
        group ClusterName
        max_relocate 1
        max_restart 1
        state started

vm:135
        group ClusterName
        max_relocate 1
        max_restart 1
        state started

vm:136
        group ClusterName
        max_relocate 1
        max_restart 1
        state started

vm:137
        group ClusterName
        max_relocate 1
        max_restart 1
        state started

vm:138
        group ClusterName
        max_relocate 1
        max_restart 1
        state started

vm:139
        group ClusterName
        max_relocate 1
        max_restart 1
        state started

vm:141
        group ClusterName
        max_relocate 1
        max_restart 1
        state started

vm:142
        group ClusterName
        state started

vm:143
        group ClusterName
        max_relocate 1
        max_restart 1
        state started

vm:144
        group ClusterName
        max_relocate 1
        max_restart 1
        state started

vm:145
        group ClusterName
        max_relocate 1
        max_restart 1
        state started

vm:147
        group ClusterName
        max_relocate 1
        max_restart 1
        state started

vm:148
        group ClusterName
        max_relocate 1
        max_restart 1
        state started

vm:149
        group ClusterName
        max_relocate 2
        max_restart 2
        state started

vm:150
        group ClusterName
        max_relocate 2
        max_restart 2
        state started

vm:151
        group ClusterName
        max_relocate 1
        max_restart 1
        state started

vm:152
        group ClusterName
        max_relocate 1
        max_restart 1
        state started

vm:153
        group ClusterName
        max_relocate 1
        max_restart 1
        state started

vm:155
        group ClusterName
        max_relocate 1
        max_restart 1
        state started

vm:156
        group ClusterName
        max_relocate 1
        max_restart 1
        state started

vm:157
        group ClusterName
        max_relocate 1
        max_restart 1
        state started

vm:158
        group ClusterName
        max_relocate 1
        max_restart 1
        state started

vm:159
        group ClusterName
        max_relocate 1
        max_restart 1
        state started

vm:160
        group ClusterName
        max_relocate 1
        max_restart 1
        state started

vm:162
        group ClusterName
        max_relocate 1
        max_restart 1
        state started

vm:163
        group ClusterName
        state started

vm:164
        group ClusterName
        max_relocate 1
        max_restart 1
        state started

vm:165
        group ClusterName
        max_relocate 1
        max_restart 1
        state started

vm:166
        group ClusterName
        max_relocate 1
        max_restart 1
        state started

vm:167
        group ClusterName
        max_relocate 1
        max_restart 1
        state started

vm:168
        group ClusterName
        max_relocate 1
        max_restart 1
        state started

vm:169
        group ClusterName
        state started

vm:170
        group ClusterName
        state started

vm:171
        group ClusterName
        state started

vm:172
        group ClusterName
        state started

vm:173
        group ClusterName
        state started

vm:174
        group ClusterName
        state started

vm:175
        group ClusterName
        state started

vm:176
        group ClusterName
        max_relocate 1
        max_restart 1
        state started

vm:177
        group ClusterName
        state started

vm:178
        group ClusterName
        state started

vm:179
        group ClusterName
        state started

vm:180
        group ClusterName
        max_relocate 1
        max_restart 1
        state started

vm:181
        group ClusterName
        state started

vm:182
        group ClusterName
        max_relocate 1
        max_restart 1
        state stopped

vm:183
        group ClusterName
        max_relocate 1
        max_restart 1
        state started

vm:186
        group ClusterName
        max_relocate 1
        max_restart 1
        state stopped

vm:200
        group ClusterName
        state started

vm:201
        group ClusterName
        state started

vm:202
        group ClusterName
        state started

vm:203
        group ClusterName
        state started

vm:204
        group ClusterName
        state started

vm:205
        group ClusterName
        state started

vm:206
        group ClusterName
        state started

vm:207
        group ClusterName
        state started

vm:208
        group ClusterName
        state started

vm:209
        group ClusterName
        state started

vm:210
        group ClusterName
        state started

vm:211
        group ClusterName
        state started

vm:212
        group ClusterName
        state started

vm:213
        group ClusterName
        state started

vm:214
        group ClusterName
        state started

vm:215
        group ClusterName
        state started

vm:216
        group ClusterName
        state started

vm:217
        group ClusterName
        state started

vm:218
        group ClusterName
        state started

vm:219
        group ClusterName
        state started

vm:221
        group ClusterName
        state started

vm:222
        group ClusterName
        state started

vm:223
        group ClusterName
        state started

vm:224
        group ClusterName
        state started

vm:266
        group ClusterName
        state started

pionk · Feb 19, 2024

and this one too from command

journalctl -u corosync -u pve-cluster -u pve-ha-crm -u pve-ha-lrm -u watchdog-mux --since "2024-02-19 20:28" --until "2024-02-19 20:36"

esi_y · Feb 19, 2024

~~Is there any post of yours missing / pending spam clearance?~~

I think you only attached one log from one node and not ha-manager output. Also, it would really require to see not only all the nodes' logs, but full bootlogs for the particular boot.

esi_y · Feb 20, 2024

Well, you have lots of VMs in the HA config, the question is, if that was intended ... other than that it would be necessary to see the logs what was happening with the cluster communication. The snippet attached only shows one member disappearing, but it looks truncated.

Can you also provide more information on the network setup and pvecm status?

esi_y · Feb 20, 2024

tempacc346235 said:
Well, you have lots of VMs in the HA config, the question is, if that was intended ...

If you do not want to run any HA services, you can do a one-shot cleanup with:

Code:

echo "" > /etc/pve/ha/resources.cfg

Wait 10+ minutes and post ha-manager status afterwards.

alexskysilk · Feb 20, 2024

The answer is within the question. if corosync traffic is interrupted, nodes will fence themselves off and reboot. To avoid this in the future, set up two corosync interfaces on two different physical nics on seperate subnets. see the section headed "Adding Redundant Links To An Existing Cluster"
on https://pve.proxmox.com/wiki/Cluster_Manager#_cluster_network

esi_y · Feb 20, 2024

alexskysilk said:
The answer is within the question. if corosync traffic is interrupted, nodes will fence themselves off and reboot. To avoid this in the future, set up two corosync interfaces on two different physical nics on seperate subnets. see the section headed "Adding Redundant Links To An Existing Cluster"
on https://pve.proxmox.com/wiki/Cluster_Manager#_cluster_network

The issue is OP previously reported in a related (someone else's post) he also had this problem of nodes rebooting despite no HA services used.

Search

Search

All nodes restart when a nodes have connectivity issue

pionk

New Member

esi_y

Renowned Member

pionk

New Member

pionk

New Member

Attachments

esi_y

Renowned Member

esi_y

Renowned Member

esi_y

Renowned Member

alexskysilk

Distinguished Member

esi_y

Renowned Member

We value your privacy