All Cluster Nodes rebootet while migrating VM

woodstock

Renowned Member
Feb 18, 2016
42
2
73
Hi all,

wow, that was a busy hour or so ...

On our 8-Node cluster (with ceph) I have started to update the nodes one by one.

I've done this by migrating all VM (no LXC in use) to other nodes an then triggering the update and a restart from the web interface. This procedure did work many times before.

While migrating the very first VM on node #7 (6 nodes already updated) suddenly all nodes restartet at once.
Luckily all except one VM are up and runing again and ceph also recovered to HEALTH_OK state.

What happened here and what can I do to never expierence that again?
I,ve been close to a stroke :)

Any advise is very welcome.

pve-version before update:

Code:
proxmox-ve: 4.2-64 (running kernel: 4.4.16-1-pve)
pve-manager: 4.2-18 (running version: 4.2-18/158720b9)
pve-kernel-4.4.13-1-pve: 4.4.13-56
pve-kernel-4.4.13-2-pve: 4.4.13-58
pve-kernel-4.4.16-1-pve: 4.4.16-64
pve-kernel-4.4.10-1-pve: 4.4.10-54
lvm2: 2.02.116-pve3
corosync-pve: 2.4.0-1
libqb0: 1.0-1
pve-cluster: 4.0-44
qemu-server: 4.0-86
pve-firmware: 1.1-9
libpve-common-perl: 4.0-72
libpve-access-control: 4.0-19
libpve-storage-perl: 4.0-57
pve-libspice-server1: 0.12.8-1
vncterm: 1.2-1
pve-qemu-kvm: 2.6.1-2
pve-container: 1.0-73
pve-firewall: 2.0-29
pve-ha-manager: 1.0-33
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u2
lxc-pve: 2.0.4-1
lxcfs: 2.0.3-pve1
cgmanager: 0.39-pve1
criu: 1.6.0-1
novnc-pve: 0.5-8
zfsutils: 0.6.5.7-pve10~bpo80
ceph: 0.94.9-1~bpo80+1

pve-version after update:
Code:
proxmox-ve: 4.3-66 (running kernel: 4.4.19-1-pve)
pve-manager: 4.3-3 (running version: 4.3-3/557191d3)
pve-kernel-4.4.13-1-pve: 4.4.13-56
pve-kernel-4.4.13-2-pve: 4.4.13-58
pve-kernel-4.4.16-1-pve: 4.4.16-64
pve-kernel-4.4.19-1-pve: 4.4.19-66
pve-kernel-4.4.10-1-pve: 4.4.10-54
lvm2: 2.02.116-pve3
corosync-pve: 2.4.0-1
libqb0: 1.0-1
pve-cluster: 4.0-46
qemu-server: 4.0-91
pve-firmware: 1.1-9
libpve-common-perl: 4.0-75
libpve-access-control: 4.0-19
libpve-storage-perl: 4.0-66
pve-libspice-server1: 0.12.8-1
vncterm: 1.2-1
pve-qemu-kvm: 2.6.2-2
pve-container: 1.0-78
pve-firewall: 2.0-31
pve-ha-manager: 1.0-35
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u2
lxc-pve: 2.0.5-1
lxcfs: 2.0.4-pve1
criu: 1.6.0-1
novnc-pve: 0.5-8
zfsutils: 0.6.5.7-pve10~bpo80
ceph: 0.94.9-1~bpo80+1
 
Hi,
Is there anything suspect in /var/log/messages ?

(For comparison, this is the output of my /var/log/messages doing a clean shutdown from command line:

Oct 17 18:39:30 susa systemd[1]: Started Synchronise Hardware Clock to System Clock.
Oct 17 18:39:30 susa systemd[1]: Unmounting /mnt/pve/ekserv...
Oct 17 18:39:30 susa systemd[1]: Stopping Session 1 of user e.kasper.
Oct 17 18:39:30 susa systemd[1]: Stopping 104.scope.
Oct 17 18:39:30 susa systemd[1]: Stopped 104.scope.
Oct 17 18:39:30 susa systemd[1]: Stopping 107.scope.
Oct 17 18:39:30 susa systemd[1]: Stopped 107.scope.
Oct 17 18:39:30 susa systemd[1]: Stopping 106.scope.
Oct 17 18:39:30 susa systemd[1]: Stopped 106.scope.
Oct 17 18:39:30 susa systemd[1]: Stopping 105.scope.
Oct 17 18:39:30 susa systemd[1]: Stopped 105.scope.
Oct 17 18:39:30 susa systemd[1]: Stopping 102.scope.
Oct 17 18:39:30 susa systemd[1]: Stopped 102.scope.
Oct 17 18:39:30 susa systemd[1]: Stopping 101.scope.
Oct 17 18:39:30 susa systemd[1]: Stopped 101.scope.
Oct 17 18:39:30 susa systemd[1]: Stopping qemu.slice.

Is that a HA cluster ?
 
Hi manu,

thanks for your reply.

After taking a look at the logs it seems that there has been a short time with network problems between the cluster nodes.
I see corosync totem retransmit messages an all nodes just before the watchdog restarted the nodes.

All our nodes have two 1G NICs bonded for cluster communication and two 10G NICs bonded for storage traffic.
The two 1G NICs are also used for the bridge and the VM traffic.

I cannot say if migrating the VM has caused the network problem but I see a peak in the traffic graphs at exact that time.

What I am wondering about now is the behavior of the watchdog.

With IPMI fencing (Versions 4.0 and earlier) fencing was triggered by the remaining cluster nodes as long as there was quorum - right?
Now every node gets restarted via the watchdog by itsself if it cannot reach the other nodes?

If I'm right (and I'm not sure) that makes me feel uncomfortable.
Can you please clarify?

We also took a look at setting up redundand cluster communications: https://pve.proxmox.com/wiki/Separate_Cluster_Network#Redundant_Ring_Protocol
Is it save to edit this configuration on a production cluster?

Thanks.
 
By default PVE > 4.0 uses a software watchdog. It if a node cannot contact the rest of the cluster, the software watchdog will indeed reboot the node after some time.
You can also use an hardware watchdog, see http://pve.proxmox.com/pve-docs/chapter-ha-manager.html#_configure_hardware_watchdog

Concerning your problem, it could be that the switch on which your cluster was running was not operating properly for some time, which triggered a general fencing, as all nodes lost contact to the cluster at the same time.

Corosync on bonding devices can be tricky. It is safe if you use active-passive mode and the devices are on two different switches, the other modes are less tested.

Concerning adding a redundant ring protocol, do the change during a maintenance window.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!