Lost 2/3 of nodes during massive restore from PBS

slapshot · Mar 12, 2023

Hello, my scenery is a cluster with 3 nodes. I dedicated second network interface of all node to corosync, with a switch to physical separate networks between them. iftop -i eno2 shows me that it seems all ok, that traffic is away from the rest of network.

Now, I did a first restore on pve3 from PBS and it worked well. Than I tried a double task, I mean a double restore always from the same PBS. The two tasks started but after a few seconds I lost pve2 and pve3 nodes, I mean they was unable in the gui interface. No problem for ssh. They were active again in the gui after some minutes. Now I tought I missed network configuration and I checked once again but it seems their eno2 is working good just for corosync. The two VMs was locked,so I had to manage the unlock with qm unlock and then erase them.

A new single restore worked flawlessy.

Please, any idea what's happened here ?

Thank you

slapshot · Mar 13, 2023

Well, syslog gave me some info. At the same time 15:31 both nodes of the cluster, I mean pve2 and pve3, rebooted. Really I don't know why, I will try to investigate further. Please, any advice is appreciated.

Thank you

leesteken · Mar 13, 2023

slapshot said:
Well, syslog gave me some info. At the same time 15:31 both nodes of the cluster, I mean pve2 and pve3, rebooted. Really I don't know why, I will try to investigate further.

A node that cannot reach enough members of the cluster will reboot. Check the journalctl before the reboot to see a message about losing quorum. If that's the case then double check the network configuration as it indicates that nodes cannot contact other nodes while the network is very busy with (restoring) backups.

slapshot · Mar 13, 2023

Thank you for your reply. I checked the journalctl and I can see on pve3 the two restore process and then reboot:

Code:

Mar 12 15:27:58 pve3 lvm[457]: Monitoring thin pool pve-data-tpool.
Mar 12 15:30:16 pve3 pvedaemon[14382]: <root@pam> end task UPID:pve3:0000FA04:00102F92:640DE16B:qmrestore:351:root@pam: OK
Mar 12 15:30:36 pve3 pvedaemon[14384]: <root@pam> starting task UPID:pve3:0001016B:00106E76:640DE20C:qmrestore:352:root@pam:
Mar 12 15:31:24 pve3 pvedaemon[14383]: <root@pam> starting task UPID:pve3:000103A7:00108156:640DE23C:qmrestore:353:root@pam:
-- Boot 3734f14ea0c94befb7c35df2a8bbb436 --
Mar 12 15:34:57 pve3 kernel: Linux version 5.15.85-1-pve (build@proxmox) (gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP PVE 5.15.85-1 (2023-02-01T00:00Z) ()
Mar 12 15:34:57 pve3 kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-5.15.85-1-pve root=/dev/mapper/pve-root ro quiet

On pve2 here it is, exactly at 15:31:24 there is a reboot:

Code:

Mar 12 15:30:56 pve2 pvestatd[1164]: storage 'OMV' is not online
Mar 12 15:30:58 pve2 pvestatd[1164]: status update time (6.768 seconds)
Mar 12 15:31:06 pve2 pvestatd[1164]: storage 'OMV' is not online
Mar 12 15:31:08 pve2 pvestatd[1164]: status update time (6.950 seconds)
Mar 12 15:31:24 pve2 pmxcfs[509490]: [status] notice: received log
-- Boot 23a22eafa1f342a4819a3525256fec85 --
Mar 12 15:37:25 pve2 kernel: Linux version 5.15.85-1-pve (build@proxmox) (gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP PVE 5.15.85-1 (2023-02-01T00:00Z) ()
Mar 12 15:37:25 pve2 kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-5.15.85-1-pve root=/dev/mapper/pve-root ro quiet

I can investigate a bit more of course and I will do. But I cannot understand the basis under this behavior. I can understand a reboot of pve3 but not of pve2 which was doing pretty anything. Should not a cluster be created for redundancy ? So a missing node should not reboot another node.

Futhermore I dedicated a isolated physical network with a managed switch (still not vlan) so each node is using its second network 1gbit interface for corosync so an increased latency should not be probably. Anyway, I searched the forum for such a similar problem with a 7 nodes reboot and there is a tip to edit corosync.conf file to manage the latency at this link.

This is my pveversion -v :

Code:

proxmox-ve: 7.3-1 (running kernel: 5.15.85-1-pve)
pve-manager: 7.3-6 (running version: 7.3-6/723bb6ec)
pve-kernel-helper: 7.3-6
pve-kernel-5.15: 7.3-2
pve-kernel-5.15.85-1-pve: 5.15.85-1
pve-kernel-5.15.74-1-pve: 5.15.74-1
ceph-fuse: 15.2.17-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.3-2
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.3-2
libpve-guest-common-perl: 4.2-3
libpve-http-server-perl: 4.1-6
libpve-storage-perl: 7.3-2
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
proxmox-backup-client: 2.3.3-1
proxmox-backup-file-restore: 2.3.3-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.5.5
pve-cluster: 7.3-2
pve-container: 4.4-2
pve-docs: 7.3-1
pve-edk2-firmware: 3.20221111-1
pve-firewall: 4.2-7
pve-firmware: 3.6-3
pve-ha-manager: 3.5.1
pve-i18n: 2.8-3
pve-qemu-kvm: 7.2.0-5
pve-xtermjs: 4.16.0-1
qemu-server: 7.3-4
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1
zfsutils-linux: 2.1.9-pve1

Thank you

Search

Search

Lost 2/3 of nodes during massive restore from PBS

slapshot

Renowned Member

slapshot

Renowned Member

leesteken

Distinguished Member

slapshot

Renowned Member

We value your privacy