[SOLVED] All nodes with VMs crash during backup task to Proxmox Backup Server

Nov 3, 2020
53
21
13
36
Every node running VMs suddenly reboots during our nightly backup task to a Proxmox Backup Server installation.

Package Versions:
Code:
proxmox-ve: 7.0-2 (running kernel: 5.11.22-5-pve)
pve-manager: 7.0-11 (running version: 7.0-11/63d82f4e)
pve-kernel-helper: 7.1-2
pve-kernel-5.11: 7.0-8
pve-kernel-5.4: 6.4-4
pve-kernel-5.3: 6.1-6
pve-kernel-5.11.22-5-pve: 5.11.22-10
pve-kernel-5.11.22-4-pve: 5.11.22-9
pve-kernel-5.4.124-1-pve: 5.4.124-1
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.10-1-pve: 5.3.10-1
ceph: 16.2.6-pve2
ceph-fuse: 16.2.6-pve2
corosync: 3.1.5-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: not correctly installed
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve1
libproxmox-acme-perl: 1.3.0
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.0-4
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.0-9
libpve-guest-common-perl: 4.0-2
libpve-http-server-perl: 4.0-2
libpve-storage-perl: 7.0-11
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.9-4
lxcfs: 4.0.8-pve2
novnc-pve: 1.2.0-3
proxmox-backup-client: 2.0.9-2
proxmox-backup-file-restore: 2.0.9-2
proxmox-mini-journalreader: 1.2-1
proxmox-widget-toolkit: 3.3-6
pve-cluster: 7.0-3
pve-container: 4.0-10
pve-docs: 7.0-5
pve-edk2-firmware: 3.20200531-1
pve-firewall: 4.2-3
pve-firmware: 3.3-2
pve-ha-manager: 3.3-1
pve-i18n: 2.5-1
pve-qemu-kvm: 6.0.0-4
pve-xtermjs: 4.12.0-1
qemu-server: 7.0-14
smartmontools: 7.2-pve2
spiceterm: 3.2-2
vncterm: 1.7-1
zfsutils-linux: 2.0.5-pve1

The PBS is running in this same cluster as a VM. I know this is not ideal; we are in the process of building a new host for the PBS.
 
Further investigation suggests a corosync congestion issue with the current layout. Is there a good place to look for logging regarding corosync errors?
you can increase logging via corosync.conf, and/or use the corosync-blackblox tool. the default log level should already provide some level of detail though when a network issue occurs, you can check with journalctl -b -u corosync -u pve-cluster
 
  • Like
Reactions: need2gcm
Not entirely sure if this is saying much of use:
Code:
-- Journal begins at Thu 2021-07-08 11:26:31 PDT, ends at Fri 2021-10-08 08:05:08 PDT.>
Oct 08 01:35:56 pvenode1 systemd[1]: Starting The Proxmox VE cluster filesystem...
Oct 08 01:35:56 pvenode1 pmxcfs[6572]: [quorum] crit: quorum_initialize failed: 2
Oct 08 01:35:56 pvenode1 pmxcfs[6572]: [quorum] crit: can't initialize service
Oct 08 01:35:56 pvenode1 pmxcfs[6572]: [confdb] crit: cmap_initialize failed: 2
Oct 08 01:35:56 pvenode1 pmxcfs[6572]: [confdb] crit: can't initialize service
Oct 08 01:35:56 pvenode1 pmxcfs[6572]: [dcdb] crit: cpg_initialize failed: 2
Oct 08 01:35:56 pvenode1 pmxcfs[6572]: [dcdb] crit: can't initialize service
Oct 08 01:35:56 pvenode1 pmxcfs[6572]: [status] crit: cpg_initialize failed: 2
Oct 08 01:35:56 pvenode1 pmxcfs[6572]: [status] crit: can't initialize service
Oct 08 01:35:57 pvenode1 systemd[1]: Started The Proxmox VE cluster filesystem.
Oct 08 01:35:57 pvenode1 systemd[1]: Starting Corosync Cluster Engine...
Oct 08 01:35:57 pvenode1 corosync[6806]:   [MAIN  ] Corosync Cluster Engine 3.1.5 star>
Oct 08 01:35:57 pvenode1 corosync[6806]:   [MAIN  ] Corosync built-in features: dbus m>
Oct 08 01:35:57 pvenode1 corosync[6806]:   [TOTEM ] Initializing transport (Kronosnet).
Oct 08 01:35:57 pvenode1 corosync[6806]:   [TOTEM ] totemknet initialized
Oct 08 01:35:57 pvenode1 corosync[6806]:   [KNET  ] common: crypto_nss.so has been loa>
Oct 08 01:35:57 pvenode1 corosync[6806]:   [SERV  ] Service engine loaded: corosync co>
Oct 08 01:35:57 pvenode1 corosync[6806]:   [QB    ] server name: cmap
Oct 08 01:35:57 pvenode1 corosync[6806]:   [SERV  ] Service engine loaded: corosync co>
Oct 08 01:35:57 pvenode1 corosync[6806]:   [QB    ] server name: cfg
Oct 08 01:35:57 pvenode1 corosync[6806]:   [SERV  ] Service engine loaded: corosync cl>
Oct 08 01:35:57 pvenode1 corosync[6806]:   [QB    ] server name: cpg
Oct 08 01:35:57 pvenode1 corosync[6806]:   [SERV  ] Service engine loaded: corosync pr>
Oct 08 01:35:57 pvenode1 corosync[6806]:   [SERV  ] Service engine loaded: corosync re>
lines 1-25...skipping...
-- Journal begins at Thu 2021-07-08 11:26:31 PDT, ends at Fri 2021-10-08 08:05:08 PDT. --
Oct 08 01:35:56 pvenode1 systemd[1]: Starting The Proxmox VE cluster filesystem...
Oct 08 01:35:56 pvenode1 pmxcfs[6572]: [quorum] crit: quorum_initialize failed: 2
Oct 08 01:35:56 pvenode1 pmxcfs[6572]: [quorum] crit: can't initialize service
Oct 08 01:35:56 pvenode1 pmxcfs[6572]: [confdb] crit: cmap_initialize failed: 2
Oct 08 01:35:56 pvenode1 pmxcfs[6572]: [confdb] crit: can't initialize service
Oct 08 01:35:56 pvenode1 pmxcfs[6572]: [dcdb] crit: cpg_initialize failed: 2
Oct 08 01:35:56 pvenode1 pmxcfs[6572]: [dcdb] crit: can't initialize service
Oct 08 01:35:56 pvenode1 pmxcfs[6572]: [status] crit: cpg_initialize failed: 2
Oct 08 01:35:56 pvenode1 pmxcfs[6572]: [status] crit: can't initialize service
Oct 08 01:35:57 pvenode1 systemd[1]: Started The Proxmox VE cluster filesystem.
Oct 08 01:35:57 pvenode1 systemd[1]: Starting Corosync Cluster Engine...
Oct 08 01:35:57 pvenode1 corosync[6806]:   [MAIN  ] Corosync Cluster Engine 3.1.5 starting up
Oct 08 01:35:57 pvenode1 corosync[6806]:   [MAIN  ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf vqsim nozzle snmp pie relro bindnow
Oct 08 01:35:57 pvenode1 corosync[6806]:   [TOTEM ] Initializing transport (Kronosnet).
Oct 08 01:35:57 pvenode1 corosync[6806]:   [TOTEM ] totemknet initialized
Oct 08 01:35:57 pvenode1 corosync[6806]:   [KNET  ] common: crypto_nss.so has been loaded from /usr/lib/x86_64-linux-gnu/kronosnet/crypto_nss.so
Oct 08 01:35:57 pvenode1 corosync[6806]:   [SERV  ] Service engine loaded: corosync configuration map access [0]
Oct 08 01:35:57 pvenode1 corosync[6806]:   [QB    ] server name: cmap
Oct 08 01:35:57 pvenode1 corosync[6806]:   [SERV  ] Service engine loaded: corosync configuration service [1]
Oct 08 01:35:57 pvenode1 corosync[6806]:   [QB    ] server name: cfg
Oct 08 01:35:57 pvenode1 corosync[6806]:   [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Oct 08 01:35:57 pvenode1 corosync[6806]:   [QB    ] server name: cpg
Oct 08 01:35:57 pvenode1 corosync[6806]:   [SERV  ] Service engine loaded: corosync profile loading service [4]
Oct 08 01:35:57 pvenode1 corosync[6806]:   [SERV  ] Service engine loaded: corosync resource monitoring service [6]
Oct 08 01:35:57 pvenode1 corosync[6806]:   [WD    ] Watchdog not enabled by configuration
Oct 08 01:35:57 pvenode1 corosync[6806]:   [WD    ] resource load_15min missing a recovery key.
Oct 08 01:35:57 pvenode1 corosync[6806]:   [WD    ] resource memory_used missing a recovery key.
Oct 08 01:35:57 pvenode1 corosync[6806]:   [WD    ] no resources configured.
Oct 08 01:35:57 pvenode1 corosync[6806]:   [SERV  ] Service engine loaded: corosync watchdog service [7]
Oct 08 01:35:57 pvenode1 corosync[6806]:   [QUORUM] Using quorum provider corosync_votequorum
Oct 08 01:35:57 pvenode1 corosync[6806]:   [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
Oct 08 01:35:57 pvenode1 corosync[6806]:   [QB    ] server name: votequorum
Oct 08 01:35:57 pvenode1 corosync[6806]:   [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Oct 08 01:35:57 pvenode1 corosync[6806]:   [QB    ] server name: quorum
Oct 08 01:35:57 pvenode1 corosync[6806]:   [TOTEM ] Configuring link 0
Oct 08 01:35:57 pvenode1 corosync[6806]:   [TOTEM ] Configured link number 0: local addr: 192.168.10.3, port=5405
Oct 08 01:35:57 pvenode1 corosync[6806]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 0)
Oct 08 01:35:57 pvenode1 corosync[6806]:   [KNET  ] host: host: 1 has no active links
Oct 08 01:35:57 pvenode1 corosync[6806]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 0)
Oct 08 01:35:57 pvenode1 corosync[6806]:   [KNET  ] host: host: 2 has no active links
Oct 08 01:35:57 pvenode1 corosync[6806]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Oct 08 01:35:57 pvenode1 corosync[6806]:   [KNET  ] host: host: 2 has no active links
Oct 08 01:35:57 pvenode1 corosync[6806]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Oct 08 01:35:57 pvenode1 corosync[6806]:   [KNET  ] host: host: 2 has no active links
Oct 08 01:35:57 pvenode1 corosync[6806]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 0)
Oct 08 01:35:57 pvenode1 corosync[6806]:   [KNET  ] host: host: 3 has no active links
Oct 08 01:35:57 pvenode1 corosync[6806]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Oct 08 01:35:57 pvenode1 corosync[6806]:   [KNET  ] host: host: 3 has no active links
Oct 08 01:35:57 pvenode1 corosync[6806]:   [QUORUM] Sync members[1]: 1
Oct 08 01:35:57 pvenode1 corosync[6806]:   [QUORUM] Sync joined[1]: 1
Oct 08 01:35:57 pvenode1 corosync[6806]:   [TOTEM ] A new membership (1.16b2) was formed. Members joined: 1
Oct 08 01:35:57 pvenode1 corosync[6806]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Oct 08 01:35:57 pvenode1 corosync[6806]:   [KNET  ] host: host: 3 has no active links
Oct 08 01:35:57 pvenode1 corosync[6806]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 0)
Oct 08 01:35:57 pvenode1 corosync[6806]:   [KNET  ] host: host: 4 has no active links

Also, last night I did NOT have the backup task running, and there was still this fencing event, so I am starting to doubt it being congestion.
 
Had this issue twice in the past on a standalone node, but not in the last few pve versions, so updating your nodes may help.
I dont remember the exact versions where this happened, though.
 
My nodes are current.

Further investigation suggests potential network infrastructure issue. Will post results.

EDIT: Bad RSTP priority was bad, lowest performant switch in the stack became root. Switch stats show peaks of usage through the uplink and downlink ports soon before fencing events. Realigned RSTP priority to a more sane layout, waiting for long term results.
 
Last edited:
  • Like
Reactions: fabian
Verified network infrastructure issue.

RSTP was indeed forcing corosync and other traffic along a route it had no business on. Adjusted RSTP values and have not had a node crash since.
 
  • Like
Reactions: fabian

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!