Cluster nodes fencing when performing backup

M-SK

Member
Oct 11, 2016
46
4
13
52
Hello,

We have started testing PBS on our 5-node cluster which was running very stable up to this point. Although some jobs ran fine, we often experience that node drops out of cluster while performing backup to local /var/tmp/vzdumptmpXXXXX without warning (usually we get fence event mail when that happens). What happens is that the node reboots usually 5-10 mins after the initial dump begins.

I have also noticed that it usually happens on larger CT's and I'm wondering what happens when the CT dump is larger than available disk space (we only user 200GB SSD's for OS and the rest is on shared storage).

Otherwise, we're very interested in testing and implementing PBS.

Regards,
Marko
 
We have started testing PBS on our 5-node cluster which was running very stable up to this point. Although some jobs ran fine, we often experience that node drops out of cluster while performing backup to local /var/tmp/vzdumptmpXXXXX without warning (usually we get fence event mail when that happens). What happens is that the node reboots usually 5-10 mins after the initial dump begins.
Do you, by any chance, use the same network for the PVE Cluster traffic (Corosync) as for the backups?
Do you have HA enabled guests?
 
Yes we do, however we also have another (redundant) ring over 10G iscsi:

Code:
LINK ID 0
        addr    = 172.20.10.17
        status:
                nodeid  1:      connected
                nodeid  3:      localhost
                nodeid  4:      connected
                nodeid  5:      connected
                nodeid  6:      connected
                nodeid  7:      connected
LINK ID 1
        addr    = 10.2.2.91
        status:
                nodeid  1:      connected
                nodeid  3:      localhost
                nodeid  4:      connected
                nodeid  5:      connected
                nodeid  6:      connected
                nodeid  7:      connected

I was also under the impression that backup was being done first on the local disk and then transferred over the network?
 
I was also under the impression that backup was being done first on the local disk and then transferred over the network?
AFAIU the backup is being transmitted over the network on the fly to the target storage. Otherwise, you would not be able to backup a large VM unless you would have double that space available.

The problem that you are seeing is most likely because the backup traffic is congesting the physical network and thus the corosync packages don't get through in time anymore. Thus, the cluster falls apart and with HA enabled guests present, they will fence themselves after ~2 minutes without contact to the quorate part of the cluster.

This is a PVE topic and not a PBS one. If you check the syslogs for `corosync` messages you should see how nodes are losing the connection to the cluster. If you search the forum you should find quite a bit about it.

Ideally corosync has at least one dedicated NIC for itself, so other services don't interfere. There is quite a bit about it in the forum already. If you need further help I would suggest moving the conversation to the PVE subforum.
 
I was wondering about being able to dump the VM with greater size than local disks, so thanks for clarifying.

We had plenty of corosync woes before, mainly when it went from multicast to unicast (version 3.0 IIRC).
That's when we actually implemented a redundant ring for quorum. Our links are usually never saturated but corosync can still be flaky, but thankfully that was fully resolved by implementing a second ring.

But I can see no corosync issues at given time, here's the syslog:

Code:
  13:03:44  host kernel: BUG: kernel NULL pointer dereference, address: 0000000000000039
  13:03:44  host kernel: #PF: supervisor read access in kernel mode
  13:03:44  host kernel: #PF: error_code(0x0000) - not-present page
  13:03:44  host kernel: PGD 0 P4D 0
  13:03:44  host kernel: Oops: 0000 [#1] SMP PTI
  13:03:44  host kernel: CPU: 3 PID: 4158 Comm: kworker/3:1H Tainted: P           O      5.4.44-2-pve #1
  13:03:44  host kernel: Hardware name: Intel Corporation S2600WT2R/S2600WT2R, BIOS SE5C610.86B.01.01.0016.033120161139 03/31/2016
  13:03:44  host kernel: Workqueue:  0x0 (kblockd)
  13:03:44  host kernel: RIP: 0010:find_busiest_group+0x4a/0x530
  13:03:44  host kernel: Code: 8d bd 08 ff ff ff 48 81 ec d8 00 00 00 65 48 8b 04 25 28 00 00 00 48 89 45 d8 31 c0 f3 48 ab 48 89 df e8 29 f8 ff ff 48 8b 03 <f6> 40 39 08 48 8b 8
  13:03:44  host kernel: RSP: 0018:ffffaa5ca9b73af0 EFLAGS: 00010046
  13:03:44  host kernel: RAX: 0000000000000000 RBX: ffffaa5ca9b73c30 RCX: 0000000000000014
  13:03:44  host kernel: RDX: 0000000000000002 RSI: 000000000002ad40 RDI: 0000000000000000
  13:03:44  host kernel: RBP: ffffaa5ca9b73ba0 R08: 0000000000000014 R09: 0000000000003c00
  13:03:44  host kernel: R10: 0000000000000003 R11: 0000000000000000 R12: ffff8e5f7fb2ad40
  13:03:44  host kernel: R13: ffffaa5ca9b73d08 R14: 0000000000000002 R15: ffff8e5f77dce000
  13:03:44  host kernel: FS:  0000000000000000(0000) GS:ffff8e5f7f8c0000(0000) knlGS:0000000000000000
  13:03:44  host kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  13:03:44  host kernel: CR2: 00000000000000b0 CR3: 0000000cda80a006 CR4: 00000000003606e0
  13:03:44  host kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  13:03:44  host kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  13:03:44  host kernel: Call Trace:
  13:03:44  host kernel:  find_busiest_group+0x47/0x530
  13:03:44  host kernel:  load_balance+0x15c/0xaf0
  13:03:44  host kernel:  newidle_balance+0x18f/0x3c0
  13:03:44  host kernel:  pick_next_task_fair+0x46/0x3b0
  13:03:44  host kernel:  __schedule+0x177/0x6f0
  13:03:44  host kernel:  schedule+0x33/0xa0
  13:03:44  host kernel:  worker_thread+0xbf/0x400
  13:03:44  host kernel:  kthread+0x120/0x140
  13:03:44  host kernel:  ? process_one_work+0x3d0/0x3d0
  13:03:44  host kernel:  ? kthread_park+0x90/0x90
  13:03:44  host kernel:  ret_from_fork+0x35/0x40
  13:03:44  host kernel: Modules linked in: ipt_REJECT nf_reject_ipv4 xt_conntrack xt_state veth nfsv3 nfs_acl nfs lockd grace fscache ebtable_filter ebtables ip_set ip6table_raw
  13:03:44  host kernel:  x_tables autofs4 zfs(PO) zunicode(PO) zlua(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) btrfs xor zstd_compress raid6_pq libcrc32c hid_generic us
  13:03:44  host kernel: CR2: 0000000000000039
  13:03:44  host kernel: ---[ end trace a0a2d43490143a09 ]---
  13:03:44  host kernel: RIP: 0010:find_busiest_group+0x4a/0x530

The host then rebooted at 13:07, there's no corosync messages at the time.
I have similar logs from other hosts (i.e kernel: watchdog: BUG: soft lockup - CPU#23 stuck for 22s! [pvesr:17458])
 
I've also checked other nodes that failed, no corosync issues. Most show some variation of the above kernel bug.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!