watchdog timeout on slow NFS backups

mensinck

Renowned Member
Oct 19, 2015
29
1
68
Kiel, germany
Hi all,

Since Version 6.0 up to now Version 6.2 we see the follwoing behavior running backups over WAN to NFS.

We have a 8 hosts cluster (all HP DL380 G7 up to W9) runnung fine.

When doing backups over a WAN connection to a QNAP we first we see al lot of this

May 21 22:39:03 pve56 kernel: [210590.778116] rpc_check_timeout: 247 callbacks suppressed
May 21 22:39:03 pve56 kernel: [210590.778117] nfs: server XXX.XXX.XXX.XXX not responding, still trying
May 21 22:39:03 pve56 kernel: [210590.798205] nfs: server XXX.XXX.XXX.XXX not responding, still trying

This is some how expected because we only can use a 100 MBit connectin.

But the watchdog times out and the host is restartet.

May 21 22:39:05 pve56 watchdog-mux[1323]: client watchdog expired - disable watchdog updates


This is reproducable and only happens while running backups over NFS. This happens an all nodes while getting timeouts on NFS.

This are the package versions we are using:


proxmox-ve: 6.2-1 (running kernel: 5.4.41-1-pve) pve-manager: 6.2-4 (running version: 6.2-4/9824574a) pve-kernel-5.4: 6.2-2 pve-kernel-helper: 6.2-2 pve-kernel-5.3: 6.1-6 pve-kernel-5.0: 6.0-11 pve-kernel-5.4.41-1-pve: 5.4.41-1 pve-kernel-5.4.34-1-pve: 5.4.34-2 pve-kernel-5.3.18-3-pve: 5.3.18-3 pve-kernel-5.3.18-2-pve: 5.3.18-2 pve-kernel-5.0.21-5-pve: 5.0.21-10 pve-kernel-5.0.15-1-pve: 5.0.15-1 ceph: 14.2.9-pve1 ceph-fuse: 14.2.9-pve1 corosync: 3.0.3-pve1 criu: 3.11-3 glusterfs-client: 5.5-3 ifupdown: 0.8.35+pve1 ksm-control-daemon: 1.3-1 libjs-extjs: 6.0.1-10 libknet1: 1.15-pve1 libproxmox-acme-perl: 1.0.4 libpve-access-control: 6.1-1 libpve-apiclient-perl: 3.0-3 libpve-common-perl: 6.1-2 libpve-guest-common-perl: 3.0-10 libpve-http-server-perl: 3.0-5 libpve-storage-perl: 6.1-8 libqb0: 1.0.5-1 libspice-server1: 0.14.2-4~pve6+1 lvm2: 2.03.02-pve4 lxc-pve: 4.0.2-1 lxcfs: 4.0.3-pve2 novnc-pve: 1.1.0-1 proxmox-mini-journalreader: 1.1-1 proxmox-widget-toolkit: 2.2-1 pve-cluster: 6.1-8 pve-container: 3.1-6 pve-docs: 6.2-4 pve-edk2-firmware: 2.20200229-1 pve-firewall: 4.1-2 pve-firmware: 3.1-1 pve-ha-manager: 3.0-9 pve-i18n: 2.1-2 pve-qemu-kvm: 5.0.0-2 pve-xtermjs: 4.3.0-1 qemu-server: 6.2-2 smartmontools: 7.1-pve2 spiceterm: 3.1-1 vncterm: 1.6-1 zfsutils-linux: 0.8.4-pve1

This is the storage configuration in storage.cfg:

nfs: backup
export /volume1/backup2
path /mnt/pve/backupKI
server XXX.XXX.XXX.XXX
content backup,vztmpl
maxfiles 7
options vers=3



Any hints how to set nfs timo in configuration?

Regards Lukas
 
Thanks for ypur answer ,,spirit" !

csync and backup so far:

This is corosync:

for ,,one" node:
node {
name: pve56
nodeid: 7
quorum_votes: 1
ring0_addr: 192.168.24.56
ring1_addr: 192.168.25.56
}


Where 192.168.240/24 is a separate Network with dedicated switch and 192.168.25.0/24 is our ,,storage" network on 10G switches.

Backup is running over 192.168.20.0/24 (which is thefrontend interface also on the 10G switches) only the host actual running the backup is resetted.
 
that's strange. and you already have 2 rings for corosync, so it's not a network overload problem.

Maybe the nfs hang, is really hanging the full system.

What about cpu usage on your host during backup ?
 
You are right, it's strange.

CPU load is about or lower 1 when doing the backups.

We have 2 NFS mounts on the cluster. One to a local QNAP which is running fine without any issues.

The other is remote connected via Gateway and IPSec. This one produces the ,,not responding" messages which is not surprising due to other traffic on the tunnel (other backups or user traffic).

But what is strange is the reset of the node.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!