backup of templates over nfs is failing

morik_proxmox · Jul 31, 2024

Hello experts,

Background: After receiving generous help, I now have 2 templates successfully built on local storage of a cluster. VM cloning with my cloud-init customizations works great on it as well. I've setup backup tasks on local disks. It works fine as well.

Issue: I'm trying to perform template/VM/container backups over NFS share. NFS share is detected and mounted fine. But, during back, nfs doesn't work.

Homework: I've combed through forum threads, reddit, google etc with similar background and issues. I've tried various recipes but to no avail. Please note, NFS share is on ZFS filesystem and is exposed as an NFS share. All ACL permissions have been turned off. It can be read, and written by any user from any where on trusted LAN addresses. Also, it is not a firewall issue. Below is log from an ubuntu VM (on a different host) to the same NFS server.

Code:

showmount -e <fqdn>
Export list for <fqdn>:
/mnt/z_store/nfs_pve (everyone)

mount | grep nfs
<fqdn>:/mnt/z_store/home on /mnt/home type nfs4 (rw,noatime,vers=4.2,rsize=131072,wsize=131072,namlen=255,acregmin=1800,acregmax=1800,acdirmin=1800,acdirmax=1800,soft,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=192.168.100.9,local_lock=none,addr=192.168.121.7)
<fqdn>:/mnt/z_store/logs on /mnt/logs type nfs4 (rw,noatime,vers=4.2,rsize=131072,wsize=131072,namlen=255,acregmin=1800,acregmax=1800,acdirmin=1800,acdirmax=1800,soft,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=192.168.100.9,local_lock=none,addr=192.168.121.7)
<fqdn>:/mnt/z_store/images on /mnt/conf_files type nfs4 (rw,noatime,vers=4.2,rsize=131072,wsize=131072,namlen=255,acregmin=1800,acregmax=1800,acdirmin=1800,acdirmax=1800,soft,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=192.168.100.9,local_lock=none,addr=192.168.121.7)

System: pve-manager/8.2.4/faa83925c9641325 (running kernel: 6.8.8-3-pve), cluster, no CEPH, local + local-lvm on SSDs running on Intel i11/i14 NUCs

journalctl log:

Code:

Jul 30 18:49:32 i11-nuc pvedaemon[7306]: INFO: Starting Backup of VM 9001 (qemu)
Jul 30 18:50:19 i11-nuc pveproxy[1091]: worker exit
Jul 30 18:50:19 i11-nuc pveproxy[1088]: worker 1091 finished
Jul 30 18:50:19 i11-nuc pveproxy[1088]: starting 1 worker(s)
Jul 30 18:50:19 i11-nuc pveproxy[1088]: worker 7446 started
Jul 30 18:52:41 i11-nuc kernel: nfs: server <fqdn> not responding, still trying
Jul 30 18:52:41 i11-nuc kernel: nfs: server <fqdn> not responding, still trying
Jul 30 18:52:41 i11-nuc kernel: nfs: server <fqdn> not responding, still trying
Jul 30 18:52:41 i11-nuc kernel: nfs: server <fqdn> not responding, still trying
Jul 30 18:53:13 i11-nuc kernel: INFO: task zstd:7330 blocked for more than 122 seconds.
Jul 30 18:53:13 i11-nuc kernel:       Tainted: P           O       6.8.8-4-pve #1
Jul 30 18:53:13 i11-nuc kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 30 18:53:13 i11-nuc kernel: task:zstd            state:D stack:0     pid:7330  tgid:7330  ppid:7326   flags:0x00004002
Jul 30 18:53:13 i11-nuc kernel: Call Trace:
Jul 30 18:53:13 i11-nuc kernel:  <TASK>
Jul 30 18:53:13 i11-nuc kernel:  __schedule+0x401/0x15e0
Jul 30 18:53:13 i11-nuc kernel:  ? nfs_pageio_complete+0xee/0x140 [nfs]
Jul 30 18:53:13 i11-nuc kernel:  schedule+0x33/0x110
Jul 30 18:53:13 i11-nuc kernel:  io_schedule+0x46/0x80
Jul 30 18:53:13 i11-nuc kernel:  folio_wait_bit_common+0x136/0x330
Jul 30 18:53:13 i11-nuc kernel:  ? __pfx_wake_page_function+0x10/0x10
Jul 30 18:53:13 i11-nuc kernel:  folio_wait_bit+0x18/0x30
Jul 30 18:53:13 i11-nuc kernel:  folio_wait_writeback+0x2b/0xa0
Jul 30 18:53:13 i11-nuc kernel:  __filemap_fdatawait_range+0x90/0x100
Jul 30 18:53:13 i11-nuc kernel:  filemap_write_and_wait_range+0x94/0xc0
Jul 30 18:53:13 i11-nuc kernel:  nfs_wb_all+0x27/0x130 [nfs]
Jul 30 18:53:13 i11-nuc kernel:  nfs4_file_flush+0x7e/0xe0 [nfsv4]
Jul 30 18:53:13 i11-nuc kernel:  filp_flush+0x35/0x90
Jul 30 18:53:13 i11-nuc kernel:  __x64_sys_close+0x34/0x90
Jul 30 18:53:13 i11-nuc kernel:  x64_sys_call+0x1a20/0x24b0
Jul 30 18:53:13 i11-nuc kernel:  do_syscall_64+0x81/0x170
Jul 30 18:53:13 i11-nuc kernel:  ? clear_bhb_loop+0x15/0x70
Jul 30 18:53:13 i11-nuc kernel:  ? clear_bhb_loop+0x15/0x70
Jul 30 18:53:13 i11-nuc kernel:  ? clear_bhb_loop+0x15/0x70
Jul 30 18:53:13 i11-nuc kernel:  entry_SYSCALL_64_after_hwframe+0x78/0x80
Jul 30 18:53:13 i11-nuc kernel: RIP: 0033:0x7a2f9fc25d57
Jul 30 18:53:13 i11-nuc kernel: RSP: 002b:00007ffe4159cd68 EFLAGS: 00000293 ORIG_RAX: 0000000000000003
Jul 30 18:53:13 i11-nuc kernel: RAX: ffffffffffffffda RBX: 00007a2f9fcfc760 RCX: 00007a2f9fc25d57
Jul 30 18:53:13 i11-nuc kernel: RDX: 00007a2f9fcf79e0 RSI: 00007a2f88000b70 RDI: 0000000000000001
Jul 30 18:53:13 i11-nuc kernel: RBP: 00007a2f9fcf85e0 R08: 0000000000000000 R09: 0000000000000000
Jul 30 18:53:13 i11-nuc kernel: R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000000
Jul 30 18:53:13 i11-nuc kernel: R13: 0000000000000002 R14: 00005a6a414b7018 R15: 000000007935e800
Jul 30 18:53:13 i11-nuc kernel:  </TASK>
Jul 30 18:53:39 i11-nuc kernel: nfs: server <fqdn> not responding, still trying
Jul 30 18:53:39 i11-nuc kernel: nfs: server <fqdn> not responding, still trying
Jul 30 18:53:39 i11-nuc kernel: nfs: server <fqdn> not responding, still trying
Jul 30 18:53:39 i11-nuc kernel: nfs: server <fqdn> not responding, still trying
Jul 30 18:53:39 i11-nuc kernel: nfs: server <fqdn> not responding, still trying
Jul 30 18:54:45 i11-nuc pvedaemon[973]: got timeout
Jul 30 18:54:45 i11-nuc pvedaemon[973]: unable to activate storage 'zfs_nfs_pve' - directory '/mnt/pve/zfs_nfs_pve' does not exist or is unreachable

storage.cfg:

Code:

nfs: zfs_nfs_pve
    export /mnt/z_store/nfs_pve
    path /mnt/pve/zfs_nfs_pve
    server <fqdn>
    content iso,images,rootdir,vztmpl,backup,snippets
    preallocation metadata
    prune-backups keep-daily=1,keep-monthly=6,keep-weekly=4,keep-yearly=1

nfs mount info:

Code:

findmnt /mnt/pve/zfs_nfs_pve
TARGET               SOURCE                                  FSTYPE OPTIONS
/mnt/pve/zfs_nfs_pve <fqdn>:/mnt/z_store/nfs_pve nfs4   rw,relatime,vers=4
.2,rsize=131072,wsize=131072,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,cli
entaddr=<ip1>,local_lock=none,addr=<ip2>

making the nfs options same as another VM (where nfs mount works - please see Homework section above) results in the same issue of hung thread dumps.

pvesm status:

Code:

pvesm status
Name               Type     Status           Total            Used       Available        %
local               dir     active        98497780         5522676        87925556    5.61%
local-lvm       lvmthin     active       492216320               0       492216320    0.00%
zfs_nfs_pve         nfs     active     33830317824             384     33830317440    0.00%

The cluster nodes seemed to be hung up on this inaccesible nfs share. To prevent sluggish behavior of the nodes, I had to issue

Code:

pvesm set zfs_nfs_pve --disable

Help please.

bbgeek17 · Jul 31, 2024

The easiest explanation is : network problem.
Check your MTUs, ensure you can ping at full packet size. The next step is to try NFSv3. If there are no clues in prior steps - tcpdump on both sides may be in order.

Good luck

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

morik_proxmox · Jul 31, 2024

The next step is to try NFSv3.

Thank you. I already attempted this (per mention in Homework

. No relief.

Check your MTUs, ensure you can ping at full packet size.

I should've mentioned doing so. Neither a reachability nor an MTU issue either.

pve interface+bridge:

Code:

auto lo
iface lo inet loopback

iface enp2s0 inet manual
    mtu 9000

auto vmbr0.140
iface vmbr0.140 inet static
    mtu 9000
..

auto vmbr0
iface vmbr0 inet manual
    bridge-ports enp2s0
    bridge-stp off
    bridge-fd 0
    bridge-vlan-aware yes
    bridge-vids 2-4094
    bridge-pvid 3
    mtu 9000
...

NFS server:

Code:

cxl0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 9000
    description: cxl0    options=6ec07bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWTSO,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6,HWRXTSTMP,NOMAP>
...
cxl1: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 9000    options=6ec07bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWTSO,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6,HWRXTSTMP,NOMAP>

Switch config:

Code:

interface xyz
  description Proxmox node
  switchport mode trunk
  switchport trunk native vlan 3
  switchport trunk allowed vlan 2-3,100,120-121,140,250
  mtu 9216

If there are no clues in prior steps - tcpdump on both sides may be in order.

Thank you for the guidance. I'll attempt and revert.

morik_proxmox · Aug 29, 2024

morik_proxmox said:
Thank you for the guidance. I'll attempt and revert.

Having tried everything, I stand stumped. I can confirm it not being a network issue because of the following reasons:

(per above) mount of the nfs share in question works on other ubuntu, mac, windows and freebsd VMs and physical hosts
mount of nfs share in question works on ubuntu (22.04) VM from inside the proxmox hosts itself

`/etc/pve/storage.cfg` on proxmox cluster:

Code:

nfs: zfs_nfs_pve
    export /mnt/z_store/nfs_pve
    path /mnt/pve/zfs_nfs_pve
    server <fqdn>
    content images,rootdir,backup,snippets,iso,vztmpl
    options rw,noatime,vers=4.2,rsize=131072,wsize=131072,namlen=255,acregmin=1800,acregmax=1800,acdirmin=1800,acdirmax=1800,soft,proto=tcp,timeo=600,retrans=2,sec=sys,local_lock=none
    preallocation metadata
    prune-backups keep-daily=1,keep-monthly=6,keep-weekly=4,keep-yearly=1

nfs `zfs_nfs_pve` mounts for a few seconds and/or few operations like `cp, mv` etc. Then, despite the `soft` option, puts the proxmox hosts in a weird hanging state whereby nothing else works e.g. no shell tasks can be launched, GUI refreshes freeze, new ssh sessions can't be connected etc. Reboot clears the state for a bit until a similar issue is observed.

nfs mount is successful from proxmox host, and can see directories/files etc:

Code:

root@i11-nuc:/var/lib/vz# ls -al /mnt/pve/zfs_nfs_pve
total 88
drwxrwxrwx 8 root   nogroup    8 Aug 15 23:29 .
drwxr-xr-x 4 root   root    4096 Jul 31 15:27 ..
drwxr-xr-x 5 nobody nogroup    7 Aug 27 21:47 dump
drwxrwxrwx 5 nobody nogroup    5 Aug 19 14:44 frigate_videos
drwxr-xr-x 2 nobody nogroup    2 Jul 30 19:07 images
drwxr-xr-x 2 nobody nogroup    2 Jul 30 19:07 private
drwxr-xr-x 2 nobody nogroup    9 Aug 27 21:57 snippets
drwxr-xr-x 4 nobody nogroup    4 Jul 30 19:07 template

A copy operation on NFS share works fine too:

Code:

root@i11-nuc:/var/lib/vz# cp -vr dump/ /mnt/pve/zfs_nfs_pve/dump/
'dump/vzdump-qemu-1001-2024_08_18-12_29_33.vma.zst' -> '/mnt/pve/zfs_nfs_pve/dump/dump/vzdump-qemu-1001-2024_08_18-12_29_33.vma.zst'
'dump/vzdump-qemu-1001-2024_08_18-12_29_33.vma.zst.notes' -> '/mnt/pve/zfs_nfs_pve/dump/dump/vzdump-qemu-1001-2024_08_18-12_29_33.vma.zst.notes'
'dump/vzdump-qemu-1001-2024_08_18-12_29_33.log' -> '/mnt/pve/zfs_nfs_pve/dump/dump/vzdump-qemu-1001-2024_08_18-12_29_33.log'

then after some time, `pvestatd` gets a timeout:

Code:

Aug 28 18:27:45 i11-nuc pvestatd[1022]: status update time (10.092 seconds)
Aug 28 18:29:45 i11-nuc pvestatd[1022]: status update time (10.090 seconds)
Aug 28 18:30:05 i11-nuc pvestatd[1022]: status update time (10.087 seconds)
Aug 28 18:31:25 i11-nuc pvestatd[1022]: status update time (10.086 seconds)
Aug 28 18:35:35 i11-nuc pvestatd[1022]: status update time (10.094 seconds)
Aug 28 18:35:46 i11-nuc pvestatd[1022]: status update time (10.095 seconds)
Aug 28 18:37:17 i11-nuc pvestatd[1022]: status update time (10.134 seconds)
Aug 28 18:37:37 i11-nuc pvestatd[1022]: status update time (10.089 seconds)
Aug 28 18:37:47 i11-nuc pvestatd[1022]: status update time (10.092 seconds)
Aug 28 18:38:57 i11-nuc pvestatd[1022]: status update time (10.087 seconds)
Aug 28 18:39:07 i11-nuc pvestatd[1022]: status update time (10.095 seconds)
Aug 28 18:40:27 i11-nuc pvestatd[1022]: status update time (10.087 seconds)
Aug 28 18:40:57 i11-nuc pvestatd[1022]: status update time (10.087 seconds)
Aug 28 18:42:48 i11-nuc pvestatd[1022]: status update time (10.088 seconds)
Aug 28 18:42:58 i11-nuc pvestatd[1022]: status update time (10.101 seconds)
Aug 28 18:44:39 i11-nuc pvestatd[1022]: status update time (10.088 seconds)
Aug 28 18:46:49 i11-nuc pvestatd[1022]: status update time (10.093 seconds)
Aug 28 18:48:41 i11-nuc pvestatd[1022]: got timeout
Aug 28 18:48:41 i11-nuc pvestatd[1022]: status update time (12.084 seconds)
Aug 28 18:49:31 i11-nuc pvestatd[1022]: status update time (10.126 seconds)

and then the same error message as the first post.

Might this be worth posting as an issue on github?

morik_proxmox · Oct 9, 2024

Others experiencing similar issue may be able to learn from my (inadvertent) mistake. It turned out to be an issue of MTU related discards on a non-primary interface (which happened to be on the same LAN as the one used for management). One of the ports on NAS server was rightfully configured w/ MTU=9000. But, the change was not made permanent. A reboot on the NAS server caused the MTUs to revert to 1500.

Thank you @bbgeek17 for nudging me in the right direction.

Search

Search

backup of templates over nfs is failing

morik_proxmox

New Member

bbgeek17

Distinguished Member

morik_proxmox

New Member

morik_proxmox

New Member

morik_proxmox

New Member