vzdump locking up to NFS share on 6.3-3

ejmerkel · Jan 26, 2021

I have a newly setup proxmox server and I am having issues where the vzdump process locks up connecting to the NFS after about 2% of the backup is complete. At this point the VM is locked up and I cannot kill the vzdump process. An entire reboot of proxmox is necessary to bring things back online. Referencing my notes below, do anyone have any suggestions on how to fix?

Here are a few notes about the setup.

1) The backup network is a separate 20G network (2 x 10 LAGG) between the PVE and NFS server
2) I have about 10 other proxmox nodes running older version of proxmox with the same seutp that don't have any problem backing up to the same server and volume
3) The NFS server at the time of these backups is not under load. I am testing them outside of our normal backup window so there is little to no activity on the NFS server when the backup is attempted.
4) The backup volumes are mounted as NFS V3 (we had issue using V4 when PVE 6 came out so we switched to this version)

Here are some snippets of the syslog when this happens.


Jan 26 15:16:56 ind-exp-hv13 pvedaemon[46634]: <root@pam> starting task UPID:ind-exp-hv13:00000975:03B1314D:601078B8:vzdump:550:root@pam:
Jan 26 15:16:56 ind-exp-hv13 pvedaemon[2421]: INFO: starting new backup job: vzdump 550 --compress lzo --mode snapshot --node ind-exp-hv13 --remove 0 --storage backup3
Jan 26 15:16:56 ind-exp-hv13 pvedaemon[2421]: INFO: Starting Backup of VM 550 (qemu)
Jan 26 15:17:00 ind-exp-hv13 systemd[1]: Starting Proxmox VE replication runner...
Jan 26 15:17:00 ind-exp-hv13 systemd[1]: pvesr.service: Succeeded.
Jan 26 15:17:00 ind-exp-hv13 systemd[1]: Started Proxmox VE replication runner.
Jan 26 15:17:01 ind-exp-hv13 CRON[2462]: pam_unix(cron:session): session opened for user root by (uid=0)
Jan 26 15:17:01 ind-exp-hv13 CRON[2463]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Jan 26 15:17:01 ind-exp-hv13 CRON[2462]: pam_unix(cron:session): session closed for user root
Jan 26 15:17:04 ind-exp-hv13 pvestatd[2267]: got timeout
Jan 26 15:17:07 ind-exp-hv13 pvestatd[2267]: got timeout
Jan 26 15:17:09 ind-exp-hv13 pvestatd[2267]: got timeout
Jan 26 15:17:09 ind-exp-hv13 pvestatd[2267]: unable to activate storage 'backup3' - directory '/mnt/pve/backup3' does not exist or is unreachable
Jan 26 15:17:09 ind-exp-hv13 pvestatd[2267]: status update time (6.368 seconds)
Jan 26 15:17:14 ind-exp-hv13 pvestatd[2267]: got timeout
Jan 26 15:17:16 ind-exp-hv13 pvestatd[2267]: got timeout
Jan 26 15:17:18 ind-exp-hv13 pvestatd[2267]: got timeout
Jan 26 15:17:18 ind-exp-hv13 pvestatd[2267]: unable to activate storage 'backup3' - directory '/mnt/pve/backup3' does not exist or is unreachable
Jan 26 15:17:18 ind-exp-hv13 pvestatd[2267]: status update time (6.344 seconds)
Jan 26 15:17:23 ind-exp-hv13 corosync[2249]:   [TOTEM ] Retransmit List: 33dea1
Jan 26 15:17:24 ind-exp-hv13 pvestatd[2267]: got timeout
Jan 26 15:17:24 ind-exp-hv13 pvestatd[2267]: unable to activate storage 'backup3' - directory '/mnt/pve/backup3' does not exist or is unreachable
Jan 26 15:17:26 ind-exp-hv13 pvestatd[2267]: got timeout
Jan 26 15:17:28 ind-exp-hv13 pvestatd[2267]: got timeout
Jan 26 15:17:28 ind-exp-hv13 pvestatd[2267]: unable to activate storage 'backup' - directory '/mnt/pve/backup' does not exist or is unreachable
Jan 26 15:17:28 ind-exp-hv13 pvestatd[2267]: status update time (6.344 seconds)
Jan 26 15:17:35 ind-exp-hv13 pvestatd[2267]: got timeout
Jan 26 15:17:35 ind-exp-hv13 pvestatd[2267]: unable to activate storage 'backup3' - directory '/mnt/pve/backup3' does not exist or is unreachable
Jan 26 15:17:37 ind-exp-hv13 pvestatd[2267]: got timeout
Jan 26 15:17:37 ind-exp-hv13 pvestatd[2267]: unable to activate storage 'backup' - directory '/mnt/pve/backup' does not exist or is unreachable
Jan 26 15:17:39 ind-exp-hv13 pvestatd[2267]: got timeout
Jan 26 15:17:39 ind-exp-hv13 pvestatd[2267]: status update time (6.381 seconds)
Jan 26 15:17:44 ind-exp-hv13 pvestatd[2267]: got timeout
Jan 26 15:17:44 ind-exp-hv13 pvestatd[2267]: unable to activate storage 'backup' - directory '/mnt/pve/backup' does not exist or is unreachable
Jan 26 15:17:46 ind-exp-hv13 pvestatd[2267]: got timeout
Jan 26 15:17:48 ind-exp-hv13 pvestatd[2267]: got timeout
Jan 26 15:17:48 ind-exp-hv13 pvestatd[2267]: unable to activate storage 'backup3' - directory '/mnt/pve/backup3' does not exist or is unreachable

A little further down in the log i see the following.


Jan 26 15:19:54 ind-exp-hv13 pvestatd[2267]: got timeout
Jan 26 15:19:54 ind-exp-hv13 pvestatd[2267]: unable to activate storage 'backup2' - directory '/mnt/pve/backup2' does not exist or is unreachable
Jan 26 15:19:57 ind-exp-hv13 pvestatd[2267]: got timeout
Jan 26 15:19:57 ind-exp-hv13 pvestatd[2267]: unable to activate storage 'backup' - directory '/mnt/pve/backup' does not exist or is unreachable
Jan 26 15:19:59 ind-exp-hv13 pvestatd[2267]: got timeout
Jan 26 15:19:59 ind-exp-hv13 pvestatd[2267]: unable to activate storage 'backup3' - directory '/mnt/pve/backup3' does not exist or is unreachable
Jan 26 15:19:59 ind-exp-hv13 pvestatd[2267]: status update time (6.351 seconds)
Jan 26 15:20:00 ind-exp-hv13 systemd[1]: Starting Proxmox VE replication runner...
Jan 26 15:20:00 ind-exp-hv13 systemd[1]: pvesr.service: Succeeded.
Jan 26 15:20:00 ind-exp-hv13 systemd[1]: Started Proxmox VE replication runner.
Jan 26 15:20:04 ind-exp-hv13 kernel: nfs: server 172.16.4.10 not responding, still trying
Jan 26 15:20:04 ind-exp-hv13 kernel: nfs: server 172.16.4.10 not responding, still trying
Jan 26 15:20:04 ind-exp-hv13 kernel: INFO: task lzop:2435 blocked for more than 120 seconds.
Jan 26 15:20:04 ind-exp-hv13 kernel:       Tainted: P           O      5.4.78-2-pve #1
Jan 26 15:20:04 ind-exp-hv13 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 26 15:20:04 ind-exp-hv13 kernel: lzop            D    0  2435   2421 0x00000000
Jan 26 15:20:04 ind-exp-hv13 kernel: Call Trace:
Jan 26 15:20:04 ind-exp-hv13 kernel:  __schedule+0x2e6/0x6f0
Jan 26 15:20:04 ind-exp-hv13 kernel:  ? bit_wait_timeout+0xa0/0xa0
Jan 26 15:20:04 ind-exp-hv13 kernel:  schedule+0x33/0xa0
Jan 26 15:20:04 ind-exp-hv13 kernel:  io_schedule+0x16/0x40
Jan 26 15:20:04 ind-exp-hv13 kernel:  bit_wait_io+0x11/0x50
Jan 26 15:20:04 ind-exp-hv13 kernel:  __wait_on_bit+0x33/0xa0
Jan 26 15:20:04 ind-exp-hv13 kernel:  out_of_line_wait_on_bit+0x90/0xb0
Jan 26 15:20:04 ind-exp-hv13 kernel:  ? var_wake_function+0x30/0x30
Jan 26 15:20:04 ind-exp-hv13 kernel:  nfs_wait_on_request+0x41/0x50 [nfs]
Jan 26 15:20:04 ind-exp-hv13 kernel:  nfs_lock_and_join_requests+0x89/0x550 [nfs]
Jan 26 15:20:04 ind-exp-hv13 kernel:  ? xas_load+0xc/0x80
Jan 26 15:20:04 ind-exp-hv13 kernel:  ? find_get_entry+0xb1/0x180
Jan 26 15:20:04 ind-exp-hv13 kernel:  nfs_updatepage+0x1b5/0x9b0 [nfs]
Jan 26 15:20:04 ind-exp-hv13 kernel:  ? nfs_flush_incompatible+0x189/0x1d0 [nfs]
Jan 26 15:20:04 ind-exp-hv13 kernel:  nfs_write_end+0x67/0x4f0 [nfs]
Jan 26 15:20:04 ind-exp-hv13 kernel:  ? iov_iter_copy_from_user_atomic+0xd8/0x370
Jan 26 15:20:04 ind-exp-hv13 kernel:  generic_perform_write+0x135/0x1b0
Jan 26 15:20:04 ind-exp-hv13 kernel:  ? _cond_resched+0x19/0x30
Jan 26 15:20:04 ind-exp-hv13 kernel:  ? _cond_resched+0x19/0x30
Jan 26 15:20:04 ind-exp-hv13 kernel:  nfs_file_write+0x103/0x270 [nfs]
Jan 26 15:20:04 ind-exp-hv13 kernel:  new_sync_write+0x125/0x1c0
Jan 26 15:20:04 ind-exp-hv13 kernel:  __vfs_write+0x29/0x40
Jan 26 15:20:04 ind-exp-hv13 kernel:  vfs_write+0xab/0x1b0
Jan 26 15:20:04 ind-exp-hv13 kernel:  ksys_write+0x61/0xe0
Jan 26 15:20:04 ind-exp-hv13 kernel:  __x64_sys_write+0x1a/0x20
Jan 26 15:20:04 ind-exp-hv13 kernel:  do_syscall_64+0x57/0x190
Jan 26 15:20:04 ind-exp-hv13 kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Jan 26 15:20:04 ind-exp-hv13 kernel: RIP: 0033:0x7ff9583d0504
Jan 26 15:20:04 ind-exp-hv13 kernel: Code: Bad RIP value.
Jan 26 15:20:04 ind-exp-hv13 kernel: RSP: 002b:00007ffd751b9708 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
Jan 26 15:20:04 ind-exp-hv13 kernel: RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007ff9583d0504
Jan 26 15:20:04 ind-exp-hv13 kernel: RDX: 0000000000000004 RSI: 00007ffd751b9770 RDI: 0000000000000001
Jan 26 15:20:04 ind-exp-hv13 kernel: RBP: 0000000000000004 R08: 000000000000000c R09: 00007ff95823a000
Jan 26 15:20:04 ind-exp-hv13 kernel: R10: 0000000000000007 R11: 0000000000000246 R12: 00007ff9582e36c0
Jan 26 15:20:04 ind-exp-hv13 kernel: R13: 0000000000000001 R14: 0000000000000019 R15: 00007ffd751b9770
Jan 26 15:20:04 ind-exp-hv13 pvestatd[2267]: got timeout
Jan 26 15:20:04 ind-exp-hv13 pvestatd[2267]: unable to activate storage 'backup2' - directory '/mnt/pve/backup2' does not exist or is unreachable
Jan 26 15:20:06 ind-exp-hv13 pvestatd[2267]: got timeout
Jan 26 15:20:06 ind-exp-hv13 pvestatd[2267]: unable to activate storage 'backup' - directory '/mnt/pve/backup' does not exist or is unreachable
Jan 26 15:20:08 ind-exp-hv13 pvestatd[2267]: got timeout
Jan 26 15:20:08 ind-exp-hv13 pvestatd[2267]: unable to activate storage 'backup3' - directory '/mnt/pve/backup3' does not exist or is unreachable
Jan 26 15:20:08 ind-exp-hv13 pvestatd[2267]: status update time (6.328 seconds)

Below are the proxmox package versions.

proxmox-ve: 6.3-1 (running kernel: 5.4.78-2-pve)
pve-manager: 6.3-3 (running version: 6.3-3/eee5f901)
pve-kernel-5.4: 6.3-3
pve-kernel-helper: 6.3-3
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.4.60-1-pve: 5.4.60-2
pve-kernel-5.4.34-1-pve: 5.4.34-2
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.7
libproxmox-backup-qemu0: 1.0.2-1
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.3-2
libpve-guest-common-perl: 3.1-3
libpve-http-server-perl: 3.1-1
libpve-storage-perl: 6.3-3
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
openvswitch-switch: 2.12.0-1
proxmox-backup-client: 1.0.6-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.4-3
pve-cluster: 6.2-1
pve-container: 3.3-2
pve-docs: 6.3-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.1.0-7
pve-xtermjs: 4.7.0-3
qemu-server: 6.3-2
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.5-pve1

Thanks,
Eric

mira · Feb 1, 2021

This hints at a network issue:

Code:

Jan 26 15:17:23 ind-exp-hv13 corosync[2249]:   [TOTEM ] Retransmit List: 33dea1

Have you checked the network config (ip -details a) and compared it to the other nodes?

When the connection to an NFS storage is lost, the processes involved can be stuck in an uninterruptable sleep. The only solution to get rid of these is to reboot.

ejmerkel · Feb 1, 2021

Here are 2 nodes for comparison. The first is the working and the second is the non-working. The only difference is the working one is an older version of 6.x proxmox. FWIW, I had about 200+ VM's backup OK on the older version this weekend to the same NAS the newer ones are having issues with.

Working

6: eno1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond2 state UP group default qlen 1000
link/ether 24:6e:96:07:1d:f0 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 9710
bond_slave state ACTIVE mii_status UP link_failure_count 0 perm_hwaddr 24:6e:96:07:1d:f0 queue_id 0 ad_aggregator_id 1 ad_actor_oper_port_state 61 ad_partner_oper_port_state 63 numtxqueues 64 numrxqueues 64 gso_max_size 65536 gso_max_segs 65535
7: eno2: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond2 state UP group default qlen 1000
link/ether 24:6e:96:07:1d:f0 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 9710
bond_slave state ACTIVE mii_status UP link_failure_count 0 perm_hwaddr 24:6e:96:07:1d:f2 queue_id 0 ad_aggregator_id 1 ad_actor_oper_port_state 61 ad_partner_oper_port_state 63 numtxqueues 64 numrxqueues 64 gso_max_size 65536 gso_max_segs 65535
12: bond2: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000
link/ether 24:6e:96:07:1d:f0 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535
bond mode 802.3ad miimon 100 updelay 0 downdelay 0 use_carrier 1 arp_interval 0 arp_validate none arp_all_targets any primary_reselect always fail_over_mac none xmit_hash_policy layer2 resend_igmp 1 num_grat_arp 1 all_slaves_active 0 min_links 0 lp_interval 1 packets_per_slave 1 lacp_rate slow ad_select stable ad_aggregator 1 ad_num_ports 2 ad_actor_key 15 ad_partner_key 104 ad_partner_mac 84:c1:c1:42:48:00 ad_actor_sys_prio 65535 ad_user_port_key 0 ad_actor_system 00:00:00:00:00:00 tlb_dynamic_lb 1 numtxqueues 16 numrxqueues 16 gso_max_size 65536 gso_max_segs 65535
inet 172.16.4.254/24 brd 172.16.4.255 scope global bond2
valid_lft forever preferred_lft forever
inet6 fe80::266e:96ff:fe07:1df0/64 scope link
valid_lft forever preferred_lft forever

Not Working

6: eno1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond2 state UP group default qlen 1000
link/ether b8:ca:3a:62:77:70 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 9710
bond_slave state ACTIVE mii_status UP link_failure_count 0 perm_hwaddr b8:ca:3a:62:77:70 queue_id 0 ad_aggregator_id 1 ad_actor_oper_port_state 61 ad_partner_oper_port_state 63 numtxqueues 64 numrxqueues 64 gso_max_size 65536 gso_max_segs 65535
7: eno2: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond2 state UP group default qlen 1000
link/ether b8:ca:3a:62:77:70 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 9710
bond_slave state ACTIVE mii_status UP link_failure_count 0 perm_hwaddr b8:ca:3a:62:77:72 queue_id 0 ad_aggregator_id 1 ad_actor_oper_port_state 61 ad_partner_oper_port_state 63 numtxqueues 64 numrxqueues 64 gso_max_size 65536 gso_max_segs 65535
12: bond2: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000
link/ether b8:ca:3a:62:77:70 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535
bond mode 802.3ad miimon 100 updelay 0 downdelay 0 use_carrier 1 arp_interval 0 arp_validate none arp_all_targets any primary_reselect always fail_over_mac none xmit_hash_policy layer2 resend_igmp 1 num_grat_arp 1 all_slaves_active 0 min_links 0 lp_interval 1 packets_per_slave 1 lacp_rate slow ad_select stable ad_aggregator 1 ad_num_ports 2 ad_actor_key 15 ad_partner_key 133 ad_partner_mac 84:c1:c1:42:48:00 ad_actor_sys_prio 65535 ad_user_port_key 0 ad_actor_system 00:00:00:00:00:00 tlb_dynamic_lb 1 numtxqueues 16 numrxqueues 16 gso_max_size 65536 gso_max_segs 65535
inet 172.16.4.251/24 brd 172.16.4.255 scope global bond2
valid_lft forever preferred_lft forever
inet6 fe80::baca:3aff:fe62:7770/64 scope link
valid_lft forever preferred_lft forever

mira · Feb 2, 2021

Is there anything in the syslog on the PVE host and the NFS server?

ejmerkel · Feb 2, 2021

mira said:
Is there anything in the syslog on the PVE host and the NFS server?

No nothing to speak of on the NFS server. I have another host on the latest PVE and it has the same problem so it doesn't seem to be related to the host and as I mentioned older PVE hosts have no issue going to same NAS. I've read some other threads that there seems to be a timeout issue on the backup software. I'll have to see if I can look more down that line.

mira · Feb 3, 2021

Could you post the output of mount for both a working host and one that doesn't? The line containing the NFS server should be enough.

ejmerkel · Feb 3, 2021

mira said:
Could you post the output of mount for both a working host and one that doesn't? The line containing the NFS server should be enough.

Here is the output of the mount command on both hosts.

Working

172.16.4.10:/volume3/proxmoxv3 on /mnt/pve/backup3 type nfs (rw,relatime,vers=3,rsize=131072,wsize=131072,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=172.16.4.10,mountvers=3,mountport=892,mountproto=udp,local_lock=none,addr=172.16.4.10)

Not Working

172.16.4.10:/volume3/proxmoxv3 on /mnt/pve/backup3 type nfs (rw,relatime,vers=3,rsize=131072,wsize=131072,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=172.16.4.10,mountvers=3,mountport=892,mountproto=udp,local_lock=none,addr=172.16.4.10)

CvH · Feb 11, 2021

Just want to add that the same symptomatic happens at cifs too https://forum.proxmox.com/threads/after-daily-backup-fails-node-is-not-usable.77679/

I also found similar sounding threads here in the forum, all starting with 6.2 till today - most of them have zero replies so no solution yet.
That the integrated backup solution can crash all nodes is a bit worry some.
We still have the problem at one of our two clusters, the backup cluster works like a charm.

mira · Feb 11, 2021

ejmerkel said:
Here is the output of the mount command on both hosts.

Working

172.16.4.10:/volume3/proxmoxv3 on /mnt/pve/backup3 type nfs (rw,relatime,vers=3,rsize=131072,wsize=131072,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=172.16.4.10,mountvers=3,mountport=892,mountproto=udp,local_lock=none,addr=172.16.4.10)

Not Working

172.16.4.10:/volume3/proxmoxv3 on /mnt/pve/backup3 type nfs (rw,relatime,vers=3,rsize=131072,wsize=131072,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=172.16.4.10,mountvers=3,mountport=892,mountproto=udp,local_lock=none,addr=172.16.4.10)

Have you tried it with active-backup bond mode?

Could you try the new kernel that's available in the pvetest repository?

Search

Search

vzdump locking up to NFS share on 6.3-3

ejmerkel

Renowned Member

mira

Proxmox Staff Member

ejmerkel

Renowned Member

mira

Proxmox Staff Member

ejmerkel

Renowned Member

mira

Proxmox Staff Member

ejmerkel

Renowned Member

CvH

Member

mira

Proxmox Staff Member