I have a newly setup proxmox server and I am having issues where the vzdump process locks up connecting to the NFS after about 2% of the backup is complete. At this point the VM is locked up and I cannot kill the vzdump process. An entire reboot of proxmox is necessary to bring things back online. Referencing my notes below, do anyone have any suggestions on how to fix?
Here are a few notes about the setup.
1) The backup network is a separate 20G network (2 x 10 LAGG) between the PVE and NFS server
2) I have about 10 other proxmox nodes running older version of proxmox with the same seutp that don't have any problem backing up to the same server and volume
3) The NFS server at the time of these backups is not under load. I am testing them outside of our normal backup window so there is little to no activity on the NFS server when the backup is attempted.
4) The backup volumes are mounted as NFS V3 (we had issue using V4 when PVE 6 came out so we switched to this version)
Here are some snippets of the syslog when this happens.
A little further down in the log i see the following.
Below are the proxmox package versions.
proxmox-ve: 6.3-1 (running kernel: 5.4.78-2-pve)
pve-manager: 6.3-3 (running version: 6.3-3/eee5f901)
pve-kernel-5.4: 6.3-3
pve-kernel-helper: 6.3-3
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.4.60-1-pve: 5.4.60-2
pve-kernel-5.4.34-1-pve: 5.4.34-2
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.7
libproxmox-backup-qemu0: 1.0.2-1
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.3-2
libpve-guest-common-perl: 3.1-3
libpve-http-server-perl: 3.1-1
libpve-storage-perl: 6.3-3
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
openvswitch-switch: 2.12.0-1
proxmox-backup-client: 1.0.6-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.4-3
pve-cluster: 6.2-1
pve-container: 3.3-2
pve-docs: 6.3-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.1.0-7
pve-xtermjs: 4.7.0-3
qemu-server: 6.3-2
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.5-pve1
Thanks,
Eric
Here are a few notes about the setup.
1) The backup network is a separate 20G network (2 x 10 LAGG) between the PVE and NFS server
2) I have about 10 other proxmox nodes running older version of proxmox with the same seutp that don't have any problem backing up to the same server and volume
3) The NFS server at the time of these backups is not under load. I am testing them outside of our normal backup window so there is little to no activity on the NFS server when the backup is attempted.
4) The backup volumes are mounted as NFS V3 (we had issue using V4 when PVE 6 came out so we switched to this version)
Here are some snippets of the syslog when this happens.
Jan 26 15:16:56 ind-exp-hv13 pvedaemon[46634]: <root@pam> starting task UPID:ind-exp-hv13:00000975:03B1314D:601078B8:vzdump:550:root@pam:
Jan 26 15:16:56 ind-exp-hv13 pvedaemon[2421]: INFO: starting new backup job: vzdump 550 --compress lzo --mode snapshot --node ind-exp-hv13 --remove 0 --storage backup3
Jan 26 15:16:56 ind-exp-hv13 pvedaemon[2421]: INFO: Starting Backup of VM 550 (qemu)
Jan 26 15:17:00 ind-exp-hv13 systemd[1]: Starting Proxmox VE replication runner...
Jan 26 15:17:00 ind-exp-hv13 systemd[1]: pvesr.service: Succeeded.
Jan 26 15:17:00 ind-exp-hv13 systemd[1]: Started Proxmox VE replication runner.
Jan 26 15:17:01 ind-exp-hv13 CRON[2462]: pam_unix(cron:session): session opened for user root by (uid=0)
Jan 26 15:17:01 ind-exp-hv13 CRON[2463]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Jan 26 15:17:01 ind-exp-hv13 CRON[2462]: pam_unix(cron:session): session closed for user root
Jan 26 15:17:04 ind-exp-hv13 pvestatd[2267]: got timeout
Jan 26 15:17:07 ind-exp-hv13 pvestatd[2267]: got timeout
Jan 26 15:17:09 ind-exp-hv13 pvestatd[2267]: got timeout
Jan 26 15:17:09 ind-exp-hv13 pvestatd[2267]: unable to activate storage 'backup3' - directory '/mnt/pve/backup3' does not exist or is unreachable
Jan 26 15:17:09 ind-exp-hv13 pvestatd[2267]: status update time (6.368 seconds)
Jan 26 15:17:14 ind-exp-hv13 pvestatd[2267]: got timeout
Jan 26 15:17:16 ind-exp-hv13 pvestatd[2267]: got timeout
Jan 26 15:17:18 ind-exp-hv13 pvestatd[2267]: got timeout
Jan 26 15:17:18 ind-exp-hv13 pvestatd[2267]: unable to activate storage 'backup3' - directory '/mnt/pve/backup3' does not exist or is unreachable
Jan 26 15:17:18 ind-exp-hv13 pvestatd[2267]: status update time (6.344 seconds)
Jan 26 15:17:23 ind-exp-hv13 corosync[2249]: [TOTEM ] Retransmit List: 33dea1
Jan 26 15:17:24 ind-exp-hv13 pvestatd[2267]: got timeout
Jan 26 15:17:24 ind-exp-hv13 pvestatd[2267]: unable to activate storage 'backup3' - directory '/mnt/pve/backup3' does not exist or is unreachable
Jan 26 15:17:26 ind-exp-hv13 pvestatd[2267]: got timeout
Jan 26 15:17:28 ind-exp-hv13 pvestatd[2267]: got timeout
Jan 26 15:17:28 ind-exp-hv13 pvestatd[2267]: unable to activate storage 'backup' - directory '/mnt/pve/backup' does not exist or is unreachable
Jan 26 15:17:28 ind-exp-hv13 pvestatd[2267]: status update time (6.344 seconds)
Jan 26 15:17:35 ind-exp-hv13 pvestatd[2267]: got timeout
Jan 26 15:17:35 ind-exp-hv13 pvestatd[2267]: unable to activate storage 'backup3' - directory '/mnt/pve/backup3' does not exist or is unreachable
Jan 26 15:17:37 ind-exp-hv13 pvestatd[2267]: got timeout
Jan 26 15:17:37 ind-exp-hv13 pvestatd[2267]: unable to activate storage 'backup' - directory '/mnt/pve/backup' does not exist or is unreachable
Jan 26 15:17:39 ind-exp-hv13 pvestatd[2267]: got timeout
Jan 26 15:17:39 ind-exp-hv13 pvestatd[2267]: status update time (6.381 seconds)
Jan 26 15:17:44 ind-exp-hv13 pvestatd[2267]: got timeout
Jan 26 15:17:44 ind-exp-hv13 pvestatd[2267]: unable to activate storage 'backup' - directory '/mnt/pve/backup' does not exist or is unreachable
Jan 26 15:17:46 ind-exp-hv13 pvestatd[2267]: got timeout
Jan 26 15:17:48 ind-exp-hv13 pvestatd[2267]: got timeout
Jan 26 15:17:48 ind-exp-hv13 pvestatd[2267]: unable to activate storage 'backup3' - directory '/mnt/pve/backup3' does not exist or is unreachable
A little further down in the log i see the following.
Jan 26 15:19:54 ind-exp-hv13 pvestatd[2267]: got timeout
Jan 26 15:19:54 ind-exp-hv13 pvestatd[2267]: unable to activate storage 'backup2' - directory '/mnt/pve/backup2' does not exist or is unreachable
Jan 26 15:19:57 ind-exp-hv13 pvestatd[2267]: got timeout
Jan 26 15:19:57 ind-exp-hv13 pvestatd[2267]: unable to activate storage 'backup' - directory '/mnt/pve/backup' does not exist or is unreachable
Jan 26 15:19:59 ind-exp-hv13 pvestatd[2267]: got timeout
Jan 26 15:19:59 ind-exp-hv13 pvestatd[2267]: unable to activate storage 'backup3' - directory '/mnt/pve/backup3' does not exist or is unreachable
Jan 26 15:19:59 ind-exp-hv13 pvestatd[2267]: status update time (6.351 seconds)
Jan 26 15:20:00 ind-exp-hv13 systemd[1]: Starting Proxmox VE replication runner...
Jan 26 15:20:00 ind-exp-hv13 systemd[1]: pvesr.service: Succeeded.
Jan 26 15:20:00 ind-exp-hv13 systemd[1]: Started Proxmox VE replication runner.
Jan 26 15:20:04 ind-exp-hv13 kernel: nfs: server 172.16.4.10 not responding, still trying
Jan 26 15:20:04 ind-exp-hv13 kernel: nfs: server 172.16.4.10 not responding, still trying
Jan 26 15:20:04 ind-exp-hv13 kernel: INFO: task lzop:2435 blocked for more than 120 seconds.
Jan 26 15:20:04 ind-exp-hv13 kernel: Tainted: P O 5.4.78-2-pve #1
Jan 26 15:20:04 ind-exp-hv13 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 26 15:20:04 ind-exp-hv13 kernel: lzop D 0 2435 2421 0x00000000
Jan 26 15:20:04 ind-exp-hv13 kernel: Call Trace:
Jan 26 15:20:04 ind-exp-hv13 kernel: __schedule+0x2e6/0x6f0
Jan 26 15:20:04 ind-exp-hv13 kernel: ? bit_wait_timeout+0xa0/0xa0
Jan 26 15:20:04 ind-exp-hv13 kernel: schedule+0x33/0xa0
Jan 26 15:20:04 ind-exp-hv13 kernel: io_schedule+0x16/0x40
Jan 26 15:20:04 ind-exp-hv13 kernel: bit_wait_io+0x11/0x50
Jan 26 15:20:04 ind-exp-hv13 kernel: __wait_on_bit+0x33/0xa0
Jan 26 15:20:04 ind-exp-hv13 kernel: out_of_line_wait_on_bit+0x90/0xb0
Jan 26 15:20:04 ind-exp-hv13 kernel: ? var_wake_function+0x30/0x30
Jan 26 15:20:04 ind-exp-hv13 kernel: nfs_wait_on_request+0x41/0x50 [nfs]
Jan 26 15:20:04 ind-exp-hv13 kernel: nfs_lock_and_join_requests+0x89/0x550 [nfs]
Jan 26 15:20:04 ind-exp-hv13 kernel: ? xas_load+0xc/0x80
Jan 26 15:20:04 ind-exp-hv13 kernel: ? find_get_entry+0xb1/0x180
Jan 26 15:20:04 ind-exp-hv13 kernel: nfs_updatepage+0x1b5/0x9b0 [nfs]
Jan 26 15:20:04 ind-exp-hv13 kernel: ? nfs_flush_incompatible+0x189/0x1d0 [nfs]
Jan 26 15:20:04 ind-exp-hv13 kernel: nfs_write_end+0x67/0x4f0 [nfs]
Jan 26 15:20:04 ind-exp-hv13 kernel: ? iov_iter_copy_from_user_atomic+0xd8/0x370
Jan 26 15:20:04 ind-exp-hv13 kernel: generic_perform_write+0x135/0x1b0
Jan 26 15:20:04 ind-exp-hv13 kernel: ? _cond_resched+0x19/0x30
Jan 26 15:20:04 ind-exp-hv13 kernel: ? _cond_resched+0x19/0x30
Jan 26 15:20:04 ind-exp-hv13 kernel: nfs_file_write+0x103/0x270 [nfs]
Jan 26 15:20:04 ind-exp-hv13 kernel: new_sync_write+0x125/0x1c0
Jan 26 15:20:04 ind-exp-hv13 kernel: __vfs_write+0x29/0x40
Jan 26 15:20:04 ind-exp-hv13 kernel: vfs_write+0xab/0x1b0
Jan 26 15:20:04 ind-exp-hv13 kernel: ksys_write+0x61/0xe0
Jan 26 15:20:04 ind-exp-hv13 kernel: __x64_sys_write+0x1a/0x20
Jan 26 15:20:04 ind-exp-hv13 kernel: do_syscall_64+0x57/0x190
Jan 26 15:20:04 ind-exp-hv13 kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9
Jan 26 15:20:04 ind-exp-hv13 kernel: RIP: 0033:0x7ff9583d0504
Jan 26 15:20:04 ind-exp-hv13 kernel: Code: Bad RIP value.
Jan 26 15:20:04 ind-exp-hv13 kernel: RSP: 002b:00007ffd751b9708 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
Jan 26 15:20:04 ind-exp-hv13 kernel: RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007ff9583d0504
Jan 26 15:20:04 ind-exp-hv13 kernel: RDX: 0000000000000004 RSI: 00007ffd751b9770 RDI: 0000000000000001
Jan 26 15:20:04 ind-exp-hv13 kernel: RBP: 0000000000000004 R08: 000000000000000c R09: 00007ff95823a000
Jan 26 15:20:04 ind-exp-hv13 kernel: R10: 0000000000000007 R11: 0000000000000246 R12: 00007ff9582e36c0
Jan 26 15:20:04 ind-exp-hv13 kernel: R13: 0000000000000001 R14: 0000000000000019 R15: 00007ffd751b9770
Jan 26 15:20:04 ind-exp-hv13 pvestatd[2267]: got timeout
Jan 26 15:20:04 ind-exp-hv13 pvestatd[2267]: unable to activate storage 'backup2' - directory '/mnt/pve/backup2' does not exist or is unreachable
Jan 26 15:20:06 ind-exp-hv13 pvestatd[2267]: got timeout
Jan 26 15:20:06 ind-exp-hv13 pvestatd[2267]: unable to activate storage 'backup' - directory '/mnt/pve/backup' does not exist or is unreachable
Jan 26 15:20:08 ind-exp-hv13 pvestatd[2267]: got timeout
Jan 26 15:20:08 ind-exp-hv13 pvestatd[2267]: unable to activate storage 'backup3' - directory '/mnt/pve/backup3' does not exist or is unreachable
Jan 26 15:20:08 ind-exp-hv13 pvestatd[2267]: status update time (6.328 seconds)
Below are the proxmox package versions.
proxmox-ve: 6.3-1 (running kernel: 5.4.78-2-pve)
pve-manager: 6.3-3 (running version: 6.3-3/eee5f901)
pve-kernel-5.4: 6.3-3
pve-kernel-helper: 6.3-3
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.4.60-1-pve: 5.4.60-2
pve-kernel-5.4.34-1-pve: 5.4.34-2
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.7
libproxmox-backup-qemu0: 1.0.2-1
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.3-2
libpve-guest-common-perl: 3.1-3
libpve-http-server-perl: 3.1-1
libpve-storage-perl: 6.3-3
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
openvswitch-switch: 2.12.0-1
proxmox-backup-client: 1.0.6-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.4-3
pve-cluster: 6.2-1
pve-container: 3.3-2
pve-docs: 6.3-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.1.0-7
pve-xtermjs: 4.7.0-3
qemu-server: 6.3-2
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.5-pve1
Thanks,
Eric