NFS export times out

ChrisRedz

New Member
Dec 4, 2020
15
0
1
36
System topology

Hypervisor: Proxmox 6.3-2 | 2x 10Gb bonded (balance-rr)
Switch: Mikrotik 10Gb receives to a bonded port (also balance-rr), transmit hash policy Layer 2 and 3
Storage: Dell ISILON
Protocol: NFS

I am completely frustrated after 3 days of troubleshooting with no luck, please help me out you marvelous souls.

Problem:

When I try to move a disk converting it to qcow2 via GUI the NFS export hangs forcing me to reboot the Node. Same happens if I am to upload an ISO or do any operation on the NFS share. Everything gets greyed out and even thouhg VMs are still running the GUI gets no response

Dec 04 11:49:01 PMXNODE1 systemd[1]: pvesr.service: Succeeded.
Dec 04 11:49:01 PMXNODE1 systemd[1]: Started Proxmox VE replication runner.
Dec 04 11:49:03 PMXNODE1 pvestatd[1435]: got timeout
Dec 04 11:49:03 PMXNODE1 pvestatd[1435]: unable to activate storage 'test13' - directory '/mnt/pve/test13' does not exist or is unreachable

Since I have tried so many things in Proxmox, Mikrotik and Isilon configurations I think the best approach would be if you could please share your thoughts and start working from there

I beg you guys to put me out of my misery

Thanks in advance
 
Have you tried active-backup bond mode?
Please provide the syslog containing a little bit before the start of a 'Move disk' operation, until some time after the issue appears.
 
I have not tried to modify the bond, the idea was to expand the bandwidth. Do you think this could be an issue? Here is the log, from the time I rebooted the server till I started moving the disk and failed


Dec 04 13:57:09 PMXNODE1 sshd[2689]: Accepted password for root from 10.28.0.1 port 59481 ssh2
Dec 04 13:57:09 PMXNODE1 sshd[2689]: pam_unix(sshd:session): session opened for user root by (uid=0)
Dec 04 13:57:09 PMXNODE1 systemd[1]: Created slice User Slice of UID 0.
Dec 04 13:57:09 PMXNODE1 systemd[1]: Starting User Runtime Directory /run/user/0...
Dec 04 13:57:09 PMXNODE1 systemd-logind[1083]: New session 1 of user root.
Dec 04 13:57:09 PMXNODE1 systemd[1]: Started User Runtime Directory /run/user/0.
Dec 04 13:57:09 PMXNODE1 systemd[1]: Starting User Manager for UID 0...
Dec 04 13:57:09 PMXNODE1 systemd[2804]: pam_unix(systemd-user:session): session opened for user root by (uid=0)
Dec 04 13:57:10 PMXNODE1 systemd[2804]: Reached target Paths.
Dec 04 13:57:10 PMXNODE1 systemd[2804]: Listening on GnuPG network certificate management daemon.
Dec 04 13:57:10 PMXNODE1 systemd[2804]: Listening on GnuPG cryptographic agent and passphrase cache.
Dec 04 13:57:10 PMXNODE1 systemd[2804]: Listening on GnuPG cryptographic agent and passphrase cache (restricted).
Dec 04 13:57:10 PMXNODE1 systemd[2804]: Reached target Timers.
Dec 04 13:57:10 PMXNODE1 systemd[2804]: Listening on GnuPG cryptographic agent (ssh-agent emulation).
Dec 04 13:57:10 PMXNODE1 systemd[2804]: Listening on GnuPG cryptographic agent and passphrase cache (access for web browsers).
Dec 04 13:57:10 PMXNODE1 systemd[2804]: Reached target Sockets.
Dec 04 13:57:10 PMXNODE1 systemd[2804]: Reached target Basic System.
Dec 04 13:57:10 PMXNODE1 systemd[2804]: Reached target Default.
Dec 04 13:57:10 PMXNODE1 systemd[2804]: Startup finished in 168ms.
Dec 04 13:57:10 PMXNODE1 systemd[1]: Started User Manager for UID 0.
Dec 04 13:57:10 PMXNODE1 systemd[1]: Started Session 1 of user root.
Dec 04 13:57:17 PMXNODE1 kernel: nfs: server 10.40.100.200 not responding, still trying
Dec 04 13:58:00 PMXNODE1 systemd[1]: Starting Proxmox VE replication runner...
Dec 04 13:58:01 PMXNODE1 systemd[1]: pvesr.service: Succeeded.
Dec 04 13:58:01 PMXNODE1 systemd[1]: Started Proxmox VE replication runner.
Dec 04 13:59:00 PMXNODE1 systemd[1]: Starting Proxmox VE replication runner...
Dec 04 13:59:01 PMXNODE1 systemd[1]: pvesr.service: Succeeded.
Dec 04 13:59:01 PMXNODE1 systemd[1]: Started Proxmox VE replication runner.
Dec 04 13:59:57 PMXNODE1 kernel: nfs: server 10.40.100.200 not responding, still trying
Dec 04 14:00:00 PMXNODE1 systemd[1]: Starting Proxmox VE replication runner...
Dec 04 14:00:01 PMXNODE1 systemd[1]: pvesr.service: Succeeded.
Dec 04 14:00:01 PMXNODE1 systemd[1]: Started Proxmox VE replication runner.
Dec 04 14:00:26 PMXNODE1 pvedaemon[1532]: got timeout
Dec 04 14:00:54 PMXNODE1 kernel: nfs: server 10.40.100.200 OK
Dec 04 14:00:55 PMXNODE1 pvedaemon[1532]: <root@pam> move disk VM 102: move --disk virtio0 --storage test4
Dec 04 14:00:56 PMXNODE1 pvedaemon[1532]: <root@pam> starting task UPID:pMXNODE1:00000D61:0000FF1A:5FCA3308:qmmove:102:root@pam:
Dec 04 14:01:00 PMXNODE1 systemd[1]: Starting Proxmox VE replication runner...
Dec 04 14:01:01 PMXNODE1 systemd[1]: pvesr.service: Succeeded.
Dec 04 14:01:01 PMXNODE1 systemd[1]: Started Proxmox VE replication runner.
Dec 04 14:01:08 PMXNODE1 pvestatd[1519]: got timeout
Dec 04 14:01:18 PMXNODE1 pvestatd[1519]: got timeout
Dec 04 14:01:28 PMXNODE1 pvestatd[1519]: got timeout
Dec 04 14:01:28 PMXNODE1 pvestatd[1519]: unable to activate storage 'test4' - directory '/mnt/pve/test4' does not exist or is unreachable
Dec 04 14:01:38 PMXNODE1 pvestatd[1519]: got timeout
Dec 04 14:01:38 PMXNODE1 pvestatd[1519]: unable to activate storage 'test4' - directory '/mnt/pve/test4' does not exist or is unreachable
Dec 04 14:01:48 PMXNODE1 pvestatd[1519]: got timeout
Dec 04 14:01:48 PMXNODE1 pvestatd[1519]: unable to activate storage 'test4' - directory '/mnt/pve/test4' does not exist or is unreachable
Dec 04 14:01:58 PMXNODE1 pvestatd[1519]: got timeout
Dec 04 14:01:58 PMXNODE1 pvestatd[1519]: unable to activate storage 'test4' - directory '/mnt/pve/test4' does not exist or is unreachable
Dec 04 14:02:00 PMXNODE1 systemd[1]: Starting Proxmox VE replication runner...
 
Is 10.40.100.200 the NFS server you're using? If so, there seems to have been a problem even before the 'move disk' started:
Code:
Dec 04 13:57:17 PMXNODE1 kernel: nfs: server 10.40.100.200 not responding, still trying
You should check your network.
 
Believe it or not I have managed to fix the issue after talking to you. I was pretty sure the problem was on the network and I have tried so many different settings on Mikrotik side but I had never considered proxmox bond could be the problem. And it was!

As soon as I destroyed the bond it started working properly. Now my question is: Each link has is 10Gb and I wanted to bond them to maximize the bandwidth. Why doesn´t proxmox like it? How can I make the bond not to timeout?

Thanks!
 
balance-rr mode requires support from both sides. Does your switch support LACP? If so, you could try it instead.
 
When you say both sides, do you mean the mikrotik switch as well? Beacuse I had a bond on mikrotik side with balance-rr too. I have just changed mikrotik and proxmox to LACP bond. Lets see what happens
 
LACP bond works flawlessly. I am about to set the MTU on proxmox to 9000. Any tips on how to do so?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!