Strange share performance issue

bferrell

Well-Known Member
Nov 16, 2018
99
2
48
54
I'm seeing a very odd and difficult to troubleshoot disk share performance issue in case anyone has thoughts.

I have a 4 node PVE cluster, server by a few nodes of FreeNAS with NFS shares. One of the FreeNAS boxes is new, with SSD storage, and as part of bringing it online, I have rearranged the cluser network connections on my Unifi XG switch. Each node has a 10G connection on VLAN100 (192.168.100.0/24) for data and a 10G on VLAN101 (192.168.101.0/24) for SAN/FreeNAS connections. All nodes use the same NFS share for Image files, but 1 node is seeing extremely slow disk access to the share. Iperf3 tests to the share come back at 10G speeds.

I've tried using different ports on the switch, checking that the interfaces are really connected to the right networks, I'm at a complete loss as to what to do. It has to be related to the change (and I can't just 'put it back', I moved enough ports I don't know which ones it was on previous, but the other 3 nodes are fine, and there are no new ports in the configuration, I simply set it up so that the nodes all were in one colum on the switch, with VLAN100 on top for every node and VLAN101 on bottom). I've tried moving the node from SPF+ to native copper ports on the switch with no apparent effect, and I've rebooted the Node multiple times.

Here is some data.

Iperf3 results on FreeNAS share from Node 1 (which is having the slow reads).
root@freenas2[~]# iperf3 -s
-----------------------------------------------------------
Server listening on 5201
-----------------------------------------------------------
Accepted connection from 192.168.101.11, port 42486
[ 5] local 192.168.101.102 port 5201 connected to 192.168.101.11 port 42488
[ ID] Interval Transfer Bitrate
[ 5] 0.00-1.00 sec 1.05 GBytes 9.00 Gbits/sec
[ 5] 1.00-2.00 sec 1.04 GBytes 8.92 Gbits/sec
[ 5] 2.00-3.00 sec 972 MBytes 8.15 Gbits/sec
[ 5] 3.00-4.00 sec 1.08 GBytes 9.29 Gbits/sec
[ 5] 4.00-5.00 sec 1.10 GBytes 9.41 Gbits/sec
[ 5] 5.00-6.00 sec 1.10 GBytes 9.41 Gbits/sec
[ 5] 6.00-7.00 sec 1.10 GBytes 9.41 Gbits/sec
[ 5] 7.00-8.00 sec 1.10 GBytes 9.41 Gbits/sec
[ 5] 8.00-9.00 sec 1.07 GBytes 9.21 Gbits/sec
[ 5] 9.00-10.00 sec 1.03 GBytes 8.82 Gbits/sec
[ 5] 10.00-10.00 sec 1.32 MBytes 9.38 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate
[ 5] 0.00-10.00 sec 10.6 GBytes 9.10 Gbits/sec receiver

Disk performance test from VM running on Node 1
Code:
bferrell@ntp:~$ sudo hdparm -Tt /dev/sda
[sudo] password for bferrell:

/dev/sda:
 Timing cached reads:    2 MB in 32.17 seconds = 63.66 kB/sec
 Timing buffered disk reads:     2 MB in 28.34 seconds = 72.26 kB/sec
bferrell@ntp:~$

similar test on Node 2
bferrell@pihole:~$ sudo hdparm -Tt /dev/sda
[sudo] password for bferrell:

/dev/sda:
Timing cached reads: 15382 MB in 1.99 seconds = 7744.40 MB/sec
Timing buffered disk reads: 396 MB in 3.00 seconds = 131.98 MB/sec
bferrell@pihole:~$


Some stats from Node 1
root@svr-01:~# traceroute 192.168.101.102
traceroute to 192.168.101.102 (192.168.101.102), 30 hops max, 60 byte packets
1 192.168.101.102 (192.168.101.102) 0.449 ms 0.418 ms 0.396 ms
root@svr-01:~# ping -c 4 192.168.101.102
PING 192.168.101.102 (192.168.101.102) 56(84) bytes of data.
64 bytes from 192.168.101.102: icmp_seq=1 ttl=64 time=0.233 ms
64 bytes from 192.168.101.102: icmp_seq=2 ttl=64 time=0.117 ms
64 bytes from 192.168.101.102: icmp_seq=3 ttl=64 time=0.218 ms
64 bytes from 192.168.101.102: icmp_seq=4 ttl=64 time=0.214 ms

--- 192.168.101.102 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 55ms
rtt min/avg/max/mdev = 0.117/0.195/0.233/0.047 ms
root@svr-01:~#

Same stats from Node 2
root@svr-02:~# traceroute 192.168.101.102
traceroute to 192.168.101.102 (192.168.101.102), 30 hops max, 60 byte packets
1 192.168.101.102 (192.168.101.102) 0.224 ms 0.189 ms 0.176 ms
root@svr-02:~# ping -c 4 192.168.101.102
PING 192.168.101.102 (192.168.101.102) 56(84) bytes of data.
64 bytes from 192.168.101.102: icmp_seq=1 ttl=64 time=0.332 ms
64 bytes from 192.168.101.102: icmp_seq=2 ttl=64 time=0.147 ms
64 bytes from 192.168.101.102: icmp_seq=3 ttl=64 time=0.171 ms
64 bytes from 192.168.101.102: icmp_seq=4 ttl=64 time=0.245 ms

--- 192.168.101.102 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 56ms
rtt min/avg/max/mdev = 0.147/0.223/0.332/0.074 ms
root@svr-02:~#
 
so, as soon I posted that it started working properly. What I've been able to determine is that the second port on the Node's 10G card brings the switchport down on VLAN101 when in any of my SFP+ ports, but it works fine in the copper ports. I've also confirmed that my laptop and other Nodes can use that SFP+ port without issue. I have no idea what that's telling me, to be honest, but at least it's functioning. Would be intersted to hear theories.
 
Did you checked the ROM versions of the NIC, maybe they're different? Also a physical damage could be possible, especiually when other devices behave fine on the same spf+ port...
 
* hmm - check the `dmesg` and journal of Node 1 (the one with the problems for any messages from the NIC
* maybe it's a faulty SFP+ module on the NIC of the Node?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!