Strange ZFS performance issue

hotspot021

New Member
Feb 2, 2022
5
2
3
Hello,
I have old-ish HP ProLiant Microserver Gen8. It has 4x 3.5" bays (currently populated with 2x WD Reds = /dev/sda, /dev/sdb) + 1x old Laptop 2.5" HDD ( I installed Proxmox OS only = /dev/sdc )

My setup is that I have the Proxmox installed on the single 2.5" HDD, where the whole disk was formatted automatically by the installation setup ( EXT4 for root )
Then I have the 2x WD REDs in ZFS mirror.

Then I have 1x VM + 2x LXCs.

Here is my proxmox storage config.

Bash:
root@pmx[ ~ ]# cat /etc/pve/storage.cfg
dir: local
    path /var/lib/vz
    content vztmpl,iso,backup

lvmthin: local-lvm
    thinpool data
    vgname pve
    content rootdir,images

zfspool: storage
    pool zp11/pmx/storage
    content rootdir,images
    mountpoint /zp11/pmx/storage
    sparse 0

dir: repo
    path /zp11/pmx/repo
    content iso,vztmpl
    shared 0

dir: backups
    path /zp11/backups/pmx
    content backup
    shared 0

root@pmx[ ~ ]#

The strange performance issue for me is, that whenever I place the 3 guests onto the zfspool (/dev/sda + /dev/sdb) their disk write requests (measured from Proxmox host) are way greater compared to if the guests are hosted onto the lvmthin ( /dev/sdc )

I read a lot about how zvols are diff to "other" volumes (LVM), and must be careful while using them. But zvols are used only for the single VM I have. The 2x LXC are using zfs datasets not zvols. Therefore I don't understand why there are performance implications also with the 2x LXC's alone.
I underline >implications< because I don't have real performance problem, it's just that while measuring and comparing the two types of storage (lvmthin vs zfspool), the write requests in KB/sec towards the underlying disks (measured with SAR / IOSTAT from Proxmox host ) are way more for the zfspool.

Originally I planned to host my VMs/LXCs onto the zfspool because it's mirrored, and leave the stand alone disk for hypervisor only.

Code:
root@pmx[ ~ ]# smartctl -a /dev/sda
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.64-1-pve] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD40EFRX-XXXXXX
Serial Number:    XXXXXXXXX
LU WWN Device Id: 5 0014ee 211de03b9
Firmware Version: 82.00A82
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Nov 14 21:43:41 2022 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

Code:
root@pmx[ ~ ]# smartctl -a /dev/sdc
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.64-1-pve] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Hitachi/HGST Travelstar 5K750
Device Model:     Hitachi XXXXXXXX
Serial Number:    XXXXXXX
LU WWN Device Id: 5 000cca 6dfe7df00
Firmware Version: JE4OA50A
User Capacity:    750,156,374,016 bytes [750 GB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      2.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 6
SATA Version is:  SATA 2.6, 3.0 Gb/s
Local Time is:    Mon Nov 14 21:44:15 2022 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

Bellow is shown ~0.5MB/s writes onto the zfspool (/dev/sda = WD Red), when all 3 hosts reside on that zfspool
Code:
root@pmx[ ~ ]# sar -d --dev=sda -s 20:00 -e 20:10 |cat
Linux 5.15.64-1-pve (pmx)     14/11/22     _x86_64_    (8 CPU)

20:00:10          DEV       tps     rkB/s     wkB/s     dkB/s   areq-sz    aqu-sz     await     %util
20:01:06          sda     16.37      0.29    582.93      0.00     35.62      0.34     20.65     17.64
20:02:01          sda     16.46      0.00    498.06      0.00     30.25      0.33     19.95     16.74
20:03:06          sda     17.29      0.12    569.22      0.00     32.93      0.34     19.68     17.64
20:04:10          sda     15.88      0.25    508.02      0.00     32.01      0.33     20.62     16.69
20:05:01          sda     16.12      0.16    509.92      0.00     31.64      0.32     19.74     16.41
20:06:06          sda     16.71      0.12    571.38      0.00     34.21      0.35     20.99     17.98
20:07:01          sda     15.46      0.15    528.85      0.00     34.21      0.30     19.22     15.70
20:08:06          sda     15.82      0.00    481.61      0.00     30.43      0.34     21.69     17.07
20:09:26          sda     17.17      0.20    609.72      0.00     35.52      0.35     20.34     17.93
Average:          sda     16.41      0.14    542.82      0.00     33.09      0.33     20.35     17.15
root@pmx[ ~ ]#

Of course the same timeframe the root single disk (/dev/sdc = small 2.5" laptop HDD) is not utilized much.
Code:
root@pmx[ ~ ]# sar -d --dev=sdc -s 20:00 -e 20:10 |cat
Linux 5.15.64-1-pve (pmx)     14/11/22     _x86_64_    (8 CPU)

20:00:10          DEV       tps     rkB/s     wkB/s     dkB/s   areq-sz    aqu-sz     await     %util
20:01:06          sdc      4.54     46.11     29.68      0.00     16.70      0.14     31.46      2.33
20:02:01          sdc      3.36     55.82     19.77      0.00     22.49      0.11     31.74      1.90
20:03:06          sdc      3.70     47.38     25.29      0.00     19.63      0.11     29.18      2.15
20:04:10          sdc      2.71     51.50     18.76      0.00     25.94      0.08     28.82      1.62
20:05:01          sdc      3.73     55.63     23.63      0.00     21.23      0.11     28.73      2.21
20:06:06          sdc      3.71     47.61     24.21      0.00     19.38      0.10     27.80      1.95
20:07:01          sdc      3.54     55.81     23.40      0.00     22.36      0.11     30.09      1.94
20:08:06          sdc      3.84     47.20     26.12      0.00     19.09      0.10     26.87      2.17
20:09:26          sdc      2.76     51.30     18.34      0.00     25.27      0.08     27.61      1.51
Average:          sdc      3.50     50.73     23.05      0.00     21.06      0.10     29.09      1.95
root@pmx[ ~ ]#

Then migrate the storage of all 3 guests from the zpool (/dev/sda + /dev/sdb ) to the lvmthin (root disk /dev/sdc )

And then after an hour measure again.

The zpool is much relaxed already. No guests hosted there. So the wkB/s are far lower.

Code:
root@pmx[ ~ ]# sar -d --dev=sda -s 21:00 -e 21:10 |cat
Linux 5.15.64-1-pve (pmx)     14/11/22     _x86_64_    (8 CPU)

21:00:12          DEV       tps     rkB/s     wkB/s     dkB/s   areq-sz    aqu-sz     await     %util
21:01:06          sda      3.19      0.00     43.83      0.00     13.72      0.07     22.40      3.70
21:02:01          sda      5.78      0.00     81.05      0.00     14.03      0.12     20.38      6.01
21:03:06          sda      1.97      0.00     24.35      0.00     12.34      0.04     20.57      2.08
21:04:22          sda      1.53      0.00     21.95      0.00     14.39      0.03     21.34      1.67
21:05:01          sda      3.08      0.00     46.38      0.00     15.03      0.07     21.60      3.39
21:06:06          sda      2.84      0.00     38.48      0.00     13.55      0.06     20.46      2.98
21:07:01          sda      5.61      0.00     76.67      0.00     13.66      0.11     19.85      5.72
21:08:06          sda      1.77      0.00     25.10      0.00     14.16      0.04     20.58      1.93
21:09:10          sda      1.87      0.00     28.27      0.00     15.11      0.04     21.03      2.01
Average:          sda      2.94      0.00     40.94      0.00     13.93      0.06     20.75      3.13
root@pmx[ ~ ]#

And then taking a look at the lvmthin blows my mind. Because the average wkB/s is less than 100 where onto the zfspool was over 500
Code:
root@pmx[ ~ ]# sar -d --dev=sdc -s 21:00 -e 21:10 |cat
Linux 5.15.64-1-pve (pmx)     14/11/22     _x86_64_    (8 CPU)

21:00:12          DEV       tps     rkB/s     wkB/s     dkB/s   areq-sz    aqu-sz     await     %util
21:01:06          sdc     11.10     57.49     77.70      0.00     12.17      0.28     25.25      7.41
21:02:01          sdc      7.69     55.90     68.98      0.00     16.25      0.23     30.44      8.05
21:03:06          sdc      7.40     47.71     71.69      0.00     16.14      0.23     31.64      6.26
21:04:22          sdc      7.53     53.55     74.71      0.00     17.02      0.23     30.89      6.29
21:05:01          sdc      7.15     52.65     63.55      0.00     16.26      0.19     26.21      5.66
21:06:06          sdc      9.01     48.62     79.94      0.00     14.26      0.32     35.71      6.73
21:07:01          sdc      7.72     59.23     74.13      0.00     17.27      0.20     25.29      5.64
21:08:06          sdc      7.77     47.43     77.46      0.00     16.07      0.24     30.44      6.60
21:09:10          sdc      7.27     47.73     69.16      0.00     16.08      0.22     30.76      6.55
Average:          sdc      8.06     51.97     73.48      0.00     15.57      0.24     29.88      6.59
root@pmx[ ~ ]#

I did many tests also with single VM or LXC moving it between the storages. It's always the lvmthin writes less data onto the phys disk compared to zfspool.
Also I don't understand the %utilization ( last column from the SAR command ), and how comes it goes above 15% when the 3 guest are hosted there...as guest don't do many IOPS

The three guests have linear workload pattern, it can not happen that they are just by coincidence busy exactly at the time when are switched to the zfspool.

I also tried to get disk measurements (with SAR/IOSTAT) from inside the guests. But with the LXCs I did not have luck, as they share the same env as the host, so the numbers were exactly the same as from the Proxmox OS ( nothing extra to compare with ).
And the disk perf from inside the VM showed practically disks which are not used at all ( wkB/s less than 100 ; %util less than 3 ) regardless which is the underlying storage for the VM.

The VM internal OS reports less than 100 wkB/s always, but when:
1/ underlying storage is zfspool the Proxmox OS reports over 500 wkB/s
2/ underlying storage is lvmthin the Proxmox OS reports same less than 100 wkB/s ( even though the Proxmox OS also runs on that disk and counts towards the same coutners )

So to me something is not optimal in between zpool disks, KVM, LXC, Proxmox

In case anyone bears with me, pls give me a hint how to continue troubleshooting.
At the end ....I don't care about the system (home setup), but I just want to find out why it is like that, because it doesn't make sense to me. [ IT nature calls:) ]
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!