[SOLVED] ZFS on HDD massive performance drop after update from Proxmox 6.2 to 6.3

Sep 15, 2019
41
0
26
Hello togehter,

Any help in the issue below is highly appreciated!

Thanks!
Christian

[In short]
After upgrading from Proxmox 6.2 to 6.3 mid of December I noticed a massive performance drop on my HDD based ZFS-pool.
Performance dropped in my SAMBA use case from 110 MByte/s Network copy speed to below 1 MByte/s. Copy jobs from the HDD pool to the SSD based ZFS pool and vice versa
see similar performance hits.

[Symptoms in detail]
I create - on a regular basis - full images of my workstation to be fast back on track in case of a hard drive defect. Images are first stored on a local USB-drive and
in a second step copied over to the SAMBA server running on my Proxmox server via robocopy. Before the upgrade to Proxmox 6.3 the robocopy process to the SAMBA server was
finished in roughly 80 - 90 Minutes for an image of 490 Gbyte in size. Since the upgrade, the copy process now takes forever. In the Windows task manager you'll see that the
data is first copied with full line speed at a rate of 110 MByte/s. After a few seconds the performance drops to below 1 MByte/s and stays there for roughly a minute. In the
same time the IO delay on the servers shoots to over 60% - sometimes 80%. Then performance comes back for a few seconds before it drops again. This looks like a saw tooth function.
To pin point the problem I made some tests with copy processes from my SSD-Pool to the HDD-Pool and vice versa without running CTs and VMs. The effect looks the same. First it starts with high copy rates
and after a few seconds the performance drops down to unusable speeds below 1 MByte/s.

Copy with pv
Code:
%
root@proxmox02:/hdd_zfs_guests# head -c 100G </dev/urandom > /hdd_zfs_ssd/random_data.ran
root@proxmox02:/hdd_zfs_guests# pv -pra  /hdd_zfs_ssd/random_data.ran > ./random_data.ran
[1.42MiB/s] [33.5MiB/s] [===========>                                                                                                                                                  ]  8%

Other symptoms are low FSYNC/Seconds on the HDD-pool after a few minutes of runtime - even with deactivated CTs and VMs:
Directly after reboot with deactivated CTs and VMs

Code:
root@proxmox02:/hdd_zfs_ssd# pveperf /hdd_zfs_guests
CPU BOGOMIPS:      83997.84
REGEX/SECOND:      4209528
HD SIZE:           2947.06 GB (hdd_zfs_guests)
FSYNCS/SECOND:     151.42
DNS EXT:           34.98 ms
DNS INT:           88.97 ms ()

After a few minutes:

Code:
root@proxmox02:~/zfs_debug# pveperf /hdd_zfs_guests
CPU BOGOMIPS:      83997.84
REGEX/SECOND:      4475092
HD SIZE:           2983.47 GB (hdd_zfs_guests)
FSYNCS/SECOND:     75.64
DNS EXT:           58.20 ms
DNS INT:           67.37 ms ()

Sometimes you see FSYNCS/SECOND values between 50 and 60.

For reference I've attached screenshots from the web interface

[Hardware and Software Configuration]
Proxmox 6.3-3
Kernel Version Linux: 5.4.78-2-pve #1 SMP PVE 5.4.78-2 (Thu, 03 Dec 2020 14:26:17 +0100)
PVE Manager Version: pve-manager/6.3-3/eee5f901

CPU: Intel Xeon E2146G
Memory: 64 GB Kingston ECC
Mainboard: Fujitsu (now Kontron): D3644-B
Network: 1x 1 Gbe Intel I219-LM onboard for maintenance and webinterface
1x 10 Gbe Intel X550T (second port not used)
HDD: 4x Seagate ST4000VN008 (Ironwolf) Configured as ZFS RAID-10 (2x2) for bind mounts (name: hdd_zfs_guests)
SSD: 2x Crucial CT2000MX500 - Configured as ZFS RAID-1 for VMs and CTs (name: hdd_zfs_ssd)
NVME: 1x Samsung 970 EVO 250Gb as Boot device and Proxmox installation drive

Server Build date: Spring 2019 - Upgrade with 2x SSDs for the second pool early November 2020

Number of typical used LXC containers:
9 (8 turned of for the SAMBA tests and all turned off for tests without SAMBA)

Number of typical running VMs:
2 (turned of for the tests)

Samba CT:
privileged LXC container with Debian 10
SAMBA Version: 4.9.5-Debian
SAMBA data storage on HDD ZFS-pool an mounted via bind mount into the CT



ZPOOL List
Code:
pool: hdd_zfs_guests
 state: ONLINE
  scan: scrub repaired 0B in 0 days 06:14:17 with 0 errors on Sun Jan  3 00:50:29 2021
config:


    NAME                        STATE     READ WRITE CKSUM
    hdd_zfs_guests              ONLINE       0     0     0
      mirror-0                  ONLINE       0     0     0
        wwn-0x5000c500b3a2d8c4  ONLINE       0     0     0
        wwn-0x5000c500b3a2edef  ONLINE       0     0     0
      mirror-1                  ONLINE       0     0     0
        wwn-0x5000c500b38ee3ed  ONLINE       0     0     0
        wwn-0x5000c500b3a2e636  ONLINE       0     0     0


errors: No known data errors


  pool: hdd_zfs_ssd
 state: ONLINE
  scan: scrub repaired 0B in 0 days 00:14:13 with 0 errors on Sat Jan  2 18:50:52 2021
config:


    NAME                        STATE     READ WRITE CKSUM
    hdd_zfs_ssd                 ONLINE       0     0     0
      mirror-0                  ONLINE       0     0     0
        wwn-0x500a0751e4a94a86  ONLINE       0     0     0
        wwn-0x500a0751e4a94af8  ONLINE       0     0     0


errors: No known data errors

ZPOOL Status
Code:
root@proxmox02:~# zpool status
  pool: hdd_zfs_guests
 state: ONLINE
  scan: scrub repaired 0B in 0 days 06:14:17 with 0 errors on Sun Jan  3 00:50:29 2021
config:

        NAME                        STATE     READ WRITE CKSUM
        hdd_zfs_guests              ONLINE       0     0     0
          mirror-0                  ONLINE       0     0     0
            wwn-0x5000c500b3a2d8c4  ONLINE       0     0     0
            wwn-0x5000c500b3a2edef  ONLINE       0     0     0
          mirror-1                  ONLINE       0     0     0
            wwn-0x5000c500b38ee3ed  ONLINE       0     0     0
            wwn-0x5000c500b3a2e636  ONLINE       0     0     0

errors: No known data errors

  pool: hdd_zfs_ssd
 state: ONLINE
  scan: scrub repaired 0B in 0 days 00:14:13 with 0 errors on Sat Jan  2 18:50:52 2021
config:

        NAME                        STATE     READ WRITE CKSUM
        hdd_zfs_ssd                 ONLINE       0     0     0
          mirror-0                  ONLINE       0     0     0
            wwn-0x500a0751e4a94a86  ONLINE       0     0     0
            wwn-0x500a0751e4a94af8  ONLINE       0     0     0

errors: No known data errors
root@proxmox02:~#

ZFS List
Code:
NAME                                   USED  AVAIL     REFER  MOUNTPOINT
hdd_zfs_guests                        4.49T  2.91T      176K  /hdd_zfs_guests
hdd_zfs_guests/home                   12.1G  2.91T     12.1G  /hdd_zfs_guests/home
hdd_zfs_guests/shares                  112K  2.91T      112K  /hdd_zfs_guests/shares
hdd_zfs_guests/shares-client_backups  2.33T  2.91T      937G  /hdd_zfs_guests/shares-client_backups
hdd_zfs_guests/shares-incoming         897G  2.91T      897G  /hdd_zfs_guests/shares-incoming
hdd_zfs_guests/shares-install          104K  2.91T      104K  /hdd_zfs_guests/shares-install
hdd_zfs_guests/shares-iso-images      9.76G  2.91T     9.76G  /hdd_zfs_guests/shares-iso-images
hdd_zfs_guests/shares-lost-n-found    75.3G  2.91T     75.3G  /hdd_zfs_guests/shares-lost-n-found
hdd_zfs_guests/shares-maintenance       96K  2.91T       96K  /hdd_zfs_guests/shares-maintenance
hdd_zfs_guests/shares-media            129G  2.91T      129G  /hdd_zfs_guests/shares-media
hdd_zfs_guests/shares-nextcloud        206G  2.91T      206G  /hdd_zfs_guests/shares-nextcloud
hdd_zfs_guests/shares-photos           112K  2.91T      112K  /hdd_zfs_guests/shares-photos
hdd_zfs_guests/shares-plex-library    11.6G  2.91T     9.65G  /hdd_zfs_guests/shares-plex-library
hdd_zfs_guests/shares-server_backup    925M  2.91T      925M  /hdd_zfs_guests/shares-server_backup
hdd_zfs_guests/timemachine             856G  2.91T      765G  /hdd_zfs_guests/timemachine
hdd_zfs_ssd                            349G  1.42T      144K  /hdd_zfs_ssd
hdd_zfs_ssd/subvol-301-disk-0          677M  19.3G      677M  /hdd_zfs_ssd/subvol-301-disk-0
hdd_zfs_ssd/subvol-302-disk-0          735M  7.28G      735M  /hdd_zfs_ssd/subvol-302-disk-0
hdd_zfs_ssd/subvol-401-disk-0          643M  7.37G      643M  /hdd_zfs_ssd/subvol-401-disk-0
hdd_zfs_ssd/subvol-404-disk-0         1.09G  28.9G     1.09G  /hdd_zfs_ssd/subvol-404-disk-0
hdd_zfs_ssd/subvol-406-disk-0         1.36G  6.64G     1.36G  /hdd_zfs_ssd/subvol-406-disk-0
hdd_zfs_ssd/subvol-407-disk-0         1.40G   149G     1.40G  /hdd_zfs_ssd/subvol-407-disk-0
hdd_zfs_ssd/subvol-408-disk-0         1.13G  18.9G     1.13G  /hdd_zfs_ssd/subvol-408-disk-0
hdd_zfs_ssd/subvol-409-disk-0         3.04G  6.96G     3.04G  /hdd_zfs_ssd/subvol-409-disk-0
hdd_zfs_ssd/subvol-410-disk-0         1.33G  6.67G     1.33G  /hdd_zfs_ssd/subvol-410-disk-0
hdd_zfs_ssd/subvol-501-disk-0         3.10G  12.9G     3.10G  /hdd_zfs_ssd/subvol-501-disk-0
hdd_zfs_ssd/vm-100-disk-0             33.0G  1.44T     10.3G  -
hdd_zfs_ssd/vm-1001-disk-0            33.0G  1.45T     1.37G  -
hdd_zfs_ssd/vm-1001-disk-1             103G  1.51T     11.2G  -
hdd_zfs_ssd/vm-1002-disk-0            33.0G  1.45T     1.86G  -
hdd_zfs_ssd/vm-114-disk-0              132G  1.46T     89.1G  -

ZFS get

See zfs_get_all.txt

ARC summary
See arc_summary.txt

dmesg
Code:
dmesg | grep -i ahci
[    1.544825] ahci 0000:00:17.0: version 3.0
[    1.555236] ahci 0000:00:17.0: AHCI 0001.0301 32 slots 6 ports 6 Gbps 0x3f impl SATA mode
[    1.555237] ahci 0000:00:17.0: flags: 64bit ncq sntf pm clo only pio slum part ems deso sadm sds apst
[    1.620251] scsi host0: ahci
[    1.620478] scsi host1: ahci
[    1.620708] scsi host2: ahci
[    1.620881] scsi host3: ahci
[    1.620947] scsi host4: ahci
[    1.621070] scsi host5: ahci

lspci
Code:
root@proxmox02:~# lspci
00:00.0 Host bridge: Intel Corporation 8th Gen Core Processor Host Bridge/DRAM Registers (rev 07)
00:02.0 VGA compatible controller: Intel Corporation Device 3e96
00:12.0 Signal processing controller: Intel Corporation Cannon Lake PCH Thermal Controller (rev 10)
00:14.0 USB controller: Intel Corporation Cannon Lake PCH USB 3.1 xHCI Host Controller (rev 10)
00:14.2 RAM memory: Intel Corporation Cannon Lake PCH Shared SRAM (rev 10)
00:16.0 Communication controller: Intel Corporation Cannon Lake PCH HECI Controller (rev 10)
00:17.0 SATA controller: Intel Corporation Cannon Lake PCH SATA AHCI Controller (rev 10)
00:1b.0 PCI bridge: Intel Corporation Cannon Lake PCH PCI Express Root Port (rev f0)
00:1b.4 PCI bridge: Intel Corporation Cannon Lake PCH PCI Express Root Port (rev f0)
00:1d.0 PCI bridge: Intel Corporation Cannon Lake PCH PCI Express Root Port (rev f0)
00:1f.0 ISA bridge: Intel Corporation Device a309 (rev 10)
00:1f.4 SMBus: Intel Corporation Cannon Lake PCH SMBus Controller (rev 10)
00:1f.5 Serial bus controller [0c80]: Intel Corporation Cannon Lake PCH SPI Controller (rev 10)
00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (7) I219-LM (rev 10)
02:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981
03:00.0 Ethernet controller: Intel Corporation Ethernet Controller 10G X550T (rev 01)
03:00.1 Ethernet controller: Intel Corporation Ethernet Controller 10G X550T (rev 01)

SMART Values
See smart_sd*.txt

System Temperatures
Code:
root@proxmox02:~/zfs_debug# sensors
coretemp-isa-0000
Adapter: ISA adapter
Package id 0:  +38.0°C  (high = +80.0°C, crit = +100.0°C)
Core 0:        +35.0°C  (high = +80.0°C, crit = +100.0°C)
Core 1:        +35.0°C  (high = +80.0°C, crit = +100.0°C)
Core 2:        +38.0°C  (high = +80.0°C, crit = +100.0°C)
Core 3:        +36.0°C  (high = +80.0°C, crit = +100.0°C)
Core 4:        +36.0°C  (high = +80.0°C, crit = +100.0°C)
Core 5:        +36.0°C  (high = +80.0°C, crit = +100.0°C)

pch_cannonlake-virtual-0
Adapter: Virtual device
temp1:        +47.0°C
 

Attachments

This looks like a saw tooth function.
Not sure if it helps but this behaviour is typical for caches kicking in / saturate and destage operations.

I was thinking about SMR as well but your disks seem to be PMR. That also doesn't make sense with regards to your description from 6.2->6.3

Have you tried to add a slog to your HDD pool and see if that relaxes the situation? I have noticed no difference in behaviour but are not running such huge copy-jobs...
 
Not sure if it helps but this behaviour is typical for caches kicking in / saturate and destage operations.

I was thinking about SMR as well but your disks seem to be PMR. That also doesn't make sense with regards to your description from 6.2->6.3

Have you tried to add a slog to your HDD pool and see if that relaxes the situation? I have noticed no difference in behaviour but are not running such huge copy-jobs...
Thanks for your answer.

Yes it looks like a cache saturation, but the system was running for 18 months without the issue.
When I was building the system I made some excessive tests. And I always could reach at least 130 MByte/s write speed to the HDDs.
So I always could saturate 1 gigabit link with headroom for additional bandwidth. The problem started with the upgrade to 6.3.

Additional log drives would surely relax the situation. But this is just a workaround. I mean I would like to find the root cause of the problem.

If you look at the atop screenshots:
A block or stream based write 8 Mbyte/s per second is a really bad result for a modern HDD. The drives are 90% busy, for large writes? This is something which makes wonder. This can only happen if either the drives are completely fragmented, are defective or the driver who sends the write requests, completely fragments the writes in small chunks.
 
When I was building the system I made some excessive tests. And I always could reach at least 130 MByte/s write speed to the HDDs.
I understand. However these tests were tied to a specific situation so may not be applicable anymore once the pool is in use and perhaps fragmented.
Keep in mind that sequential access is way faster than random access to a disk.
I always read that fragmentation is not an issue for ZFS but once heads need to position (which happens when there are overwrites / deletes) things can get really slow in comparison to true sequential access.

Again that does not make sense with your upgrade so best I think would be to research for ZFS changes on the current ZOL version being used.
Software is modified a d behaves different or even needs tuning. Also could totally be unrelated to ZFS itself. Could also be tied to firmware of your controller or new drivers ...
 
Have you tried to add a slog to your HDD pool and see if that relaxes the situation? I have noticed no difference in behaviour but are not running such huge copy-jobs...
Just one question regarding SLOG:

When setting sync=disabled for testing purposes, I should see the same performance increase like when using an SLOG device. Right?

I know, that disabling sync is in general a bad idea, but as an easy and quick and dirty tests, this is ok.

Thanks!

Christian
 
To be honest: I dont know.
But a SLOG primarily is used for sync writes. So I assume that it should the same effect.
 
OK. In the meantime I did a first test with sync=disabled. Same effect. The HDDs are painfully slow on writes.
The FSYNCS increase dramatically, but the issue with writes is still the same.

This is driving me nuts. I had a perfect working system this long.
 
Perhaps @fabian or @Dominic have an idea?
My guess is that some driver or whatever has changed (could be ZOL version as well). Other than that I am running out of ideas.
 
my guess is also that your pool is starting to get fuller (or your access pattern/load changed), and your spinning disks can't keep up with the now more random access (atop shows them busy with high wait times and lots of requests but little throughput)
 
Hi,

Some ideas:

- if you want to optimise your pool, do not concentrare on FSYNC low value, because in your case you do not have many SYNC IO :

ZIL committed transactions: 125.7k

- I would try to test, creating a plain file, like head -c 100G </dev/urandom > myfile, and then copy this file to the target pool(maybe the /dev/urandom is a bottelneck on your system) - during the copy test will be intersting to see:

zpool iostat -q 2
arc_summary(at the end of test)
zpool list -v
- 3 succesive tests would be OK
- zfs get all dataset_source(whre is the myfile) and the same for dataset_destination
- what ashift do you have(maybe 12? ) on the both pools?

Good luck /Bafta !
 
Last edited:
you'll see that the
data is first copied with full line speed at a rate of 110 MByte/s. After a few seconds the performance drops to below 1 MByte/s and stays there for roughly a minute

This spikes, are happend at about 5 seconds? If it not the case, what is the time bethwen 2 spikes?
 
Last edited:
my guess is also that your pool is starting to get fuller (or your access pattern/load changed), and your spinning disks can't keep up with the now more random access (atop shows them busy with high wait times and lots of requests but little throughput)
Hello Fabian,

thanks for the reply. Did some testing over the weekend. In trusting the backup I reverted to 0.8.3.
Transfer rate increased massive. The dips still occur, but they only go down to round about 40 Mbyte/s. Which is slower than before the upgrade to 6.3, but much better with 0.8.5.

Use case didn't change. It is always the same: Transferring huge files over SMB to the pool or from the SSD pool to the HDD pool. The pool itself is 45% full.

I went back to 0.8.5. -> Problem is again there. Performance drops to a few 100 kByte/s.

Cheers,

Christian
 
Hi,

Some ideas:

- if you want to optimise your pool, do not concentrare on FSYNC low value, because in your case you do not have many SYNC IO :

ZIL committed transactions: 125.7k

- I would try to test, creating a plain file, like head -c 100G </dev/urandom > myfile, and then copy this file to the target pool(maybe the /dev/urandom is a bottelneck on your system) - during the copy test will be intersting to see:

zpool iostat -q 2
arc_summary(at the end of test)
zpool list -v
- 3 succesive tests would be OK
- zfs get all dataset_source(whre is the myfile) and the same for dataset_destination
- what ashift do you have(maybe 12? ) on the both pools?

Good luck /Bafta !
Thanks for your help.

I will do the test tomorrow.
Just one remark: As you can see in my previous post - I first created the random file and after creation I transferred it from the SSD pool to the HDD pool. So even if /dev/urandom would be a bottleneck, it will not affect the copy speed.

And yes: shift on both pools is 12.

So just for my understanding: When is a write synchronous and when it is asynchronous in the sense of ZFS?

Thanks!

Christian
 
So just for my understanding: When is a write synchronous and when it is asynchronous in the sense of ZFS?

Hi,

Most of the time and also most of the applications will do ASYNC IO. SYNC IO will be for any DBase engine and for mail-server(because they do DIRECT IO). But also can be applic who are have a bad design or bugs and can do SYNC IO(but this is the exception and not the general rule) . This is the general rule. All ASYNC IO write will go in the ARC who after at max. 5 sec(default) will be write to the pool.


What intrigue me is the fact that I have a very close setup like you(samba4 in a ubuntu and centos CT, starting with PMX 5, and then upgrade to ver 6) but in my case I do not encounter a performance degradation like in your case.

Good luck / Bafta !
 
Hello Fabian,

thanks for the reply. Did some testing over the weekend. In trusting the backup I reverted to 0.8.3.
Transfer rate increased massive. The dips still occur, but they only go down to round about 40 Mbyte/s. Which is slower than before the upgrade to 6.3, but much better with 0.8.5.

Use case didn't change. It is always the same: Transferring huge files over SMB to the pool or from the SSD pool to the HDD pool. The pool itself is 45% full.

I went back to 0.8.5. -> Problem is again there. Performance drops to a few 100 kByte/s.

Cheers,

Christian
which packages did you downgrade?
 
Hi,

Some ideas:

- if you want to optimise your pool, do not concentrare on FSYNC low value, because in your case you do not have many SYNC IO :

ZIL committed transactions: 125.7k

- I would try to test, creating a plain file, like head -c 100G </dev/urandom > myfile, and then copy this file to the target pool(maybe the /dev/urandom is a bottelneck on your system) - during the copy test will be intersting to see:

zpool iostat -q 2
arc_summary(at the end of test)
zpool list -v
- 3 succesive tests would be OK
- zfs get all dataset_source(whre is the myfile) and the same for dataset_destination
- what ashift do you have(maybe 12? ) on the both pools?

Good luck /Bafta !
Hi,

here is the result for zpool list -v:
Code:
root@proxmox02:~# zpool list -v
NAME                         SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
hdd_zfs_guests              7.25T  4.24T  3.01T        -         -    25%    58%  1.19x    ONLINE  -
  mirror                    3.62T  2.12T  1.50T        -         -    24%  58.6%      -  ONLINE
    wwn-0x5000c500b3a2d8c4      -      -      -        -         -      -      -      -  ONLINE
    wwn-0x5000c500b3a2edef      -      -      -        -         -      -      -      -  ONLINE
  mirror                    3.62T  2.12T  1.50T        -         -    26%  58.5%      -  ONLINE
    wwn-0x5000c500b38ee3ed      -      -      -        -         -      -      -      -  ONLINE
    wwn-0x5000c500b3a2e636      -      -      -        -         -      -      -      -  ONLINE
hdd_zfs_ssd                 1.81T   237G  1.58T        -         -     6%    12%  1.00x    ONLINE  -
  mirror                    1.81T   237G  1.58T        -         -     6%  12.8%      -  ONLINE
    wwn-0x500a0751e4a94a86      -      -      -        -         -      -      -      -  ONLINE
    wwn-0x500a0751e4a94af8      -      -      -        -         -      -      -      -  ONLINE



Interesting part is that the first 4 tests went through more or less flawless. Which made me crazy. So I started a fifth test and now the problems started.
Interesting part though is that the IO load didn't go up that much.
Maybe it has something to do, that I did a reboot due to some tests I did with other kernels. I have the feeling that the problems cumulate over time.

Chris
 

Attachments

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!