Verify jobs - Terrible IO performance

harmonyp

Member
Nov 26, 2020
160
3
23
44
I already explained that. The above command won't create a normal raid10. It creates two striped quad mirrors.
And I also wrote the correct command for a raid10 some posts above:




zfs set recordsoze=1M YourPoolName/DatasetUsedAsDatastore
Sorry I misread.

One last question what do you suggest I set the blocksize to for an average PBS backup only server.

I have only set the blocksize for special device not the pool.

zfs set special_small_blocks=4K zfs
 

Dunuin

Famous Member
Jun 30, 2020
8,924
2,281
156
Germany
Sorry I misread.

One last question what do you suggest I set the blocksize to for an average PBS backup only server.
Best you don't store your datastore on the pools root, but to create a new dataset for it first. You can then set the recordsize to 1M just for that dataset, so only the PBS datastore with all of it 4MB chunk files will use that big recordsize. Everything else then could continue using the default 128K recordsize which is more universal.
I have only set the blocksize for special device not the pool.

zfs set special_small_blocks=4K zfs
Thats not really the blocksize. It still uses whatever you set as recordsize/volblocksize for the pool. "special_small_blocks" defines if and what data will be stored on the SSDs instead of the HDDs. You set it to 4K, so only data between 512B and 2K + all metadata will be stored on the SSDs. You used a ashift of 12 (=4K), so no data at all should be stored on the SSDs, as all data will be atleast 4K in size.
See here: https://pve.proxmox.com/wiki/ZFS_on_Linux#sysadmin_zfs_special_device
 
Last edited:

harmonyp

Member
Nov 26, 2020
160
3
23
44
Best you don't store your datastore on the pools root, but to create a new dataset for it first. You can then set the recordsize to 1M just for that dataset, so only the PBS datastore with all of it 4MB chunk files will use that big recordsize. Everything else then could continue using the default 128K recordsize which is more universal.

Thats not really the blocksize. It still uses whatever you set as recordsize/volblocksize for the pool. "special_small_blocks" defines if and what data will be stored on the SSDs instead of the HDDs. You set it to 4K, so only data between 512B and 2K + all metadata will be stored on the SSDs. You used a ashift of 12 (=4K), so no data at all should be stored on the SSDs, as all data will be atleast 4K in size.
See here: https://pve.proxmox.com/wiki/ZFS_on_Linux#sysadmin_zfs_special_device
I am fairly new to ZFS what do you mean by "don't store your datastore on the pools root, but to create a new dataset for it first. You can then set the recordsize to 1M just for that dataset, so only the PBS datastore with all of it 4MB chunk files will use that big recordsize" how would I go about doing that?
 

Dunuin

Famous Member
Jun 30, 2020
8,924
2,281
156
Germany
I am fairly new to ZFS what do you mean by "don't store your datastore on the pools root, but to create a new dataset for it first. You can then set the recordsize to 1M just for that dataset, so only the PBS datastore with all of it 4MB chunk files will use that big recordsize" how would I go about doing that?
You should learn a little bit about the ZFS basics.

This is part of a PVE encryption tutorial I am writing right now, where I try to explain the ZFS terminology:

4 A.) ZFS Definition​

The first thing people often get wrong is, that ZFS isn't just a software raid. Its way more than that. It's software raid, it's a volume manager like a LVM and it's even a filesystem. It's a complete enterprise grade all-in-one package that manages everything from the individual disks down to single files and folders.
You really have to read some books or at least read several tutorials to really understand what it is doing and how it is doing it. It's very different compared to traditional raid or file systems. So don't make the mistake to think it will work like the other things you used so far and you are familiar with.


Maybe I should explain some common ZFS terms so you can follow the tutorial a bit better:

  • Vdev:
    Vdev is the short form of "virtual device" and means a single disk or a group of disks that are pooled together. So for example a single disk could be a vdev. A raidz1/raidz2 (aka raid5/raid6) of multiple disks could be a vdev. Or a mirror of 2 or more disks could be a vdev.
    All vdevs have in common that no matter how many disks that vdev consists of, the IOPS performance of that vdev won't be faster than the single slowest disk that is part of that vdev.
    So you can do a raidz1 (raid5) of 100 HDDs and get great throughput performance and data-to-parity-ratio, but IOPS performance will still be the same as a vdev that is just a single HDD. So think of a vdev like a single virtual device that can only do one thing at a time and needs to wait for all member disks to finish their things before the next operation can be started.
  • Stripe:
    When you want to get more IOPS performance you will have to stripe multiple vdevs. You could for example stripe multiple mirror vdevs (aka raid1) to form a striped mirror (aka raid10). Striping vdevs will add up the capacity of each of the vdevs and the IOPS performance will increase with the number of striped vdevs. So if you got 4 mirror vdevs of 2 disks each and stripe these 4 mirror vdevs together, then you will get four times the IOPS performance, as work will be split across all vdevs and be done in parallel. But be aware that as soon as you loose a single complete vdev, the data on all vdevs is lost. So when you need IOPS performance its better to have multiple small vdevs that are striped together than having just a single big vdev. I wouldn't recommend it, but you could even stripe a mirror vdev (raid1) and raidz1 vdev (raid5) to form something like a raid510 ;-).
  • Pool:
    A pool is the biggest possible ZFS construct and can consist of a single vdev or multiple vdevs that are striped together. But it can't be multiple vdevs that are not striped together. If you want multiple mirrors (raid1) but don't want a striped mirror (raid10) you will have to create multiple pools. All pools are completely independent.
  • Zvol:
    A zvol is a volume. A block device. Think of it like a LV if you are familiar with LVM. Or like a virtual disk. It can't store files or folders on its own but you can format it with the filesystem of your choice and store files/folders on that filesystem. PVE will use these zvols to store the virtual disks of your VMs.
  • Volblocksize:
    Every block device got a fixed block size that it will work with. For HDDs this is called a sector which nowadays usually is 4KB in size. That means no matter how small or how big your data is, it has to be stored/read in full blocks that are a multiple of the block size. If you want to store 1KB of data on a HDD it will still consume the full 4KB as a HDD knows nothing smaller than a single block. And when you want to store 42KB it will write 11 full blocks, so 44KB will be consumed to store it. What's the sector size for a HDD is the volblocksize for a zvol. The bigger your volblocksize gets, the more capacity you will waste and the more performance you will lose when storing/accessing small amounts of data. Every zvol can use a different volblocksize but this can only be set once at the creation of the zvol and not changed later. And when using a raidz1/raidz2/raidz3 vdev you will need to change it, because the default volblocksize of 8K is too small for that.
  • Ashift:
    The ashift is defined pool wide at creation, can't be changed later and is the smallest block size a pool can work with. Usually, you want it to be the same as the biggest sector size of all your disks the pool consists of. Let's say you got some HDDs that report using a physical sector size of 512B and some that report using a physical sector size of 4K. Then you usually want the ashift to be 4K too, as everything smaller would cause massive read/write amplification when reading/writing from the disks that can't handle blocks smaller than 4K. But you can't just write ashift=4K. Ashift is noted as 2^X where you just set the X. So if you want your pool to use a 512B block size you will have to use an ashift of 9 (because 2^9 = 512). If you want a block size of 4K you need to write ashift=12 (because 2^12 = 4096) and so on.
  • Dataset:
    As I already mentioned before, ZFS is also a filesystem. This is where datasets come into play. The root of the pool itself is also handled like a dataset. So you can directly store files and folder on it. Each dataset is its own filesystem, so don't think of them as normal folders, even if you can nest them like this: YourPool/FirstDataset/SecondDataset/ThirdDataset.
    When PVE creates virtual disks for LXCs, it won't use zvols like for VM, it will use datasets instead. The root filesystem PVE uses is also a dataset.
  • Recordsize: Everything a dataset will store is stored in records. The size of a record is dynamic and will be a multiple of the ashift but will never be bigger than the recordsize. The default recordsize is 128K. So with a ashift of 12 (so 4K) and a recordsize of 128K, a record can be 4K, 8K, 16K, 32K, 64K or 128K. If you now want to save a 50K file it will store it as a 64K record. If you want to store a 6K file it will create a 8K record. So it will always use the next bigger possible recordsize. With files that are bigger than the recordsize this is a bit different. When storing a 1M file it will create eight 128K records. So the recordsize is usually not as critical as the volblocksize for zvols, as it is quite versatile because of its dynamic nature.

You can create a dataset with zfs create YourPool/NameOfNewDataset. As datasets are filesystems, you can store files and folders in them. A PBS datstore is just a folder. So you could use the path "/YourPool/NameOfNewDataset/" as your datastore. And when doing a zfs set recordsize=1M YourPool/NameOfNewDataset only files stored in that dataset will use a 1M recordsize. everything else still uses the default 128K recordsize.
 
Last edited:
  • Like
Reactions: harmonyp

harmonyp

Member
Nov 26, 2020
160
3
23
44
Makes sense thanks for the replies. I was wondering why I could not create a virtual machine when I set the whole pool to 4MB in the GUI.

For now I have done the following commands

Code:
zpool create -f zfs -o ashift=12 mirror /dev/sda /dev/sdb mirror /dev/sdc /dev/sdd mirror /dev/sde /dev/sdf mirror /dev/sdg /dev/sdh special mirror /dev/nvme0n1 /dev/nvme1n1
zfs set special_small_blocks=4K zfs
zfs set compression=lz4 zfs

zfs create zfs/PBS
zfs set recordsize=1M zfs/PBS

proxmox-backup-manager datastore create TEST /zfs/PBS/TEST

I did want to set it to 4M but it threw the following error. Few mentions I found online suggested to leave it at 1M to prevent any unknown errors but that was from 2017 guess it's been tested more since then. I just can't find how to change it to test performance.

Code:
cannot set property for 'zfs/PBS': 'recordsize' must be power of 2 from 512B to 1M

From reading your last reply I just want to confirm the ashift=12 is suitable?

Code:
Disk /dev/sdf: 12.73 TiB, 14000519643136 bytes, 27344764928 sectors
Disk model: WUH721414AL5201
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
 

Dunuin

Famous Member
Jun 30, 2020
8,924
2,281
156
Germany
I did want to set it to 4M but it threw the following error. Few mentions I found online suggested to leave it at 1M to prevent any unknown errors but that was from 2017 guess it's been tested more since then. I just can't find how to change it to test performance.

Code:
cannot set property for 'zfs/PBS': 'recordsize' must be power of 2 from 512B to 1M
You have to change the "zfs_max_recordsize" first in the ZFS config file. It supports up to 16M but defaults to 1M.
And you will have to enable the "large_blocks" feature of your pool first.
See here: https://openzfs.github.io/openzfs-docs/Performance and Tuning/Module Parameters.html#zfs-max-recordsize
From reading your last reply I just want to confirm the ashift=12 is suitable?
Jup.
 

Dunuin

Famous Member
Jun 30, 2020
8,924
2,281
156
Germany
@Dunuin Question regarding the special devices /dev/nvme0n1 /dev/nvme1n1

How can I check how full they are?
With zpool list -v:
Code:
root@MainNAS[~]# zpool list -v
NAME                                             SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
HDDpool                                         29.3T  22.2T  7.12T        -     -     3%    75%  1.00x    ONLINE  /mnt
  raidz1-0                                      29.1T  22.1T  6.98T        -     -     3%  76.0%      -    ONLINE
    gptid/c368738b-a623-11ec-aaa2-002590467989  7.28T      -      -        -     -      -      -      -    ONLINE
    gptid/c37c8a34-a623-11ec-aaa2-002590467989  7.28T      -      -        -     -      -      -      -    ONLINE
    gptid/c379fcd0-a623-11ec-aaa2-002590467989  7.28T      -      -        -     -      -      -      -    ONLINE
    gptid/c37b3604-a623-11ec-aaa2-002590467989  7.28T      -      -        -     -      -      -      -    ONLINE
special                                             -      -      -        -     -      -      -      -  -
  mirror-1                                       186G  47.5G   139G        -     -    45%  25.5%      -    ONLINE
    gptid/c2e04c14-a623-11ec-aaa2-002590467989   186G      -      -        -     -      -      -      -    ONLINE
    gptid/c2e45ed6-a623-11ec-aaa2-002590467989   186G      -      -        -     -      -      -      -    ONLINE
So here my special device mirror got 47.5 of 186GiB used. Once it is 75% full, metadata/data will spill over to the normal data vdevs (in my case the raidz1)
 
Last edited:

harmonyp

Member
Nov 26, 2020
160
3
23
44
With zpool list -v:
Code:
root@MainNAS[~]# zpool list -v
NAME                                             SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
HDDpool                                         29.3T  22.2T  7.12T        -     -     3%    75%  1.00x    ONLINE  /mnt
  raidz1-0                                      29.1T  22.1T  6.98T        -     -     3%  76.0%      -    ONLINE
    gptid/c368738b-a623-11ec-aaa2-002590467989  7.28T      -      -        -     -      -      -      -    ONLINE
    gptid/c37c8a34-a623-11ec-aaa2-002590467989  7.28T      -      -        -     -      -      -      -    ONLINE
    gptid/c379fcd0-a623-11ec-aaa2-002590467989  7.28T      -      -        -     -      -      -      -    ONLINE
    gptid/c37b3604-a623-11ec-aaa2-002590467989  7.28T      -      -        -     -      -      -      -    ONLINE
special                                             -      -      -        -     -      -      -      -  -
  mirror-1                                       186G  47.5G   139G        -     -    45%  25.5%      -    ONLINE
    gptid/c2e04c14-a623-11ec-aaa2-002590467989   186G      -      -        -     -      -      -      -    ONLINE
    gptid/c2e45ed6-a623-11ec-aaa2-002590467989   186G      -      -        -     -      -      -      -    ONLINE
So here my special device mirror got 47.5 of 186GiB used. One it is 75% full, metadata/data will spill over to the normal data vdevs (in my case the raidz1)
Great here is mine

Code:
NAME          SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
zfs          51.3T  8.75T  42.6T        -         -     2%    17%  1.00x    ONLINE  -
  mirror-0   12.7T  2.15T  10.6T        -         -     2%  16.9%      -    ONLINE
    sda      12.7T      -      -        -         -      -      -      -    ONLINE
    sdb      12.7T      -      -        -         -      -      -      -    ONLINE
  mirror-1   12.7T  2.18T  10.5T        -         -     2%  17.1%      -    ONLINE
    sdc      12.7T      -      -        -         -      -      -      -    ONLINE
    sdd      12.7T      -      -        -         -      -      -      -    ONLINE
  mirror-2   12.7T  2.17T  10.5T        -         -     2%  17.1%      -    ONLINE
    sde      12.7T      -      -        -         -      -      -      -    ONLINE
    sdf      12.7T      -      -        -         -      -      -      -    ONLINE
  mirror-3   12.7T  2.17T  10.6T        -         -     2%  17.0%      -    ONLINE
    sdg      12.7T      -      -        -         -      -      -      -    ONLINE
    sdh      12.7T      -      -        -         -      -      -      -    ONLINE
special          -      -      -        -         -      -      -      -  -
  mirror-4    476G  90.1G   386G        -         -     5%  18.9%      -    ONLINE
    nvme0n1   477G      -      -        -         -      -      -      -    ONLINE
    nvme1n1   477G      -      -        -         -      -      -      -    ONLINE

Can you modify that 75% value? Why not closer to 95%
 

Dunuin

Famous Member
Jun 30, 2020
8,924
2,281
156
Germany
Can you modify that 75% value? Why not closer to 95%
As far as I know that is hardcoded. And you shouldn't fill a vdev that much anyway. The ZFS documentation recommends not to fill a pool more than 80% for best performance. If you fill it too much it will fragment and become slow.
 

niziak

Member
Apr 18, 2020
15
3
8
43
Thanks for this thread.
I don't have fast SSD/NVM for metadata yet. I just added consumer SSD as L2ARC. I found switching L2ARC policy to MFU only also helps a lot (cache is not flooded with every new backup):
Please add ZFS module parameters to /etc/modprobe.d/zfs/conf:
Code:
options zfs l2arc_mfuonly=1 l2arc_noprefetch=0
 

mow

New Member
Nov 3, 2022
12
2
3
With zpool list -v:
Code:
root@MainNAS[~]# zpool list -v
NAME                                             SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
HDDpool                                         29.3T  22.2T  7.12T        -     -     3%    75%  1.00x    ONLINE  /mnt
  raidz1-0                                      29.1T  22.1T  6.98T        -     -     3%  76.0%      -    ONLINE
So zpool always reports the raw size, and if you allocate 30GB, FREE will go down and ALLOC up by 40GB? https://openzfs.github.io/openzfs-docs/man/8/zpool-list.8.html unfortunately doesn't explain what is actually shown in those columns.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!