Verify jobs - Terrible IO performance

I already explained that. The above command won't create a normal raid10. It creates two striped quad mirrors.
And I also wrote the correct command for a raid10 some posts above:




zfs set recordsoze=1M YourPoolName/DatasetUsedAsDatastore
Sorry I misread.

One last question what do you suggest I set the blocksize to for an average PBS backup only server.

I have only set the blocksize for special device not the pool.

zfs set special_small_blocks=4K zfs
 
Sorry I misread.

One last question what do you suggest I set the blocksize to for an average PBS backup only server.
Best you don't store your datastore on the pools root, but to create a new dataset for it first. You can then set the recordsize to 1M just for that dataset, so only the PBS datastore with all of it 4MB chunk files will use that big recordsize. Everything else then could continue using the default 128K recordsize which is more universal.
I have only set the blocksize for special device not the pool.

zfs set special_small_blocks=4K zfs
Thats not really the blocksize. It still uses whatever you set as recordsize/volblocksize for the pool. "special_small_blocks" defines if and what data will be stored on the SSDs instead of the HDDs. You set it to 4K, so only data between 512B and 2K + all metadata will be stored on the SSDs. You used a ashift of 12 (=4K), so no data at all should be stored on the SSDs, as all data will be atleast 4K in size.
See here: https://pve.proxmox.com/wiki/ZFS_on_Linux#sysadmin_zfs_special_device
 
Last edited:
Best you don't store your datastore on the pools root, but to create a new dataset for it first. You can then set the recordsize to 1M just for that dataset, so only the PBS datastore with all of it 4MB chunk files will use that big recordsize. Everything else then could continue using the default 128K recordsize which is more universal.

Thats not really the blocksize. It still uses whatever you set as recordsize/volblocksize for the pool. "special_small_blocks" defines if and what data will be stored on the SSDs instead of the HDDs. You set it to 4K, so only data between 512B and 2K + all metadata will be stored on the SSDs. You used a ashift of 12 (=4K), so no data at all should be stored on the SSDs, as all data will be atleast 4K in size.
See here: https://pve.proxmox.com/wiki/ZFS_on_Linux#sysadmin_zfs_special_device
I am fairly new to ZFS what do you mean by "don't store your datastore on the pools root, but to create a new dataset for it first. You can then set the recordsize to 1M just for that dataset, so only the PBS datastore with all of it 4MB chunk files will use that big recordsize" how would I go about doing that?
 
I am fairly new to ZFS what do you mean by "don't store your datastore on the pools root, but to create a new dataset for it first. You can then set the recordsize to 1M just for that dataset, so only the PBS datastore with all of it 4MB chunk files will use that big recordsize" how would I go about doing that?
You should learn a little bit about the ZFS basics.

This is part of a PVE encryption tutorial I am writing right now, where I try to explain the ZFS terminology:

4 A.) ZFS Definition​

The first thing people often get wrong is, that ZFS isn't just a software raid. Its way more than that. It's software raid, it's a volume manager like a LVM and it's even a filesystem. It's a complete enterprise grade all-in-one package that manages everything from the individual disks down to single files and folders.
You really have to read some books or at least read several tutorials to really understand what it is doing and how it is doing it. It's very different compared to traditional raid or file systems. So don't make the mistake to think it will work like the other things you used so far and you are familiar with.


Maybe I should explain some common ZFS terms so you can follow the tutorial a bit better:

  • Vdev:
    Vdev is the short form of "virtual device" and means a single disk or a group of disks that are pooled together. So for example a single disk could be a vdev. A raidz1/raidz2 (aka raid5/raid6) of multiple disks could be a vdev. Or a mirror of 2 or more disks could be a vdev.
    All vdevs have in common that no matter how many disks that vdev consists of, the IOPS performance of that vdev won't be faster than the single slowest disk that is part of that vdev.
    So you can do a raidz1 (raid5) of 100 HDDs and get great throughput performance and data-to-parity-ratio, but IOPS performance will still be the same as a vdev that is just a single HDD. So think of a vdev like a single virtual device that can only do one thing at a time and needs to wait for all member disks to finish their things before the next operation can be started.
  • Stripe:
    When you want to get more IOPS performance you will have to stripe multiple vdevs. You could for example stripe multiple mirror vdevs (aka raid1) to form a striped mirror (aka raid10). Striping vdevs will add up the capacity of each of the vdevs and the IOPS performance will increase with the number of striped vdevs. So if you got 4 mirror vdevs of 2 disks each and stripe these 4 mirror vdevs together, then you will get four times the IOPS performance, as work will be split across all vdevs and be done in parallel. But be aware that as soon as you loose a single complete vdev, the data on all vdevs is lost. So when you need IOPS performance its better to have multiple small vdevs that are striped together than having just a single big vdev. I wouldn't recommend it, but you could even stripe a mirror vdev (raid1) and raidz1 vdev (raid5) to form something like a raid510 ;-).
  • Pool:
    A pool is the biggest possible ZFS construct and can consist of a single vdev or multiple vdevs that are striped together. But it can't be multiple vdevs that are not striped together. If you want multiple mirrors (raid1) but don't want a striped mirror (raid10) you will have to create multiple pools. All pools are completely independent.
  • Zvol:
    A zvol is a volume. A block device. Think of it like a LV if you are familiar with LVM. Or like a virtual disk. It can't store files or folders on its own but you can format it with the filesystem of your choice and store files/folders on that filesystem. PVE will use these zvols to store the virtual disks of your VMs.
  • Volblocksize:
    Every block device got a fixed block size that it will work with. For HDDs this is called a sector which nowadays usually is 4KB in size. That means no matter how small or how big your data is, it has to be stored/read in full blocks that are a multiple of the block size. If you want to store 1KB of data on a HDD it will still consume the full 4KB as a HDD knows nothing smaller than a single block. And when you want to store 42KB it will write 11 full blocks, so 44KB will be consumed to store it. What's the sector size for a HDD is the volblocksize for a zvol. The bigger your volblocksize gets, the more capacity you will waste and the more performance you will lose when storing/accessing small amounts of data. Every zvol can use a different volblocksize but this can only be set once at the creation of the zvol and not changed later. And when using a raidz1/raidz2/raidz3 vdev you will need to change it, because the default volblocksize of 8K is too small for that.
  • Ashift:
    The ashift is defined pool wide at creation, can't be changed later and is the smallest block size a pool can work with. Usually, you want it to be the same as the biggest sector size of all your disks the pool consists of. Let's say you got some HDDs that report using a physical sector size of 512B and some that report using a physical sector size of 4K. Then you usually want the ashift to be 4K too, as everything smaller would cause massive read/write amplification when reading/writing from the disks that can't handle blocks smaller than 4K. But you can't just write ashift=4K. Ashift is noted as 2^X where you just set the X. So if you want your pool to use a 512B block size you will have to use an ashift of 9 (because 2^9 = 512). If you want a block size of 4K you need to write ashift=12 (because 2^12 = 4096) and so on.
  • Dataset:
    As I already mentioned before, ZFS is also a filesystem. This is where datasets come into play. The root of the pool itself is also handled like a dataset. So you can directly store files and folder on it. Each dataset is its own filesystem, so don't think of them as normal folders, even if you can nest them like this: YourPool/FirstDataset/SecondDataset/ThirdDataset.
    When PVE creates virtual disks for LXCs, it won't use zvols like for VM, it will use datasets instead. The root filesystem PVE uses is also a dataset.
  • Recordsize: Everything a dataset will store is stored in records. The size of a record is dynamic and will be a multiple of the ashift but will never be bigger than the recordsize. The default recordsize is 128K. So with a ashift of 12 (so 4K) and a recordsize of 128K, a record can be 4K, 8K, 16K, 32K, 64K or 128K. If you now want to save a 50K file it will store it as a 64K record. If you want to store a 6K file it will create a 8K record. So it will always use the next bigger possible recordsize. With files that are bigger than the recordsize this is a bit different. When storing a 1M file it will create eight 128K records. So the recordsize is usually not as critical as the volblocksize for zvols, as it is quite versatile because of its dynamic nature.

You can create a dataset with zfs create YourPool/NameOfNewDataset. As datasets are filesystems, you can store files and folders in them. A PBS datstore is just a folder. So you could use the path "/YourPool/NameOfNewDataset/" as your datastore. And when doing a zfs set recordsize=1M YourPool/NameOfNewDataset only files stored in that dataset will use a 1M recordsize. everything else still uses the default 128K recordsize.
 
Last edited:
  • Like
Reactions: harmonyp
Makes sense thanks for the replies. I was wondering why I could not create a virtual machine when I set the whole pool to 4MB in the GUI.

For now I have done the following commands

Code:
zpool create -f zfs -o ashift=12 mirror /dev/sda /dev/sdb mirror /dev/sdc /dev/sdd mirror /dev/sde /dev/sdf mirror /dev/sdg /dev/sdh special mirror /dev/nvme0n1 /dev/nvme1n1
zfs set special_small_blocks=4K zfs
zfs set compression=lz4 zfs

zfs create zfs/PBS
zfs set recordsize=1M zfs/PBS

proxmox-backup-manager datastore create TEST /zfs/PBS/TEST

I did want to set it to 4M but it threw the following error. Few mentions I found online suggested to leave it at 1M to prevent any unknown errors but that was from 2017 guess it's been tested more since then. I just can't find how to change it to test performance.

Code:
cannot set property for 'zfs/PBS': 'recordsize' must be power of 2 from 512B to 1M

From reading your last reply I just want to confirm the ashift=12 is suitable?

Code:
Disk /dev/sdf: 12.73 TiB, 14000519643136 bytes, 27344764928 sectors
Disk model: WUH721414AL5201
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
 
I did want to set it to 4M but it threw the following error. Few mentions I found online suggested to leave it at 1M to prevent any unknown errors but that was from 2017 guess it's been tested more since then. I just can't find how to change it to test performance.

Code:
cannot set property for 'zfs/PBS': 'recordsize' must be power of 2 from 512B to 1M
You have to change the "zfs_max_recordsize" first in the ZFS config file. It supports up to 16M but defaults to 1M.
And you will have to enable the "large_blocks" feature of your pool first.
See here: https://openzfs.github.io/openzfs-docs/Performance and Tuning/Module Parameters.html#zfs-max-recordsize
From reading your last reply I just want to confirm the ashift=12 is suitable?
Jup.
 
@Dunuin Question regarding the special devices /dev/nvme0n1 /dev/nvme1n1

How can I check how full they are?
With zpool list -v:
Code:
root@MainNAS[~]# zpool list -v
NAME                                             SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
HDDpool                                         29.3T  22.2T  7.12T        -     -     3%    75%  1.00x    ONLINE  /mnt
  raidz1-0                                      29.1T  22.1T  6.98T        -     -     3%  76.0%      -    ONLINE
    gptid/c368738b-a623-11ec-aaa2-002590467989  7.28T      -      -        -     -      -      -      -    ONLINE
    gptid/c37c8a34-a623-11ec-aaa2-002590467989  7.28T      -      -        -     -      -      -      -    ONLINE
    gptid/c379fcd0-a623-11ec-aaa2-002590467989  7.28T      -      -        -     -      -      -      -    ONLINE
    gptid/c37b3604-a623-11ec-aaa2-002590467989  7.28T      -      -        -     -      -      -      -    ONLINE
special                                             -      -      -        -     -      -      -      -  -
  mirror-1                                       186G  47.5G   139G        -     -    45%  25.5%      -    ONLINE
    gptid/c2e04c14-a623-11ec-aaa2-002590467989   186G      -      -        -     -      -      -      -    ONLINE
    gptid/c2e45ed6-a623-11ec-aaa2-002590467989   186G      -      -        -     -      -      -      -    ONLINE
So here my special device mirror got 47.5 of 186GiB used. Once it is 75% full, metadata/data will spill over to the normal data vdevs (in my case the raidz1)
 
Last edited:
With zpool list -v:
Code:
root@MainNAS[~]# zpool list -v
NAME                                             SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
HDDpool                                         29.3T  22.2T  7.12T        -     -     3%    75%  1.00x    ONLINE  /mnt
  raidz1-0                                      29.1T  22.1T  6.98T        -     -     3%  76.0%      -    ONLINE
    gptid/c368738b-a623-11ec-aaa2-002590467989  7.28T      -      -        -     -      -      -      -    ONLINE
    gptid/c37c8a34-a623-11ec-aaa2-002590467989  7.28T      -      -        -     -      -      -      -    ONLINE
    gptid/c379fcd0-a623-11ec-aaa2-002590467989  7.28T      -      -        -     -      -      -      -    ONLINE
    gptid/c37b3604-a623-11ec-aaa2-002590467989  7.28T      -      -        -     -      -      -      -    ONLINE
special                                             -      -      -        -     -      -      -      -  -
  mirror-1                                       186G  47.5G   139G        -     -    45%  25.5%      -    ONLINE
    gptid/c2e04c14-a623-11ec-aaa2-002590467989   186G      -      -        -     -      -      -      -    ONLINE
    gptid/c2e45ed6-a623-11ec-aaa2-002590467989   186G      -      -        -     -      -      -      -    ONLINE
So here my special device mirror got 47.5 of 186GiB used. One it is 75% full, metadata/data will spill over to the normal data vdevs (in my case the raidz1)
Great here is mine

Code:
NAME          SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
zfs          51.3T  8.75T  42.6T        -         -     2%    17%  1.00x    ONLINE  -
  mirror-0   12.7T  2.15T  10.6T        -         -     2%  16.9%      -    ONLINE
    sda      12.7T      -      -        -         -      -      -      -    ONLINE
    sdb      12.7T      -      -        -         -      -      -      -    ONLINE
  mirror-1   12.7T  2.18T  10.5T        -         -     2%  17.1%      -    ONLINE
    sdc      12.7T      -      -        -         -      -      -      -    ONLINE
    sdd      12.7T      -      -        -         -      -      -      -    ONLINE
  mirror-2   12.7T  2.17T  10.5T        -         -     2%  17.1%      -    ONLINE
    sde      12.7T      -      -        -         -      -      -      -    ONLINE
    sdf      12.7T      -      -        -         -      -      -      -    ONLINE
  mirror-3   12.7T  2.17T  10.6T        -         -     2%  17.0%      -    ONLINE
    sdg      12.7T      -      -        -         -      -      -      -    ONLINE
    sdh      12.7T      -      -        -         -      -      -      -    ONLINE
special          -      -      -        -         -      -      -      -  -
  mirror-4    476G  90.1G   386G        -         -     5%  18.9%      -    ONLINE
    nvme0n1   477G      -      -        -         -      -      -      -    ONLINE
    nvme1n1   477G      -      -        -         -      -      -      -    ONLINE

Can you modify that 75% value? Why not closer to 95%
 
Can you modify that 75% value? Why not closer to 95%
As far as I know that is hardcoded. And you shouldn't fill a vdev that much anyway. The ZFS documentation recommends not to fill a pool more than 80% for best performance. If you fill it too much it will fragment and become slow.
 
Thanks for this thread.
I don't have fast SSD/NVM for metadata yet. I just added consumer SSD as L2ARC. I found switching L2ARC policy to MFU only also helps a lot (cache is not flooded with every new backup):
Please add ZFS module parameters to /etc/modprobe.d/zfs/conf:
Code:
options zfs l2arc_mfuonly=1 l2arc_noprefetch=0
 
With zpool list -v:
Code:
root@MainNAS[~]# zpool list -v
NAME                                             SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
HDDpool                                         29.3T  22.2T  7.12T        -     -     3%    75%  1.00x    ONLINE  /mnt
  raidz1-0                                      29.1T  22.1T  6.98T        -     -     3%  76.0%      -    ONLINE
So zpool always reports the raw size, and if you allocate 30GB, FREE will go down and ALLOC up by 40GB? https://openzfs.github.io/openzfs-docs/man/8/zpool-list.8.html unfortunately doesn't explain what is actually shown in those columns.
 
With zpool list -v:
Code:
root@MainNAS[~]# zpool list -v
NAME                                             SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
HDDpool                                         29.3T  22.2T  7.12T        -     -     3%    75%  1.00x    ONLINE  /mnt
  raidz1-0                                      29.1T  22.1T  6.98T        -     -     3%  76.0%      -    ONLINE
    gptid/c368738b-a623-11ec-aaa2-002590467989  7.28T      -      -        -     -      -      -      -    ONLINE
    gptid/c37c8a34-a623-11ec-aaa2-002590467989  7.28T      -      -        -     -      -      -      -    ONLINE
    gptid/c379fcd0-a623-11ec-aaa2-002590467989  7.28T      -      -        -     -      -      -      -    ONLINE
    gptid/c37b3604-a623-11ec-aaa2-002590467989  7.28T      -      -        -     -      -      -      -    ONLINE
special                                             -      -      -        -     -      -      -      -  -
  mirror-1                                       186G  47.5G   139G        -     -    45%  25.5%      -    ONLINE
    gptid/c2e04c14-a623-11ec-aaa2-002590467989   186G      -      -        -     -      -      -      -    ONLINE
    gptid/c2e45ed6-a623-11ec-aaa2-002590467989   186G      -      -        -     -      -      -      -    ONLINE
So here my special device mirror got 47.5 of 186GiB used. Once it is 75% full, metadata/data will spill over to the normal data vdevs (in my case the raidz1)
Got a bit of an issue on my end now

Code:
root@storage:~# zpool list -v
NAME          SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
zfs          51.3T  29.0T  22.4T        -         -     2%    56%  1.00x    ONLINE  -
  mirror-0   12.7T  7.09T  5.63T        -         -     2%  55.8%      -    ONLINE
    sda      12.7T      -      -        -         -      -      -      -    ONLINE
    sdb      12.7T      -      -        -         -      -      -      -    ONLINE
  mirror-1   12.7T  7.16T  5.56T        -         -     2%  56.3%      -    ONLINE
    sdc      12.7T      -      -        -         -      -      -      -    ONLINE
    sdd      12.7T      -      -        -         -      -      -      -    ONLINE
  mirror-2   12.7T  7.15T  5.57T        -         -     2%  56.2%      -    ONLINE
    sde      12.7T      -      -        -         -      -      -      -    ONLINE
    sdf      12.7T      -      -        -         -      -      -      -    ONLINE
  mirror-3   12.7T  7.13T  5.59T        -         -     2%  56.0%      -    ONLINE
    sdg      12.7T      -      -        -         -      -      -      -    ONLINE
    sdh      12.7T      -      -        -         -      -      -      -    ONLINE
special          -      -      -        -         -      -      -      -  -
  mirror-4    476G   433G  42.6G        -         -    43%  91.0%      -    ONLINE
    nvme0n1   477G      -      -        -         -      -      -      -    ONLINE
    nvme1n1   477G      -      -        -         -      -      -      -    ONLINE

Performance seems to have dropped a bit. I may just replace with larger 2TB disks but if I don't want happens once it fills up? Will I lose any data?

You said it spills over presuming that means no data loss but what happens to that data that spills over once special device has lots of free space again will it convert automatically back?
 
You said it spills over presuming that means no data loss but what happens to that data that spills over once special device has lots of free space again will it convert automatically back?
I don`t think it will convert back to the special device. Should be the same as adding a special device to a pool without it, where only new writes will end up on the special device. If ZFS could move old metadata from normal vdevs to special vdevs, then I guess it would already do that.
Usual way to move old metadata to the special devices is to move all old data of the pool and back (or to move the files/folders of a dataset into another dataset).
 
Last edited:
  • Like
Reactions: harmonyp
I don`t think it will convert back to the special device. Should be the same as adding a special device to a pool without it, where only new writes will end up on the special device. If ZFS could move old metadata from normal vdevs to special vdevs, then I guess it would already do that.
I wouldn't worry about that, though: Data that is rarely accessed doesn't matter anyway, and data that is accesses frequently has a higher probability of being overwritten, which would fix the problem.
 
@Dunuin just want to confirm if I want RAIDZ-2 the correct command would be

Code:
zpool create -f zfs -o ashift=12 mirror /dev/sda /dev/sdb mirror /dev/sdc /dev/sdd mirror /dev/sde /dev/sdf mirror /dev/sdg /dev/sdh special mirror /dev/nvme0n1 /dev/nvme1n1
 
Your pool should be called "zfs"? Your command would create a raid10, not a raidz2.

A 8 disk raidz2 called "zfs" with 2 special devices in mirror would be:
zpool create -f -o ashift=12 zfs raidz2 /dev/sda /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh special mirror /dev/nvme0n1 /dev/nvme1n1
And I wouldn't use sda, sdb, sdc and so on but "/dev/disk/by-id/". So:
zpool create -f -o ashift=12 zfs raidz2 /dev/disk/by-id/your1stDisk /dev/disk/by-id/your2ndDisk /dev/disk/by-id/your3rdDisk /dev/disk/by-id/your4thDisk /dev/disk/by-id/your5thDisk /dev/disk/by-id/your6thDisk /dev/disk/by-id/your7thDisk /dev/disk/by-id/your8thDisk special mirror /dev/disk/by-id/firstSpecial /dev/disk/by-id/2ndSpecialk

And don't forget to change the blocksize to atleast 16K for your ZFSPool Storage before creating your first VM.
 
Last edited:
Your pool should be called "zfs"? Your command would create a raid10, not a raidz2.

A 8 disk raidz2 called "zfs" with 2 special devices in mirror would be:
zpool create -f -o ashift=12 zfs raidz2 /dev/sda /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh special mirror /dev/nvme0n1 /dev/nvme1n1
And I wouldn't use sda, sdb, sdc and so on but "/dev/disk/by-id/". So:
zpool create -f -o ashift=12 zfs raidz2 /dev/disk/by-id/your1stDisk /dev/disk/by-id/your2ndDisk /dev/disk/by-id/your3rdDisk /dev/disk/by-id/your4thDisk /dev/disk/by-id/your5thDisk /dev/disk/by-id/your6thDisk /dev/disk/by-id/your7thDisk /dev/disk/by-id/your8thDisk special mirror /dev/disk/by-id/firstSpecial /dev/disk/by-id/2ndSpecialk

And don't forget to change the blocksize to atleast 16K for your ZFSPool Storage before creating your first VM.
blocksize is the same thing as recordsize???


After creating the pool I will

Code:
zfs set special_small_blocks=4K zfs
zfs set compression=lz4 zfs
zfs create zfs/PBS
zfs set recordsize=1M zfs/PBS

then create the following I will mount for VM data
Code:
zfs create zfs/VMDATA
zfs set recordsize=16K zfs/VMDATA

for the UUID of disks what value do I use for example /dev/sda is

Code:
ls -rtl
lrwxrwxrwx 1 root root  9 Apr  8 19:02 wwn-0x5000cca28f6d86a0 -> ../../sda
lrwxrwxrwx 1 root root  9 Apr  8 19:02 scsi-35000cca28f6d86a0 -> ../../sda

do I use wwn-0x5000cca28f6d86a0 or scsi-35000cca28f6d86a0
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!