ZFS Datastore on PBS - ZFS Recordsize/Dedup...

Ramalama · May 9, 2024

It looks to me like ZFS Deduplication is not really needed for PBS, because PBS does already the deduplication.
I have on my Pool with a lot of backups a dedup ratio of 1,1. So dedup happens a bit on ZFS side, but seems to me not worth it.

Same for ZFS-Compression, runs here with ZSTD as Compression algorithm.
And i get around 1,2x

Seems to me all not worth it.
I changed logbias to throughput and recordsize to 1M on the pool, which gave me a ton of speed, 200mb/s backup speed was raised to 900mb/s.

Now im thinking how i can further optimize the Backup ZFS Pool. To gain more speed etc...
Does it makes sense to use a higher ashift value, like 13 for 8k blocks or even 14?
Setting the recordsize maybe even to 2m or 4m? Since the chucks are all 4mb in size?

Are there any Pro's here, which gone already through all that to squeeze the most out of PBS?
It's an exclusive PBS-Server with 2x E5-2637 v3 and 256GB Ram, so i have enough Horsepower and memory to not care or fall into any limits.

I destroyed the pool already, to reformat all the SAS Disks with sg_format to 4k logical block size, they are 4k physical, but were 512b logical before.
Not sure if sg_format will change actually the Logical block size in the Firmware of the Drive, but so far it looks like sg_format sends the formatting command to the Drives Firmware and reads just the progress. So the chances are high, that logical block changes to 4k.
@Dunuin As far i remember, you have the most knowledge about PBS out here, did you did already some tuning?
I bet you had the same questions initially as you started with PBS.

Everyone is welcome here, with his findings!
Cheers

Ramalama · May 9, 2024

https://forum.proxmox.com/threads/writes-sync-or-async.84612
I showed in this thread the speed increase with logbias and recordsize change, while still having deduplication on zfs enabled.
But at that time i didn't looked exactly if deduplication makes sense. However, i know now for better, that it doesn't really helps, just slows down things probably.
Just looking for a way to further increase speed.

Cheers

Dunuin · May 9, 2024

Most data are chunks and they are already ZSTD compressed so ZFS compression isn't helping much. I use LZ4 compression as this at least won't hurt performance.
And ZFS deduplication isn't worth it either.
Makes sense to increase the recordsize to 4M (you might need to enable that pool feature first or you might be limited to 1M). Chunks are max 4MB but usually more like 2MB because of compression.

What helps most is adding some enterprise SSDs as special devices in case you are using HDD-only.

Ashift is a difficult topic. Shouldn't help with HDDs and for SSDs it depends on the model and no manufacturer will tell you what rral blocksize the SSDs will work internally with. Only way to see if that makes sense is to benchmark different ashifts and then compare what performs best for your SSD models.

Ramalama · May 9, 2024

Hmmh, Thanks for the tipps!
Yeah its a pool of hdd's, a special device is surely a benefit, but it drives the costs above the roof, for a little gain.
-> The Server has no U.2/3, so PCIE Cards + at least 400€ for a drive and i need a minimum of 2 for a mirror. Thats +1000€ sth around.
- Additionally the Server has only Gen3 PCIe. Even with Optane from ebay it would be around 1000€.
--> For around 720€, i could add another 2 SAS Drives and increase the throughput +200mb/s. (8x ZFS Raid10 instead of 6x)

I could instead upgrade the Server to 512GB Memory at the cost of a little less Memory speed, since i would use then 2-Dimms per Channel, instead of one. But on the other side, i don't know how to utilize the memory, increasing ARC sure, but ARC max is set already to 200GB, i doubt that 400GB max and something like 300GB min would be any beneficial.
The thing is, i have a lot of RAM left from other Servers, thats why its a Cheap option for me.
I just think that 1Dimm/channel (256gb) but a little faster is better as 512GB and 2Dimms/Channel.

So in the end i should try:
- recordsize=2mb (large_blocks is anyway enabled by default since 2.1 or so)
- logbias=throughput (skips the intent log, powerloss=dataloss)
- xattr=sa
- dnodesize=auto
- atime=off
- ashift 12/13/14, i need to benchmark that simply.
Just ZFS Benchmarking is a hard task, i would need to disable primarycache at least?

Other as that, im not sure how to squeeze out more performance, i have a lot of gen3 pcie slots, but unaware of a cheap solution.
For sure i wont use Samsung 980/990 Pro's or something like that consumer crap anymore, after i seen how fast they die even as a special device with ZFS xD

EDIT:
https://www.reddit.com/r/zfs/comments/z5ojfh/is_there_a_case_against_ashift13_with_large/
According to that, i should simply stick with ashift 12. So thats solved. Seems like only recordsize is left.
Well seems like im on the limit already with 900mb/s of whats possible and the only option that is left is going with either a special device or 2 more sas drives.

Cheers

Dunuin · May 9, 2024

Ramalama said:
Hmmh, Thanks for the tipps!
Yeah its a pool of hdd's, a special device is surely a benefit, but it drives the costs above the roof, for a little gain.
-> The Server has no U.2/3, so PCIE Cards + at least 400€ for a drive and i need a minimum of 2 for a mirror. Thats +1000€ sth around.
- Additionally the Server has only Gen3 PCIe. Even with Optane from ebay it would be around 1000€.
--> For around 720€, i could add another 2 SAS Drives and increase the throughput +200mb/s. (8x ZFS Raid10 instead of 6x)

Even a cheap SATA SSD would boost GC performance by multiple magnitudes. And helps a little bit with everything else too, as all those millions over millions of chunks will cause millions over millions of metadata reads/writes that will then hit the SSD instead of the HDDs. And the SSDs only need to be <1% of the capacity of the HDDs, so you usually don't need big expensive SSDs. But yes, for backups/restore/verify you are usually bandwidth limited and not IOPS limited.
Also helps with the problem that the webUI will be very slow or even fail with a timeout when HDDs can't keep up when browsing the backup snapshots of your datastore.

Ramalama said:
atime=off

PBS needs atime. Disable it and the GC will corrupt your backups. Relatime would be fine.

Ramalama said:
i would need to disable primarycache at least?

Yes, that would disable RAM for reads. And sync writes if you want to benchmark NAND and not RAM for writes.

Ramalama · May 9, 2024

Dunuin said:
Even a cheap SATA SSD would boost GC performance by multiple magnitudes. And helps a little bit with everything else too, as all those millions over millions of chunks will cause millions over millions of metadata reads/writes that will then hit the SSD instead of the HDDs. And the SSDs only need to be <1% of the capacity of the HDDs, so you usually don't need big expensive SSDs. But yes, for backups/restore/verify you are usually bandwidth limited and not IOPS limited.

PBS needs atime. Disable it and the GC will corrupt your backups. Relatime would be fine.

Yes, that would disable RAM for reads. And sync writes if you want to benchmark NAND and not RAM for writes.

Thanks for the hint about atime lol.
I didnt know it's important for PBS lol.
1% is still a lot for a 56/70TB Pool (i will increase it later), that means i would need at least 2 optane drives, at least something like 2x 905p with 480gb or better 4.
Maybe ill find a cheap option. Not sure.

Dunuin · May 9, 2024

Ramalama said:
I didnt know it's important for PBS lol.

Thats how PBS decides what chunks to delete. Step one it will step through every index file and update the atime of every chunk that is still referenced by some backup snapshot. Step two it will read the atime of all chunks and delete those that didn't got updated in the last 24 hours and 5 minutes. So with disabled atime the GC can't update it in step 1 anymore and step two will destroy chunks that are still needed.
Millions of chunks means millions of random metadata reads and writes when setting/reading the atime. So here it help if the metadata is on a SSD with 1 million or 50K IOPS instead on a HDD with 100-200 IOPS

Ramalama · May 9, 2024

i found this ones:
https://geizhals.de/samsung-ssd-pm1735-1-6tb-mzplj1t6hbjr-00007-a2213016.html

2 of them in a mirror for a special vdev, is in the budget. Thats the cheapest option, since i dont need any adapters etc...
However, its 1,6TB, more as i need.

My understanding issues starting now. As i have a ton of memory, all the metadata should be anyway in memory or?
Sure it will get written to the drive, but in a async fashion or? means they get actually colected in memory and written once as chunk to the disk.
At least its how i imagine it.

Now the both nvme drives would actually not give a lot of performance, because all the metadata should actually be anyway in my 200gb Arc ?
i could increase the memory to 512gb (2dimms/channel).

And the next issue i have is, if im going with 2mb recordsize on the pool, those 1,6TB drives will be a waste, because with 2mb recordsize i can forget about using that nvme's for special small blocks? The risk will be too high that those nvme's runs out of space very fast.

So in the end, if i want to go the special vdev route, i would optimally need either smaller drives, around 800gb optimally, or at least 3,2TB Drives to use for special small blocks?
1,6TB seems to me absolutely not optimal for a 56 or 70TB Pool for PBS.

Thanks again Dunuin!

Ramalama · May 10, 2024

Forget it, i answered my question myself.

I will buy those 1,6TB Drives and set special_small_blocks to 256k.

i have on the pool right now around 3,4TB of backups:
find /datasets/Backup-HDD-SATA/ -type f -size -256k -print0 | du -ch --files0-from=- | grep total$
--> 104G total

That means multiplyed all by 10, is around 1TB of special small blocks for 34TB of actual backups.
Which should be enough in the beginning, if i will need more, i will simply buy another 2 nvmes. (And think later about how i expand special vdev, in worst case move the data to somewhere else temporarily)

Then i have still around 500GB left for metadata, for 34TB of Backups.
Which is optimal.

For the perfect recordsize i found a better approach:

Code:

Actual Backup Usage: 3,4TB:

Around 512k:
find /datasets/Backup-HDD-SATA/ -type f -size +400k -size -600k -print0 | du -ch --files0-from=- | grep total$
120G    total
find /datasets/Backup-HDD-SATA/ -type f -size +400k -size -600k | wc -l
246912

Around 1M:
find /datasets/Backup-HDD-SATA/ -type f -size +900k -size -1100k -print0 | du -ch --files0-from=- | grep total$
78G    total
find /datasets/Backup-HDD-SATA/ -type f -size +900k -size -1100k | wc -l
81125

Around 2M:
find /datasets/Backup-HDD-SATA/ -type f -size +1900k -size -2200k -print0 | du -ch --files0-from=- | grep total$
265G    total
find /datasets/Backup-HDD-SATA/ -type f -size +1900k -size -2200k | wc -l
133254

Around 4M:
find /datasets/Backup-HDD-SATA/ -type f -size +3900k -size -4300k -print0 | du -ch --files0-from=- | grep total$
632G    total
find /datasets/Backup-HDD-SATA/ -type f -size +3900k -size -4300k | wc -l
161337

Bigger as 5M:
find /datasets/Backup-HDD-SATA/ -type f -size +5000k -print0 | du -ch --files0-from=- | grep total$
291G    total
find /datasets/Backup-HDD-SATA/ -type f -size +5000k | wc -l
39043

So i need a middle recordsize, to take a very mixed use into account, something that covers 512k and as well 2M and 4M.
I think the recordsize of 1M is the best middleground here.

So as conclusion a special vdev with a size of 1,6TB for small blocks of 256k and recordsize of 1M for a 34TB Actual Backups for PBS Specific, is the most optimal i can do.
i will add anyway 2 more SAS Drives to the pool, to get raw at least 200mb/s more speed. So that it backups at least with 1GB/s.

Thanks Dunuin, Cheers

Dunuin · May 10, 2024

Ramalama said:
My understanding issues starting now. As i have a ton of memory, all the metadata should be anyway in memory or?

Yes. But a special device still help as the RAM only read-caches metadata. Half of the GC task are metadata writes and there the RAM won't help and the HDDs IOPS performance will be the limiting factor.

Ramalama said:
Sure it will get written to the drive, but in a async fashion or? means they get actually colected in memory and written once as chunk to the disk.
At least its how i imagine it.

Yes, but ZFS only write-caches the last 5 seconds of async writes in RAM. After 5 seconds it still has to be written from RAM to HDD. So still more IOPS than the HDDs could handle with the millions of random writes.

Ramalama said:
And the next issue i have is, if im going with 2mb recordsize on the pool, those 1,6TB drives will be a waste, because with 2mb recordsize i can forget about using that nvme's for special small blocks? The risk will be too high that those nvme's runs out of space very fast.

Special_small_blocks won't help. I would only use those special devices for metadata.

Ramalama said:
So in the end, if i want to go the special vdev route, i would optimally need either smaller drives, around 800gb optimally, or at least 3,2TB Drives to use for special small blocks?
1,6TB seems to me absolutely not optimal for a 56 or 70TB Pool for PBS.

For 70TB of datastores a 480GB could be enough. 800GB if you want some headroom.

Another thing that you will have to keep in mind is that ZFS won't actively move the existing metadata from HDD to SSD. You would need to write all that existing data again for the metadata to end up on the SSDs.

Ramalama said:
So i need a middle recordsize, to take a very mixed use into account, something that covers 512k and as well 2M and 4M.
I think the recordsize of 1M is the best middleground here.

Recordsize is a "up to" value. If you set it to 4M, a 4MB chunk will result in a 4M record. A 3 MB chunk in a 4M record, a 1.9MB chunk in a 2M record, a 0.9 MB chunk in a 1M record and so on. So shouldn't hurt to set the recordsize to 4MB. Even if most chunks will be smaller because of compression.

Ramalama · May 10, 2024

Aah thanks for the recordsize explanation, makes sense.
For the writes you're correct, but for the reads it would open a 4mb to get for example 800kb out of.

But you say that with 4mb recordsize a small file would still save to 1mb?
I don't get it, because that would mean that recordsize is somehow dynamic.
If it splits a file of 5mb for example with a recordsize of 4mb, i would get a 4mb and an 1mb block?

In this case what's the downside at all of recordsize=256mb for example?
I know that 256 is far to much, but just to illustrate.

I will use special small blocks of 128k then, there are just very little small files below 128k, maybe 10gb for 3,4tb of backups. So it's a little better utilization of the nvmes and it will never get full.
I bought today 2 more sas drives and 2x Samsung pm1735.
Will report then how much it improved.

Thanks dunuin!

Dunuin · May 10, 2024

Ramalama said:
But you say that with 4mb recordsize a small file would still save to 1mb?

Its the next 2^X that is bigger than the ashift and not bigger than the recordsize. So with a ashift=12 and recordsize=4M it is on one them: 4K/8K/16K/32K/64K/128K/256K/512K/1M/2M/4M. Everything bigger than the recordsize would be stored as multiple records.

Ramalama said:
If it splits a file of 5mb for example with a recordsize of 4mb, i would get a 4mb and an 1mb block?

Yes, bigger files it will store as multiple records. But not sure if it will create 2x 4M records or 4M + 1M records.

Ramalama said:
In this case what's the downside at all of recordsize=256mb for example?
I know that 256 is far to much, but just to illustrate.

Lets say you got a database that is stored as a single 10GB file and you use a 256M recordsize. Then the whole DB is 40x 256M records. Each time you query the database because you want 16K of data, it has to read and cache the full 256M record. That would be pretty bad.
So for lots of small random IO you want a small recordsize and for lots of big sequential IO a big one.
Ideally you use different datasets for different kinds of workloads and then set the optimal recordsize per dataset.

Ramalama · May 10, 2024

Dunuin said:
Its the next 2^X that is bigger than the ashift and not bigger than the recordsize. So with a ashift=12 and recordsize=4M it is on one them: 4K/8K/16K/32K/64K/128K/256K/512K/1M/2M/4M. Everything bigger than the recordsize would be stored as multiple records.

Yes, bigger files it will store as multiple records. But not sure if it will create 2x 4M records or 4M + 1M records.

Lets say you got a database that is stored as a single 10GB file and you use a 256M recordsize. Then the whole DB is 40x 256M records. Each time you query the database because you want 16K of data, it has to read and cache the full 256M record. That would be pretty bad.
So for lots of small random IO you want a small recordsize and for lots of big sequential IO a big one.
Ideally you use different datasets for different kinds of workloads and then set the optimal recordsize per dataset.

The database is a great example where it makes sense to use smaller recordsize. Or any application that reads only a small part of a file.
But PBS reads or writes the whole file, so it seems to me like 4m recordsize should have absolutely no downsides.
After all the explanation!

Thanks Dunuin as always

Ramalama · May 10, 2024

Hey @Dunuin the Answer from Chatgpt about recordsize:

----
In OpenZFS 2.2, the recordsize property defines the block size used for reading and writing. Larger recordsize values can be counterproductive for databases due to their random access patterns. However, they can improve disk space utilization for larger files and reduce fragmentation. With a recordsize set to 4 MB:

An 800 KB file will only occupy 800 KB of actual storage plus minimal metadata, without padding up to 4 MB. It will still occupy less physical space than 4 MB due to ZFS's variable-length blocks.
A 5.4 MB file will be split into two blocks. The first 4 MB block will be fully used, and the second will store the remaining 1.4 MB, leaving the rest of the 4 MB block empty but only consuming storage for the 1.4 MB of data plus metadata.

Key Considerations:

Disk Space Utilization: Larger recordsize values can enhance space utilization for large files and reduce fragmentation.
Performance: Larger recordsize values may improve sequential read/write performance for large files but can degrade random read performance.
Compression: If compression is enabled, space savings can be affected because ZFS compresses data based on the recordsize value.

---

Thats an Answer..., no one could describe it better, lol

Ramalama · Friday at 15:20

-> I checked with a script that does:

Code:

dd if=/dev/zero of=/mnt/$diskid/1m_test bs=1M count=8000 oflag=direct

the write speeds for each disk.

-> all disks support 4k Locical Block-Size and 512b, but they come shipped by default with 512b.
---> There is absolutely no performance difference between 512b and 4b logical block size on sata/sas HDD's. (Tested)
---> On NVME's there is definitively a performance difference! (Tested)

This is a speedtest of my script. (I've attached my Script if others may want to use it).

Code:

./danger_testspeed.sh
Test: SATA -> 4TB/sdi/ata-MB4000GCWDC_XXXX? (y/N):
Test: SATA -> 4TB/sdh/ata-MB4000GCWDC_XXXX? (y/N):
Test: SATA -> 4TB/sda/ata-MB4000GCWDC_XXXX? (y/N):
Test: SATA -> 4TB/sdk/ata-MB4000GCWLV_XXXX? (y/N):
Test: SATA -> 4TB/sdl/ata-MB4000GCWLV_XXXX? (y/N):
Test: SATA -> 4TB/sdg/ata-MB4000GCWLV_XXXX? (y/N):
Test: SATA -> 4TB/sdf/ata-MB4000GCWLV_XXXX? (y/N):
Test: SATA -> 4TB/sdd/ata-MB4000GCWLV_XXXX? (y/N):
Test: SATA -> 4TB/sde/ata-MB4000GCWLV_XXXX? (y/N):
Test: SATA -> 4TB/sdj/ata-MB4000GCWLV_XXXX? (y/N):
Test: SATA -> 4TB/sdc/ata-WDC_WD4003FFBX-XXXX? (y/N):
Test: SATA -> 4TB/sdb/ata-WDC_WD4003FFBX-XXXX? (y/N):
Test: SAS/MB014000JWTFD -> 14TB/sdu/scsi-3500XXXX? (y/N): y
sdu -> Read(1m): 246 MB/s | Read(2m): 244 MB/s | Write(1m): 262 MB/s | Write(2m): 117 MB/s
Test: SAS/MB014000JWTFD -> 14TB/sdt/scsi-3500XXXX? (y/N): y
sdt -> Read(1m): 222 MB/s | Read(2m): 223 MB/s | Write(1m): 248 MB/s | Write(2m): 111 MB/s
Test: SAS/EH0300JDYTH -> 300GB/sdm/scsi-3500XXXX? (y/N):
Test: SAS/EH0300JDYTH -> 300GB/sdn/scsi-3500XXXX? (y/N):
Test: SAS/MB014000JWUDB -> 14TB/sdr/scsi-3500XXXX? (y/N): y
sdr -> Read(1m): 251 MB/s | Read(2m): 245 MB/s | Write(1m): 174 MB/s | Write(2m): 110 MB/s
Test: SAS/MB014000JWUDB -> 14TB/sdv/scsi-3500XXXX? (y/N): y
sdv -> Read(1m): 247 MB/s | Read(2m): 239 MB/s | Write(1m): 175 MB/s | Write(2m): 113 MB/s
Test: SAS/MB014000JWUDB -> 14TB/sdq/scsi-3500XXXX? (y/N): y
sdq -> Read(1m): 261 MB/s | Read(2m): 246 MB/s | Write(1m): 150 MB/s | Write(2m): 109 MB/s
Test: SAS/WUH721816AL5204 -> 16TB/sdp/scsi-3500XXXX? (y/N): y
sdp -> Read(1m): 267 MB/s | Read(2m): 261 MB/s | Write(1m): 272 MB/s | Write(2m): 142 MB/s
Test: SAS/WUH721816AL5204 -> 16TB/sdo/scsi-3500XXXX? (y/N): y
sdo -> Read(1m): 272 MB/s | Read(2m): 259 MB/s | Write(1m): 274 MB/s | Write(2m): 141 MB/s
Test: SAS/WUH721414AL5204 -> 14TB/sds/scsi-3500XXXX? (y/N): y
sds -> Read(1m): 263 MB/s | Read(2m): 264 MB/s | Write(1m): 262 MB/s | Write(2m): 135 MB/s

- So my conclusion is, that i will prefer 1m blocksize over 2m, so recordsize=1m.
- I made the Raid10 pool, and grouped the mirrors based on performance of the disks, which should give a little more performance.
--- > Raid10 with 8 disks is a stripe of 4 mirrors, on ZFS the whole pool won't slowdown to the slowest mirror, but you'll get a slight unbalanced space usage that will vary a little between the mirrors after a time. Which doesn't have any other impact other as a slight Read-performance degradation. But the pool always rebalances itself after some time.

So the resulting Pool:

Code:

zpool create \
  -o ashift=12 \
  -O special_small_blocks=128k \
  -O xattr=sa \
  -O dnodesize=auto \
  -O recordsize=1M \
  -O mountpoint=/HDD-SAS \
  HDD-SAS \
  mirror /dev/disk/by-id/wwn-0x5000XXXX /dev/disk/by-id/wwn-0x5000XXXX \
  mirror /dev/disk/by-id/wwn-0x5000XXXX /dev/disk/by-id/wwn-0x5000XXXX \
  mirror /dev/disk/by-id/wwn-0x5000XXXX /dev/disk/by-id/wwn-0x5000XXXX \
  mirror /dev/disk/by-id/wwn-0x50000XXXX /dev/disk/by-id/wwn-0x50000XXXX \
  special mirror /dev/disk/by-id/nvme-SAMSUNG_MZPLJ1T6HBJR-00007_XXXX /dev/disk/by-id/nvme-SAMSUNG_MZPLJ1T6HBJR-00007_XXXX

--> Now the issues i have:

- Backup Speed:

Code:

INFO: starting new backup job: vzdump 129 --remove 0 --mode snapshot --notes-template '{{guestname}}' --storage Backup-SAS --node pve-hdr --notification-mode auto
INFO: Starting Backup of VM 129 (qemu)
INFO: Backup started at 2024-05-17 13:15:06
INFO: status = running
INFO: VM Name: blub
INFO: include disk 'virtio0' 'Storage-Default:vm-129-disk-0' 300G
INFO: include disk 'virtio1' 'Storage-Default:vm-129-disk-1' 99G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating Proxmox Backup Server archive 'vm/129/2024-05-17T11:15:06Z'
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
INFO: started backup task '5c6007a0-6182-4cee-943f-93b4b2d4009b'
INFO: resuming VM again
INFO: virtio0: dirty-bitmap status: existing bitmap was invalid and has been cleared
INFO: virtio1: dirty-bitmap status: existing bitmap was invalid and has been cleared
INFO:   0% (3.9 GiB of 399.0 GiB) in 3s, read: 1.3 GiB/s, write: 877.3 MiB/s
INFO:   1% (6.8 GiB of 399.0 GiB) in 6s, read: 1014.7 MiB/s, write: 510.7 MiB/s
INFO:   2% (9.4 GiB of 399.0 GiB) in 9s, read: 896.0 MiB/s, write: 605.3 MiB/s
INFO:   3% (12.0 GiB of 399.0 GiB) in 12s, read: 870.7 MiB/s, write: 612.0 MiB/s
INFO:   4% (16.4 GiB of 399.0 GiB) in 19s, read: 650.3 MiB/s, write: 642.3 MiB/s
INFO:   5% (22.1 GiB of 399.0 GiB) in 23s, read: 1.4 GiB/s, write: 619.0 MiB/s
INFO:   6% (24.1 GiB of 399.0 GiB) in 27s, read: 510.0 MiB/s, write: 510.0 MiB/s
INFO:   8% (34.3 GiB of 399.0 GiB) in 31s, read: 2.5 GiB/s, write: 413.0 MiB/s
INFO:  14% (59.3 GiB of 399.0 GiB) in 34s, read: 8.3 GiB/s, write: 0 B/s
INFO:  21% (84.2 GiB of 399.0 GiB) in 37s, read: 8.3 GiB/s, write: 4.0 MiB/s
INFO:  25% (100.3 GiB of 399.0 GiB) in 40s, read: 5.4 GiB/s, write: 209.3 MiB/s
INFO:  26% (103.8 GiB of 399.0 GiB) in 48s, read: 451.0 MiB/s, write: 451.0 MiB/s
INFO:  27% (107.9 GiB of 399.0 GiB) in 57s, read: 461.8 MiB/s, write: 461.8 MiB/s
INFO:  28% (112.0 GiB of 399.0 GiB) in 1m 5s, read: 524.0 MiB/s, write: 491.0 MiB/s
INFO:  29% (116.0 GiB of 399.0 GiB) in 1m 14s, read: 459.1 MiB/s, write: 454.2 MiB/s
INFO:  30% (120.3 GiB of 399.0 GiB) in 1m 22s, read: 546.0 MiB/s, write: 518.5 MiB/s
INFO:  31% (123.9 GiB of 399.0 GiB) in 1m 29s, read: 522.9 MiB/s, write: 522.3 MiB/s
INFO:  32% (127.8 GiB of 399.0 GiB) in 1m 37s, read: 508.0 MiB/s, write: 508.0 MiB/s
INFO:  33% (131.9 GiB of 399.0 GiB) in 1m 45s, read: 525.0 MiB/s, write: 525.0 MiB/s
INFO:  34% (136.0 GiB of 399.0 GiB) in 1m 53s, read: 515.0 MiB/s, write: 510.0 MiB/s
INFO:  35% (140.1 GiB of 399.0 GiB) in 2m 1s, read: 524.0 MiB/s, write: 513.0 MiB/s
INFO:  36% (144.1 GiB of 399.0 GiB) in 2m 9s, read: 519.5 MiB/s, write: 512.0 MiB/s
INFO:  37% (148.1 GiB of 399.0 GiB) in 2m 16s, read: 581.7 MiB/s, write: 544.6 MiB/s
INFO:  38% (152.1 GiB of 399.0 GiB) in 2m 23s, read: 578.3 MiB/s, write: 548.6 MiB/s
INFO:  39% (155.7 GiB of 399.0 GiB) in 2m 30s, read: 528.6 MiB/s, write: 518.9 MiB/s
INFO:  40% (159.7 GiB of 399.0 GiB) in 2m 38s, read: 520.5 MiB/s, write: 518.0 MiB/s
INFO:  41% (163.8 GiB of 399.0 GiB) in 2m 46s, read: 515.5 MiB/s, write: 515.5 MiB/s
INFO:  42% (167.7 GiB of 399.0 GiB) in 2m 53s, read: 574.9 MiB/s, write: 572.6 MiB/s
INFO:  43% (171.9 GiB of 399.0 GiB) in 3m 1s, read: 541.0 MiB/s, write: 527.0 MiB/s
INFO:  44% (175.9 GiB of 399.0 GiB) in 3m 8s, read: 578.9 MiB/s, write: 536.0 MiB/s
INFO:  45% (179.6 GiB of 399.0 GiB) in 3m 15s, read: 545.1 MiB/s, write: 545.1 MiB/s
INFO:  46% (183.8 GiB of 399.0 GiB) in 3m 23s, read: 544.0 MiB/s, write: 541.0 MiB/s
INFO:  47% (188.1 GiB of 399.0 GiB) in 3m 28s, read: 860.8 MiB/s, write: 482.4 MiB/s
INFO:  48% (191.7 GiB of 399.0 GiB) in 3m 34s, read: 621.3 MiB/s, write: 545.3 MiB/s
INFO:  49% (195.6 GiB of 399.0 GiB) in 3m 41s, read: 576.0 MiB/s, write: 511.4 MiB/s
INFO:  50% (199.8 GiB of 399.0 GiB) in 3m 48s, read: 611.4 MiB/s, write: 520.6 MiB/s
INFO:  51% (203.7 GiB of 399.0 GiB) in 3m 54s, read: 661.3 MiB/s, write: 527.3 MiB/s
INFO:  52% (208.8 GiB of 399.0 GiB) in 4m, read: 878.7 MiB/s, write: 412.7 MiB/s
INFO:  57% (229.9 GiB of 399.0 GiB) in 4m 3s, read: 7.0 GiB/s, write: 20.0 MiB/s
INFO:  61% (244.5 GiB of 399.0 GiB) in 4m 6s, read: 4.8 GiB/s, write: 196.0 MiB/s
INFO:  66% (266.5 GiB of 399.0 GiB) in 4m 9s, read: 7.3 GiB/s, write: 0 B/s
INFO:  72% (288.4 GiB of 399.0 GiB) in 4m 12s, read: 7.3 GiB/s, write: 0 B/s
INFO:  77% (310.4 GiB of 399.0 GiB) in 4m 15s, read: 7.3 GiB/s, write: 0 B/s
INFO:  83% (332.4 GiB of 399.0 GiB) in 4m 18s, read: 7.3 GiB/s, write: 0 B/s
INFO:  88% (354.4 GiB of 399.0 GiB) in 4m 21s, read: 7.3 GiB/s, write: 0 B/s
INFO:  94% (376.4 GiB of 399.0 GiB) in 4m 24s, read: 7.3 GiB/s, write: 0 B/s
INFO:  99% (398.3 GiB of 399.0 GiB) in 4m 27s, read: 7.3 GiB/s, write: 0 B/s
INFO: 100% (399.0 GiB of 399.0 GiB) in 4m 28s, read: 672.0 MiB/s, write: 4.0 MiB/s
INFO: backup is sparse: 279.55 GiB (70%) total zero data
INFO: backup was done incrementally, reused 279.74 GiB (70%)
INFO: transferred 399.00 GiB in 268 seconds (1.5 GiB/s)
INFO: adding notes to backup
INFO: Finished Backup of VM 129 (00:04:35)
INFO: Backup finished at 2024-05-17 13:19:41
INFO: Backup job finished successfully
INFO: notified via target `mail-to-root`
TASK OK

This is now a Pool with 8 SAS Drives and a Special vdev. But logbias is set to default (latency)

- Verification Speed:

Code:

                                                      capacity     operations     bandwidth
pool                                                alloc   free   read  write   read  write
--------------------------------------------------  -----  -----  -----  -----  -----  -----
HDD-SAS                                             1.25T  52.9T     63    586  55.5M   208M
  mirror-0                                           356G  14.2T     17     64  15.5M  57.3M
    wwn-0x5000XXXXXXXXXXXX                              -      -      8     32  7.75M  28.7M
    wwn-0x5000XXXXXXXXXXXX                              -      -      8     32  7.72M  28.7M
  mirror-1                                           304G  12.4T     14     55  13.3M  49.1M
    wwn-0x5000XXXXXXXXXXXX                              -      -      7     27  6.67M  24.5M
    wwn-0x5000XXXXXXXXXXXX                              -      -      7     27  6.66M  24.5M
  mirror-2                                           301G  12.4T     14     54  13.2M  48.5M
    wwn-0x5000XXXXXXXXXXXX                              -      -      7     27  6.58M  24.3M
    wwn-0x5000XXXXXXXXXXXX                              -      -      7     27  6.61M  24.3M
  mirror-3                                           307G  12.4T     14     55  13.4M  49.4M
    wwn-0x50000XXXXXXXXXXX                              -      -      7     27  6.72M  24.7M
    wwn-0x50000XXXXXXXXXXX                              -      -      7     27  6.66M  24.7M
special                                                 -      -      -      -      -      -
  mirror-4                                          8.14G  1.45T      2    356   148K  3.63M
    nvme-SAMSUNG_MZPLJ1T6HBJR-00007_S5XXXXXXXXXXXX      -      -      1    178  74.5K  1.81M
    nvme-SAMSUNG_MZPLJ1T6HBJR-00007_S5XXXXXXXXXXXX      -      -      1    177  73.4K  1.81M
--------------------------------------------------  -----  -----  -----  -----  -----  -----

This is the Speed i had previously with the old Pool:
Backup Speed:

Code:

INFO: starting new backup job: vzdump 150 --storage Backup-SAS --notes-template '{{guestname}}' --remove 0 --mode snapshot --notification-mode auto --node pve-bdr
INFO: Starting Backup of VM 150 (qemu)
INFO: Backup started at 2024-05-03 19:29:43
INFO: status = running
INFO: VM Name: DokuBit2
INFO: include disk 'virtio0' 'Storage-Default:vm-150-disk-0' 100G
INFO: include disk 'virtio1' 'Storage-Default:vm-150-disk-1' 1T
INFO: include disk 'virtio2' 'Storage-Default:vm-150-disk-2' 1T
INFO: include disk 'virtio3' 'Storage-Default:vm-150-disk-3' 1T
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating Proxmox Backup Server archive 'vm/150/2024-05-03T17:29:43Z'
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
INFO: started backup task '3fdfd40a-c89a-4adf-bda7-558bfcd26a42'
INFO: resuming VM again
INFO: virtio0: dirty-bitmap status: OK (2.7 GiB of 100.0 GiB dirty)
INFO: virtio1: dirty-bitmap status: OK (44.5 GiB of 1.0 TiB dirty)
INFO: virtio2: dirty-bitmap status: OK (7.5 GiB of 1.0 TiB dirty)
INFO: virtio3: dirty-bitmap status: OK (972.0 MiB of 1.0 TiB dirty)
INFO: using fast incremental mode (dirty-bitmap), 55.7 GiB dirty of 3.1 TiB total
INFO: 3% (1.9 GiB of 55.7 GiB) in 3s, read: 654.7 MiB/s, write: 648.0 MiB/s
INFO: 6% (3.7 GiB of 55.7 GiB) in 6s, read: 597.3 MiB/s, write: 593.3 MiB/s
INFO: 9% (5.1 GiB of 55.7 GiB) in 9s, read: 480.0 MiB/s, write: 480.0 MiB/s
INFO: 13% (7.4 GiB of 55.7 GiB) in 12s, read: 810.7 MiB/s, write: 805.3 MiB/s
INFO: 17% (9.6 GiB of 55.7 GiB) in 15s, read: 728.0 MiB/s, write: 718.7 MiB/s
INFO: 22% (12.4 GiB of 55.7 GiB) in 18s, read: 970.7 MiB/s, write: 786.7 MiB/s
INFO: 26% (14.9 GiB of 55.7 GiB) in 21s, read: 841.3 MiB/s, write: 841.3 MiB/s
INFO: 31% (17.5 GiB of 55.7 GiB) in 24s, read: 900.0 MiB/s, write: 900.0 MiB/s
INFO: 36% (20.1 GiB of 55.7 GiB) in 27s, read: 894.7 MiB/s, write: 894.7 MiB/s
INFO: 40% (22.6 GiB of 55.7 GiB) in 30s, read: 826.7 MiB/s, write: 826.7 MiB/s
INFO: 45% (25.3 GiB of 55.7 GiB) in 33s, read: 928.0 MiB/s, write: 928.0 MiB/s
INFO: 50% (28.0 GiB of 55.7 GiB) in 36s, read: 929.3 MiB/s, write: 929.3 MiB/s
INFO: 55% (30.7 GiB of 55.7 GiB) in 39s, read: 932.0 MiB/s, write: 932.0 MiB/s
INFO: 59% (33.3 GiB of 55.7 GiB) in 42s, read: 885.3 MiB/s, write: 885.3 MiB/s
INFO: 64% (36.1 GiB of 55.7 GiB) in 45s, read: 944.0 MiB/s, write: 944.0 MiB/s
INFO: 69% (38.9 GiB of 55.7 GiB) in 48s, read: 954.7 MiB/s, write: 954.7 MiB/s
INFO: 74% (41.5 GiB of 55.7 GiB) in 51s, read: 890.7 MiB/s, write: 890.7 MiB/s
INFO: 79% (44.1 GiB of 55.7 GiB) in 54s, read: 876.0 MiB/s, write: 876.0 MiB/s
INFO: 84% (46.9 GiB of 55.7 GiB) in 57s, read: 957.3 MiB/s, write: 957.3 MiB/s
INFO: 89% (50.0 GiB of 55.7 GiB) in 1m, read: 1.1 GiB/s, write: 1.1 GiB/s
INFO: 93% (52.3 GiB of 55.7 GiB) in 1m 3s, read: 778.7 MiB/s, write: 776.0 MiB/s
INFO: 97% (54.1 GiB of 55.7 GiB) in 1m 6s, read: 600.0 MiB/s, write: 594.7 MiB/s
INFO: 99% (55.3 GiB of 55.7 GiB) in 1m 9s, read: 426.7 MiB/s, write: 425.3 MiB/s
INFO: 100% (55.7 GiB of 55.7 GiB) in 1m 12s, read: 141.3 MiB/s, write: 141.3 MiB/s
INFO: Waiting for server to finish backup validation...
INFO: backup was done incrementally, reused 3.04 TiB (98%)
INFO: transferred 55.75 GiB in 95 seconds (600.9 MiB/s)
INFO: adding notes to backup
INFO: Finished Backup of VM 150 (00:01:39)
INFO: Backup finished at 2024-05-03 19:31:22
INFO: Backup job finished successfully
INFO: notified via target `mail-to-root`
TASK OK

Conclusion:
- New pool: Raid10 of 8xSAS + Special Vdev + all tunings from above, but logbias=latency (default)
- Old pool: Raid10 of 6xSAS + all tunings from above, but logbias=throughput

1. Special vdev seems absolutely useless to me, i don't see any benefits at all for PBS, not even for verification.
--> It stores metadata + blocks below 128k, i checked zpool iostat during backup/verification, it is almost not in use, the write/read queue is extremely low and the write/read speed is not worth to mention even.
--> The issue is especially, that nothing gets faster done, no backup or verfication speed improvement.

2. Even with 2 more SAS Drives in the Pool, logbias=throughput still outperforms a logbias=latency pool.
-> But im going to make some tests in the evening between the logbias options to see the real difference here.

The issue is, i wanted to use logbias=latency (default), because with throughput the data/metadata will get over time very fragmented, which will hurt Read-Speed.
But if the difference is so huge, between them, throughput is then the only correct option for PBS, at least for me.

Cheers

Ramalama · Friday at 20:28

Lets start with basic tuning parameters that i use:

Code:

  -o ashift=12 \
  -O special_small_blocks=128k \
  -O xattr=sa \
  -O dnodesize=auto \
  -O recordsize=1M \

Thats means logbias is default (latency) + special vdev

Testings:

Code:

logbias=latency + special vdev: INFO: Finished Backup of VM 166 (00:15:35)
logbias=throughput + special vdev: INFO: Finished Backup of VM 166 (00:15:37)
logbias=latency / no special vdev: xxxx
logbias=throughput / no special vdev: xxxx

xxxx

Code:

recordsize=1M: INFO: Finished Backup of VM 166 (00:15:35)
recordsize=2M: INFO: Finished Backup of VM 166 (00:15:36)
recordsize=4M: INFO: Finished Backup of VM 166 (00:15:37)

(This post is in work, testing just takes forever, recreating the pool + 16min backup run for each test etc...)

-------------
Found the ISSUE!!!! -> First Sorry @Dunuin i was probably very wrong about the special vdev recommendation.

Okay further testing makes no sense! as all backup runtimes were identical till here.
That explains why i already had 800-1GB/s speed already with 6 SAS Drives.

I made now a pool only with 2x Samsung PM1735 NVME's and i tested those already before, they deliver at minimum 2GB/s write speed.

Code:

zpool create   -o ashift=12   -O xattr=sa   -O dnodesize=auto   -O recordsize=1M   -O mountpoint=/HDD-SAS   HDD-SAS mirror /dev/disk/by-id/nvme-SAMSUNG_MZPLJ1T6HBJR-00007_S55JNCXXXXX /dev/disk/by-id/nvme-SAMSUNG_MZPLJ1T6HBJR-00007_S55JNCXXXX

So that should give in theory 2GB/s right?

No im still somewhere else limited to 800-1GB/s, backup speed is exactly the same as before:

Code:

SAS-POOL: INFO: Finished Backup of VM 166 (00:15:35)
NVME-ONLY: INFO: Finished Backup of VM 166 (00:15:34)

Network..... Nope...
- Everything is connected here at least with 2x25G in LACP.
- Migration of VM's between nodes is extremely fast.
- Iperf3 between the Nodes: 23.5 Gbits/sec
- Iperf3 between any Node and Backup-Srv: 23.5 Gbits/sec

The Genoa Cluster has Intel e810 Nics, the Backup-Server just a Mellanox ConnectX 4 LX.
However its not a Network limitation.
Its not a Storage limitation (The nodes have 8x Micron 7450 Max Nvme's in Raid10), the Backupserver, NVME's that can do at bare minimum 2GB/s.
Its not a CPU/Ram limitation, all memory channels on all Servers populated with 1DPC.

The only thing which comes into my mind, could be Interconnect on the Backupserver, 2x E5-2637 v3.
The network card sits on CPU2 and NVME's/SAS (All Storage) on CPU1.
But no interconnect on this planet is that slow, we talk about 800MB/s - 1GB/s, not 16GB/s or anything like that.

So what the hell is my limiting factor?
The Backup-server does nothing else as beeing PBS btw, and does absolutely nothing else in the meantime as my test runs.

Ramalama · Friday at 21:29

Anyone here reached higher Backup-Speed as 800MB/s or 1GB/s ?
Maybe thats some sort of PBS limit

Its getting weirder!
I have created a VM with PBS on another Genoa Server, the measured write speed is 1,5GB/s inside the VM and read around 5GB/s.
But thats a ZVOL issue, that im aware of, the write speed on the node itself on the ZFS pool is 6-7GB/s and read around 20GB/s.

Network is there no issue, because i migrate VM's with far over 6GB/s between the Genoa Nodes.

And you guess the Backup Speed tops out at 1GB/s like previously.
Backup-Server:
NFO: Finished Backup of VM 166 (00:15:35)

PBS in VM, on second Genoa-Node:
INFO: Finished Backup of VM 166 (00:15:36)

Local Storage (Directory on the fast Micron ZFS Pool)
INFO: Finished Backup of VM 166 (00:13:21)

Local makes sense, no network stack in between, the speeds went up to 1-1,2GB/s.

Next step is to migrate the VM from ZVOL to Directory based Storage....
That indeed improved the Read-Speeds in the Backup Process to around 20-22GB/s
INFO: 6% (69.3 GiB of 1.0 TiB) in 3s, read: 23.1 GiB/s, write: 991.8 MiB/s

BUT, i found out finally the ISSUE!!!
Its Compression!

ZSTD -> Around 800MB//s till 1GB/s
LZO -> Around 600MB/s
GZIP -> Around 40MB/s
none -> 4,5 GB/s

What the hell, someone from Proxmox Team could explain me, why you're not using Multithreaded versions of ZSTD/GZIP/LZO ???
ZSTD by itself supports multithreading since Multiple years.

But you guys are using it in zstd --threads=1 mode.
Where can we change that?

Ramalama · Friday at 23:02

Actually, lets think different, i can use no Compression, because the files gets anyway Compressed with LZ4 on the Backup-Server, at least via ZFS.

But this is still a bummer. That means simply that 1GB/s is the max Backup-Speed Limit for everyone, that don't disable Compression.

Cheers

EDIT: Im f...., there is no way of deactivating Compression, if you want to backup onto a PBS.
You can deactivate it only for Local-Storage....
Oh God, it can't be worse.

Ramalama · Saturday at 00:45

Okay, its not ZSTD itself, but a Bug in Proxmox somewhere:
https://pve.proxmox.com/pve-docs/chapter-vzdump.html#vzdump_configuration
you can set "zstd: 1"
I did a test to backup to Local-Storage and indeed zstd runs with 32 Threads!!!!

But the Backup-Speeds still hits all the same limits! No matter if PBS-VM on another genoa server, or local storage or my PBS-Server.
Exactly the same limits.

If i deactivate Compression, (for local storage at least), i get a speed bump from 1-1,2GB/s to almost 5GB/s.
With zstd and 32 Threads, or zstd with 1 Thread, there is no difference :-(
With the switch from GZIP to PIGZ (32c), i get a speed bump from 40MB/s to 500MB/s.
But thanks to the Proxmox-Team, that there is a setting for that in the vzdump.conf!!!! (Sorry for the previous Post)

So its something that is limiting in the pipeline of Compression, not the algorithm itself.
But there my knowledge ends, i don't know what it is. And im tired, need to sleep :-(

Ramalama · 2024-05-19T15:16:31+0200

https://bugzilla.proxmox.com/show_bug.cgi?id=5481
Im not completely sure if it's SSL, but almost nothing else is left. I cannot break the barrier of 1GB/s, no matter which hardware.
And im absolutely sure, that no one can, if the destination is a PBS.

Alternative way that's left, is to ditch PBS Completely and simply go the Standard-Linux route, making the server Available over ISCSi. That way i can disable Compression Completely and get the desired speeds of at least 3GB/s.

Cheers

ZFS Datastore on PBS - ZFS Recordsize/Dedup...

Well-Known Member

Well-Known Member

Distinguished Member

Well-Known Member

Distinguished Member

Well-Known Member

Distinguished Member

Well-Known Member

Well-Known Member

Distinguished Member

Well-Known Member

Distinguished Member

Well-Known Member

Well-Known Member

Key Considerations:​

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Key Considerations: