really slow restore from HDD pool

carl0s · Oct 16, 2024

I have a RAIDZ on PBS with 4x 16Tb SATA HDD.

I am watching a KVM restore on a 10G LAN at 12MiB/s. Target pool is 5x 1.92Tb enterprise SAS 12g ZFS through HBA330 controller.

how can it be so bad? I understand it's many small files, but 12MiB/s ? There must be a way to improve this? I read that SSDs are recommended, but who would use wearout&high-cost SSDs for bulk backup storage?

I initiated the restore from PVE rather than from PBS if that matters. It is restoring to a different node (non-clustered)

carl0s · Oct 17, 2024

That was a restore to an older node (48 x Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz (2 Sockets), target datastore 5x 1.92TB RAIDZ SAS 12G ZFS HBA330 no controller raid.

I am now testing a restore to the original node which is a modern PowerEdge R7615 with the fastest single-core performance I could find at the time (32 x AMD EPYC 9174F 16-Core Processor (1 Socket), 4x Intel 3.84TB on latest PERC12 NVMe controller - H965i, controller RAID5, no ZFS, LVM Thin.).

It seems that the restore performance is double (half as bad) on the newer server, which makes me think this is not IO limited, and certainly not caused by the spinning HDDs on PBS.

It is still not good, but it is half as bad. Any ideas?

Attached are the identical restore jobs - one to older node, one to newer node. Both PVE 8.2 (8.2.7 and 8.2.4)

older node which is 48 x Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz (2 Sockets), target datastore 5x 1.92TB RAIDZ SAS 12G
Older node manages 12 MiB restore (on latter part, second disk especially)
took 1hr
very poor performance for restoring 2nd disk

Newer node is still awfully slow, but half as slow as the other one. Both on the same 10G LAN.
I can see with btop on the pve node that newer one is chugging along at ~30MiB/s over the network while the older one was 12MiB/s.
The strange thing is, it's even really slow to make progress when there is nothing coming over the network and pbs box shows 100% idle in top (around ~36 - 39% of the restore of the second larger disk).

restore to newer node
start 23:35
done at 00:04
30 minutes.

Johannes S · Oct 17, 2024

Even for the newer node the data to restore lays on the RAIDZ HDD datastore right? Then my guess is, that it behaves like expected with the number of small files of PBS. I also recall that in a lot of discussions here RAIDZ was called out for being slow compared to a ZFS mirror setup.
If you can install some SSDs in your backup host it might be worth to setup ZFS with a special device for the metadata to speed up the operations.
But this needs serious consideration and planning since in case of a loss of the special device you can't access your data anymore.

I recall that in the german subforum @Falk R. and other consultants explained their setup for their customers PBS servers at Hetzner:
- Two relative small (and thus affordable) SSDs
- Two 10 TB HDDs

They would be configured like this:
- The HDDs are configured as ZFS mirrors für the datastore
- The SSD are split up in two partitions which are ZFS mirrors as well. One smaller for the PBS operating system and one larger one for usage as special device of the datastore.

A similiar setup should also work with your local PBS, the mirrors take care that one broken disk doesn't lead to a complete data loss.

If you unterstand German (or use Google Translate/Deepl) here is the thread: https://forum.proxmox.com/threads/ist-es-möglich-pbs-mit-aws-s3-blockspeicher-zu-betreiben.152790/

Maybe Falk or some other of the participants can explain it better than me.

One problem in your case could of course be the existing data store, which you may have to save somewhere else before setting up everything with a more performance-oriented approach.

Hope this helps, best regards, Johannes.

carl0s · Oct 17, 2024

That is correct in terms of where the data to restore lays.

Thank you for your pointers and links. I have to go out now but I will do some study over the following week while I am away on holiday.

UdoB · Oct 17, 2024

carl0s said:
I understand it's many small files, but 12MiB/s ?

If "atime" is active on the datastore then each read of a single chunk would require (at least) one additional write. All in all you may need several head movements, (~ 10 ms each) for each and every chunk. And with RaidZ you only have the IOPS of a single drive. Rotating rust is really slow in this usecase.

The hint regarding the "Special Device" from @Johannes S is really important and speeds up the thing drastically.

carl0s · Nov 2, 2024

UdoB said:
If "atime" is active on the datastore then each read of a single chunk would require (at least) one additional write. All in all you may need several head movements, (~ 10 ms each) for each and every chunk. And with RaidZ you only have the IOPS of a single drive. Rotating rust is really slow in this usecase.

The hint regarding the "Special Device" from @Johannes S is really important and speeds up the thing drastically.

I believe atime has to stay active for garbage collection. There is some contradictory discussion on it, but if I understood correctly, it can maybe be changed to relatime but that's all, and it is already mounted as such:

RAIDZ on /mnt/datastore/RAIDZ type zfs (rw,relatime,xattr,noacl,casesensitive)

Anyway how do we account for it taking half as long to restore from the same PBS box, but to the newer PVE node?
When I get more time I will look at adding a special device. Actually I plan to replace the hardware on the PBS machine anyway. The current chassis can only hold 4x 3.5" HDD internally and they are used for this RAIDZ 4x16TB (plus a single SSD for OS boot tucked inside).
I have a PowerEdge T330 that I can repurpose which has 8x 3.5 hot plug bays and twice the RAM.

When I did my initial backups to this PBS box, I was seeing write speeds to the RAIDZ of somewhere in the region of 500MB/s over the 10G LAN. I believe this has since reduced quite a bit, but that's hard to tell because all backups are incremental now. Restores though, they are just awful.

UdoB · Nov 2, 2024

carl0s said:
I was seeing write speeds to the RAIDZ of somewhere in the region of 500MB/s

That bandwidth does not help at all. IOPS is key, not the maximum speed for large files.

I know that the following is not new, has been discussed several times and does not solve your problem, but let me add it here nevertheless as a random datapoint. This is from one machine in my Homelab:

One of my pools consists of four 6TB Western Digital / Seagate drives. And I had the glorious idea to go for RaidZ2 without an adequate "Special Device". I added a simple (read-) cache later, but that does not help at all. To show what this gets me to I utilize fio like this:

Code:

zfs create -o compression=off rpool/fio

Code:

/rpool/fio# fio --name=randrw --ioengine=libaio --direct=1  --rw=randrw --bs=2M --numjobs=1 --iodepth=16 --size=20G  --time_based  --runtime=60
randrw: (g=0): rw=randrw, bs=(R) 2048KiB-2048KiB, (W) 2048KiB-2048KiB, (T) 2048KiB-2048KiB, ioengine=libaio, iodepth=16

  read: IOPS=20, BW=40.9MiB/s (42.9MB/s)(2458MiB/60108msec)
  write: IOPS=21, BW=42.3MiB/s (44.3MB/s)(2542MiB/60108msec); 0 zone resets

That bs=2M was chosen because the PBS chunks are 4M uncompressed. May be there are more 2MB than 4MB. For comparison the same with 4MB:

Code:

:/rpool/fio# fio --name=randrw --ioengine=libaio --direct=1  --rw=randrw --bs=4M --numjobs=1 --iodepth=16 --size=20G  --time_based  --runtime=60
  read: IOPS=16, BW=66.4MiB/s (69.6MB/s)(3984MiB/60014msec)
  write: IOPS=17, BW=70.4MiB/s (73.8MB/s)(4224MiB/60014msec); 0 zone resets

From tests like this I get my totally (not) surprising understanding that rust is slow ;-)

If you really need to utilize classic hdd go for multiple vdevs (use mirrors!) AND add a "Special Device" early in the process. If you add it later you need to send/recv all the data (or copy it once by other means).

In any case those "20 IOPS" are barely useable...

At the same time I can proof (wrongly) to be able to write with 300 MB/s:

Code:

:/rpool/fio# dd if=/dev/urandom bs=2M count=2000 of=4gb.dd status=progress 
4171235328 bytes (4.2 GB, 3.9 GiB) copied, 13 s, 321 MB/s
2000+0 records in
2000+0 records out
4194304000 bytes (4.2 GB, 3.9 GiB) copied, 13.1896 s, 318 MB/s

The above is lying, the proof is this the same command plus an extra "sync":

Code:

:/rpool/fio# time ( dd if=/dev/urandom bs=2M count=2000 of=4gb.dd status=progress ; sync )
4173332480 bytes (4.2 GB, 3.9 GiB) copied, 13 s, 321 MB/s
2000+0 records in
2000+0 records out
4194304000 bytes (4.2 GB, 3.9 GiB) copied, 13.1658 s, 319 MB/s

real    1m6.581s
user    0m0.008s
sys     0m9.233s

It is async writes with normal ZIL. The "dd" ends early after 13 seconds. But the "real" time measures 66 seconds = 5 times longer.
So the bandwidth is more like 60 MB/s than 300 MB/s...

The result gets worse if dd is directly requesting sync writes:

Code:

:/rpool/fio# dd if=/dev/urandom bs=2M count=1000 of=2gb.dd status=progress oflag=sync
2090860544 bytes (2.1 GB, 1.9 GiB) copied, 168 s, 12.4 MB/s
1000+0 records in
1000+0 records out
2097152000 bytes (2.1 GB, 2.0 GiB) copied, 168.575 s, 12.4 MB/s

Now I am down to 12 MB/s. This is what I meant with "performance of a single spindle" and "multiple head movements due to meta data handling etc.".

This was a write-test. For reading data a cache may help, but only for repetitive requests of the same data.

carl0s · Nov 2, 2024

Thank you for the thorough reply! It is no problem for me to offload the data and recreate the volume. I just need to spend some time investigating required size of SSD special devices.

carl0s · Nov 2, 2024

Also, still no explanation why it is 30MiB/s to the newer target PVE node?
admittedly, I have only run each restore once. Perhaps I will run again but in reverse order - second target node first, then the other one.

UdoB · Nov 2, 2024

carl0s said:
Also, still no explanation why it is 30MiB/s to the newer target PVE node?

Yeah, that question is not really answered. So let's take a look at my bad-example-PBS-datastore with that RaidZ2:

How fast can I read some data from the chunk store?

Code:

.chunks# time cat ffe*/*  > /dev/zero 
real    0m31.925s
user    0m0.011s
sys     0m0.277s

How much data is that?

Code:

.chunks# du -k -s ffe* | cut -f1   | paste -sd+ -  | bc
940545

That is KiB, so nearly 1 GB.

Code:

.chunks# echo $(( 940545 / 32 )) 
29392

First result: this is less than 30 MB per seconds.

Just another set of those files:

Code:

.chunks# time cat eee*/*  > /dev/zero 
real    0m31.282s
user    0m0.017s
sys     0m0.260s

Cool! The same duration for reading those chunks.

Code:

.chunks# du -k -s eee* | cut -f1   | paste -sd+ -  | bc
895047
.chunks# echo $(( 895047 / 31 )) 
28872

And a third measurement:

Code:

.chunks# time cat abc*/*  > /dev/zero 
real    0m28.534s

.chunks# du -k -s abc* | cut -f1   | paste -sd+ -  | bc
875068

.chunks# echo $(( 875068 / 28 )) 
31252

Note that I always took a random set of files to make sure the cache is "cold".

My RaidZ2 can read data from a PBS chunk store with a speed of 30 MB per second.

So from my point of view and with the above actual tests your speed is approved to be expected

carl0s · Nov 2, 2024

Right. so 30 MB /sec is to be expected. Fine. but throughout this discussion it was said that the 12 MB /sec that I see when I restore to the older PowerEdge R630 is because of PBS HDD source. and I am saying how can that be, when I see 2.5* that performance when restoring to a newer PowerEdge T7615. It says to me that a key part of the problem is in restore code perhaps single threaded on PBS side ? Older machine has Xeon E5-2670 v3 (PassMark single thread rating 1690). Newer machine has Epyc 9174F (PassMark single thread rating 3318)

If I add a special device and rebuild the PBS datastore with SSDs, am I going to see greater than 12 MB/sec on the slower node or what?

Johannes S · Nov 2, 2024

carl0s said:
If I add a special device and rebuild the PBS datastore with SSDs, am I going to see greater than 12 MB/sec on the slower node or what?

This is what I would expect yes. You will need to rewrite the data with zfs send/receive after adding them otherwise the special device won't be used for the old data.
And of course even with the special device Performance won't be as with a ssd-only store, so don't expect too much

carl0s · Nov 2, 2024

OK. I have some enterprise 200gb SAS SSDs sitting around that I can use, and I can set up another PBS temporarily to shift the data off to and then back.

Johannes S · Nov 2, 2024

Hm I wondered which size would be approate for the device since ss soon as the special device is filled the small files and metadata will again be written to the HDD so I googled and found this in the truenas forums: https://forums.truenas.com/t/special-vdev-svdev-planning-sizing-and-considerations/5086

carl0s · Nov 2, 2024

Johannes S said:
Hm I wondered which size would be approate for the device since ss soon as the special device is filled the small files and metadata will again be written to the HDD so I googled and found this in the truenas forums: https://forums.truenas.com/t/special-vdev-svdev-planning-sizing-and-considerations/5086

Small files is optional isn't it? I plan to only put metadata on. 0.3% of my 36tb is about 108gb if I done my maths right.

carl0s · Nov 2, 2024

I made a small miscalculation.. I had in my head that my datastore is 36TB, but it's 44TB or therabouts (4x16TB RAIDZ). I still don't understand ZFS results in df

but still at 0.3% for metadata it would be 132GB so those 200GB Intel enterprise SAS disks should be fine and I have 4 of them that I just replaced with 1.92TB drives in that spare/older server that I have been talking about (PowerEdge R630)

Johannes S · Nov 2, 2024

carl0s said:
Small files is optional isn't it? I plan to only put metadata on. 0.3% of my 36tb is about 108gb if I done my maths right.

I come to the same conclusion with your formula however since PBS operates with small files I would expect that the special device would still benefit from using it.

I now also found some old threads from here which are quite worth a look:

In the first link Fabian of Proxmox staff explains that one shouldn't expect much performance gains in the verify jobs since this need to read the actual data (which still lies on the HDD). I would expect then, that restore won't have a big speedup either. The garbage collection on the other hand profits a lot since this jobs mostly works with metadata. One user in the thread went from 18 hours for a verify job with a plain hdd setup to 20 minutes with hdd+ssd special device. So for work which mostly works with the metadata the speedup is HUGE.
It also talks about the impact of using the special device just for metadata (needs a certain ZFS setting), it might yield some performance benefits for stuff like garbage collection but will impact other jobs.
Both links say that one should calculate around 0.02 Ratio or 20 GB special device capacity for 1 TB on a hdd. So for 36 TB it would be between 720 and 800 GB.

carl0s · Nov 2, 2024

It seems there's different comments on that space usage. Perhaps more people are doing 'small files' as well as metadata. I will try with what I have available anyway at first and keep an eye on it.

Johannes S · Nov 2, 2024

Please report about your reults, I'm curious about the speedup and used space of of the ssds

carl0s · Nov 3, 2024

My obvious reason for leaving out 'small files' is that, isn't every PBS backup chunk a fixed-size file? I thought they were all fixed at 4mb? none smaller, none larger?

really slow restore from HDD pool

Well-Known Member

Well-Known Member

Attachments

Active Member

Well-Known Member

Distinguished Member

Well-Known Member

Distinguished Member

Well-Known Member

Well-Known Member

Distinguished Member

Well-Known Member

Active Member

Well-Known Member

Active Member

Well-Known Member

Well-Known Member

Active Member

Well-Known Member

Active Member

Well-Known Member