High I/O delay, fluctuating transfer rates

AlpsView

New Member
Apr 1, 2025
16
1
3
I'm testing PBS atm, to find out if it will be the solution for my backup needs.
  • PBS is running in a VM on PVE.
  • The VMs are stored in local NVMes.
  • An external USB 3.0 harddisk is passed through to PBS. Filessystem is ext4.
  • Backups are running using snapshots.
  • Backups are encrypted.
During the backup process on PBS and on PVE i can see a high to very high iowait. Most oft the time 40-50% up to almost 100% at peak.

1746286131483.png


On PVE in the running Backup Job logs, I can see the read/write rates going up and down in waves with write rates from > 300 MiB/s to <30 MiB/s.

1746286355584.png


I'm trying to understand where the problem comes from. My conclusions so far:
  • Encryption needs CPU power. The CPU never comes near 100%, so it doesn't seem to be the bottleneck.
  • However the iowait indicates the CPU is waiting.
  • So the question is, is it waiting to read more data or is it waiting to write data?
  • Since read rates constantly are bigger than write rates, I conclude the CPU is reading at highest possible speed, is processing what ever it gets and then tries to write it to the target drive. And this write transfer does not seem to be fast enough. This is why we see the iowait.
  • USB 3.0 (Super Speed) should be able to do 5 GBps = 400 MB/s net. At the very begining of each transfer write rates start somewhere near 400 MB/s. I see this as a "proof" the USB drive is operating on Superspeed indeed. However, rates drop fast to a fraction of the inital rate and move in waves after this, never coming near the 400 MB/s again.
  • These waves to me looks like a faster cache in front of the acutally harddisk being filled faster than the disk itself is able to write. So each time the cache is full, the cpu goes into iowait while the disk tries to write the cache. I wondering if this is a cache on the disk or the USB controller or ....
Is my conclusion correct?
Anything i can improve in this case? Are there any potential BIOS settings to improve the transfer/cache handling? Anything else other than getting a faster harddisk (I'm aware these external USB drives are rather low end performance-wise - so maybe this is just what it is)? Other Filesystem?

For comparisoni had been doing backups to a second internal NVMe disk where i had much faster rates and didn't see these waves. I don't remember if I have seen iowait as i did not had an focus on this then, but i guess there wasn't as I might have noticed it.
 
Last edited:
This is how the backup process looked on PVE (showing only the last part of the backup run but it looked the same the whole time):

1746288461731.png
 
An external USB 3.0 harddisk is passed through to PBS.

That's a brave attempt. It is not clear from your post but I assume rotating rust.

PBS needs IOPS, IOPS and also IOPS. Earlier or later you'll encounter timeouts and need to fight to get the data read back successfully. My very first PBS was so slow that listing the backups was impossible...

Background in a nutshell: PBS stores each and every backup in chunks of 2-4 MiB in size. (4 MiB minus compression.) For each of them the head has to move physically, possibly multiple times. Plus movements for metadata. That's just... slooooow.

If you really need to use a classic HDD go for "classic" vzdump backups instead of PBS. That would create one single large file, which will work without this drama.


For Reference, the requirements: https://pbs.proxmox.com/docs/installation.html#recommended-server-system-requirements
 
Thanks a lot for your your input and the insights UdoB.

Yes, it's plausible the bottleneck is the harddisk. I know that sata harddrives are at the low end of performance. I just didn't expect to see these waves, specially these "valleys". Might just be because I have never seen this before and I wasn't aware of how much the write pattern influences the amount and time for the head movements.

From the amount of data transfered divided by the total time the backup took, it averages at around 43 MB/s (initial backup run, so a full run without dedup). Given the fact that this is my homelab, I could live with this if it's not going to be much worse over time, specially since the amount of data being changed between two runs is rather small. This is why I wanted to try PBS instead of vzdump, as with the latter only full backups are possible while PBS works with deduplication, consuming only a small fraction of additional storage in subsequent runs and as lesser data has to be written, also completing faster. So even with a rusty harddrive, I thought PBS might be the better solution in terms of duration and storage needed.

I just completed a second run (like 5h after the first one with very little changed - some logs, some wiki entries, some emails, ...). This one only took around 8 minutes and iowait was minimal. However, I'm a bit alerted you write that in your experience in such a setup even listing the backup meta data failed after a while. Is this because you were holding backups for ever, not pruning old ones?
 
However, I'm a bit alerted you write that in your experience in such a setup even listing the backup meta data failed after a while. Is this because you were holding backups for ever, not pruning old ones?

The age is irrelevant. The culprit lies in the pure number of chunks.

It was a separate setup with a Raid6 (no ZFS), constructed only with HDDs. That's of course the worst case as IOPS are basically identical to one single drive. Nevertheless I wanted to give it a try - while already expecting problems from the beginning. But I wanted to learn the actual behavior on my hardware.

It worked. With the amount of backups stored the number of chunks increased. After several weeks(!) and a few (still single-digit, iirc) Terabytes stored I had the effect that the WebGUI could not list the backups. The first click just got a timeout, showing an empty table as the result. When I clicked a second time the most of the backup's listing was already in cache and now this second click worked. This started slowly to happen, depending on cache+buffers. After another two weeks this effect was "stable" reproducible. It was clear that this was not acceptable :cool:

I do not remember if an actual restore worked at that time. Probably it was not necessary, so I did not test it in that situation...

So my solution (better: workaround) was to introduce fast meta data storage. This approach is possible for ZFS (with a fast "Special Device" added before filling in much data) and also (for example) for btrfs on a Synology. But not for your single HDD.

Meanwhile I've setup other instances of PBS, all in the lower two-digit TB range. The most of them use ZFS with mirrored vdevs plus that mentioned "Special Device". I never had another problem after the above story.

Restore is still slow with this construct - data is still stored on HDD - but in my setups it works reliably. That said... I think I should run some more restore-tests to verify that statement... :-)
 
Last edited:
That's a brave attempt.

LOL. I have one of these!

It started as a joke. I grabbed a couple large USB external drives from one of our colos. Took them back to my office, where we did not have a NAS.
(I think I was going to do passthrough w/esxi and a vm to make a nas. Then broadcom screwed us all.)

I stood up Rocky Linux on a mini Dell desktop. Plugged in the drives. Sorted out all the mounting BS so they always work regardless of port. It was fun.
Built out NFS and SMB shares. Got all the perms and stuff sorted. More fun.
Installed KVM. Learned to work with it directly without PVE. Setup huge datastore on one of the USB mounts. Lotta fun.
Installed PBS as a guest. (Several times. I was new.) Gave it a big disk on the big USB datastore.

So ya, it runs like cr@p. What would you expect?
It was a test ground for several technologies and still serves a purpose in my office today.
And still runs like cr@p. Keeps workin.

....
BTW ... OP seems both really on point but also like they don't want to believe themselves.
This is true. "faster cache in front of the acutally harddisk being filled faster than the disk itself is able to write"
The rest of this post is just discussion. It was all resolved at that sentence.
 
Last edited:
  • Like
Reactions: Johannes S and UdoB
LOL. I have one of these!

Sure. A lot of "special" things do work in a homelab. Until they don't ;-)

The most important aspect from my story is that one should know that some use cases are not "officially supported" although they seem to work at a first glance.
 
I'm acutally trying to differentiate between "is slow", "is not reliable" and "does not work". From a home lab (sorry, I forgot to point this out from the very begining) point of view, "is slow" might be acceptable. "Is not reliable" when it comes to data consistency is a red line also in the home lab. And "is not working" we don't need to discuss. In the end, for private use the price tag surely plays a role (I'm not working in a DC where enterprise grade industrial SSDs are available on a self-service buffet à discrétion ;-)) and would justify some trade off regarding transfer rates. That's why I was asking @UdbB about his scenario, experience and the "root cause" of the problem he observed. Since if i.e. his backup volume is 20 times mine and the number of chunks is the driver, I might conclude that when his problems started after let's say 2 months, mine would start after 40 months? Which might be an acceptable mid term solution. And if there were some tweaks which could improve the situation, also a couple of little improvements sum up.

So, my take-aways from all of the above for my situation:
- Withou t a high I/O storage system use full backup on PVE, don't use PBS
- Better use SSDs instead of HDDs. If SSD not possible, go for faster HDD
- USB3.x as interface should be fine as with 5 Gbps max it's in the same league as SATAIII SSDs.

SSD makes me fear the recurring costs. From what I read about the wear, enterprise disks are needed. And when using full backups, quite a bunch of enterprise disks are needed.

Rain and backup problems - not my happiest sunday :-)
 
Last edited:
From what I read about the wear, enterprise disks are needed.
For PBS? I doubt that. PBS as a backups system is writing new data only and reading it once in a blue moon (or for scheduled verification). Note that HDD + Metadata on SSD is a special case though. "Enterprise Class" is...
  • highly recommended for running PVE with multiple active VMs hammering their virtual drives
  • always recommended for all use cases which require high IOPS and/or write performance
  • usually recommended when using ZFS; it is not a technical requirement; ZFS just writes more volume and more often --> higher wearout and slow on cheap SSDs w/o PLP
Said differently: "Enterprise Class" is nearly always "better" - but it is usually not required.

Disclaimer: just from my own experience with no hard evidence --> my opinion; ymmv.

(( In my Homelab I use "Second Hand Enterprise Class, SATA" - because of the price. Though several NVMe do not have PLP here and are also not redundant = MiniPC, only single slot. This is not recommended and I do expect trouble earlier or later - but I am prepared for that. ))
 
Last edited:
Whats interesting to me is the IO delay is being measured even when pbs is not doing a job. My pbs is in an LXC on truenas. The disks are a RAIDZ2 HDD. and I am seeing spikes of 40% in pbs, wile running an rclone over NFS on the truenas (reading from synology, writing to truenas).

I am also interested in does the io delay matter if its a general measure like this.
Whats the right way to reduce the delay - is it purely on a RAIDZ2 adding a metadata special vdev (this machine has some unuse optanes so i can certainly do that!)

(this copy job is basically a one time task, most of the time my system will be lightly loaded)
 
Whats the right way to reduce the delay - is it purely on a RAIDZ2 adding a metadata special vdev
I can't give hard evidence, so... just guessing. In my usecase an added SD made a big difference - overall, but not for all operations: the data part of sync writes will get handled by the "old" disks as slow as without a SD.

One problem I want to mention explicitly: once added you can not remove it anymore. Experimenting on a pool which already stores data might be a bad idea if the result doesn't go as expected. And a SD is only used for new data, written in the future, so "old" data does not benefit.

If I would be going to add a Special Device on pool already in use I would look at man zpool-checkpoint and use it. (Beside of real backups, of course.)
 
once added you can not remove it anymore.

If I would be going to add a Special Device on pool already in use I would look at man zpool-checkpoint and use it. (Beside of real backups, of course.)
yes was ware about the non removable nature, but always worth telling people!

and thanks for giving me a new command to learn - i did a quick scan, it doesnt seem to have any ability to allow people to revert to checkpoints without the metadat vdev? or rather once one has a checkpoint it looks like one cant add things?
 
  • Like
Reactions: Johannes S
i did a quick scan, it doesnt seem to have any ability to allow people to revert to checkpoints without the metadat vdev? or rather once one has a checkpoint it looks like one cant add things?
There is only one single checkpoint, if any. It is "global". If you rewind all datasets, all zvol and all snapshots are restored.

I expect that to also include topology changes, like adding a vdev. The man page says "... prohibits the following zpool subcommands: remove, attach, detach, split, and reguid", it does not list "add special device vdev". (Maybe they mean it - by listing "attach".)

But... I never used it for the discussed purpose myself! That's why I said "I would look at it".

Brutal changes like this one should always get tested in a sandbox before touching a productive environment.
 
Brutal changes like this one should always get tested in a sandbox before touching a productive environment.

This made me curious. Just to verify the actual behavior of a “checkpoint” in regard to changing the topology of a pool - especially by adding a "Special Device". I did this on a throw-away system of course:

Code:
root@pnz:~# zpool checkpoint rpool
root@pnz:~# zpool status
  pool: rpool
 state: ONLINE
checkpoint: created Sun May 18 10:08:56 2025, consumes 496K
Code:
root@pnz:~# zpool add rpool special /dev/sdb
Code:
root@pnz:~# zpool list -v
NAME                                           SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
rpool                                         35.5G  1.87G  33.6G    2.23M         -     1%     5%  1.00x    ONLINE  -
  scsi-0QEMU_QEMU_HARDDISK_drive-scsi0-part3  31.5G  1.86G  29.1G    2.23M         -     1%  6.01%      -    ONLINE
special                                           -      -      -        -         -      -      -      -         -
  sdb                                         4.99G  1.41M  4.50G        -         -     0%  0.03%      -    ONLINE

Code:
root@pnz:~# zfs create rpool/dummy-dataset
root@pnz:~# zfs list
NAME                  USED  AVAIL  REFER  MOUNTPOINT
rpool                1.86G  28.2G    96K  /rpool
rpool/ROOT           1.86G  28.2G    96K  /rpool/ROOT
rpool/ROOT/pve-1     1.86G  28.2G  1.86G  /
rpool/data             96K  28.2G    96K  /rpool/data
rpool/dummy-dataset    96K  28.2G    96K  /rpool/dummy-dataset
rpool/var-lib-vz       96K  28.2G    96K  /var/lib/vz

Now how to remove the SD?​

It is clear that this can not work:
Code:
root@pnz:~# zpool remove rpool  /dev/sdb
cannot remove /dev/sdb: checkpoint exists
Code:
root@pnz:~# zpool import --rewind-to-checkpoint
no pools available to import

It CAN NOT work from within the running system with an already imported pool. Also my first try to “Boot --> Recovery Mode” will not help as it already has imported the rpool!

So I booted the installation .iso and selected “Install Proxmox VE (Terminal UI, Debug Mode)“. Then:
  • zpool import - listed successfully the pool in question
  • zpool import --rewind-to-checkpoint rpool -f - does its job. This is the elimination of all changes since then
  • zpool export rpool - I just wanted to exit in a clean way - not a good idea...?
Reboot

This dropped me into the initramfs shell, unexpectedly (for me). Probably because I “exported” that pool. So I just did “zpool import rpool” and it worked fine. Exit...

The “real” system boots, all data from before that checkpoint is there, no errors of whatever kind is reported and the Special Device is gone!

Conclusion: works as expected :)

----
Edit: all this effort with booting from another media is gone if we are talking about a separate/independent ZFS pool. The separation of "Operating System" and "Data / virtual disks" is still best practice! But on a Homelab with Mini-PCs there are some constraints, and so several users will "mis-use" the rpool for VMs too...
 
Last edited:
  • Like
Reactions: Johannes S