Can't upload ISOs via PVE (proxmox 9 ZFS IO Stall)

brainsoft

Member
Jun 25, 2024
36
3
8
Hello,
I used to be able to upload isos from the web ui. This is a basic feature, of course it worked, never had a problem with pve 8. Somewhere since PVE9, maybe 9.1, I can no longer upload isos. The system io stall goes up to 90 and the upload stops dead. it is pretty consistent that it gets to maybe 2.4gb then slots down for a couple of seconds, gets to 2.6 or so and stops cold. It behaves the same on the 1gbe and 10gbe connections, just gets to 2.6gb faster before stopping.

I'ved tried a few things, mounting tmp on the nvme, mounting it as tmpfs in ram, adding an optane slog, disabling sync writes on rpool but nothing has helped. I've split the mirrors too, nothing has changed.

Yes, these are cheap consumer grade sata ssds. No, they didn't have a problem before. Other than being the only iso upload source, all they are is boot and a single instance nightly guest backup. (pbs handles more) Configurations haved changed over time, but this problem isn't even isolated to just this one machine. This machine 'tiberius' is an old Ryzen 2600 gaming rig shoved to the gills (pci expansions like 10gbe nic, 8-port HBA, gpu, additional drives carefully laid out to use 100% of the available IO). The other machine 'regulus' is just a Dell Optiplex 3040. Both machines have a zfs mirrored rpool and an m2 nvme for guests. Both are running 9.1.1 or 9.1.2. There is also a third node, an old 4th gen Intel Nuc, it is also on 9.1.1. it isn't powerful enough to pull the full gigabit line speed, but it will take 300-400mbps upload and actually complete, even if it's writing afterwords over nfs back to tiberius.

Since it affects both machines, and since uploading ISO is a pretty basic function, any help is appreciated. I've gone as far as I can with LLMs, here are some details that naturally chatgpt goes AHA, this is the smoking gun! ZFS back pressure from ZFS TXG flush activity metadata etc etc but I'm not a zfs master. All I know is... why did it work and what changed in either ZFS or proxmox 9 that means I can't upload isos anymore, because it used to work even with this crap before I upgraded.
All 3 nodes went through PVE8to9 upgrades but that was weeks/months ago. IN fairness, i haven't tried to upload any isos in a while, just working with lxc templates mostly, so I can't say exactly when it happened.

Code:
Total DISK READ:         0.00 B/s | Total DISK WRITE:         0.00 B/s
Current DISK READ:       0.00 B/s | Current DISK WRITE:      38.84 M/s
    PID  PRIO  USER     DISK READ DISK WRITE>    COMMAND                                                                                                  
  62348 be/4 www-data      0.00 B      2.64 G pveproxy worker
   4295 be/4 root          0.00 B     23.51 M systemd-journald
   3327 be/4 root        124.00 K      9.77 M kvm -id 108 -name homeassistant-vm,debug-threads=on -n~0=pflash0,pflash1=drive-efidisk0,hpet=off,type=q35+pve0
  23593 be/4 root          0.00 B      9.50 M [kworker/u50:2-kvfree_rcu_reclaim]
     87 be/4 root          0.00 B      7.79 M [kworker/u49:0-events_unbound]
  17630 be/4 root          0.00 B      7.73 M [kworker/u50:1-kvfree_rcu_reclaim]
  39124 be/4 root          0.00 B      5.74 M [kworker/u50:0-kvfree_rcu_reclaim]
  32153 be/4 root          0.00 B      5.38 M [kworker/u50:4-kvfree_rcu_reclaim]
  34590 be/4 root          0.00 B      4.42 M [kworker/u49:3-events_unbound]
   2427 be/4 root        112.00 K      4.03 M rrdcached -g
    400 be/0 root        311.00 K      4.00 M [zvol_tq-1]
  25580 be/4 root          0.00 B      3.09 M [kworker/u49:1-kvfree_rcu_reclaim]
   2959 be/4 root          0.00 B      3.00 M pmxcfs
    401 be/0 root        349.00 K      2.28 M [zvol_tq-2]
   3860 ?dif 100999        6.84 M   1158.00 K pihole-FTL -f
    399 be/0 root         89.00 K   1024.00 K [zvol_tq-0]
  52327 be/4 root          0.00 B    992.00 K [kworker/u49:2-kvfree_rcu_reclaim]
   3628 be/4 100000        0.00 B    504.00 K systemd-journald
   4596 be/4 backup        0.00 B    324.00 K proxmox-backup-proxy
  49834 be/0 root         16.00 K    208.00 K [zvol_tq-1]
   3600 be/4 100000        0.00 B    178.00 K systemd-journald
    637 be/4 root         24.00 K    120.00 K [txg_sync]
   1825 be/4 root          0.00 B    120.00 K [txg_sync]
  61313 be/0 root          0.00 B    114.00 K [zvol_tq-1]
                                       9
   3157 be/4 100000       20.00 K     22.00 K rsbd: cleanupd
   2430 be/4 root          0.00 B     11.00 K smartd -n -q never
  61338 be/0 root          5.00 K      8.00 K [zvol_tq-2]
   3039 be/4 root          0.00 B   1024.00 B nmbd --foreground --no-process-group
   3741 be/4 100000        0.00 B   1024.00 B dhclient -4 -v -i -pf /run/dhclient.eth0.pid -lf /var/~.leases -I -df /var/lib/dhcp/dhclient6.eth0.leases eth0
  63356 be/4 postfix       0.00 B   1024.00 B bounce -z -n defer -t unix -u -c

  keys:  any: refresh  q: quit  i: ionice  o: all  p: threads  a: bandwidth                                                                                
  sort:  r: asc  left: DISK READ  right: COMMAND  home: PID  end: COMMAND                                                                                  
CONFIG_TASK_DELAY_ACCT and kernel.task_delayacct sysctl not enabled in kernel, cannot determine SWAPIN and IO %

`watch -n 1 cat /proc/pressure/io`
Code:
Every 1.0s: cat /proc/pressure/io                                                                                         tiberius: Tue Dec 16 09:48:32 2025

some avg10=87.72 avg60=56.26 avg300=18.42 total=283301742
full avg10=85.36 avg60=54.29 avg300=17.72 total=275486165

`watch -n 1 'date; zpool iostat -v'`
Code:
Every 1.0s: date; zpool iostat -v                                                                               tiberius: Tue Dec 16 09:48:51 2025

Tue Dec 16 09:48:51 AM EST 2025
                                                               capacity     operations     bandwidth
pool                                                         alloc   free   read  write   read  write
-----------------------------------------------------------  -----  -----  -----  -----  -----  -----
nvme_guests                                                  6.42G   226G     10     18   137K   211K
  nvme-Samsung_SSD_970_EVO_Plus_500GB_S58SNM0T907161H-part1  6.42G   226G     10     18   137K   211K
-----------------------------------------------------------  -----  -----  -----  -----  -----  -----
rpool                                                        23.7G  74.8G      2     50  75.8K  3.70M
  mirror-0                                                   23.7G  74.8G      2     50  75.8K  3.70M
    ata-Netac_SSD_120GB_YS581296399139783932-part3               -      -      1     25  34.8K  1.85M
    ata-Netac_SSD_120GB_YS581296399139784728-part3               -      -      1     25  41.0K  1.85M
-----------------------------------------------------------  -----  -----  -----  -----  -----  -----
vault                                                        18.7G  25.5T      0      0  1.18K  2.34K
  mirror-0                                                   9.24G  12.7T      0      0    220    458
    ata-WDC_WUH721414ALE6L4_81G25VVV                             -      -      0      0    110    229
    ata-WDC_WUH721414ALE604_9JGJ4LST                             -      -      0      0    109    229
  mirror-1                                                   9.18G  12.7T      0      0    219    458
    ata-WDC_WUH721414ALE6L4_Y5JMJSUC                             -      -      0      0    109    229
    ata-WDC_WUH721414ALE6L4_81G7NSRV                             -      -      0      0    109    229
special                                                          -      -      -      -      -      -
  mirror-2                                                    327M   111G      0      0    772  1.44K
    ata-INTEL_SSDSC2BB120G4_PHWL522301B2120LGN                   -      -      0      0    420    738
    ata-INTEL_SSDSC2BB120G4_PHWL522301LH120LGN                   -      -      0      0    352    738
-----------------------------------------------------------  -----  -----  -----  -----  -----  -----
 
Last edited:
I am currently running Proxmox 9.1.2 and experiencing the same result. I have attempted uploading an ISO twice now with the same result. It says it uploads 100%, but then it shows the following where the only choices are to click the (X) in upper right or on [Abort] (see image below), both of which end with me back at the ISO images screen with no new file uploaded.

It should be noted that I have plenty of storage space, so that's not the issue. And while uploading, I can see the temp file in /var/tmp/ showing up and growing in size. But when the dialog box shows what it does below, the file is gone, and it is unclear what's going on.

Not sure what's going on, but this worked the last time I tried this, which like the OP, was about a month ago when I was still on PVE 8.x. Is this some kind of bug in v9.x?

1765940904520.png
 
Last edited:
@fseeseink do you also use ZFS?
 
and sata SSDs?

I've seen a few of these io stall threads and there is often a SATA ssd commonality to them.

Fabian If there is anything I can do to help please let me know. I'm just going to bypass the GUI for uploads entirely for now but there is definately something going on, a tweak to zfs maybe, that has really highlighted the weakness we all know exists in these drives under zfs.

I haven't tried any zfs tuning yet, I don't want to start increasing the dirty cache size just to try to get the whole thing into ram if writing to disk at the end is still going to hit a brick wall. Of course, it always comes back to, it used to work. Even If we could just get the parameters and return to the previous behaviour somehow, that would be a big win. I don't care what the graphs say, I just want to be able to upload isos for testing.

I wish the intel server sata ssds were not part of the special vdev so I could try them out as the boot discs to see if it makes a difference. These current discs are the cheapest ssds imaginable and I don't expect much from them. But that's homelab, fail safely
 
No ZFS. Local storage.
And 1TB NVMe SSD.
About as simple as it can get for storage.
This is a single node setup.

As both times it reaches 100% and just sat there, it feels like something has changed in how the final step is performed (moving the uploaded file from /var/tmp/ to its final location for ISO files, which appears to be /var/lib/vz/template/iso). But with no error message, no logs to indicate anything, it's a bit of a head scratcher.

Mind you, otherwise Proxmox is running fine. The VMs are running and doing their thing. And I have had this setup for nearly 2 years now I think. The only major change was upgrading from v8 to v9, which went very smoothly when I did it. And otherwise haven't noticed a thing.
 
Last edited:
Strange. I wanted to see if I could upload another ISO but didn't want to use anything big (takes a bit of time). So from here: https://www.alpinelinux.org/downloads/

pulled down specifically https://dl-cdn.alpinelinux.org/alpine/v3.23/releases/x86_64/alpine-standard-3.23.2-x86_64.iso

Then did the [Upload] and it worked!

Granted, this file is only ~363MB (vs the Bazzite ISO which is ~7GB). But the Proxmox server has 32GB RAM, most of its 1TB SSD free, so it's not hurting for resources even if it were uploading to a RAM cache first. And I already have several other large ISOs (though admittedly the Bazzite one appears to be the largest as seen below).

Anyway, just another data point in case it helps. At this point I have the image I needed so should be ok. But if I can help test anything, let me know.

1765976352947.png
 
  • Like
Reactions: Kingneutron
I tried increasing arc from the 3.2gb it was, to 8gb, and I was able to actually complete the upload. I tried also playing with dirty cache up to max_max of 4gb, also increasing txg transfer group from 5second to 10, nothing else helped the up stall and zfs write is nothing near the actual threshold of any of the datasets. Like 1mb/sec instead of 100.

I never looked this hard before som i'm sure a lot of this was happening before, but it seems excessive
 
Workarounds:

On pve server:
adduser yourusernamehere

Use winSCP to upload ISOs to /home/yourusernamehere and move them using PVE node Shell (as root) to /var/lib/vz/template/iso/

OR chown yourusernamehere /var/lib/vz/template/iso/ (you might have to chown more directories upwards ^^ to /var in order to get proper write access for non-root ID) and use winscp to upload them there directly

You might also add /home/yourusernamehere as a Directory in Datacenter / Storage and define it as "ISO image" capable, then no need for /var/lib/blahblahblah

https://unix.stackexchange.com/questions/5860/how-do-i-recursively-check-permissions-in-reverse
.
Install Mobaxterm on windows (or use WSL2), use tar and netcat to directly upload to PVE server (no scp, faster)
.
^^ REF: https://search.brave.com/search?q=tar+netcat+upload&summary=1&conversation=9d9f5654285bf9e7feb3ed
.
https://github.com/kneutron/ansitest/tree/master/proxmox

Use the symlink-samba-isos script and share all your ISOs to cluster / multiple instances over Samba, script will recurse the share and flat-map the .iso files to /var/lib/vz/template/iso/ -- works well and can be implemented on multiple PVE servers to have a single ISO source. I run nightly in cron to update
 
Last edited: