VM i/o is blocked on ZFS storage

Finn Gerstatt

New Member
Feb 25, 2025
3
0
1
Greetings. I have Proxmox 8.3.2 installed on a mirror zpool with mostly default settings:
code_language.shell:
# zpool list -v
NAME                                          SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
rpool                                        7.27T  69.5G  7.20T        -         -     0%     0%  1.00x    ONLINE  -
  mirror-0                                   7.27T  69.5G  7.20T        -         -     0%  0.93%      -    ONLINE
    ata-HGST_HUH728080ALE604_...-part3  7.28T      -      -        -         -      -      -      -    ONLINE
    ata-HGST_HUH728080ALE604_...-part3  7.28T      -      -        -         -      -      -      -    ONLINE

# zfs list
NAME                       USED  AVAIL  REFER  MOUNTPOINT
rpool                      132G  7.01T   104K  /rpool
rpool/ROOT                32.3G  7.01T    96K  /rpool/ROOT
rpool/ROOT/pve-1          32.3G  17.7G  32.3G  /
rpool/data                99.2G  6.90T    96K  /rpool/data
rpool/data/vm-112-disk-0  7.11G  6.90T  7.02G  -
rpool/data/vm-113-disk-0  15.1G  6.90T  15.0G  -
rpool/data/vm-113-disk-1  15.1G  6.90T  15.0G  -
rpool/data/vm-113-disk-2  15.1G  6.92T    56K  -
rpool/var-lib-vz           104K  50.0G   104K  /var/lib/vz

I also have two guest VMs with block level devices on zvols.
code_language.shell:
# qm list
      VMID NAME                 STATUS     MEM(MB)    BOOTDISK(GB) PID    
       112 app-disk-test-1      running    512                7.00 1824  
       113 app-disk-test-zvol32k-2 stopped    2048              15.00 0

One guest is running, another is being restored by dd reading image file and writing it to VM's zvol. E. g. we are reading from host's root FS and writing to VM's zvol, everything is happening within the same mirror. This single dd is enough to make completely unresponsive block level devices on the guest that is already running.

The problem seem to manifest after ARC is exausted, before that the running guest can write to its disk.

Do you have any idea what can I do to improve parallel writes to the pool? Yes, I know that HDDs are slow, and cache SSD can improve situation, and mirrored RAID doesn't help to get higher iops compared to a single HDD, and it's not best idea to keep root FS and guests on the same mirror, and I can imagine that something went wrong so I have write amplification (I used mostly default settings though, and I'm going to show it below, so I doubt it) but what kind of write amplification can make VM disks to be unresponsive for 5 minutes straight?

I don't want to mask the problem with thin provision, compression, etc., so I turned it off. I also don't want to set ARC size too high, because in real life you can't have 3/4 of hosts RAM dedicated to disk writes. My current ARC size is 6 GiB, but I had more or less same results with 12 GiB.

Full history of my zpool:
code_language.shell:
# zpool history
History for 'rpool':
2025-01-16.14:29:53 zpool create -f -o cachefile=none -o ashift=12 rpool mirror /dev/disk/by-id/ata-HGST_HUH728080ALE604_...-part3 /dev/disk/by-id/ata-HGST_HUH728080ALE604_...-part3
2025-01-16.14:29:54 zfs create rpool/ROOT
2025-01-16.14:29:54 zfs create rpool/ROOT/pve-1
2025-01-16.14:29:54 zfs create rpool/data
2025-01-16.14:29:54 zfs create -o mountpoint=/rpool/ROOT/pve-1/var/lib/vz rpool/var-lib-vz
2025-01-16.14:29:54 zfs set atime=on relatime=on rpool
2025-01-16.14:29:54 zfs set compression=on rpool
2025-01-16.14:29:54 zfs set acltype=posix rpool/ROOT/pve-1
2025-01-16.14:29:54 zfs set sync=disabled rpool
2025-01-16.15:08:40 zfs set sync=standard rpool
2025-01-16.15:08:41 zfs set mountpoint=/ rpool/ROOT/pve-1
2025-01-16.15:08:41 zfs set mountpoint=/var/lib/vz rpool/var-lib-vz
2025-01-16.15:08:41 zpool set bootfs=rpool/ROOT/pve-1 rpool
...
2025-01-21.13:09:47 zfs set quota=50g rpool/ROOT/pve-1
2025-01-21.13:09:55 zfs set quota=7168g rpool/data
2025-01-21.13:10:02 zfs set quota=50g rpool/var-lib-vz
...
2025-01-21.14:30:15 zfs set compression=off rpool/data

HDDs have caching turned on:
code_language.shell:
# hdparm -W /dev/sda

/dev/sda:
 write-caching =  1 (on)

# hdparm -W /dev/sdb

/dev/sdb:
 write-caching =  1 (on)

The problem persists on both my ProLiant DL360 Gen10 servers (same hardware configuration: Intel Xeon Silver 4110 CPU/64GiB RAM/2x8TB HDDs), so it's not something specific to only one server. HDDs are connected to HPE Smart Array but left unconfigured:
hdd-model.png

To be precise about what is happening I made the following test:
  • Guest VM writes 1MiB of random data in a loop:
    code_language.shell:
    while(:); do bash -c 'date "+[%T]"; dd if=/dev/urandom of=test.bin bs=1M count=1; '\''rm'\'' -v test.bin'; sleep 1; done
    See vm-writes-1dd.txt in attachment for result.
  • Host is writing 15GiB file to zvol:
    code_language.shell:
    # dd if=test.bin of=/dev/zvol/rpool/data/vm-113-disk-0 bs=1M status=progress
    16091447296 bytes (16 GB, 15 GiB) copied, 204 s, 78,9 MB/s15360+0 records in
    15360+0 records out
    16106127360 bytes (16 GB, 15 GiB) copied, 395,183 s, 40,8 MB/s
    See arcstat-1dd.txt and zpool-iostat-1-1dd.txt in attachment for result.
  • CPU load during test:

    cpu-load-1dd.png
  • Zpool settings: see zfs-get-all.txt, zpool-get-all.txt in attachment.
 

Attachments

Last edited:
Forgot to mention that I tried to use 16 and 32k volblocksize for zvols without much of improvoment.

Also VM configs are the following:
code_language.shell:
# cat /etc/pve/qemu-server/112.conf
boot: order=ide2;scsi0
cores: 1
cpu: x86-64-v2-AES
ide2: none,media=cdrom
memory: 512
meta: creation-qemu=9.0.2,ctime=1737472192
name: app-disk-test-1
net0: virtio=BC:24:11:05:B3:11,bridge=vmbr0,firewall=1,tag=100
numa: 0
onboot: 1
ostype: l26
scsi0: local-zfs:vm-112-disk-0,iothread=1,size=7G
scsihw: virtio-scsi-single
smbios1: uuid=104517b7-4e4d-4dae-8803-576874d7ce24
sockets: 1
vmgenid: bc568e9e-b15b-4e2b-8d50-419b6167765a

code_language.shell:
# cat /etc/pve/qemu-server/113.conf
boot: order=scsi0;ide2;net0
cores: 1
cpu: x86-64-v2-AES
ide2: none,media=cdrom
memory: 2048
meta: creation-qemu=9.0.2,ctime=1739975844
name: app-disk-test-zvol32k-2
net0: virtio=BC:24:11:14:B1:3C,bridge=vmbr0,firewall=1
numa: 0
ostype: l26
scsi0: local-zfs:vm-113-disk-0,iothread=1,size=15G
scsi1: local-zfs:vm-113-disk-1,iothread=1,size=15G
scsi2: local-zfs:vm-113-disk-2,iothread=1,size=15G
scsihw: virtio-scsi-single
smbios1: uuid=da0a7799-ddbe-42a5-b00e-bd8320f6fb62
sockets: 1
vmgenid: 3af6ba0e-b1ff-4835-8645-7a7f78e3d0df
 
indeed HDD is slow + ZFS even slower (about 1/2 because double writes and many things ... ) + within Guest even slower due to Virtualization.
zfs_dirty_max settings help to prevent /limit high writes, then wait up to hang :
 
Thanks for reply @_gabriel. I tried setting different values for zfs_dirty_data_max, but it doesn't help a lot. Small values like one in your quote slows I/O, but it gets blocked anyways.

What is really helped me is setting zvol_request_sync=1.
This effectively limits the queue depth to 1 for each I/O submitter.
After that I see no problems with VM i/o blocked. Currently I'm trying to assess if there any throughput penalty of my settings. Does anyone know, can we set queue properties in more flexible way, not just enable/disable?
 
> Yes, I know that HDDs are slow, and cache SSD can improve situation, and mirrored RAID doesn't help to get higher iops compared to a single HDD, and it's not best idea to keep root FS and guests on the same mirror, and I can imagine that something went wrong so I have write amplification (I used mostly default settings though, and I'm going to show it below, so I doubt it) but what kind of write amplification can make VM disks to be unresponsive for 5 minutes straight?

Your best bet is to install a couple of SSDs or nvmes, set them up as a separate zpool/mirror and move the VMs to that. Trying to run everything from a single spinning-disk rpool is going to cause problems, as you have seen.

Separate OS and data, and be sure to have backups.

Note that you're also using HGST spinners, which leads me to believe that these drives might be over 5 years old. Have you run SMART Long tests and checked your stats lately?
 
Last edited: