[SOLVED] help me make sense of zpool iostat output

lucius_the

Member
Jan 5, 2023
70
4
8
Hi,
Please help me make sense of `zpool iostat` output.

I have a server with 4 HDD-s put in a RAID-10 zpool / 2 mirror vdevs.
I am currently running a sync from another PBS over 1 gigabit connection. Transfer speed is about 1 Gbps. PBS UI, iotop and htop are all showing disk writes of around 100 MB/s.

zpool iostat shows 200 MB/s of writes. This I don't understand. And couldn't find much (or anything) on the Internet around this. Does anyone have an idea what's going on here ? Besides transferring chunks from another PBS instance, I'm not doing anything else on this server. It has no VM-s/CT-s, just plain fresh PVE installation and PBS alongside it.

Code:
root@kom1:~# zpool iostat bazen -yv 5
                              capacity     operations     bandwidth
pool                        alloc   free   read  write   read  write
--------------------------  -----  -----  -----  -----  -----  -----
bazen                       1.08T  1.09T      0    318      0   199M
  mirror-0                   556G   556G      0    153      0  97.2M
    scsi-35000cca08069e584      -      -      0     76      0  48.6M
    scsi-35000cca08069af94      -      -      0     76      0  48.6M
  mirror-1                   555G   557G      0    165      0   102M
    scsi-35000cca08069ce38      -      -      0     82      0  50.9M
    scsi-35000cca08069e9d8      -      -      0     82      0  50.9M
--------------------------  -----  -----  -----  -----  -----  -----

zpool iostat shows exactly double bandwidth of actual writes going to the pool... A mirror is a mirror and 2 vdevs mean writes should go to both mirrors (in half). I should be seeing roughly 100 MB/s going to the pool, then about 50 MB/s going to each vdev. But for some reason I'm seeing double here.

Does anyone have an idea what's going on here ?
 
Last edited:
That's because you configured zfs mirror which writes the data 2 times /vdev.

what do you mean by this... a vdev is not mirrored. I have 2 vdevs, each of which is a mirror. Each vdev gets half the writes from pool (as there are just 2 vdevs). This is a ZFS equivalent of a RAID-10.
 
Last edited:
Problem is the pool itself gets 200 MB/s of writes, which is... unexpected. I have no idea where those numbers are coming from - as the OS is issuing 100 MB/s of writes

That the OS is issuing 100 MB/S is something I confirmed by checking with both iotop and htop, but also just knowing that I'm using 1 Gbps link, so around 120 MB/s is also the theoretical maximum of writes I could get -> but still zpool iostat shows I'm getting 200 MB/s of writes to the pool. THIS I don't understand. The pool itself is getting double of writes.
 
@waltar I think you might be correct actually.
If the pool is showing a cumulative amount of writes that all disks are making together... then the math works. I guess it's possible that zpool iostat doesn't actually show the writes it's getting from the OS, but that's just strange.

Anyway thanks !
 
what do you mean by this... a vdev is not mirrored. I have 2 vdevs, each of which is a mirror. Each vdev gets half the writes from pool (as there are just 2 vdevs). This is a ZFS equivalent of a RAID-10.
Yeah, each vdev gets half of the (application) writes but as it's a zfs mirror vdev it writes double to both disks in 1 vdev - and even the same happens if you have more than 1 vdev in a mirror (like in a raid10).
200 B/s is exactly what is expected - mirror means has data written twice !!!
 
  • Like
Reactions: lucius_the
Problem is the pool itself gets 200 MB/s of writes
Use htop and press <TAB> once. It will show a list of processes sorted by IO.

Just for illustration:
Code:
    PID USER       IO    DISK R/W▽ DISK READ   DISK WRITE SWPD%  IOD% Command                                                                                                                                                                                                                                                                               
   2318 root       B4    0.00 B/s    0.00 B/s    0.00 B/s   0.0   0.0 pvestatd
2689200 3000005    B4  269.43 K/s    0.00 B/s  269.43 K/s   0.0   0.0 /usr/sbin/smbd --foreground --no-process-group
   2124 root       B4  315.36 K/s    0.00 B/s  315.36 K/s   0.0   0.0 /usr/bin/pmxcfs
 
Use htop and press <TAB> once. It will show a list of processes sorted by IO.
Yes, of course, that's exactly what I've done. And the one process that was creating writes was proxmox-backup-server - doing roughly 100 MB/s of writes. I also checked with iotop, getting same result (around 100 MB/s of disk writes in total).
 
That's the theory ... but in practise the seen written data is NOT the amount of data written as ... :
Mostly zfs has compression enabled as so less data is seen written and read !!
Checksum is additionally written and read (~1-3%).
When using raidz* or draid* additionally parity is written but ideally not read if not needed.
So when measuring performance for zfs you must have a defined amount of data and measure time - after that make the math or a benchmark programm will do that internally for you !!
So iostat and iotop do mostly NEVER show real amount of data !!
 
Last edited:
  • Like
Reactions: lucius_the
Thanks. Yes you're right and I do understand compression, metadata and how various redundancy levels cause a different write profile when it all goes to actual drives. Also sync writes I think would be hitting ZIL first (not sure if that means another write afterwards, probably yes) and all that IO happens in the pool.

I might conclude now that the `bandwidth` column from `zpool iostat` output is actually showing cumulatives, or more precisely, the sum of all disk IO happening on the drives beneath.

That aligns with what I'm seeing. When I look at the drives, they are all getting 50 MB/s which is perfectly correct. Two mirrors, each getting 50 MB/s means 100 MB/s in total when striped. So it's just a matter of how `zpool iostat` shows it. It combines it into a bandwidth, which is not the same as "pool writes". Not what I'd expect, but makes sense still. Bandwidth actually is 200 MB/s, if bandwidth means all IO combined.

The docs here: https://openzfs.github.io/openzfs-docs/man/master/8/zpool-iostat.8.html actually say this two important details:
Displays logical I/O statistics for the given pools/vdevs. Physical I/O statistics may be observed via iostat(1)
...
Additional I/O may be generated depending on the level of vdev redundancy.
I have to admit the info above was not enough for me to immediately grok it ;) The docs could, perhaps, be a tad more precise about what the bandwidth column represents, but still the docs contain a clue - I just failed to properly interpret it at first. The column is called "bandwidth" so I could have figured it out sooner :)

Well, another day and another detail learned !
Now that I've spend some time with ZFS I actually like it. It does take some time to understand it well enough, though.

Leaving the comment here for others that might possibly be wondering about the same thing. Thanks everyone !
 
Last edited:
  • Like
Reactions: waltar and UdoB

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!