One zfs pool slows down all other zfs pools

Jannoke

Renowned Member
Jul 13, 2016
69
11
73
It might be more like zfs question, but maybe someone has had experience.

So i have few zfs pools on one machine.
One of pools is running on simple consumer (dramless) nvme's (3 disk raidz1, one disk missing).. There is only one virtual machine on that specific pool . While filling disk on that pools guest machine using:
Bash:
dd if=/dev/urandom of=filltest.bin bs=1M count=95000 status=progress
It starts at 220MB/s ..and in around 5 minutes and 20GB it has dropped to around 100-120MB/s. At the same time all other guest machines are starting to report high IO even thou they are on different pools. And verifiebly the io is very sluggish on all these virtual machines to the point some services start to report being down. If i cancel the DD, it will recover.

There is plenty of ram (512GB) and cpu (dual xeon 2699v4). Also machine is not that loaded either on cpu or disk side before activating dd.

So how can it be that one virtual machine with unrelated pool to other machines can take down entire server zfs storage? What am I missing here. Is it like trashing my ARC?
Bash:
root@bla:~# pveversion
pve-manager/9.0.9/117b893e0e6a4fee (running kernel: 6.14.11-2-pve)
root@bla:/etc/modprobe.d# cat zfs.conf 
options zfs zfs_arc_min=10737418240
options zfs zfs_arc_max=64424509440

it happends with sync=disabled or sync=standard
 
> One of pools is running on simple consumer (dramless) nvme's (3 disk raidz1, one disk missing)

Seriously, you're running a consumer-level 3-disk raidz DEGRADED with 1 disk MISSING, and posting about it here?? Fix your pool first.

If you want better speed, rebuild it as a mirror pool. With Enterprise-level SSD or at least high-TBW rated like Lexar NM790.
 
  • Like
Reactions: waltar
> One of pools is running on simple consumer (dramless) nvme's (3 disk raidz1, one disk missing)

Seriously, you're running a consumer-level 3-disk raidz DEGRADED with 1 disk MISSING, and posting about it here?? Fix your pool first.

If you want better speed, rebuild it as a mirror pool. With Enterprise-level SSD or at least high-TBW rated like Lexar NM790.
Hey chatgpt - It's always grumpy people like you who come to comment. Question is about what happens if situation like this arises. Throwing stuff like make it a mirror is useless - I know how to make pools and what different raid levels are. Also this is not hw purchase thread so you why are we recommending some cheapass lexars which are another variant of consumer drives. Why are you even talking about high-tbw drives if you don't even know what i have. High tbw does not mean anything on zfs, especially if you are not going to use it. These drives are currently 1600TBW drives @1TB size if it makes you feel better.

I don't want to fix problem blindly - i want to learn from it. This pool is not broken because my drive broke. I removed the drive. It's no problem for me to change these out to some enterprise Samsung drives like the other ones are in the pool.

But that's not the problem -

It was about what I saw happening to other pools on the same machine if one unrelated pool member loses drive, and is still being written on. Disk can be lost at any time , but question is why it bogs other unrelated pools.

Next time someone asks a question and you don't have an answer just move one and go comment to the mirror. If you don't know about this specific matter, stop forcing your "one-size-fits-all" solutions. You can be either problem solver or part changer and obviously you are latter.
 
I don't want to fix problem blindly - i want to learn from it. This pool is not broken because my drive broke. I removed the drive.
That wasn't obvious from your first post. A lot of users come here with lowest end hardware and/or semi-damaged systems - wondering why they have problems.

but question is why it bogs other unrelated pools.
Good question!

I have seen similar behavior - a single high level task slowing down a whole system. And I have had systems with a classic system load of >700 and the foreground processes acted fine, being completely unimpressed.

Are the disks of the unrelated pools on the same controller? Maybe the controller is busy handling the missing drive again and again - and this slows down any additional command because the command queue is full. Sorry, no good answer from me...
 
All the good and constructive questions @UdoB and @chrcoluk . All pools are INDEPENDENT as stated. This is Zpool info for verification:

Bash:
root@bla:~# zpool status
  pool: zf_2tbnvme
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
        invalid.  Sufficient replicas exist for the pool to continue
        functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-4J
  scan: scrub repaired 0B in 00:23:09 with 0 errors on Sun Sep 14 00:47:10 2025
config:

        NAME                                                STATE     READ WRITE CKSUM
        zf_2tbnvme                                    DEGRADED     0     0     0
          raidz1-0                                          DEGRADED     0     0     0
            nvme-GIGABYTE_GP-GSM2NE3100TNTD_SN194208902062  ONLINE       0     0     0
            nvme-GIGABYTE_GP-GSM2NE3100TNTD_SN195108903686  ONLINE       0     0     0
            9499347251494022038                             UNAVAIL      0     0     0  was /dev/disk/by-id/nvme-GIGABYTE_GP-GSM2NE3100TNTD_SN194208902052-part1

errors: No known data errors

  pool: zf_24g_raidz2
 state: ONLINE
  scan: scrub repaired 0B in 12:50:47 with 0 errors on Sun Oct 12 13:14:48 2025
config:

        NAME                                 STATE     READ WRITE CKSUM
        zf_24g_raidz2                     ONLINE       0     0     0
          raidz2-0                           ONLINE       0     0     0
            ata-ST8000VN004-2M2101_WSD4JCVQ  ONLINE       0     0     0
            ata-ST8000VN004-2M2101_WSD4JK5D  ONLINE       0     0     0
            ata-ST8000VN004-2M2101_WSD4JCTC  ONLINE       0     0     0
            ata-ST8000VN004-2M2101_WSD4JJ34  ONLINE       0     0     0
            ata-ST8000VN004-2M2101_WSD4JCC1  ONLINE       0     0     0

errors: No known data errors

  pool: zf_32g_raidz1
 state: ONLINE
  scan: scrub repaired 0B in 1 days 07:11:14 with 0 errors on Mon Oct 13 07:35:23 2025
config:

        NAME                                   STATE     READ WRITE CKSUM
        zf_32g_raidz1                       ONLINE       0     0     0
          raidz1-0                             ONLINE       0     0     0
            ata-ST16000NM001G-2KK103_ZL2D4PZZ  ONLINE       0     0     0
            ata-ST16000NM001G-2KK103_ZL2K3GH1  ONLINE       0     0     0
            ata-ST16000NM001G-2KK103_ZL2K3DRY  ONLINE       0     0     0

errors: No known data errors

  pool: zf_samsung
 state: ONLINE
  scan: scrub repaired 0B in 01:39:15 with 0 errors on Sun Oct 12 02:03:26 2025
config:

        NAME                                                STATE     READ WRITE CKSUM
        zf_samsung                                          ONLINE       0     0     0
          mirror-0                                          ONLINE       0     0     0
            nvme-SAMSUNG_MZ1LB1T9HALS-00007_S436NA1R114234  ONLINE       0     0     0
            nvme-SAMSUNG_MZ1LB1T9HALS-00007_S436NA1R114237  ONLINE       0     0     0

errors: No known data errors

  pool: zf_samsung2
 state: ONLINE
  scan: scrub repaired 0B in 00:01:00 with 0 errors on Sun Oct 12 00:25:12 2025
config:

        NAME                                                  STATE     READ WRITE CKSUM
        zf_samsung2                                           ONLINE       0     0     0
          mirror-0                                            ONLINE       0     0     0
            nvme-Samsung_SSD_980_500GB_S64DNX0RC58998J-part4  ONLINE       0     0     0
            nvme-Samsung_SSD_980_500GB_S64DNX0RC59038P-part4  ONLINE       0     0     0

errors: No known data errors

As these are NVMe they are NOT on same controller, They are direct PCIe. All "Ata-*" are on single LSI3008 controller.

ANY of them suffer from high IO from gigabyte NVME zpool IO. I can replace them and I will . But as you can see from the pool set that it's not financial issue , but i'm more on how and why it's possible to kill all zpools on same server with single disk/pool inefficency. I have been working on tests with chatgpt5/premium that actually make sence, but in the end there is no clear answer so i'm looking for human thoughts and real life experiences. If it helps to debug i can re-add the 3rd disk to revive the array.
zf_samsung2 is unused and waits to be removed.