Help with ZFS Mirror

Yaya48

New Member
Apr 2, 2023
11
1
3
Hello, i've setup a ZFS RAID 1 with 2x4 SSDs. One Crucial one WD to store my VMs on it. It's a 64gig machine with an AMD R9 and i got some issues

1729102715128.png

For testing i've used an samba share but basically when uploading a large file it works fine ( here at 1/GBS ) and after like some time it basically hang and drop the bandwidth for like 30-60s then it goes back to 1/GBS as shown in the screenshot. When the drop happen basically every VMs freeze and the IO Delay is like 50%, I've read up about ARC cache and i've set it to 12GB of RAM. It fixed the speed but not the IO Delay/Hang issue.
 
please show zpool list -v and zpool status
And lsblk
Do you optimize an zfs config entry and/ or set the zfs arc cache?
 
please show zpool list -v and zpool status
And lsblk
Do you optimize an zfs config entry and/ or set the zfs arc cache?
I'm really new to ZFS stuff since i just got the drive to do so, so only the ARC cache is set to 12GB.

I've also another Mirrror for the Proxmox OS on two NVME of 500GB ( but i haven't got any issues with it )
 

Attachments

Code:
NAME                                                  SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
rpool                                                 464G  6.53G   457G        -         -     0%     1%  1.00x    ONLINE  -
  mirror-0                                            464G  6.53G   457G        -         -     0%  1.40%      -    ONLINE
    nvme-eui.e8238fa6bf530001001b444a417da772-part3   465G      -      -        -         -      -      -      -    ONLINE
    nvme-CT500P3SSD8_2405470C794F-part3               465G      -      -        -         -      -      -      -    ONLINE
vm-data                                              3.62T   685G  2.96T        -         -     1%    18%  1.00x    ONLINE  -
  mirror-0                                           3.62T   685G  2.96T        -         -     1%  18.5%      -    ONLINE
    ata-CT4000BX500SSD1_2413E8A23B33                 3.64T      -      -        -         -      -      -      -    ONLINE
    ata-WD_Blue_SA510_2.5_4TB_2426R7D00569           3.64T      -      -        -         -      -      -      -    ONLINE

  pool: rpool
 state: ONLINE
  scan: scrub repaired 0B in 00:00:02 with 0 errors on Sun Oct 13 00:24:03 2024
config:

    NAME                                                 STATE     READ WRITE CKSUM
    rpool                                                ONLINE       0     0     0
      mirror-0                                           ONLINE       0     0     0
        nvme-eui.e8238fa6bf530001001b444a417da772-part3  ONLINE       0     0     0
        nvme-CT500P3SSD8_2405470C794F-part3              ONLINE       0     0     0

errors: No known data errors

  pool: vm-data
 state: ONLINE
  scan: scrub repaired 0B in 00:20:31 with 0 errors on Sun Oct 13 00:44:33 2024
config:

    NAME                                        STATE     READ WRITE CKSUM
    vm-data                                     ONLINE       0     0     0
      mirror-0                                  ONLINE       0     0     0
        ata-CT4000BX500SSD1_2413E8A23B33        ONLINE       0     0     0
        ata-WD_Blue_SA510_2.5_4TB_2426R7D00569  ONLINE       0     0     0

errors: No known data errors

NAME        MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
sda           8:0    0   3.6T  0 disk
├─sda1        8:1    0   3.6T  0 part
└─sda9        8:9    0     8M  0 part
sdb           8:16   0   3.6T  0 disk
├─sdb1        8:17   0   3.6T  0 part
└─sdb9        8:25   0     8M  0 part
sr0          11:0    1  1024M  0 rom 
zd0         230:0    0    32G  0 disk
├─zd0p1     230:1    0   487M  0 part
├─zd0p2     230:2    0     1K  0 part
└─zd0p5     230:5    0  31.5G  0 part
zd16        230:16   0     4M  0 disk
zd32        230:32   0    32G  0 disk
├─zd32p1    230:33   0   100M  0 part
├─zd32p2    230:34   0    16M  0 part
├─zd32p3    230:35   0  31.3G  0 part
└─zd32p4    230:36   0   569M  0 part
zd48        230:48   0    62G  0 disk
├─zd48p1    230:49   0   487M  0 part
├─zd48p2    230:50   0     1K  0 part
└─zd48p5    230:53   0  61.5G  0 part
zd64        230:64   0    32G  0 disk
├─zd64p1    230:65   0   100M  0 part
├─zd64p2    230:66   0    16M  0 part
├─zd64p3    230:67   0  31.3G  0 part
└─zd64p4    230:68   0   569M  0 part
zd80        230:80   0     4M  0 disk
zd96        230:96   0   400G  0 disk
zd112       230:112  0     1M  0 disk
zd128       230:128  0   600G  0 disk
zd144       230:144  0    32G  0 disk
├─zd144p1   230:145  0   487M  0 part
├─zd144p2   230:146  0     1K  0 part
└─zd144p5   230:149  0  31.5G  0 part
zd160       230:160  0     1M  0 disk
zd176       230:176  0    32G  0 disk
├─zd176p1   230:177  0   487M  0 part
├─zd176p2   230:178  0     1K  0 part
└─zd176p5   230:181  0  31.5G  0 part
zd192       230:192  0   100G  0 disk
nvme0n1     259:0    0 465.8G  0 disk
├─nvme0n1p1 259:1    0  1007K  0 part
├─nvme0n1p2 259:2    0     1G  0 part
└─nvme0n1p3 259:3    0 464.8G  0 part
nvme1n1     259:4    0 465.8G  0 disk
├─nvme1n1p1 259:5    0  1007K  0 part
├─nvme1n1p2 259:6    0     1G  0 part
└─nvme1n1p3 259:7    0 464.8G  0 part

The first thing I would is to try to reproduce this not in a mirror. These are two quite different drives.

NB I still can't quite grasp what is nvme-eui.e8238fa6bf530001001b444a417da772-part3.
 
Last edited:
Code:
NAME                                                  SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
rpool                                                 464G  6.53G   457G        -         -     0%     1%  1.00x    ONLINE  -
  mirror-0                                            464G  6.53G   457G        -         -     0%  1.40%      -    ONLINE
    nvme-eui.e8238fa6bf530001001b444a417da772-part3   465G      -      -        -         -      -      -      -    ONLINE
    nvme-CT500P3SSD8_2405470C794F-part3               465G      -      -        -         -      -      -      -    ONLINE
vm-data                                              3.62T   685G  2.96T        -         -     1%    18%  1.00x    ONLINE  -
  mirror-0                                           3.62T   685G  2.96T        -         -     1%  18.5%      -    ONLINE
    ata-CT4000BX500SSD1_2413E8A23B33                 3.64T      -      -        -         -      -      -      -    ONLINE
    ata-WD_Blue_SA510_2.5_4TB_2426R7D00569           3.64T      -      -        -         -      -      -      -    ONLINE

  pool: rpool
 state: ONLINE
  scan: scrub repaired 0B in 00:00:02 with 0 errors on Sun Oct 13 00:24:03 2024
config:

    NAME                                                 STATE     READ WRITE CKSUM
    rpool                                                ONLINE       0     0     0
      mirror-0                                           ONLINE       0     0     0
        nvme-eui.e8238fa6bf530001001b444a417da772-part3  ONLINE       0     0     0
        nvme-CT500P3SSD8_2405470C794F-part3              ONLINE       0     0     0

errors: No known data errors

  pool: vm-data
 state: ONLINE
  scan: scrub repaired 0B in 00:20:31 with 0 errors on Sun Oct 13 00:44:33 2024
config:

    NAME                                        STATE     READ WRITE CKSUM
    vm-data                                     ONLINE       0     0     0
      mirror-0                                  ONLINE       0     0     0
        ata-CT4000BX500SSD1_2413E8A23B33        ONLINE       0     0     0
        ata-WD_Blue_SA510_2.5_4TB_2426R7D00569  ONLINE       0     0     0

errors: No known data errors

NAME        MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
sda           8:0    0   3.6T  0 disk
├─sda1        8:1    0   3.6T  0 part
└─sda9        8:9    0     8M  0 part
sdb           8:16   0   3.6T  0 disk
├─sdb1        8:17   0   3.6T  0 part
└─sdb9        8:25   0     8M  0 part
sr0          11:0    1  1024M  0 rom
zd0         230:0    0    32G  0 disk
├─zd0p1     230:1    0   487M  0 part
├─zd0p2     230:2    0     1K  0 part
└─zd0p5     230:5    0  31.5G  0 part
zd16        230:16   0     4M  0 disk
zd32        230:32   0    32G  0 disk
├─zd32p1    230:33   0   100M  0 part
├─zd32p2    230:34   0    16M  0 part
├─zd32p3    230:35   0  31.3G  0 part
└─zd32p4    230:36   0   569M  0 part
zd48        230:48   0    62G  0 disk
├─zd48p1    230:49   0   487M  0 part
├─zd48p2    230:50   0     1K  0 part
└─zd48p5    230:53   0  61.5G  0 part
zd64        230:64   0    32G  0 disk
├─zd64p1    230:65   0   100M  0 part
├─zd64p2    230:66   0    16M  0 part
├─zd64p3    230:67   0  31.3G  0 part
└─zd64p4    230:68   0   569M  0 part
zd80        230:80   0     4M  0 disk
zd96        230:96   0   400G  0 disk
zd112       230:112  0     1M  0 disk
zd128       230:128  0   600G  0 disk
zd144       230:144  0    32G  0 disk
├─zd144p1   230:145  0   487M  0 part
├─zd144p2   230:146  0     1K  0 part
└─zd144p5   230:149  0  31.5G  0 part
zd160       230:160  0     1M  0 disk
zd176       230:176  0    32G  0 disk
├─zd176p1   230:177  0   487M  0 part
├─zd176p2   230:178  0     1K  0 part
└─zd176p5   230:181  0  31.5G  0 part
zd192       230:192  0   100G  0 disk
nvme0n1     259:0    0 465.8G  0 disk
├─nvme0n1p1 259:1    0  1007K  0 part
├─nvme0n1p2 259:2    0     1G  0 part
└─nvme0n1p3 259:3    0 464.8G  0 part
nvme1n1     259:4    0 465.8G  0 disk
├─nvme1n1p1 259:5    0  1007K  0 part
├─nvme1n1p2 259:6    0     1G  0 part
└─nvme1n1p3 259:7    0 464.8G  0 part

The first thing I would is to try to reproduce this not in a mirror. These are two quite different drives.

NB I still can't quite grasp what is nvme-eui.e8238fa6bf530001001b444a417da772-part3.
I do what a Raid 0 ? if yes it is possible to switch without having to format everything ?
nvme-eui.e8238fa6bf530001001b444a417da772-part3 Should be a WD to
 
Last edited:
Hi,
if i interpret the usage of your pool right, tghe rust-based "vm-data" is for the VM-Images. If I am right on this, i would suggest to use a small part of the NVMEs as (mirrored) log device. This would increase the the IOPs form about 50-100 to 500-1000. For several VMs 50-100 is just not enough, overloads may occur from time to time. you may use pveperf to check your basic performance data.

Regards
 
I do what a Raid 0 ? if yes it is possible to switch without having to format everything ?

I really like to avoid these terms with ZFS, Proxmox use them in the installer wrongly [1].

ZFS has:
- stripes (if you do not specify anything it is going to be striping it)
- mirrors
- RAIDZx, dRAIDs, etc

What you are asking is if you can take out a disk out of a 2-device vdev (mirror) to make it ... single-device vdev.

For the purpose of testing - you can just take it away, if you do, your pool will be "degraded". Just ... unplug the disk. :) If this is not hotplug, turn the system off, then back on. Otherwise you would be looking at detach / attach [2], but I would NOT mess with it for now.

EDIT: If you do not want to physically take it out, that's what offline [3] is for, but again, if you can, it is easier to unplug it.

NB Be careful with CLI with ZFS, some commands do not even ask for confirmation, e.g. it is possible to destroy pool with single command, no questions asked.

nvme-eui.e8238fa6bf530001001b444a417da772-part3 Should be a WD to

I meant more like specific model.

[1] https://openzfs.github.io/openzfs-docs/man/master/7/zpoolconcepts.7.html
[2] https://openzfs.github.io/openzfs-docs/man/v2.2/8/zpool-detach.8.html
https://openzfs.github.io/openzfs-docs/man/v2.2/8/zpool-attach.8.html
[3] https://openzfs.github.io/openzfs-docs/man/v2.2/8/zpool-offline.8.html
 
Last edited:
Hi,
if i interpret the usage of your pool right, tghe rust-based "vm-data" is for the VM-Images.

I really believe it would be good to clarify the hardware foremost. :)

The way I read that pool was there's:

WD Blue SA510 SATA

Crucial BX500 SATA

 
ZFS RAID 1 with 2x4 SSDs

This is what most people read first in your post, but it is confusing a lot. I believe you have (hope you wanted to have):

2 mirrors, one consisting of 2 SATA SSDs and another made of 2 NVMes (Crucial P3 and who knows what).

BTW It's only a matter of time until someone tells you how these SSDs are not meant for hypevisor, etc.

My personal take is that ZFS is not not suitable for the SSDs. They might work just okay for your use on e.g. LVM w/ext4.
 
Last edited:
Well, I am using ZFS much longer (about a Decade) than Proxmox. Maybe it does not perform well on NVME/SSD but still good enough to use their advantages. When using them for special tasks, such as logging, ZFS and NVME/SSD are a dream-team.

Mirrors are the thing, that ZFS prefers, since ZFS utilizes the increased read-performance. But it fails if the mirror is not symmetric, at least quite a while ago. Customer-Harddisks tend to "develop" critical Spots and then we have the described situation -short decrease of performance and frozen VMs. Logging-Devices help to overcome this Situation.

It's just a suggestion from my side.
I totally agree with ZFS-Commands are dangerous, if you do not know, what you're doing...
 
Well, I am using ZFS much longer (about a Decade) than Proxmox. Maybe it does not perform well on NVME/SSD but still good enough to use their advantages. When using them for special tasks, such as logging, ZFS and NVME/SSD are a dream-team.

I really do not want to hijack this thread from OP on arguments for/against ZFS - I figured, on this forum, I could be doing full-time just that.

I just say this: Crucial P3 500G ... if I got that right from quick search has no DRAM and TBW 110TB.

You know what will happen some time soon with that SSD on PVE on ZFS, right?
 
Hi,
if i interpret the usage of your pool right, tghe rust-based "vm-data" is for the VM-Images. If I am right on this, i would suggest to use a small part of the NVMEs as (mirrored) log device. This would increase the the IOPs form about 50-100 to 500-1000. For several VMs 50-100 is just not enough, overloads may occur from time to time. you may use pveperf to check your basic performance data.

Regards
K i'll check it out
 
I really like to avoid these terms with ZFS, Proxmox use them in the installer wrongly [1].

ZFS has:
- stripes (if you do not specify anything it is going to be striping it)
- mirrors
- RAIDZx, dRAIDs, etc

What you are asking is if you can take out a disk out of a 2-device vdev (mirror) to make it ... single-device vdev.

For the purpose of testing - you can just take it away, if you do, your pool will be "degraded". Just ... unplug the disk. :) If this is not hotplug, turn the system off, then back on. Otherwise you would be looking at detach / attach [2], but I would NOT mess with it for now.

EDIT: If you do not want to physically take it out, that's what offline [3] is for, but again, if you can, it is easier to unplug it.

NB Be careful with CLI with ZFS, some commands do not even ask for confirmation, e.g. it is possible to destroy pool with single command, no questions asked.



I meant more like specific model.

[1] https://openzfs.github.io/openzfs-docs/man/master/7/zpoolconcepts.7.html
[2] https://openzfs.github.io/openzfs-docs/man/v2.2/8/zpool-detach.8.html
https://openzfs.github.io/openzfs-docs/man/v2.2/8/zpool-attach.8.html
[3] https://openzfs.github.io/openzfs-docs/man/v2.2/8/zpool-offline.8.html
For the NVME model its WD SN580 and Crucial P3, For Sata its Crucial BX500, WD SA510, i'll try to turn one disk off
 
I just say this: Crucial P3 500G ... if I got that right from quick search has no DRAM and TBW 110TB.

You know what will happen some time soon with that SSD on PVE on ZFS, right?

When you intend to use ZFS this way, you should go on and check:

smartctl -a /dev/...

And look for amount of data written so far. Your endurance on that P3 is 110TB and ZFS has write amplification as does PVE workflows & existing bugs. You should watch how quickly you are consuming those drives. Decide for yourself.

Or see more, e.g.:
https://forum.proxmox.com/threads/nvm-ssd-extreme-high-wearout.143823/#post-647255
https://forum.proxmox.com/threads/etc-pve-pmxcfs-amplification-inefficiencies.154074
 
Last edited:
When you intend to use ZFS this way, you should go on and check:

smartctl -a /dev/...

And look for amount of data written so far. Your endurance on that P3 is 110TB and ZFS has write amplification as does PVE workflows & existing bugs. You should watch how quickly you are consuming those drives. Decide for yourself.

Or see more, e.g.:
https://forum.proxmox.com/threads/nvm-ssd-extreme-high-wearout.143823/#post-647255
https://forum.proxmox.com/threads/etc-pve-pmxcfs-amplification-inefficiencies.154074
I just tried to send the same file with only one disk the WD One No issues IO Delay hasn't move from 0.5% No VM Freeze/Hangs

About the P3 what do you advise me to do, send it back?
 
Last edited:
I just tried to send the same file with only one disk the WD One No issues IO Delay hasn't move from 0.5% No VM Freeze/Hangs

About the P3 what do you advise me to do, send it back?

I actually advise people mostly to NOT use ZFS for most (home) use cases. It is cretainly true you can probably get better bang for the buck than you did for the P3, but the requirements Proxmox official docs would give you on SSDs is absurd for home users, e.g. PLP with high TBW. In practical term because the sofware behaves like it does (especially on ZFS), you are supposed to, according to them, get a datacentre grade SSD, think double prices per GB. At the end of the day, you really need something like enudrance 1,000TB+ on SSDs to play with this. You can get this with consumer larger capacity ones nowadays (e.g. WD SN700 if I remember well). They also happen to have DRAM. This is really more about the research on SSDs and the technologies.

It's really your call. But consider if you are e.g. running Ubuntu on this P3 NOT on ZFS would be probably just fine for regular person's home PC. E.g I have a 10yo 128G SSD from back in the era with <30T written total for OS drive. But it was never used for ZFS.
 
I just tried to send the same file with only one disk the WD One No issues IO Delay hasn't move from 0.5% No VM Freeze/Hangs

About the P3 what do you advise me to do, send it back?

BTW If this is not obvious, this is not the P3 necessarily being that much worse. You can test the opposite case, keep only the P3 in. You get the idea ...
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!