Please explain LVM Thin making your existing Proxmox drive a mirror

Dunuin · Aug 5, 2024

bossey1 said:
What is the needed command to move the metadata to another vdev/partition?

You need a "special" vdev first. And make sure to mirror it, as losing it means ALL data is lost. Metadata alone can't be moved. You will have to rewrite all the data as well. See:
https://forum.level1techs.com/t/zfs-metadata-special-device-z/159954

bossey1 · Aug 5, 2024

Dunuin said:
You need a "special" vdev first. And make sure to mirror it, as losing it means ALL data is lost. Metadata alone can't be moved. You will have to rewrite all the data as well. See:
https://forum.level1techs.com/t/zfs-metadata-special-device-z/159954

Cool. Thanks for that. I'm hitting the sack now so I will look at that 1st thing. Much appreciated. The way things are going I can see myself getting a community licence very soon.

guruevi · Aug 5, 2024

Personally I would say, do you REALLY need SLOG? If so, why are you sharing your SLOG with other workloads? Are you running time critical databases? If you use the safe write-back cache options on the VMs, you generally won’t see the need, it cuts in your RAM and because you use your SSD for other purposes may actually end up slowing them down. Read how ZFS works, I would use them as caches if you really feel the need and don’t have a lot of RAM because that is where most workloads benefit.

As far as “partitions” - NVMe drives have namespaces. Hence why a NVMe partition is something like nvme0n1p2 - n1 is the namespace, add a namespace (not sure if the tools are on the Proxmox boot disk, the tool is named nvme) and your NVMe will split into two (or 3 or 4) NVMe and they will appear properly in the setup of Proxmox. And your controller should still do wear leveling across the entire flash, this is just between the controller and the host.

Dunuin · Aug 5, 2024

guruevi said:
it cuts in your RAM

That is L2ARC that consumes RAM, not the SLOG. All sync writes are written to RAM + disk anyway. Difference is that without a SLOG the sync writes will be written to the ZIL on your data disks instead of a dedicated disk for it.

bossey1 · Aug 7, 2024

guruevi said:
Personally I would say, do you REALLY need SLOG? If so, why are you sharing your SLOG with other workloads? Are you running time critical databases? If you use the safe write-back cache options on the VMs, you generally won’t see the need, it cuts in your RAM and because you use your SSD for other purposes may actually end up slowing them down. Read how ZFS works, I would use them as caches if you really feel the need and don’t have a lot of RAM because that is where most workloads benefit.

As far as “partitions” - NVMe drives have namespaces. Hence why a NVMe partition is something like nvme0n1p2 - n1 is the namespace, add a namespace (not sure if the tools are on the Proxmox boot disk, the tool is named nvme) and your NVMe will split into two (or 3 or 4) NVMe and they will appear properly in the setup of Proxmox. And your controller should still do wear leveling across the entire flash, this is just between the controller and the host.

Hi
I don't intend to share my slog with other worklods. I'm only partitioning it so I can vary the areas written to, from time to time.
I like the idea of namespaces but how are they different to just partitions? I think the utility on Proxmox is nvme-cli but it is not installed by default so I can install it.

guruevi · Aug 7, 2024

Namespaces are at the NVMe level, so the controller is "partitioning" the NVMe into multiple sub-devices, primarily intended for things like virtualization or spaces where you have shared NVMe fabrics (similar to being able to make a spinning disk RAID array have many SCSI LUN back in the day). But it is neat if you don't want to mess with a custom installation on appliance-like systems like Proxmox or TrueNAS and just want to use the GUI.

I don't understand why you would need to "vary the areas written to". Are you talking about wear-leveling, because any SSD should do that already across the entire device, regardless of partitions, the controller actually doesn't care whether you have partitions, it doesn't know about those, it just relies on the OS telling it which blocks are "free" (the TRIM command). If your SSD does not have wear-leveling or TRIM, throw it out as it is potentially untrustworthy for any type of use.

bossey1 · Aug 7, 2024

Dunuin said:
You need a "special" vdev first. And make sure to mirror it, as losing it means ALL data is lost. Metadata alone can't be moved. You will have to rewrite all the data as well. See:
https://forum.level1techs.com/t/zfs-metadata-special-device-z/159954

I just had a chance to look at the link you gave. You also said, "Metadata alone can't be moved. You will have to rewrite all the data as well".
But there is this in the link. Please give me a bit more. Thanks.

Screenshot 2024-08-07 at 22-12-26 ZFS Metadata Special Device Z - Level1Techs _ L1 Articles & ...png

Dunuin · Aug 7, 2024

You can add a special device later (but not remove it without destroying the pool). But old metadata will still be stored on those HDDs and ZFS can't move that metadata to the special device SSDs. You will have to remove and add all your data again (or move files between datasets) for the metadata to end up on those newly added special device SSDs. Only the metadata of NEWLY written data will be stored on the special devices. So adding special device SSDs won't give you any direct performance increase at first and you will only see a difference slowly over time or once you rewrite all existing data.

bossey1 · Aug 8, 2024

guruevi said:
Namespaces are at the NVMe level, so the controller is "partitioning" the NVMe into multiple sub-devices, primarily intended for things like virtualization or spaces where you have shared NVMe fabrics (similar to being able to make a spinning disk RAID array have many SCSI LUN back in the day). But it is neat if you don't want to mess with a custom installation on appliance-like systems like Proxmox or TrueNAS and just want to use the GUI.

I don't understand why you would need to "vary the areas written to". Are you talking about wear-leveling, because any SSD should do that already across the entire device, regardless of partitions, the controller actually doesn't care whether you have partitions, it doesn't know about those, it just relies on the OS telling it which blocks are "free" (the TRIM command). If your SSD does not have wear-leveling or TRIM, throw it out as it is potentially untrustworthy for any type of use.

Aha... I get the namespaces thing now.

I checked my Western Digital SN730 drives, and they only support 1 namespace, so I will have to make do with just partitions for this build.

Code:

:~# nvme id-ctrl /dev/nvme2 | grep ^nn
nn        : 1

Later purchases will take this into account though. Maybe a pair of Samsung 990 Pro's then I can use namespaces to distribute and mirror the metadata slog and cache. Cheers.

bossey1 · Aug 8, 2024

Dunuin said:
You can add a special device later (but not remove it without destroying the pool). But old metadata will still be stored on those HDDs and ZFS can't move that metadata to the special device SSDs. You will have to remove and add all your data again (or move files between datasets) for the metadata to end up on those newly added special device SSDs. Only the metadata of NEWLY written data will be stored on the special devices. So adding special device SSDs won't give you any direct performance increase at first and you will only see a difference slowly over time or once you rewrite all existing data.

Thank you. That is now very clear to me. I'm grateful as I haven't created many vm yet so I'm happy to start all over with encryption and ssh unlock, then create the pools with a special device added in mirror configuration on the two 512Gb SN730's, a slog on a KIOXA 256Gb nvme and mirrored cache on 2 further 128Gb nvme's to be bought tomorrow. Interesting days ahead.

Dunuin · Aug 8, 2024

You don't need to mirror your read cache. If you are paranoid you could mirror that SLOG in case that SLOG SSD could be failing on an power outage.

bossey1 · Aug 9, 2024

Dunuin said:
You don't need to mirror your read cache. If you are paranoid you could mirror that SLOG in case that SLOG SSD could be failing on an power outage.

Yeah, I picked that up earlier in one of your replies but it seems I have so much spare NVMe as I don't have anything under 256Gb so I might as well use it up.

Search

Search

Please explain LVM Thin making your existing Proxmox drive a mirror

Dunuin

Distinguished Member

bossey1

New Member

guruevi

Active Member

Dunuin

Distinguished Member

bossey1

New Member

guruevi

Active Member

bossey1

New Member

Dunuin

Distinguished Member

bossey1

New Member

bossey1

New Member

Dunuin

Distinguished Member

bossey1

New Member

We value your privacy