Proxmox with ZFS + SSDs: Built-in TRIM cron job vs zfs autotrim?

SInisterPisces · Sep 10, 2022

Hello,

Using a general "how to get the best performance/life out of SSDs on ZFS" guide, I set `autotrim=on` on both my rpool and my vmStore1 pool.

Then I became aware that Proxmox includes a cron job to do this weekly:

/etc/cron.d# cat zfsutils-linux
PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin 

# TRIM the first Sunday of every month. 
24 0 1-7 * * root if [ $(date +\%w) -eq 0 ] && [ -x /usr/lib/zfs-linux/trim ]; then /usr/lib/zfs-linux/trim; fi

# Scrub the second Sunday of every month. 
24 0 8-14 * * root if [ $(date +\%w) -eq 0 ] && [ -x /usr/lib/zfs-linux/scrub ]; then /usr/lib/zfs-linux/scrub; fi

Question: Do I need to have autotrim enabled with this cron job in place? If I don't need it, and have it enabled, does that cause problems/excess disk wear?
Currently: I think I can disable autotrim because Proxmox is doing both TRIM and SCRUB weekly? But I don't know how to read cron jobs entirely, and I'm not sure if it's TRIMming/SCRUBbing on all pools or just rpool.

On Arch-based systems I use `fstrim.timer` to set automatic SSD TRIM. I've not seen it done with a cron job before, and none of those systems use ZFS, so I'd really appreciate any advice.

Thanks!

LnxBil · Sep 11, 2022

SInisterPisces said:
Do I need to have autotrim enabled with this cron job in place?

No, you don't. Autotrim is a synchronous trim that is issued after a block has been deleted on your pool. This is a unpredictable event for your pool with respect to a regular trim on a defined point in time, so your pool can be slower on heavier loads if you enable autotrim. This was the same with ext4 and its two trimming options. "External trimming" is always prefered, because of it's planned IO workload.

SInisterPisces said:
But I don't know how to read cron jobs entirely, and I'm not sure if it's TRIMming/SCRUBbing on all pools or just rpool.

Do understand if further, you need to look at the actual command that is run /usr/lib/zfs-linux/scrub. There you see that all pools, that are not already trimmed are trimmed.

SInisterPisces · Sep 11, 2022

Awesome answer. Thanks! I really appreciate the extra explanation of what's going on under the hood.

I'll go ahead and disable autotrim now.

lexxxel · Oct 13, 2023

LnxBil said:
No, you don't. Autotrim is a synchronous trim that is issued after a block has been deleted on your pool. This is a unpredictable event for your pool with respect to a regular trim on a defined point in time, so your pool can be slower on heavier loads if you enable autotrim. This was the same with ext4 and its two trimming options. "External trimming" is always prefered, because of it's planned IO workload.

Do understand if further, you need to look at the actual command that is run /usr/lib/zfs-linux/scrub. There you see that all pools, that are not already trimmed are trimmed.

I believe this answer is missing one bit of information. The pool has to have a special flag active, see the zfs docu. The script under /usr/lib/zfs-linux/trim might exit if org.debian:periodic-trim is disabled

here is the related part from that script:

Bash:

zpool list -H -o health,name 2>&1 | \
    awk -F'\t' '$1 == "ONLINE" {print $2}' | \
while read -r pool
do
    # read user-defined config
    ret=$(get_property "${pool}") || continue
    case "${ret}" in
        disable);; <<<<----- see here if org.debian:periodic-trim is disabled, the script will exit without doing anything.
        enable)    trim_if_not_already_trimming "${pool}" ;;
        -|auto)    if pool_is_nvme_only "${pool}"; then trim_if_not_already_trimming "${pool}"; fi ;;   <<<<----- if org.debian:periodic-trim is not set, it only runs when all drives are nvme drives.
        *)    cat > /dev/stderr <<EOF

LnxBil · Oct 13, 2023

LnxBil said:
Do understand if further,

Sometimes I wonder why the datapath from my head to my fingers is so faulty ...

lexxxel said:
I believe this answer is missing one bit of information.

Yes, that's a good point.

Please keep also in mind, that enterprise SSDs do normally not need to have regular scrubbing as consumer/prosumer SSDs do.

SInisterPisces · Oct 13, 2023

lexxxel said:
I believe this answer is missing one bit of information. The pool has to have a special flag active, see the zfs docu. The script under /usr/lib/zfs-linux/trim might exit if org.debian:periodic-trim is disabled

here is the related part from that script:

Bash:

zpool list -H -o health,name 2>&1 | \ awk -F'\t' '$1 == "ONLINE" {print $2}' | \ while read -r pool do # read user-defined config ret=$(get_property "${pool}") || continue case "${ret}" in disable);; <<<<----- see here if org.debian:periodic-trim is disabled, the script will exit without doing anything. enable) trim_if_not_already_trimming "${pool}" ;; -|auto) if pool_is_nvme_only "${pool}"; then trim_if_not_already_trimming "${pool}"; fi ;; <<<<----- if org.debian:periodic-trim is not set, it only runs when all drives are nvme drives. *) cat > /dev/stderr <<EOF

That script is a bit over my head for the moment, but I appreciate the extra information.

It sounds like I need to leave auto-trim off, but make sure periodic trim is enabled on a per-dataset basis (because it's not implemented at the pool level yet)?
(Also, that Debian ZFS doc page was one of the clearest explanations of any ZFS thing I've ever read. That was nice.

)

This is a low power system with a single consumer NVME for boot and VM store for speed. The VMs back up to a mirrored enterprise SATA SSD pair. So I wanted to make sure I wasn't killing the NVME any faster than necessary.

lexxxel · Oct 13, 2023

SInisterPisces said:
That script is a bit over my head for the moment, but I appreciate the extra information.

It sounds like I need to leave auto-trim off, but make sure periodic trim is enabled on a per-dataset basis (because it's not implemented at the pool level yet)?
(Also, that Debian ZFS doc page was one of the clearest explanations of any ZFS thing I've ever read. That was nice. )

This is a low power system with a single consumer NVME for boot and VM store for speed. The VMs back up to a mirrored enterprise SATA SSD pair. So I wanted to make sure I wasn't killing the NVME any faster than necessary.

No, you can set auto-trim to whatever you want, but for your SATA SSD it might be easier to set it to off. The periodic trim on the other side WILL RUN on your NVME drive, BUT WILL NOT run on your SATA SSDs if you do not set

Code:

zfs set org.debian:periodic-trim=enable <your zfs tank>

.

I figured that out today while I was turning auto-trim off and looked for that periodic trim, because on my consumer grade SSD/NVME RAIDZ2 mix it created to much io wait with a constant high amount of writes per second (between 3-20MB/s).

SInisterPisces · Oct 13, 2023

Thanks for the clarification. I'll update my "how to set up new PVE node ZFS" document today.

I assume at some point periodic trim will get implemented at the pool level...

SInisterPisces · Oct 17, 2023

lexxxel said:
No, you can set auto-trim to whatever you want, but for your SATA SSD it might be easier to set it to off. The periodic trim on the other side WILL RUN on your NVME drive, BUT WILL NOT run on your SATA SSDs if you do not set

Code:

zfs set org.debian:periodic-trim=enable <your zfs tank>

.

I figured that out today while I was turning auto-trim off and looked for that periodic trim, because on my consumer grade SSD/NVME RAIDZ2 mix it created to much io wait with a constant high amount of writes per second (between 3-20MB/s).

Quick followup:

On this box, I've only got 2 enterprise SATA SSDs that I'm using.
(There's an NVME in there, but it's completely unused.)

Is there any reason not to set zfs set org.debian:periodic-trim=enable rpool/data/<myDataset>, or even /rpool/data?
Poking at rpool at all makes me nervous, but I'd prefer to set this correctly as high on the inheritance tree as possible.

SInisterPisces · Apr 8, 2024

Hello,

It is me again. I'm re-setting up a new PVE node, and it's been long enough since I've needed to do this (and also I'm now using a node with a single NVME mirror rpool to house the OS and VM data) that I wanted to double-check myself.

Looking at the scripts actually run by the cron jobs, I'm realizing I don't quite understand the syntax of what the case statements are doing. Here's the one in the trim script, for example:

Code:

  53   │ # TRIM all healthy pools that are not already trimming as per their configs.
  54   │ zpool list -H -o health,name 2>&1 | \
  55   │     awk -F'\t' '$1 == "ONLINE" {print $2}' | \
  56   │ while read -r pool
  57   │ do
  58   │     # read user-defined config
  59   │     ret=$(get_property "${pool}") || continue
  60   │     case "${ret}" in
  61   │         disable);;
  62   │         enable) trim_if_not_already_trimming "${pool}" ;;
  63   │         -|auto) if pool_is_nvme_only "${pool}"; then trim_if_not_already_trimming "${pool}"; fi ;;
  64   │         *)  cat > /dev/stderr <<EOF
  65   │ $0: [WARNING] illegal value "${ret}" for property "${PROPERTY_NAME}" of ZFS dataset "${pool}".
  66   │ $0: Acceptable choices for this property are: auto, enable, disable. The default is auto.
  67   │ EOF
  68   │     esac
  69   │ done

I'm reading from line 60 as, look at the value of the periodic-trim/scrub property, and:

IF disabled, THEN do nothing; ELSE
IF enabled, THEN trim if not already trimming; ELSE
IF auto or "-" [the default value on fresh install], THEN trim if not already trimming but only if this is an NVME only pool; ELSE
Illegal value, print to standard error output.

This looks like the actual logic for @lexxxel 's statement above that "The periodic trim on the other side WILL RUN on your NVME drive, BUT WILL NOT run on your SATA SSDs if you do not set [the periodic-trim property]."

So in a system where everything lives on rpool and rpool is an NVME mirror, the correct answer is still "everything will TRIM and SCRUB correctly by default, do nothing."

Is that right?

lexxxel · Apr 9, 2024

Yes.
Also, you copied the same code from the same file like I did. I only trimmed the end.

SInisterPisces · Apr 9, 2024

lexxxel said:
Yes.
Also, you copied the same code from the same file like I did. I only trimmed the end.

Oh, good! My bumbling posts elsewhere aside, I really do feel like I'm getting a better handle on ZFS the last couple months.

I did repost your code, yes. I apologize if it looked like I was claiming to have found it and figured it out myself. Mostly, I just wanted to be able to see it with line numbers while I was typing/thinking through what it was doing.

Ramalama · Apr 9, 2024

That scripts default behaviour is somewhat not well thought out, i mean the "auto" setting doesn't recognize Sata/Sas SSD's.

It simply doesn't work for most or probably all sas or sata based SSD's.
Something like Samsung PM893 or 870 EVO/QVO, etc will not work with this script!

The reason is simple, the script checks the transport layer and triggers only on "nvme", but the transport layer of almost all SATA/SAS Based SSD's is sas.
So in the end it wont get triggered ever.

So you have 3 options:
- 1. exchange inside /usr/lib/zfs-linux/trim the lsblk -dnr -o TRAN "$dev" line, inside the get_transp function to:
if [ "$(lsblk -dnr -o DISC-MAX "$dev")" != "0B" ]; then echo nvme; fi
- 2. simply add to cronjob to do once a month zpool trim YOURPOOL
- 3. zfs set org.debian:periodic-trim=enable YOURPOOL
- 4. Enable autotrim, autotrim was broken and crappy, yes, but its fixed and should be perfect since recent Openzfs releases (Don't ask me in which version they fixed it, that would need digging, but i know that its fixed, im using it without issues)

Cheers

lexxxel · Apr 9, 2024

SInisterPisces said:
Oh, good! My bumbling posts elsewhere aside, I really do feel like I'm getting a better handle on ZFS the last couple months.

I did repost your code, yes. I apologize if it looked like I was claiming to have found it and figured it out myself. Mostly, I just wanted to be able to see it with line numbers while I was typing/thinking through what it was doing.

I only ment it as a clearification, not as an assault, sry.

lexxxel · Apr 9, 2024

Ramalama said:
That scripts default behaviour is somewhat not well thought out, i mean the "auto" setting doesn't recognize Sata/Sas SSD's.

It simply doesn't work for most or probably all sas or sata based SSD's.
Something like Samsung PM893 or 870 EVO/QVO, etc will not work with this script!

The reason is simple, the script checks the transport layer and triggers only on "nvme", but the transport layer of almost all SATA/SAS Based SSD's is sas.
So in the end it wont get triggered ever.

So you have 3 options:
- 1. exchange inside /usr/lib/zfs-linux/trim the lsblk -dnr -o TRAN "$dev" line, inside the get_transp function to:
if [ "$(lsblk -dnr -o DISC-MAX "$dev")" != "0B" ]; then echo nvme; fi
- 2. simply add to cronjob to do once a month zpool trim YOURPOOL
- 3. zfs set org.debian:periodic-trim=enable YOURPOOL
- 4. Enable autotrim, autotrim was broken and crappy, yes, but its fixed and should be perfect since recent Openzfs releases (Don't ask me in which version they fixed it, that would need digging, but i know that its fixed, im using it without issues)

Cheers

There are reasons behind that behavior. Even the Linux Kernel has a special list of SATA SSDs that do not implement discord properly. To my knowledge, only nvme SSDs can be fully trusted to have implemented the spec good enough.
About your option 4: I had a very bad experience with a SanDisk Plus and a Ultra 3D SanDisk SSD. Both drives required hundreds of milliseconds or even seconds for discard operations in a ZRAID2 with 4 drives. This is why I started to look into scheduled trimming in the first place. So do check your iostats when turning autotrim on with SATA SSDs.

SInisterPisces · Apr 10, 2024

Ramalama said:
That scripts default behaviour is somewhat not well thought out, i mean the "auto" setting doesn't recognize Sata/Sas SSD's.

It simply doesn't work for most or probably all sas or sata based SSD's.
Something like Samsung PM893 or 870 EVO/QVO, etc will not work with this script!

The reason is simple, the script checks the transport layer and triggers only on "nvme", but the transport layer of almost all SATA/SAS Based SSD's is sas.
So in the end it wont get triggered ever.

So you have 3 options:
- 1. exchange inside /usr/lib/zfs-linux/trim the lsblk -dnr -o TRAN "$dev" line, inside the get_transp function to:
if [ "$(lsblk -dnr -o DISC-MAX "$dev")" != "0B" ]; then echo nvme; fi
- 2. simply add to cronjob to do once a month zpool trim YOURPOOL
- 3. zfs set org.debian:periodic-trim=enable YOURPOOL
- 4. Enable autotrim, autotrim was broken and crappy, yes, but its fixed and should be perfect since recent Openzfs releases (Don't ask me in which version they fixed it, that would need digging, but i know that its fixed, im using it without issues)

Cheers

That is as intended. By default, Proxmox/ZFS considers non-enterprise SATA/SAS SSDs not to correctly implement TRIM/implement a slow TRIM that will drag down your system, so doesn't turn it on by default. NVME, however, is assumed to be well capable of periodic-trim as defined in the special org.debian

eriodic-trim proeperty.

From up-thread (this is a long thread): https://wiki.debian.org/ZFS#Periodic_TRIM

Periodic TRIM

On Debian systems, since Bullseye release (or 2.0.3-9 in buster-backports), periodic TRIM is implemented using a custom per-pool property: org.debianeriodic-trim

By default, these TRIM jobs are scheduled on the first Sunday of every month. The completion speed depends on the disks size, disk speed and workload pattern. Cheap QLC disks could take considerable more time than very expensive enterprise graded NVMe disks.

That page includes a helpful chart and footnotes that I'm reproducing here for others who find this thread. tl;dr automatic periodic-trim support for SATA 3.0 or lower may be a sync operation that will block all other I/O (very bad on an active PVE node, especially as PVE is really meant for commercial environments where downtime loses money); SATA 3.1 SSDs should implement queued TRIM that is non-blocking but not all disks implement this correctly even if they say they do, which is worse than not implementing it at all in a production environment, and mixed pools won't work until SATA 3.1 is properly implemented to the satisfaction of the Debian maintainers.

Footnote 3 paints a pretty dire situation re: proper SATA 3.1 support, so it makes sense that it's not enabled and probably will never be enabled.

You can override these settings via the zfs set command, but as explained in the linked article and the excepts below, that could cause serious issues if your SSDs are lying about what they support. ZFS itself is designed to avoid the issues TRIM is meant to correct, to some extent, so leaving it disabled on SSDs that might not implement it correctly is not as big a deal as it would be on an ext4 system.

I wouldn't even consider overriding the defaults unless I was using trustworthy-brand used enterprise SSDs with clear spec-sheets that are known to be telling the truth.

When org.debianeriodic-trim is not present in pool, or the property is present but value is empty/invalid, they are treated as auto.

SATA SSD with protocol version 3.0 or lower handles TRIM (UNMAP) in synchronous manner which could block all other I/O on the disk immediately until the command is finished, this could lead to severe interruption. In such case, pool trim is only recommended in scheduled maintenance period.

SATA SSD with protocol version >=3.1 may perform TRIM in a queued manner, making the operation not blocking. Enabling TRIM on these disks is planned by the Debian ZFS mantainers (990871), but yet to be implemented because there are issues to be considered - for example some disks advertise the ability of doing Queued TRIM although the implementation is known broken. Users can enable the pool trim by setting the property to enable after checking carefully.

When the >=3.1 support is properly implemented, pool with a mixed types of SSDs will be measured by whether all disks are of the recommended types. Users can enable the pool trim by setting the property to enable after checking all disks in pool carefully.

Ramalama · Apr 10, 2024

SInisterPisces said:
That is as intended. By default, Proxmox/ZFS considers non-enterprise SATA/SAS SSDs not to correctly implement TRIM/implement a slow TRIM that will drag down your system, so doesn't turn it on by default. NVME, however, is assumed to be well capable of periodic-trim as defined in the special org.debianeriodic-trim proeperty.

From up-thread (this is a long thread): https://wiki.debian.org/ZFS#Periodic_TRIM

That page includes a helpful chart and footnotes that I'm reproducing here for others who find this thread. tl;dr automatic periodic-trim support for SATA 3.0 or lower may be a sync operation that will block all other I/O (very bad on an active PVE node, especially as PVE is really meant for commercial environments where downtime loses money); SATA 3.1 SSDs should implement queued TRIM that is non-blocking but not all disks implement this correctly even if they say they do, which is worse than not implementing it at all in a production environment, and mixed pools won't work until SATA 3.1 is properly implemented to the satisfaction of the Debian maintainers.

Footnote 3 paints a pretty dire situation re: proper SATA 3.1 support, so it makes sense that it's not enabled and probably will never be enabled.

You can override these settings via the zfs set command, but as explained in the linked article and the excepts below, that could cause serious issues if your SSDs are lying about what they support. ZFS itself is designed to avoid the issues TRIM is meant to correct, to some extent, so leaving it disabled on SSDs that might not implement it correctly is not as big a deal as it would be on an ext4 system.

I wouldn't even consider overriding the defaults unless I was using trustworthy-brand used enterprise SSDs with clear spec-sheets that are known to be telling the truth.
View attachment 66108

May then i say simply that Samsung SSD 870 EVO's does TRIM correctly? Im using them in some servers with enabled periodic trim and have no issues.
Just simply like an "confirmation" that these drives are okay with trim.
Same with PM893.

All other drives i didnt tested.

lexxxel · Apr 10, 2024

I would not say they do it correctly. As far as I know this whole generation is on the kernel drivers blacklisted:

C:

{ "Samsung SSD 870*",        NULL,    ATA_HORKAGE_NO_NCQ_TRIM |
                        ATA_HORKAGE_ZERO_AFTER_TRIM |
                        ATA_HORKAGE_NO_NCQ_ON_ATI },

Meaning they do trim, but not like the spec would require. Periodic trim will be better then autotrim, that's for sure.
https://github.com/torvalds/linux/blob/master/drivers/ata/libata-core.c

SInisterPisces · Apr 10, 2024

Samsung PM 893 SATA SSDs are data center SSDs. I'd expect them to implement TRIM correctly and be rather shocked if they didn't.
Likewise, SAS SSDs all tend to be enterprise/data center disks, and should implement TRIM correctly.

(This is my opinion given how heavily enterprise SSD marketing is focused on data integrity and standards compliance. You should still confirm whether whatever the disk is actually implements the standard correctly.)

Samsung EVOs are consumer SATA disks. Consumer disks shouldn't be assumed to implement TRIM correctly (...or correctly report their real sector size). That doesn't mean the 870s aren't great (I have a pair I'm using as a destination for PBS storage), but they do have known limitations in standards support and performance (though they're plenty fast for home server use).

Ramalama · Apr 10, 2024

lexxxel said:
I would not say they do it correctly. As far as I know this whole generation is on the kernel drivers blacklisted:

C:

{ "Samsung SSD 870*", NULL, ATA_HORKAGE_NO_NCQ_TRIM | ATA_HORKAGE_ZERO_AFTER_TRIM | ATA_HORKAGE_NO_NCQ_ON_ATI },

Meaning they do trim, but not like the spec would require. Periodic trim will be better then autotrim, that's for sure.
https://github.com/torvalds/linux/blob/master/drivers/ata/libata-core.c

good point, thanks for digging, then i disable periodic trim. Even if its working fine for over 3-4 months now, there seems to be a risk.

Proxmox with ZFS + SSDs: Built-in TRIM cron job vs zfs autotrim?

Active Member

Distinguished Member

Active Member

New Member

Distinguished Member

Active Member

New Member

Active Member

Active Member

Active Member

New Member

Active Member

Well-Known Member

New Member

New Member

Active Member

Well-Known Member

New Member

Active Member

Well-Known Member