HowTo defrag an zfs-pool?

udo · Apr 13, 2018

Hi,
I have an zfs pool with an MsSql-VM, wich change a lot of data. I use zfs for disaster recovery - send snapshots with pve-zsync to another cluster-node and with znapzend to an remote-host.

After a short time of use, the pool has an high fragmentation:

Code:

zpool get capacity,size,health,fragmentation
NAME       PROPERTY       VALUE   SOURCE
pve02pool  capacity       73%     -
pve02pool  size           1.73T   -
pve02pool  health         ONLINE  -
pve02pool  fragmentation  40%     -

I have read, that defragmentation isn't possible on zfs. Is this still valid?

And the REFER blow up, which don't fit to the sanpshots:

Code:

zfs list -t snapshot
NAME                                                      USED  AVAIL  REFER  MOUNTPOINT
pve02pool/vm-200-disk-2@rep_default_2018-04-11_11:45:01   476M      -   605G  -
pve02pool/vm-200-disk-2@2018-04-11-180000                62.6M      -   685G  -
pve02pool/vm-200-disk-2@2018-04-12-000000                30.9M      -   684G  -
pve02pool/vm-200-disk-2@2018-04-12-060000                11.3M      -   684G  -
pve02pool/vm-200-disk-2@2018-04-12-120000                94.2M      -   684G  -
pve02pool/vm-200-disk-2@rep_default_2018-04-12_17:45:16  2.01M      -   684G  -
pve02pool/vm-200-disk-2@2018-04-12-180000                   2M      -   684G  -
pve02pool/vm-200-disk-2@2018-04-13-000000                 142G      -   880G  -
pve02pool/vm-200-disk-2@rep_default_2018-04-13_05:45:07  2.45M      -   727G  -
pve02pool/vm-200-disk-2@2018-04-13-060000                2.42M      -   727G  -
pve02pool/vm-200-disk-2@2018-04-13-120000                91.7M      -   727G  -
pve02pool/vm-200-disk-2@rep_default_2018-04-13_13:30:05  46.7M      -   727G  -

If the only way is storage migration, it's not realy usable - if I migrate the 1TB-Volume with storage migration the volume use the full space after that. Mean migrate away, migrate back, write zeros to free space.
And, if i see it right, an new zfs-sync (with pve-zsync and znapzend) will transmit then the whole vm-disk again, because it's an new one (and 600GB over an wan-connection takes some days).

Any hints?

Udo

wolfgang · Apr 13, 2018

Hi,

udo said:
I have read, that defragmentation isn't possible on zfs. Is this still valid?

AFIK and as the git log says there is no defrag option.

But the fragmentation level tells you only how the new data will be written and not how the written data are fragmented.
Normally if your stay under 70% pool use you get no performance problems.

Do you have any performance problems with 73%?

udo said:
And, if i see it right, an new zfs-sync (with pve-zsync and znapzend) will transmit then the whole vm-disk again,

Yes, this is true.

udo · Apr 13, 2018

wolfgang said:
Hi,
AFIK and as the git log says there is no defrag option.

bad...

But the fragmentation level tells you only how the new data will be written and not how the written data are fragmented.
Normally if your stay under 70% pool use you get no performance problems.

Do you have any performance problems with 73%?

yesterday the monitoring show some strange things during heavy io on this pool.
The biggest problem is that 70% is reached after two weeks!
I must learned before that with 93% nothing work anymore... So I removed this one VM and after migrate back it's work til now but it' don't look that I schould use that for another week...

zfs list show, that the volume use 1,01T but the refer with 727G plus all snaps are much less...
But usedbysnapshots show an higher value:

Code:

zfs get used,usedbydataset,usedbysnapshots pve02pool/vm-200-disk-2
NAME                     PROPERTY         VALUE     SOURCE
pve02pool/vm-200-disk-2  used             1.01T     -
pve02pool/vm-200-disk-2  usedbydataset    647G      -
pve02pool/vm-200-disk-2  usedbysnapshots  382G      -

this don't fit with the zfs -list -t snapshot

Code:

zfs list -t snapshot | grep 200-disk-2
pve02pool/vm-200-disk-2@rep_default_2018-04-11_11:45:01   476M      -   605G  -
pve02pool/vm-200-disk-2@2018-04-11-180000                62.6M      -   685G  -
pve02pool/vm-200-disk-2@2018-04-12-000000                30.9M      -   684G  -
pve02pool/vm-200-disk-2@2018-04-12-060000                11.3M      -   684G  -
pve02pool/vm-200-disk-2@2018-04-12-120000                94.2M      -   684G  -
pve02pool/vm-200-disk-2@rep_default_2018-04-12_17:45:16  2.01M      -   684G  -
pve02pool/vm-200-disk-2@2018-04-12-180000                   2M      -   684G  -
pve02pool/vm-200-disk-2@2018-04-13-000000                 142G      -   880G  -
pve02pool/vm-200-disk-2@rep_default_2018-04-13_05:45:07  2.45M      -   727G  -
pve02pool/vm-200-disk-2@2018-04-13-060000                2.42M      -   727G  -
pve02pool/vm-200-disk-2@2018-04-13-120000                 110M      -   727G  -
pve02pool/vm-200-disk-2@rep_default_2018-04-13_14:45:06  36.9M      -   647G  -

Perhaps it's has something to do with the blocksize?

Code:

get volblocksize pve02pool/vm-200-disk-2
NAME                     PROPERTY      VALUE     SOURCE
pve02pool/vm-200-disk-2  volblocksize  8K        default

Udo

wolfgang · Apr 16, 2018

Do you use "thin provision"?
What Raid level do you use?

fabian · Apr 16, 2018

udo said:

zfs list show, that the volume use 1,01T but the refer with 727G plus all snaps are much less...
But usedbysnapshots show an higher value:

Code:

zfs get used,usedbydataset,usedbysnapshots pve02pool/vm-200-disk-2
NAME                     PROPERTY         VALUE     SOURCE
pve02pool/vm-200-disk-2  used             1.01T     -
pve02pool/vm-200-disk-2  usedbydataset    647G      -
pve02pool/vm-200-disk-2  usedbysnapshots  382G      -

this don't fit with the zfs -list -t snapshot

Code:

zfs list -t snapshot | grep 200-disk-2
pve02pool/vm-200-disk-2@rep_default_2018-04-11_11:45:01   476M      -   605G  -
pve02pool/vm-200-disk-2@2018-04-11-180000                62.6M      -   685G  -
....
pve02pool/vm-200-disk-2@2018-04-13-120000                 110M      -   727G  -
pve02pool/vm-200-disk-2@rep_default_2018-04-13_14:45:06  36.9M      -   647G  -

the 'used' value of one snapshot just tells you how much space is used only in that snapshot (or, in other words, how much space you are guaranteed to get back if you delete it). data that is stored in more than one snapshot is not counted in any individual snapshot's 'used' value, but only in 'usedbysnapshots' of the dataset itself.

man zfs said:
The used space of a snapshot (see the Snapshots section) is space that is referenced exclusively by this snap‐
shot. If this snapshot is destroyed, the amount of used space will be freed. Space that is shared by multiple
snapshots isn't accounted for in this metric. When a snapshot is destroyed, space that was previously shared
with this snapshot can become unique to snapshots adjacent to it, thus changing the used space of those snap‐
shots. The used space of the latest snapshot can also be affected by changes in the file system. Note that
the used space of a snapshot is a subset of the written space of the snapshot.

udo · Apr 16, 2018

wolfgang said:
Verwendest du "thin provision"?
Was fuer einen Raid level hast du?

Hi,
it's an raid1 with two big SSDs.
Du to storage migration the Volume is first a thick volume, but with writing zeros inside the VM and the discard option it's like thin provisioning.

Udo

udo · Apr 16, 2018

fabian said:
the 'used' value of one snapshot just tells you how much space is used only in that snapshot (or, in other words, how much space you are guaranteed to get back if you delete it). data that is stored in more than one snapshot is not counted in any individual snapshot's 'used' value, but only in 'usedbysnapshots' of the dataset itself.

Hi Fabian,
thanks for this info.

Now, after deleting some older snapshots (e.g. with write zeros) it's looks better (capacity) except the fragmentation:

Code:

zpool get capacity,size,health,fragmentation
NAME       PROPERTY       VALUE   SOURCE
pve02pool  capacity       56%     -
pve02pool  size           1.73T   -
pve02pool  health         ONLINE  -
pve02pool  fragmentation  36%     -

Udo

fabian · Apr 16, 2018

udo said:
Hi Fabian,
thanks for this info.

Now, after deleting some older snapshots (e.g. with write zeros) it's looks better (capacity) except the fragmentation:

Code:

zpool get capacity,size,health,fragmentation NAME PROPERTY VALUE SOURCE pve02pool capacity 56% - pve02pool size 1.73T - pve02pool health ONLINE - pve02pool fragmentation 36% -

Udo

the fragmentation refers to free space. if you are on SSDs, I wouldn't worry about 36% fragmentation.

puertorico · Dec 8, 2018

I just saw this and have a though, the ssd's you use what are these ?

When i first was introduced to zfs i thought that running on any fast consumer ssd like Samsung 950 evo and pro was good enough, we had several 1tb disks i raid 1. we found out the hard way that every ssd we owned was not good enough for our workload, with containers.

THey where slow to start with but in a very short time the disk got much slower. in some cases way below performance on regular hdd, i tried every thing possible, changed server, HBA added up to 64 gb ram. nothing worked.

Then one guy here at the forums pointed me in the direction that consumer devices was not meant for zfs.
today we changed all the disk on all our nodes to run on raid1 with Intel DC S3710. in our test we see insanely speed improvements and even when deframnetation is 61%

stevensedory · Mar 12, 2020

fabian said:
the fragmentation refers to free space. if you are on SSDs, I wouldn't worry about 36% fragmentation.

When should you worry? I have 40% on SSDs. Any way to know, before it gets us in a pinch?

fabian · Mar 12, 2020

stevensedory said:
When should you worry? I have 40% on SSDs. Any way to know, before it gets us in a pinch?

unless you really want to dive into ZFS internals, you don't need to worry about metaslab fragmentation directly. you'll notice when it becomes worrysome because your performance will drop. that will only happen if ZFS can't write bigger sequential chunks and has to split them up into smaller ones. on (proper) SSDs this is not that much of a problem, since they handle smaller and random writes much better than spinning disks.

stevensedory · Mar 12, 2020

fabian said:
unless you really want to dive into ZFS internals, you don't need to worry about metaslab fragmentation directly. you'll notice when it becomes worrysome because your performance will drop. that will only happen if ZFS can't write bigger sequential chunks and has to split them up into smaller ones. on (proper) SSDs this is not that much of a problem, since they handle smaller and random writes much better than spinning disks.

Good to know. That said, is there a certain percentage of frag that I should monitor for, that way I don't have to find out when customers start complaining about performance issues?

fabian · Mar 13, 2020

stevensedory said:
Good to know. That said, is there a certain percentage of frag that I should monitor for, that way I don't have to find out when customers start complaining about performance issues?

not really. the FRAG value just tells you how large the segements of free space are on average. if you have lots of big, sequential writes, you want little fragmentation in order to not lose too much performance. but it depends a lot on your workload. you can set a notification for something like 70% to check for performance degradation then. but it probably makes more sense to monitor performance metrics directly instead

stevensedory · Mar 16, 2020

This was very helpful. Thank you!

Search

Search

HowTo defrag an zfs-pool?

udo

Distinguished Member

wolfgang

Proxmox Retired Staff

udo

Distinguished Member

wolfgang

Proxmox Retired Staff

fabian

Proxmox Staff Member

udo

Distinguished Member

udo

Distinguished Member

fabian

Proxmox Staff Member

puertorico

Renowned Member

stevensedory

Well-Known Member

fabian

Proxmox Staff Member

stevensedory

Well-Known Member

fabian

Proxmox Staff Member

stevensedory

Well-Known Member

We value your privacy