HowTo defrag an zfs-pool?

udo

Distinguished Member
Apr 22, 2009
5,977
199
163
Ahrensburg; Germany
Hi,
I have an zfs pool with an MsSql-VM, wich change a lot of data. I use zfs for disaster recovery - send snapshots with pve-zsync to another cluster-node and with znapzend to an remote-host.

After a short time of use, the pool has an high fragmentation:
Code:
zpool get capacity,size,health,fragmentation
NAME       PROPERTY       VALUE   SOURCE
pve02pool  capacity       73%     -
pve02pool  size           1.73T   -
pve02pool  health         ONLINE  -
pve02pool  fragmentation  40%     -
I have read, that defragmentation isn't possible on zfs. Is this still valid?

And the REFER blow up, which don't fit to the sanpshots:
Code:
zfs list -t snapshot
NAME                                                      USED  AVAIL  REFER  MOUNTPOINT
pve02pool/vm-200-disk-2@rep_default_2018-04-11_11:45:01   476M      -   605G  -
pve02pool/vm-200-disk-2@2018-04-11-180000                62.6M      -   685G  -
pve02pool/vm-200-disk-2@2018-04-12-000000                30.9M      -   684G  -
pve02pool/vm-200-disk-2@2018-04-12-060000                11.3M      -   684G  -
pve02pool/vm-200-disk-2@2018-04-12-120000                94.2M      -   684G  -
pve02pool/vm-200-disk-2@rep_default_2018-04-12_17:45:16  2.01M      -   684G  -
pve02pool/vm-200-disk-2@2018-04-12-180000                   2M      -   684G  -
pve02pool/vm-200-disk-2@2018-04-13-000000                 142G      -   880G  -
pve02pool/vm-200-disk-2@rep_default_2018-04-13_05:45:07  2.45M      -   727G  -
pve02pool/vm-200-disk-2@2018-04-13-060000                2.42M      -   727G  -
pve02pool/vm-200-disk-2@2018-04-13-120000                91.7M      -   727G  -
pve02pool/vm-200-disk-2@rep_default_2018-04-13_13:30:05  46.7M      -   727G  -
If the only way is storage migration, it's not realy usable - if I migrate the 1TB-Volume with storage migration the volume use the full space after that. Mean migrate away, migrate back, write zeros to free space.
And, if i see it right, an new zfs-sync (with pve-zsync and znapzend) will transmit then the whole vm-disk again, because it's an new one (and 600GB over an wan-connection takes some days).

Any hints?

Udo
 
Hi,

I have read, that defragmentation isn't possible on zfs. Is this still valid?
AFIK and as the git log says there is no defrag option.

But the fragmentation level tells you only how the new data will be written and not how the written data are fragmented.
Normally if your stay under 70% pool use you get no performance problems.

Do you have any performance problems with 73%?

And, if i see it right, an new zfs-sync (with pve-zsync and znapzend) will transmit then the whole vm-disk again,
Yes, this is true.
 
Hi,
AFIK and as the git log says there is no defrag option.
bad...
But the fragmentation level tells you only how the new data will be written and not how the written data are fragmented.
Normally if your stay under 70% pool use you get no performance problems.

Do you have any performance problems with 73%?
yesterday the monitoring show some strange things during heavy io on this pool.
The biggest problem is that 70% is reached after two weeks!
I must learned before that with 93% nothing work anymore... So I removed this one VM and after migrate back it's work til now but it' don't look that I schould use that for another week...

zfs list show, that the volume use 1,01T but the refer with 727G plus all snaps are much less...
But usedbysnapshots show an higher value:
Code:
zfs get used,usedbydataset,usedbysnapshots pve02pool/vm-200-disk-2
NAME                     PROPERTY         VALUE     SOURCE
pve02pool/vm-200-disk-2  used             1.01T     -
pve02pool/vm-200-disk-2  usedbydataset    647G      -
pve02pool/vm-200-disk-2  usedbysnapshots  382G      -
this don't fit with the zfs -list -t snapshot
Code:
zfs list -t snapshot | grep 200-disk-2
pve02pool/vm-200-disk-2@rep_default_2018-04-11_11:45:01   476M      -   605G  -
pve02pool/vm-200-disk-2@2018-04-11-180000                62.6M      -   685G  -
pve02pool/vm-200-disk-2@2018-04-12-000000                30.9M      -   684G  -
pve02pool/vm-200-disk-2@2018-04-12-060000                11.3M      -   684G  -
pve02pool/vm-200-disk-2@2018-04-12-120000                94.2M      -   684G  -
pve02pool/vm-200-disk-2@rep_default_2018-04-12_17:45:16  2.01M      -   684G  -
pve02pool/vm-200-disk-2@2018-04-12-180000                   2M      -   684G  -
pve02pool/vm-200-disk-2@2018-04-13-000000                 142G      -   880G  -
pve02pool/vm-200-disk-2@rep_default_2018-04-13_05:45:07  2.45M      -   727G  -
pve02pool/vm-200-disk-2@2018-04-13-060000                2.42M      -   727G  -
pve02pool/vm-200-disk-2@2018-04-13-120000                 110M      -   727G  -
pve02pool/vm-200-disk-2@rep_default_2018-04-13_14:45:06  36.9M      -   647G  -
Perhaps it's has something to do with the blocksize?
Code:
get volblocksize pve02pool/vm-200-disk-2
NAME                     PROPERTY      VALUE     SOURCE
pve02pool/vm-200-disk-2  volblocksize  8K        default
Udo
 
Do you use "thin provision"?
What Raid level do you use?
 
zfs list show, that the volume use 1,01T but the refer with 727G plus all snaps are much less...
But usedbysnapshots show an higher value:
Code:
zfs get used,usedbydataset,usedbysnapshots pve02pool/vm-200-disk-2
NAME                     PROPERTY         VALUE     SOURCE
pve02pool/vm-200-disk-2  used             1.01T     -
pve02pool/vm-200-disk-2  usedbydataset    647G      -
pve02pool/vm-200-disk-2  usedbysnapshots  382G      -
this don't fit with the zfs -list -t snapshot
Code:
zfs list -t snapshot | grep 200-disk-2
pve02pool/vm-200-disk-2@rep_default_2018-04-11_11:45:01   476M      -   605G  -
pve02pool/vm-200-disk-2@2018-04-11-180000                62.6M      -   685G  -
....
pve02pool/vm-200-disk-2@2018-04-13-120000                 110M      -   727G  -
pve02pool/vm-200-disk-2@rep_default_2018-04-13_14:45:06  36.9M      -   647G  -

the 'used' value of one snapshot just tells you how much space is used only in that snapshot (or, in other words, how much space you are guaranteed to get back if you delete it). data that is stored in more than one snapshot is not counted in any individual snapshot's 'used' value, but only in 'usedbysnapshots' of the dataset itself.

man zfs said:
The used space of a snapshot (see the Snapshots section) is space that is referenced exclusively by this snap‐
shot. If this snapshot is destroyed, the amount of used space will be freed. Space that is shared by multiple
snapshots isn't accounted for in this metric. When a snapshot is destroyed, space that was previously shared
with this snapshot can become unique to snapshots adjacent to it, thus changing the used space of those snap‐
shots. The used space of the latest snapshot can also be affected by changes in the file system. Note that
the used space of a snapshot is a subset of the written space of the snapshot.
 
the 'used' value of one snapshot just tells you how much space is used only in that snapshot (or, in other words, how much space you are guaranteed to get back if you delete it). data that is stored in more than one snapshot is not counted in any individual snapshot's 'used' value, but only in 'usedbysnapshots' of the dataset itself.
Hi Fabian,
thanks for this info.

Now, after deleting some older snapshots (e.g. with write zeros) it's looks better (capacity) except the fragmentation:
Code:
zpool get capacity,size,health,fragmentation
NAME       PROPERTY       VALUE   SOURCE
pve02pool  capacity       56%     -
pve02pool  size           1.73T   -
pve02pool  health         ONLINE  -
pve02pool  fragmentation  36%     -
Udo
 
Hi Fabian,
thanks for this info.

Now, after deleting some older snapshots (e.g. with write zeros) it's looks better (capacity) except the fragmentation:
Code:
zpool get capacity,size,health,fragmentation
NAME       PROPERTY       VALUE   SOURCE
pve02pool  capacity       56%     -
pve02pool  size           1.73T   -
pve02pool  health         ONLINE  -
pve02pool  fragmentation  36%     -
Udo

the fragmentation refers to free space. if you are on SSDs, I wouldn't worry about 36% fragmentation.
 
  • Like
Reactions: udo
I just saw this and have a though, the ssd's you use what are these ?

When i first was introduced to zfs i thought that running on any fast consumer ssd like Samsung 950 evo and pro was good enough, we had several 1tb disks i raid 1. we found out the hard way that every ssd we owned was not good enough for our workload, with containers.

THey where slow to start with but in a very short time the disk got much slower. in some cases way below performance on regular hdd, i tried every thing possible, changed server, HBA added up to 64 gb ram. nothing worked.

Then one guy here at the forums pointed me in the direction that consumer devices was not meant for zfs.
today we changed all the disk on all our nodes to run on raid1 with Intel DC S3710. in our test we see insanely speed improvements and even when deframnetation is 61%
 
When should you worry? I have 40% on SSDs. Any way to know, before it gets us in a pinch?

unless you really want to dive into ZFS internals, you don't need to worry about metaslab fragmentation directly. you'll notice when it becomes worrysome because your performance will drop. that will only happen if ZFS can't write bigger sequential chunks and has to split them up into smaller ones. on (proper) SSDs this is not that much of a problem, since they handle smaller and random writes much better than spinning disks.
 
unless you really want to dive into ZFS internals, you don't need to worry about metaslab fragmentation directly. you'll notice when it becomes worrysome because your performance will drop. that will only happen if ZFS can't write bigger sequential chunks and has to split them up into smaller ones. on (proper) SSDs this is not that much of a problem, since they handle smaller and random writes much better than spinning disks.

Good to know. That said, is there a certain percentage of frag that I should monitor for, that way I don't have to find out when customers start complaining about performance issues?
 
Good to know. That said, is there a certain percentage of frag that I should monitor for, that way I don't have to find out when customers start complaining about performance issues?

not really. the FRAG value just tells you how large the segements of free space are on average. if you have lots of big, sequential writes, you want little fragmentation in order to not lose too much performance. but it depends a lot on your workload. you can set a notification for something like 70% to check for performance degradation then. but it probably makes more sense to monitor performance metrics directly instead ;)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!