FSTrim increases IOwait, Containers Fail

May 18, 2019
231
15
38
Varies
I've had a slower NVME SSD in the machine, on which the most IO demanding containers would choke during backups. I had to IOnice 8 and BWlimit 51200 to get these containers to perform w/o errors during backups.

FStrim (via the script below) ran weekly on this SSD, and it was never a problem for these 3 IO demanding containers in this slower SSD.

/usr/sbin/pct list | awk '/^[0-9]/ {print $1}' | while read ct; do /usr/sbin/pct fstrim ${ct} && NOW=$(date +"%Y-%m-%d-%R") && echo -e "$NOW\tTrimming ${ct}"; done

In order to run the backups faster, I added another disk. This is a much faster disk (one of the fastest). I moved the 3 IO demanding containers to it. I've been able to increase BWlimit to 179200 (and I am still increasing it on a daily basis. I might not need a limit after all), and there have been no problems during backup.

But last night for the first time the weekly fstrim cronjob ran on the new disk the 3 containers failed to perform (didn't do their job for a good part of the 20 or so minutes it takes for fstrim to run on every container).

There were configuration changes on the machine nor the containers. I am running kernel 5.3.13-2, and current everything, except from a pending reboot to switch to kernel 5.3.13-3.

I am wondering if this has something to do with how I setup LVM-thin on the new SSD? As far as I remember, part of the setup for the old SSD was done via CLI. On the new SSD, I used the GUI for everything. So the two SSDs have some differences. nvme0n1 is the older/slower disk:

6RC7ODQ.png


XtAJRIb.png


0B2wZkT.png


GwThJQF.png


Here's CPU & IO Wait during FStrim and Backups (backup as always is done to spinning disks in RAID 10)

DnJXyKn.png


lsblk

MjFO5O4.png


Why would a much faster disk fail to keep up during fstrim when the slower disk had no problems? Do I need to change something related to the setup of the 2nd disk?
 
Hi,

fstrim is not for free. I produce many writes(delete blocks).
Also, it depends on how the trim command is implemented in the disk firmware.
With NVMe it is the deallocate command of the Data Management commands.
 
Hi Wolfgang. So if the differences in disk setup can't possibly be to blame, the takeaway is that this drive's fstrim is implemented in such a way that it is more taxing to IO output? This is possible but unlikely to explain the situation here since IO also suffered fstrim was running on the containers on the OLD drive. The containers in the new drive failed from 23:50 to 00:08, and fstrim ran on the new drive 3 times (one of each container) for a total of 5.5 minutes only
 
Last edited:
So if the differences in disk setup can't possibly be to blame, the takeaway is that this drive's fstrim is implemented in such a way that it is more taxing to IO output?
Could be but this is easily tested.
allocate a thin lv and wit 10GB random data. then erase the data and run a trimfs.
Take the time and compare the disks.

This is possible but unlikely to explain the situation here since IO also suffered fstrim was running on the containers on the OLD drive.
If an IO is hanging it can block also other IO. Do not forget there is only one kernel for the IO.

But you can send me the output of this command to see if there is a configuration different.

Code:
pvs -o all /dev/nvme0n1
pvs -o all /dev/nvme1n1
lvs -o all /dev/mapper/lvmt_containers1-lvmt_containers1*
lvs -o all /dev/mapper/lvmt_containers2-lvmt_containers2*
smartctl -Ha /dev/nvme0n1
smartctl -Ha /dev/nvme1n1
 
Could be but this is easily tested.
allocate a thin lv and wit 10GB random data. then erase the data and run a trimfs.
Take the time and compare the disks.


If an IO is hanging it can block also other IO. Do not forget there is only one kernel for the IO.

But you can send me the output of this command to see if there is a configuration different.

Code:
pvs -o all /dev/nvme0n1
pvs -o all /dev/nvme1n1
lvs -o all /dev/mapper/lvmt_containers1-lvmt_containers1*
lvs -o all /dev/mapper/lvmt_containers2-lvmt_containers2*
smartctl -Ha /dev/nvme0n1
smartctl -Ha /dev/nvme1n1

Thanks, I'm PMing you the diag report.
 
The main difference between them is the metadata size.
This could explain the effects.
The old disk has only 120m metadata.
This is very dangerous if you use snapshots.
That is the reason why we increased this, so your new disk got about 16GB metadata.
 
Ok, but I have used snapshots on the old disk, didn't have a problem.
There are no problems until no more space is left.

So smaller allocation for metadata reduces the IO impact of fstrim?
I can't prove this on a quick lab test.
The manpage says

Code:
The discard behavior of a thin pool LV determines how discard requests are handled.  Enabling discard under a file system may adversely affect the file system performance (see the section on fstrim for an alternative.)
more information you can get in the man lvmthin
 
Code:
# lvresize --poolmetadatasize +16G lvmt_containers1/lvmt_containers1
  Insufficient free space: 4096 extents needed, but only 0 available

do I need to lvreduce before extending poolmetadatasize?
can I set `thin_pool_autoextend_threshold` to 70 to grow automatically? or it won't grow bc of the same lack of space?

PS: didn't linux 4.7 implement async discard? how do I enable it?
 
do I need to lvreduce before extending poolmetadatasize?
If no space is free you have to reduce the lv.
But be careful with reducing lv. This can end in a corrupt LV.
So it is good to have a backup before reducing the size.

can I set `thin_pool_autoextend_threshold` to 70 to grow automatically?

We decide to do not use this parameter because you must have also free extents.
And if you have no space free it will fail anyway.
So I don't see any benefit to set this parameter.

PS: didn't linux 4.7 implement async discard? how do I enable it?
As far I know the application/FS must support this.
 
  • Like
Reactions: Proxygen
I switched the new NVME disk from one with a Silicon Motion controller to one with the latest Phison controler (E16). I am still having a complete lock up/CPU spike during fstrim.

So I ran
Code:
lsblk --discard
and saw a possible reason for the system lock up during FStrim on the faster drive and not on the slower drive:

DISC-MAX is 16G for all LVM-thin container filesystems.
But DISC-GRAN is 1M for each FS in the slower disk, it is 64K for the FSes on the faster disk.

How do I change/set DISC-GRAN to 1M on the new disk? I can't find an option in the GUI. As far as I recall LVM-thin for the older disk was manually created.

UPDATE: Debian sets discard granularity to the blocksize. So the question is how do I set the blocksize for an LVM thin volume? I know I will have to empty out the disk and reformat. Can only be done via CLI?
 
Last edited: