FSTrim increases IOwait, Containers Fail

Proxygen · Feb 3, 2020

I've had a slower NVME SSD in the machine, on which the most IO demanding containers would choke during backups. I had to IOnice 8 and BWlimit 51200 to get these containers to perform w/o errors during backups.

FStrim (via the script below) ran weekly on this SSD, and it was never a problem for these 3 IO demanding containers in this slower SSD.

/usr/sbin/pct list | awk '/^[0-9]/ {print $1}' | while read ct; do /usr/sbin/pct fstrim ${ct} && NOW=$(date +"%Y-%m-%d-%R") && echo -e "$NOW\tTrimming ${ct}"; done

In order to run the backups faster, I added another disk. This is a much faster disk (one of the fastest). I moved the 3 IO demanding containers to it. I've been able to increase BWlimit to 179200 (and I am still increasing it on a daily basis. I might not need a limit after all), and there have been no problems during backup.

But last night for the first time the weekly fstrim cronjob ran on the new disk the 3 containers failed to perform (didn't do their job for a good part of the 20 or so minutes it takes for fstrim to run on every container).

There were configuration changes on the machine nor the containers. I am running kernel 5.3.13-2, and current everything, except from a pending reboot to switch to kernel 5.3.13-3.

I am wondering if this has something to do with how I setup LVM-thin on the new SSD? As far as I remember, part of the setup for the old SSD was done via CLI. On the new SSD, I used the GUI for everything. So the two SSDs have some differences. nvme0n1 is the older/slower disk:

Here's CPU & IO Wait during FStrim and Backups (backup as always is done to spinning disks in RAID 10)

lsblk

Why would a much faster disk fail to keep up during fstrim when the slower disk had no problems? Do I need to change something related to the setup of the 2nd disk?

wolfgang · Feb 4, 2020

Hi,

fstrim is not for free. I produce many writes(delete blocks).
Also, it depends on how the trim command is implemented in the disk firmware.
With NVMe it is the deallocate command of the Data Management commands.

Proxygen · Feb 4, 2020

Hi Wolfgang. So if the differences in disk setup can't possibly be to blame, the takeaway is that this drive's fstrim is implemented in such a way that it is more taxing to IO output? This is possible but unlikely to explain the situation here since IO also suffered fstrim was running on the containers on the OLD drive. The containers in the new drive failed from 23:50 to 00:08, and fstrim ran on the new drive 3 times (one of each container) for a total of 5.5 minutes only

wolfgang · Feb 5, 2020

Gaia said:
So if the differences in disk setup can't possibly be to blame, the takeaway is that this drive's fstrim is implemented in such a way that it is more taxing to IO output?

Could be but this is easily tested.
allocate a thin lv and wit 10GB random data. then erase the data and run a trimfs.
Take the time and compare the disks.

Gaia said:
This is possible but unlikely to explain the situation here since IO also suffered fstrim was running on the containers on the OLD drive.

If an IO is hanging it can block also other IO. Do not forget there is only one kernel for the IO.

But you can send me the output of this command to see if there is a configuration different.

Code:

pvs -o all /dev/nvme0n1
pvs -o all /dev/nvme1n1
lvs -o all /dev/mapper/lvmt_containers1-lvmt_containers1*
lvs -o all /dev/mapper/lvmt_containers2-lvmt_containers2*
smartctl -Ha /dev/nvme0n1
smartctl -Ha /dev/nvme1n1

Proxygen · Feb 5, 2020

wolfgang said:
Could be but this is easily tested.
allocate a thin lv and wit 10GB random data. then erase the data and run a trimfs.
Take the time and compare the disks.

If an IO is hanging it can block also other IO. Do not forget there is only one kernel for the IO.

But you can send me the output of this command to see if there is a configuration different.

Code:

pvs -o all /dev/nvme0n1 pvs -o all /dev/nvme1n1 lvs -o all /dev/mapper/lvmt_containers1-lvmt_containers1* lvs -o all /dev/mapper/lvmt_containers2-lvmt_containers2* smartctl -Ha /dev/nvme0n1 smartctl -Ha /dev/nvme1n1

Thanks, I'm PMing you the diag report.

wolfgang · Feb 6, 2020

The main difference between them is the metadata size.
This could explain the effects.
The old disk has only 120m metadata.
This is very dangerous if you use snapshots.
That is the reason why we increased this, so your new disk got about 16GB metadata.

Proxygen · Feb 6, 2020

Ok, but I have used snapshots on the old disk, didn't have a problem.

So smaller allocation for metadata reduces the IO impact of fstrim?

wolfgang · Feb 7, 2020

Gaia said:
Ok, but I have used snapshots on the old disk, didn't have a problem.

There are no problems until no more space is left.

Gaia said:
So smaller allocation for metadata reduces the IO impact of fstrim?

I can't prove this on a quick lab test.
The manpage says

Code:

The discard behavior of a thin pool LV determines how discard requests are handled.  Enabling discard under a file system may adversely affect the file system performance (see the section on fstrim for an alternative.)

more information you can get in the man lvmthin

Proxygen · Feb 7, 2020

wolfgang said:
There are no problems until no more space is left.

so the metadata should ideally be the size of the largest volume?

wolfgang said:
Code:

Enabling discard under a file system may adversely affect the file system performance

I believe this refers to real time discard.

wolfgang · Feb 10, 2020

Gaia said:
so the metadata should ideally be the size of the largest volume?

You should extend the metadata pool on this lv.
Here is the docu for it.
https://pve.proxmox.com/wiki/LVM2#Resize_metadata_pool

Gaia said:
I believe this refers to real time discard.

This does no matter. life and scheduled are for the disk the same workload
The different is only that real-time discard are always and small load and scheduled is the cumulated load from all.

Proxygen · Feb 10, 2020

Code:

# lvresize --poolmetadatasize +16G lvmt_containers1/lvmt_containers1
  Insufficient free space: 4096 extents needed, but only 0 available

do I need to lvreduce before extending poolmetadatasize?
can I set `thin_pool_autoextend_threshold` to 70 to grow automatically? or it won't grow bc of the same lack of space?

PS: didn't linux 4.7 implement async discard? how do I enable it?

wolfgang · Feb 11, 2020

Gaia said:
do I need to lvreduce before extending poolmetadatasize?

If no space is free you have to reduce the lv.
But be careful with reducing lv. This can end in a corrupt LV.
So it is good to have a backup before reducing the size.

Gaia said:
can I set `thin_pool_autoextend_threshold` to 70 to grow automatically?

We decide to do not use this parameter because you must have also free extents.
And if you have no space free it will fail anyway.
So I don't see any benefit to set this parameter.

Gaia said:
PS: didn't linux 4.7 implement async discard? how do I enable it?

As far I know the application/FS must support this.

Proxygen · Feb 24, 2020

I switched the new NVME disk from one with a Silicon Motion controller to one with the latest Phison controler (E16). I am still having a complete lock up/CPU spike during fstrim.

So I ran

Code:

lsblk --discard

and saw a possible reason for the system lock up during FStrim on the faster drive and not on the slower drive:

DISC-MAX is 16G for all LVM-thin container filesystems.
But DISC-GRAN is 1M for each FS in the slower disk, it is 64K for the FSes on the faster disk.

How do I change/set DISC-GRAN to 1M on the new disk? I can't find an option in the GUI. As far as I recall LVM-thin for the older disk was manually created.

UPDATE: Debian sets discard granularity to the blocksize. So the question is how do I set the blocksize for an LVM thin volume? I know I will have to empty out the disk and reformat. Can only be done via CLI?

Search

Search

FSTrim increases IOwait, Containers Fail

Proxygen

Active Member

wolfgang

Proxmox Retired Staff

Proxygen

Active Member

wolfgang

Proxmox Retired Staff

Proxygen

Active Member

wolfgang

Proxmox Retired Staff

Proxygen

Active Member

wolfgang

Proxmox Retired Staff

Proxygen

Active Member

wolfgang

Proxmox Retired Staff

Proxygen

Active Member

wolfgang

Proxmox Retired Staff

Proxygen

Active Member

We value your privacy