Proxmox writing my SSDs with about 40GB/h for no reason

drbeam · Apr 5, 2021

Hi all,

i have a problem with my proxmox setup.
I'm running two nodes each with a 512GB NVMe SSD, i5 8500T and 32GB Memory. On this two machines i'm running 4 VMs. A Docker Host with ~15 Containers, a Mailserver (Mailcow Dockerized), a VPN server and a Proxy. All those machines are replicated through ZFS to each other every minute.
One disk has 1700h and 67TB written, the other one on the other host has 3300h and 66TB written. What the hell is causing this excessive writes? i mean for the first drive this is about 40GB/h or 11MB/s continuous.

This is the output of iostat:

Code:

Device             tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
nvme0n1         267.57         0.50         6.00     643034    7728174

I don't trust this iostat because the MB_wrtn/s doesn't change anymore, it's stuck on 6.00 MB/s.

Even with replication disabled and no VM or container running on the host its still more than 2MB/s:

Code:

Device             tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
nvme0n1         162.22         0.55         2.61        543       2598

I read about write amplification here: https://forum.proxmox.com/threads/high-ssd-wear-after-a-few-days.24840/#post-124488 but i have no idea how i can check this and how to find the reason for this.

As a first try i set the replication schedule to every two hours but this doesn't solve my problem at all... Please help

Dunuin · Apr 5, 2021

It is normal that ZFS will write alot to SSDs. Thats why you should always buy enterprise grade SSDs that can handle some petabytes of writes.
Some things that willl cause write amplification:
- sync writes (especially if you SSDs don't got powerloss protection and therefore can't use the internal cache to optimize writes)
- alot of small random writes instead of few big synchronous writes
- bad padding (for example volblocksize not high enough if using raidz)
- mixed blocksizes (lba, ashift, volblocksize, virtio SCSI controller blocksize, blocksize of guests filesystem)
- copy on write filesystems (like ZFS)
- virtualization overhead in general
- raid

It isn't that hard to write 6 MB/s to a drive. My homeserver got here 24 MB/s too while idleing. Around 500kb/s of writes from the guests amplified to 8MB/s on the host. And amplified again from 8MB/s to 24MB/s from host to the SSDs NAND cells.
This is what my iostat looks like (sdh+sdi are drives just for storing logs and metrics to a MySQL DB / mongo DB / elastic search DB):

Code:

Device             tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sdc              49.24       171.93       321.76  180092960  337042572
sda               3.50        94.24        26.68   98711400   27945477
sdb               3.44        93.09        26.68   97512511   27945476
md0               0.00         0.01         0.04       5386      44691
md1               2.05         1.49        26.52    1559636   27778984
sde              46.43       173.12       322.36  181341928  337666996
sdf              46.33       173.05       322.37  181269840  337677812
sdd              46.40       173.17       322.23  181399756  337534292
sdg              46.03       173.09       322.40  181309180  337713072
sdh              55.87       316.26      3095.16  331287032 3242176332
sdi              55.73       315.73      3095.16  330724592 3242176264
dm-0              2.05         1.49        26.50    1557180   27763124
dm-1              1.98         1.44        27.51    1509748   28820420
dm-2              0.06         0.04         0.20      46232     210076

The last amplification can't be seen by iostat because it all happens inside the SSDs. If you want to monitor the drives internal write amplification you need to look at the smart stats (if the SSDs firmware supports that). My SSD for example got "Host_Writes_32MiB" and "NAND_Writes_32MiB". The first one is the amount of data that the host sent to the SSD to write and the last one is what the SSD actually have written to the flash. My "NAND_Writes_32MiB" is 2.5 to 3.5 times higher than my "Host_Writes_32MiB" so I got an additional internal write amplification of around factor 3 that you won't see if you only look at stats reported by your host.

Lets say you got a write amplification from guest to host of factor 10. That way 600kb/s will result in 6MB/s writes. 600kb/s my homeserver will easily create by itself just for storing metrics and logs.

Edit:
And iostat will show you the average since reboot. If you want to see what happed in the last hour you should try something like iostat 3600 2 and wait an hour for the the command to finish. After an hour it will return a second table what represents only what happend the last hour.

If you want to see if sync writes are a big problem you could use "zfs set sync=disabled" for testing. That way every write will be forced to be a async write.
You could also try to increase zfs_txg_timeout so small writes won't cause that much write amplification.
But all of that will sacrifice data security for a lower write amplification.

I gave up on optimizing my write amplification and replaced all my consumer SSDs with enterprise SSDs that can handle that high write rate.

drbeam · Apr 23, 2021

Hi Dunuin,

i tried "zfs set sync=disabled" and it helped a bit. Can you please explain what that means and why i should not use it in production?

Replacing the drives in the longterm is definitely an option, i'm looking at the Ironwolf 510 SSDs with 480 or 960GB but for now would try to minimize the writes so i have a bit more time.

Dunuin · Apr 23, 2021

drbeam said:
i tried "zfs set sync=disabled" and it helped a bit. Can you please explain what that means and why i should not use it in production?

Google what the difference between syncronous and asyncronous writes are. If you set "sync=disabled" every sync write will be written as a async write. In short:
If a program needs to make sure that a write is securely saved, because its a important one, it will tell the OS not to cache it and will wait for the OS to get an answer that the file was successfully stored. Only if that answer is received it will start to write the second file. Each file will be written one after another and not in parallel and directly to the persistent drive and not the volatile cache. The point here is, that you can only loose a single file and the program will know that and can plan occordingly so nothing cant be really screwed up. Keep in mind that everything cached in RAM will be lost on an power outage, OS crash or a CPU/mainboard/RAM/PSU failure occures. If you tell ZFS "sync=disabled" your OS will lie and answer the program that the file was successfully saved but in reality it is not, its just cached in RAM. If now an power outage occures these files are lost.

drbeam said:
Replacing the drives in the longterm is definitely an option, i'm looking at the Ironwolf 510 SSDs with 480 or 960GB but for now would try to minimize the writes so i have a bit more time.

That are still prosumer drives only. Its not enterprise grade and got no powerloss protection and so on. If you really want M.2 (and not U.2 with M.2 to U.2 cables) there are drives like the DC1000B but U.2 drives (like the DC1000M) would be a way better option. SATA (like the Intel S4610) should be totally fine too if you don't need high end performance.

ness1602 · Apr 24, 2021

Seagate 110 ssd series have powerloss protection, and they are cheaper, but offer less DPWD. I've been using them in one host, and they are ok.

Dunuin · Apr 24, 2021

The IronWolf 110 got powerloss protection, the IronWolf 510 SSD not.

ness1602 · Apr 24, 2021

Yes,thats what I wrote, don't understand your comment.

Dunuin · Apr 25, 2021

Because he wants to buy a Ironwolf 510 SSDs (M.2 SSD) and you talk about the IronWolf 110 (SATA SSD). Just wanted to clarify that the M.2 SSD doesn't got that.

ness1602 · Apr 25, 2021

https://www.zdnet.com/product/seagate-ironwolf-510-ssd/
https://www.techradar.com/reviews/seagate-ironwolf-510-ssd
Yes for the 510 it is really strange,on reviewers site you can see powerloss protection, but on seagate site there is no mention of it. Strange.

Dunuin · Apr 25, 2021

ness1602 said:
Yes for the 510 it is really strange,on reviewers site you can see powerloss protection, but on seagate site there is no mention of it. Strange.

On the german seagate site there is a table where they clearly state "Powerloss Protection: No" for the IronWolf 510 and IronWolf 125 series. And "Powerloss Protection: Yes" for the IronWolf Pro 125 and IronWolf 110.
The main problem here is that M.2 just got a way too small footprint. There is just not enough room to add the capacitators for the lowerloss protection or to add more flash chips for better durability and performance. Thats why there are only a few M.2 enterprise SSDs out there and why the U.2 drives are always better. With U.2 you can use the M.2 slot for connection but the SSD is in a normal 2.5" or 3.5" case so you got enough space.

Search

Search

Proxmox writing my SSDs with about 40GB/h for no reason

drbeam

Member

Dunuin

Distinguished Member

drbeam

Member

Dunuin

Distinguished Member

ness1602

Famous Member

Dunuin

Distinguished Member

ness1602

Famous Member

Dunuin

Distinguished Member

ness1602

Famous Member

Dunuin

Distinguished Member

We value your privacy