Another SSD-related question about health and storage options

dkong

New Member
Aug 13, 2017
16
0
1
37
Dear Proxmox community,

I have yet another question about Proxmox and SSD's. I apologize if the following is too redundant but I was not able to find exact answers and I'm hoping the people here can provide some perspective. I would be very thankful and will provide my experiences and results here in the future if I'm able to get Proxmox working for my scenario.

A while back I installed Proxmox on my old HP ML110 G7. The only drive in the box is an older Kingston 120GB consumer SSD. After letting it run for about a day, even without any VM's running, I noticed the health status of the SSD decreased rather fast. When I read some posts on this forum I discovered that Proxmox, at least since version 4, can kill consumer SSD's in a heartbeat.

I've been using ESXi at home for a couple of years. I just install ESXi on a USB-stick and then add a couple of consumer SSD's (Samsung Pro series) as datastores. There never was a problem with that setup.
I'd like to move from ESXi to Proxmox but it would kind of suck if that forces me to spend way more on enterprise class SSD's. I want to do some testing first on the ML110 with two or three nested Proxmox instances on top of a bare metal Proxmox install. This will be mainly to learn about and test DRBD and Ceph configurations, without any serious workloads on top. I am going to get a Samsung PM863a (240GB) for that, which should be up to it. After this experience I would like to move my main box to Proxmox.

Would it be possible to use an enterprise class SSD for the OS and then use consumer SSD's (the Samsung 960 Pro NVME for example) for the equivalent of ESXi datastores (no idea yet how that would be named in Proxmox land)? The idea of course being able to use cheaper very fast SSD's for the VM's without killing the drives. Would the drives still get killed because they will still be mounted under the main root of the Proxmox OS filesystem? I've never done a lot of linux in my VM's on ESXi, it would be possible that more linux in that setup would have also killed my consumer SSD's I suppose. Who can shed some light on this?

Kind regards
 
Problem is PVE, not linux generally. PVE writes to disk all the time, even with no VM running. True, it is just a few bytes every 3-4 seconds, but any write-op means at least 8kB is written. Moreover, TRIM on ZFS (on linux) is not supported, so quickly every block is marked as "used" and then any small writing generates multiple read-modify-write operations.

But this should affect only PVE system-disk. If you put VMs on different disk, it should not "eat" SSD quickly. Of course, some tweaks in VMs are necessary (i.e. "noatime" to minimise writing). Then your suggestions might work (using enterprise-class SSD for PVE and consumer-SSD for VMs)...
 
  • Like
Reactions: dkong
Thanks a lot for your reply. I might just grab an extra second hand SAS card and a cheap 10k or 15k RPM SAS disk and use that for the OS. Saves some money and should be good enough.
 
Is a /tmpfs volume a solution for this problem?
(i've superdoms (supermicro 64gb sata dom ssds) in every node for the OS).

El Tebe, the SuperMicro DOM's would be a perfect solution if they don't get killed. I have no idea about the durability of those. Could you provide some insight about how many nodes you run, their storage setup and how long the DOM's have been running and their lifetime indication?
About the tmpfs remark: you mean place just that directory on another storage target that can handle the amount of IO?
 
Moreover, TRIM on ZFS (on linux) is not supported, so quickly every block is marked as "used" a

This is wrong in my opinion. zfs is a cow system, so it is not modify any existent data block on disk.

I have several servers that use that use consumer SSD (including 120 G Kingston) for zfs cache and zil, and all are usable even now. I also have some proxmox nodes with consumers SSD for proxmox os. But the /tmp and /var/log are symlink to the zfs datasets (spinning HDD), and after 1/2 year usage, the health status is 99.0 %.
 
  • Like
Reactions: dkong
Thanks a lot for your added insight, guletz. Could you please clarify one more thing as I think I am missing something right now? The topic that Rhinox mentioned (found here) talks about PVE writing to the /etc/pve location a lot. Placing /tmp and /var/log on spinning disks would not fully alleviate the problems if I'm following along correctly. What PVE version are you running?

One more question about your setup (I'm new to PVE and I've yet to figure out sensible storage setup etc): you mention using ZFS datasets on spinning disks, but with cache and ZIL on SSD's. Is that ZFS construction the place where you keep your VM's? And if so, do you host the datasets in the same physical boxes that are also running PVE with the VM's or do you have separate storage boxes for your setup?
 
Thanks a lot for your added insight, guletz. Could you please clarify one more thing as I think I am missing something right now? The topic that Rhinox mentioned (found here) talks about PVE writing to the /etc/pve location a lot. Placing /tmp and /var/log on spinning disks would not fully alleviate the problems if I'm following along correctly.

Not fully, but it will help a lot.


I use a dedicated dataset for the VMs who is on the same proxmox node.
 
/etc/pve is a FUSE filesystem, any writes happening there are actually writes to an sqlite DB in /var/lib/pve-cluster .
 
  • Like
Reactions: El Tebe and guletz
/etc/pve is a FUSE filesystem, any writes happening there are actually writes to an sqlite DB in /var/lib/pve-cluster .


So we do not use a hammer when we need to put a nail in a wall (a file) we could do a better job if we we use a bulldozer (fuse + a data base = filesystem?) ;)
How bad cand be a hammer ?
 
Last edited:
So we do not use a hammer when we need to put a nail in a wall (a file) we could do a better job if we we use a bulldozer (fuse + a data base = filesystem?) ;)
How bad cand be a hammer ?
the point of using a database is to use each change as a transaction, which can be (safely) synced to disk and over the network to the other cluster nodes, so we can be sure that every node in the cluster has a
consistent and safe view of the content

the fuse wrapper is there to make it easy to edit and see the configurations
 
This is wrong in my opinion. zfs is a cow system, so it is not modify any existent data block on disk.
For SSD-controller it does not matter what filesystem you have. Even if your OS wants to write to some particular sector, SSD decides on its own where data actually will be written, for the benefit of wear-leveling.

Moreover, SSD can not change content on sector-base, because "erase" on SSD works on blocks, not individual sectors. Even if you want to write to single sector, if that block is marked as "used", the whole block must be read, erased, and written...
 
  • Like
Reactions: guletz
the point of using a database is to use each change as a transaction, which can be (safely) synced to disk and over the network to the other cluster nodes, so we can be sure that every node in the cluster has a
consistent and safe view of the content

the fuse wrapper is there to make it easy to edit and see the configurations

I think is not the best solution. Maybe consul could be better one. Consul is mostly used in many cluster systems for orchestrated activities. For sure is better compared with a db system/fuse. in my opinion.
 
For SSD-controller it does not matter what filesystem you have. Even if your OS wants to write to some particular sector, SSD decides on its own where data actually will be written, for the benefit of wear-leveling


This is correct. I miss the ssd firmware controller layer. My fault.
 
/etc/pve is a FUSE filesystem, any writes happening there are actually writes to an sqlite DB in /var/lib/pve-cluster .
So, to elivate this problem of PVE constantly writing, and killing SSD drives, would one move /etc/pve and /var/lib/pve-cluster to a SATA drive?
 
So, to elivate this problem of PVE constantly writing, and killing SSD drives, would one move /etc/pve and /var/lib/pve-cluster to a SATA drive?

moving /etc/pve does not make sense, it is a mount point and anything below that is just provided by FUSE, and not actually files on disk. moving /var/lib/pve-cluster to a spinning disk will prevent pmxcfs from wearing out your SSD, yes.
 
  • Like
Reactions: ppo and El Tebe
moving /etc/pve does not make sense, it is a mount point and anything below that is just provided by FUSE, and not actually files on disk. moving /var/lib/pve-cluster to a spinning disk will prevent pmxcfs from wearing out your SSD, yes.
Thank you.
 
El Tebe, the SuperMicro DOM's would be a perfect solution if they don't get killed. I have no idea about the durability of those. Could you provide some insight about how many nodes you run, their storage setup and how long the DOM's have been running and their lifetime indication?
About the tmpfs remark: you mean place just that directory on another storage target that can handle the amount of IO?

Supermicro sata DOM's endurance: 1 drive writes per day 1 (DWPD)
 
I think is not the best solution. Maybe consul could be better one. Consul is mostly used in many cluster systems for orchestrated activities. For sure is better compared with a db system/fuse. in my opinion.

Agree
Consul is made exactly for this purpose
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!