Another SSD-related question about health and storage options

dkong · Aug 13, 2017

Dear Proxmox community,

I have yet another question about Proxmox and SSD's. I apologize if the following is too redundant but I was not able to find exact answers and I'm hoping the people here can provide some perspective. I would be very thankful and will provide my experiences and results here in the future if I'm able to get Proxmox working for my scenario.

A while back I installed Proxmox on my old HP ML110 G7. The only drive in the box is an older Kingston 120GB consumer SSD. After letting it run for about a day, even without any VM's running, I noticed the health status of the SSD decreased rather fast. When I read some posts on this forum I discovered that Proxmox, at least since version 4, can kill consumer SSD's in a heartbeat.

I've been using ESXi at home for a couple of years. I just install ESXi on a USB-stick and then add a couple of consumer SSD's (Samsung Pro series) as datastores. There never was a problem with that setup.
I'd like to move from ESXi to Proxmox but it would kind of suck if that forces me to spend way more on enterprise class SSD's. I want to do some testing first on the ML110 with two or three nested Proxmox instances on top of a bare metal Proxmox install. This will be mainly to learn about and test DRBD and Ceph configurations, without any serious workloads on top. I am going to get a Samsung PM863a (240GB) for that, which should be up to it. After this experience I would like to move my main box to Proxmox.

Would it be possible to use an enterprise class SSD for the OS and then use consumer SSD's (the Samsung 960 Pro NVME for example) for the equivalent of ESXi datastores (no idea yet how that would be named in Proxmox land)? The idea of course being able to use cheaper very fast SSD's for the VM's without killing the drives. Would the drives still get killed because they will still be mounted under the main root of the Proxmox OS filesystem? I've never done a lot of linux in my VM's on ESXi, it would be possible that more linux in that setup would have also killed my consumer SSD's I suppose. Who can shed some light on this?

Kind regards

Rhinox · Aug 13, 2017

Problem is PVE, not linux generally. PVE writes to disk all the time, even with no VM running. True, it is just a few bytes every 3-4 seconds, but any write-op means at least 8kB is written. Moreover, TRIM on ZFS (on linux) is not supported, so quickly every block is marked as "used" and then any small writing generates multiple read-modify-write operations.

But this should affect only PVE system-disk. If you put VMs on different disk, it should not "eat" SSD quickly. Of course, some tweaks in VMs are necessary (i.e. "noatime" to minimise writing). Then your suggestions might work (using enterprise-class SSD for PVE and consumer-SSD for VMs)...

dkong · Aug 14, 2017

Thanks a lot for your reply. I might just grab an extra second hand SAS card and a cheap 10k or 15k RPM SAS disk and use that for the OS. Saves some money and should be good enough.

El Tebe · Aug 14, 2017

Rhinox said:
PVE writes to disk all the time, even with no VM running

Is a /tmpfs volume a solution for this problem?
(i've superdoms (supermicro 64gb sata dom ssds) in every node for the OS).

dkong · Aug 14, 2017

El Tebe said:
Is a /tmpfs volume a solution for this problem?
(i've superdoms (supermicro 64gb sata dom ssds) in every node for the OS).

El Tebe, the SuperMicro DOM's would be a perfect solution if they don't get killed. I have no idea about the durability of those. Could you provide some insight about how many nodes you run, their storage setup and how long the DOM's have been running and their lifetime indication?
About the tmpfs remark: you mean place just that directory on another storage target that can handle the amount of IO?

guletz · Aug 14, 2017

Rhinox said:
Moreover, TRIM on ZFS (on linux) is not supported, so quickly every block is marked as "used" a

This is wrong in my opinion. zfs is a cow system, so it is not modify any existent data block on disk.

I have several servers that use that use consumer SSD (including 120 G Kingston) for zfs cache and zil, and all are usable even now. I also have some proxmox nodes with consumers SSD for proxmox os. But the /tmp and /var/log are symlink to the zfs datasets (spinning HDD), and after 1/2 year usage, the health status is 99.0 %.

dkong · Aug 14, 2017

Thanks a lot for your added insight, guletz. Could you please clarify one more thing as I think I am missing something right now? The topic that Rhinox mentioned (found here) talks about PVE writing to the /etc/pve location a lot. Placing /tmp and /var/log on spinning disks would not fully alleviate the problems if I'm following along correctly. What PVE version are you running?

One more question about your setup (I'm new to PVE and I've yet to figure out sensible storage setup etc): you mention using ZFS datasets on spinning disks, but with cache and ZIL on SSD's. Is that ZFS construction the place where you keep your VM's? And if so, do you host the datasets in the same physical boxes that are also running PVE with the VM's or do you have separate storage boxes for your setup?

guletz · Aug 14, 2017

dkong said:
Thanks a lot for your added insight, guletz. Could you please clarify one more thing as I think I am missing something right now? The topic that Rhinox mentioned (found here) talks about PVE writing to the /etc/pve location a lot. Placing /tmp and /var/log on spinning disks would not fully alleviate the problems if I'm following along correctly.

Not fully, but it will help a lot.

I use a dedicated dataset for the VMs who is on the same proxmox node.

fabian · Aug 14, 2017

/etc/pve is a FUSE filesystem, any writes happening there are actually writes to an sqlite DB in /var/lib/pve-cluster .

guletz · Aug 14, 2017

Thx a lot. I will make another symlink.

guletz · Aug 14, 2017

fabian said:
/etc/pve is a FUSE filesystem, any writes happening there are actually writes to an sqlite DB in /var/lib/pve-cluster .

So we do not use a hammer when we need to put a nail in a wall (a file) we could do a better job if we we use a bulldozer (fuse + a data base = filesystem?)

How bad cand be a hammer ?

dcsapak · Aug 14, 2017

guletz said:
So we do not use a hammer when we need to put a nail in a wall (a file) we could do a better job if we we use a bulldozer (fuse + a data base = filesystem?)
How bad cand be a hammer ?

the point of using a database is to use each change as a transaction, which can be (safely) synced to disk and over the network to the other cluster nodes, so we can be sure that every node in the cluster has a
consistent and safe view of the content

the fuse wrapper is there to make it easy to edit and see the configurations

Rhinox · Aug 14, 2017

guletz said:
This is wrong in my opinion. zfs is a cow system, so it is not modify any existent data block on disk.

For SSD-controller it does not matter what filesystem you have. Even if your OS wants to write to some particular sector, SSD decides on its own where data actually will be written, for the benefit of wear-leveling.

Moreover, SSD can not change content on sector-base, because "erase" on SSD works on blocks, not individual sectors. Even if you want to write to single sector, if that block is marked as "used", the whole block must be read, erased, and written...

guletz · Aug 15, 2017

dcsapak said:
the point of using a database is to use each change as a transaction, which can be (safely) synced to disk and over the network to the other cluster nodes, so we can be sure that every node in the cluster has a
consistent and safe view of the content

the fuse wrapper is there to make it easy to edit and see the configurations

I think is not the best solution. Maybe consul could be better one. Consul is mostly used in many cluster systems for orchestrated activities. For sure is better compared with a db system/fuse. in my opinion.

guletz · Aug 15, 2017

Rhinox said:
For SSD-controller it does not matter what filesystem you have. Even if your OS wants to write to some particular sector, SSD decides on its own where data actually will be written, for the benefit of wear-leveling

This is correct. I miss the ssd firmware controller layer. My fault.

Magneto · Aug 16, 2017

fabian said:
/etc/pve is a FUSE filesystem, any writes happening there are actually writes to an sqlite DB in /var/lib/pve-cluster .

So, to elivate this problem of PVE constantly writing, and killing SSD drives, would one move /etc/pve and /var/lib/pve-cluster to a SATA drive?

fabian · Aug 17, 2017

SilverNodashi said:
So, to elivate this problem of PVE constantly writing, and killing SSD drives, would one move /etc/pve and /var/lib/pve-cluster to a SATA drive?

moving /etc/pve does not make sense, it is a mount point and anything below that is just provided by FUSE, and not actually files on disk. moving /var/lib/pve-cluster to a spinning disk will prevent pmxcfs from wearing out your SSD, yes.

Magneto · Aug 17, 2017

fabian said:
moving /etc/pve does not make sense, it is a mount point and anything below that is just provided by FUSE, and not actually files on disk. moving /var/lib/pve-cluster to a spinning disk will prevent pmxcfs from wearing out your SSD, yes.

Thank you.

El Tebe · Nov 14, 2017

dkong said:
El Tebe, the SuperMicro DOM's would be a perfect solution if they don't get killed. I have no idea about the durability of those. Could you provide some insight about how many nodes you run, their storage setup and how long the DOM's have been running and their lifetime indication?
About the tmpfs remark: you mean place just that directory on another storage target that can handle the amount of IO?

Supermicro sata DOM's endurance: 1 drive writes per day 1 (DWPD)

Alessandro 123 · Nov 14, 2017

guletz said:
I think is not the best solution. Maybe consul could be better one. Consul is mostly used in many cluster systems for orchestrated activities. For sure is better compared with a db system/fuse. in my opinion.

Agree
Consul is made exactly for this purpose

Another SSD-related question about health and storage options

New Member

Active Member

New Member

Active Member

New Member

Famous Member

New Member

Famous Member

Proxmox Staff Member

Famous Member

Famous Member

Proxmox Staff Member

Active Member

Famous Member

Famous Member

Well-Known Member

Proxmox Staff Member

Well-Known Member

Active Member

Well-Known Member