"pmxcfs" writing to disk all the time...

Rhinox · Jul 20, 2017

I noticed this: when I start PVE 5.0, the process "pmxcfs" keeps writing something to disk all the time.

Every 3-4 seconds there is a spike in writeops. This is going on and on forewer.

According to wiki, "The Proxmox Cluster file system (“pmxcfs”) is a database-driven file system for storing configuration files, replicated in real time to all cluster nodes using corosync. We use this to store all PVE related configuration files".

This is solo-host with just local storage (no cluster). No VM, no LXC-container is running. No config-file is being edited. The whole host is basically idle, except for steady writing-flux caused by pmxcfs process. Why???

This is probably one reason why consumer-SSD gets eaten up so quickly, and if installed on usb-stick, it gets destroyed within a few days (this happened to me some time ago even with industry-level SLC-based usb-stick!)....

dcsapak · Jul 21, 2017

even on an empty standalone node, the pve-ha-lrm daemon writes periodically its status into /etc/pve (this is not optimal and our improvement list)
but it is a rather small write

Rhinox said:
This is probably one reason why consumer-SSD gets eaten up so quickly, and if installed on usb-stick, it gets destroyed within a few days (this happened to me some time ago even with industry-level SLC-based usb-stick!)....

your observation was right (pmxcfs writes periodically), but your conclusion is wrong
on a host with running vms, there will be more written by collecting the rrd stats, the guest file system (when on the same disk), the logs from systemd, kernel etc. so this is not the fault of the pmxcfs

Rhinox · Jul 21, 2017

There is no problem with writing status (if it is necessary), but doing it every 3-4 seconds seems to me to be overkill. Especially if host is configured as standalone (I suppose that "ha" has something to do with high availability; can I disable pve-ha-lrm/pve-ha-crm on stand-alone host?).

Concerning my conclusion: I tried to start single VM, waited to boot and "settle" a little, and let it run idle. And guess what? Surprisingly, most of the writes were again those of "pmxcfs" (but I'm collecting logs on different box). As you confirmed, those are small writes, but short-periodic.

This actually is worse for SSD then writing big data, because even with writing a few bytes the "page" (size can be anything between 2kB and 16kB) is marked as used and next writing must be done to the next free page. And what's even worse, "pages" can not be modified individually, but in "blocks" (128 or 256 "pages"). So once every "page" is used at least once, all next writes must be done using "read-modify-write" for the whole block. "pmxcfs" might send just a few bytes, but ultimately a few MB must be written. That's terribly high "write amplification factor"! And because zfs (on linux) does not support "trim", it comes to this scenario very quickly.

I still think this exactly (a lot of small writes) is the reason why on consumer-ssd TBW-counter is spinning so quickly. Previously I have used 850/pro, and ~25% of its TBW was expended in just a few months (btw, I did not see this with other hypervisors I tested). I'm using industry-level SSD now so it does not bother me so much, but I still think this issue is worth of investigating. When you check this forum, there are a few topics with users complaining TBW of their SSDs are used very quickly (expecially with PVE 5.x)...

dcsapak · Jul 21, 2017

Rhinox said:
There is no problem with writing status (if it is necessary), but doing it every 3-4 seconds seems to me to be overkill. Especially if host is configured as standalone (I suppose that "ha" has something to do with high availability; can I disable pve-ha-lrm/pve-ha-crm on stand-alone host?).

yes, if you do not enable ha you can disable the pve-ha-lrm/pve-ha-crm

Rhinox said:
This actually is worse for SSD then writing big data, because even with writing a few bytes the "page" (size can be anything between 2kB and 16kB) is marked as used and next writing must be done to the next free page. And what's even worse, "pages" can not be modified individually, but in "blocks" (128 or 256 "pages"). So once every "page" is used at least once, all next writes must be done using "read-modify-write" for the whole block. "pmxcfs" might send just a few bytes, but ultimately a few MB must be written. That's terribly high "write amplification factor"! And because zfs (on linux) does not support "trim", it comes to this scenario very quickly.

I still think this exactly (a lot of small writes) is the reason why on consumer-ssd TBW-counter is spinning so quickly. Previously I have used 850/pro, and ~25% of its TBW was expended in just a few months (btw, I did not see this with other hypervisors I tested). I'm using industry-level SSD now so it does not bother me so much, but I still think this issue is worth of investigating. When you check this forum, there are a few topics with users complaining TBW of their SSDs are used very quickly (expecially with PVE 5.x)...

this depends very much on the ssd, and even then, every other part of the system (rrd logging, logs from kernel etc) do the same thing, especially on zfs
i have here for example a crucial mx200 as root disk, and after 1.5 years the "percent lifetime used" (crucial wearout indicator) stands at 1%

Rhinox · Jul 21, 2017

Yes, I can agree. It just seems in this case that those other parts of systems do it less frequently. And log-server can be configured not to write every message separatelly, instead filling first predefined buffer (accepting the risk of loosing some logs in case of sudden system failure)...

I'll try to disable pve-ha-lrm/pve-ha-crm and see if it changes anyting. Thanks for the reply.

Rhinox · Jul 21, 2017

A few more observations:

1. It is not possible to disable service from web-interface, only stop it (or start/restart). It means, after reboot it is again running. Could it be possible to add to "system" tab also "disable" (in addition to "start", "stop", "restat" that are already there)?

2. I logged to console and used "systemctl" to stop & disable pve-ha-lrm/pve-ha-crm. Now the process "pmxcfs" is not writing to disk all the time. Idle pve-host is really idle, there is no i/o activity.

I tried to stop "pve-cluster" too, but that is probably not good idea even on stand-alone host, as immediatelly syslog started to write error-messages with the same speed as "pmxcfs" did previously...

dcsapak · Jul 24, 2017

Rhinox said:
I tried to stop "pve-cluster" too, but that is probably not good idea even on stand-alone host, as immediatelly syslog started to write error-messages with the same speed as "pmxcfs" did previously...

yes, you should not do that, pve-cluster is the service which starts pmxcfs, and we need it for many things

guletz · Aug 14, 2017

dcsapak said:
even on an empty standalone node, the pve-ha-lrm daemon writes periodically its status into /etc/pve (this is not optimal and our improvement list)
but it is a rather small write

Is a mistake(in my own opinion ), in etc must be only static configuration files, and not some status info. The place for this is /tmp or /var/tmp. This is the (pmxcfs) first case that doing this(from what I see/know/remember) in more of 10 years in linux.

Maybe a symlink could fix this? Or a ram tmp ?

fabian · Aug 14, 2017

guletz said:
Is a mistake(in my own opinion ), in etc must be only static configuration files, and not some status info. The place for this is /tmp or /var/tmp. This is the (pmxcfs) first case that doing this(from what I see/know/remember) in more of 10 years in linux.

Maybe a symlink could fix this? Or a ram tmp ?

/etc/pve is not a real directory stored on disk, it is a FUSE filesystem backed by an sqlite DB stored in /var/lib/pve-cluster

guletz · Aug 14, 2017

Very nice

Rhinox · Aug 14, 2017

fabian said:
/etc/pve is not a real directory stored on disk, it is a FUSE filesystem backed by an sqlite DB stored in /var/lib/pve-cluster

Be it anyway, it does not correspond to FHS nor LSB. Status is kind of log-message, so it should be written somewhere ot /var/log. So that db-based fuse should be mounted somewhere ins /var/log...

fabian · Aug 16, 2017

Rhinox said:
Be it anyway, it does not correspond to FHS nor LSB. Status is kind of log-message, so it should be written somewhere ot /var/log. So that db-based fuse should be mounted somewhere ins /var/log...

no. status != log, and status in this case is akin to state anyway. pmxcfs is primarily used for synchronized configuration and state storage. which means that /var/lib is the right place to store the (non-user accessible) DB:

5.8. /var/lib : Variable state information

State information is generally used to preserve the condition of an application (or a group of inter-related applications) between invocations and between different instances of the same application. State information should generally remain valid after a reboot, should not be logging output, and should not be spooled data.

An application (or a group of inter-related applications) must use a subdirectory of /var/lib for its data. There is one required subdirectory, /var/lib/misc, which is intended for state files that don't need a subdirectory; the other subdirectories should only be present if the application in question is included in the distribution.

/var/lib/<name> is the location that must be used for all distribution packaging support. Different distributions may use different names, of course.

and /etc/pve is the right place to expose it to the user, because the primary interaction with the mounted pmxcfs (by the user) is for configuration purposes. there is some extra information exposed via the mounted FUSE file system that is actually not backed by the DB, but transparently generated, but overall the existing scheme is by far the best fit in the standard file system hierarchy.

tuonoazzurro · Jul 26, 2018

So. In 5.2 this problem is still there. Is there any solution other than disable pve-ha-lrm/pve-ha-crm?

Temtaime · Jan 23, 2020

Still no solution?
My standalone proxmox installation writes 20G every day when idle.

Also is a path to rrdcached directory hardcoded?
I moved /var/lib/rrdcached to another disk and changed paths in /etc/default/rrdcached appropriately, but got errors:
Jan 23 21:19:42 pve pmxcfs[2853]: [status] notice: RRD update error /var/lib/rrdcached/db/pve2-vm/113: opening '/var/lib/rrdcached/db/pve2-vm/113': No such file or directory

emox · Jan 31, 2020

Temtaime said:
Still no solution?
My standalone proxmox installation writes 20G every day when idle.

Also is a path to rrdcached directory hardcoded?
I moved /var/lib/rrdcached to another disk and changed paths in /etc/default/rrdcached appropriately, but got errors:
Jan 23 21:19:42 pve pmxcfs[2853]: [status] notice: RRD update error /var/lib/rrdcached/db/pve2-vm/113: opening '/var/lib/rrdcached/db/pve2-vm/113': No such file or directory

I just posted this - maybe it will be useful:
https://forum.proxmox.com/threads/reducing-rrdcached-writes.64473/

20G/day seems like a lot for just rrd data!

justr23 · Feb 6, 2021

Ist there any solution to exclude this two services from being started? Otherwise one has to stop them every time the system starts. The two services doesn't seem to be started via init.d

ph0x · Feb 7, 2021

They're handled by systemd. You disable them by typing:

Bash:

systemctl stop pve-ha-lrm.service
systemctl stop pve-ha-crm.service
systemctl disable pve-ha-lrm.service
systemctl disable pve-ha-crm.service

ultima · May 14, 2021

Temtaime said:
Still no solution?
My standalone proxmox installation writes 20G every day when idle.

Also is a path to rrdcached directory hardcoded?
I moved /var/lib/rrdcached to another disk and changed paths in /etc/default/rrdcached appropriately, but got errors:
Jan 23 21:19:42 pve pmxcfs[2853]: [status] notice: RRD update error /var/lib/rrdcached/db/pve2-vm/113: opening '/var/lib/rrdcached/db/pve2-vm/113': No such file or directory

Did you ever succeed in redirecting rrdcached? Did you restart the service/reboot before you received the error message?

Temtaime · May 14, 2021

ultima said:
Did you ever succeed in redirecting rrdcached? Did you restart the service/reboot before you received the error message?

Yes, i did. I don't know what i'd done exactly, but this worked, But finally i didn't want to fight windmills and decided to install proxmox on a dedicated ssd.

ultima · May 14, 2021

Thanks. Does the "yes" refer to the first part of my question or the second?

"pmxcfs" writing to disk all the time...

Active Member

Proxmox Staff Member

Active Member

Proxmox Staff Member

Active Member

Active Member

Proxmox Staff Member

Famous Member

Proxmox Staff Member

Famous Member

Active Member

Proxmox Staff Member

Well-Known Member

Active Member

Member

Active Member

Renowned Member

Member

Active Member

Member