Proxmox VE management GUI doing a fair amount of writing to disk

isus · May 19, 2021

We have been testing PVE on a couple of servers for a couple of weeks now. It's a great product. Nevertheless, there is something that has been greatly puzzling us.

The two servers exhibit exactly the same behavior. Specs of both servers: Epyc 7702P, 256GB, three mirrored SAS SSDs (ZFS mirror created by the Proxmox installer itself), Proxmox VE 6.4-1. The two servers are being tested independently from each other (no clustering, that is). The installation and configuration are completely standard in both cases, except for two things:

1- We disabled pve-ha-crm, pve-ha-lrm, pvesr and corosync.
2- We disabled RRDcached journaling, and increased WRITE_TIMEOUT and FLUSH_TIMEOUT.

If all VMs and containers are stopped, and there is no network traffic whatsoever to and from any of the interfaces (other than management related), and we are not connected to the management GUI (https://a.b.c.d:8006), either server writes about 10MB/hour (ten megabytes per hour) to disk. Mostly logs, we assume. Pretty reasonable.

However, under exactly the same conditions, as soon as we connect (open a browser session) to the management port and do nothing other than stare at the browser window, either server starts writing puzzling amounts of unknown (to us) data to disk at a rate of about 400MB/hour (400 megabytes per hour). The instant we disconnect (close the browser session) from port 8006, both servers revert to their previous behavior and start writing only about 10MB/hour. That's a more than considerable difference.

We have used all sorts of unix tools to try to find out what is going on, but to no avail at all. What could be causing the servers to write 40 times as much data to disk as soon as a connection to management port 8006 is established, even though the servers are doing nothing but chug along at the most minimal of states? Can that behavior be modified?

Thanks.

fabian · May 20, 2021

the only thing an open GUI session does is poll certain status API endpoints, which leads to additional logging (API access log mostly). you haven't described where you get your I/O measurements from, so it's a bit hard to tell whether anything else is going on (like write amplification).

~100kb/s is not that much though (and most likely dwarfed by whatever your guests will be doing), my mostly idle workstation averages around 650 as measured by iostat on the block device

isus · May 20, 2021

Thanks for your response, Fabian.

We got our data the same way as you, via iostat. We did many runs on two different servers and, invariably, got the same results. In fact, all the writing happens at exactly 5 second intervals, which is rather strange to us. If it were just regular logging, one would expect the data to be written spontaneously, not at exactly every 5 seconds. All that periodic writing smells of something other than mere logging to us.

The thing is that 400 MB/hour (9.6 GB per day) is indeed a lot for, apparently, doing nothing too useful, given that the servers are doing nothing at all (other than keeping a connection to port 8006 open) while writing all that data. It really is puzzling to us what PVE might be doing that requires all the writing just because one starts (and keeps open) a management session to port 8006. It is also strange that all the writing stops the very moment we close the connection to port 8006 (by simply closing the browser window). What is all the information that must be written to disk only while one is connected to port 8006? As I mentioned in my previous post, if one is not connected to the management GUI, the writing shrinks to a mere, reasonable, and very healthy 10MB/hour (240MB per day).

The thing is that the kind of solid state drives we use (high-end enterprise) cost a pretty penny per GB. Every byte written counts. We always keep all the writing to a strict minimum in all of our servers, a practice that has given us excellent results over several decades of experience looking after very many and sundry servers. Even hard drives will live longer the less you write onto them, that is why they are always rated for a specific yearly cycle (at least the good ones). We learned that the hard way many years ago.

fabian · May 20, 2021

isus said:
Thanks for your response, Fabian.

We got our data the same way as you, via iostat. We did many runs on two different servers and, invariably, got the same results. In fact, all the writing happens at exactly 5 second intervals, which is rather strange to us. If it were just regular logging, one would expect the data to be written spontaneously, not at exactly every 5 seconds. All that periodic writing smells of something other than mere logging to us.

that's probably just ZFS syncing out the async writes (the default maximum time for a ZFS transaction group is exactly 5 seconds). it is possible to adjust this (at the risk of losing more of that not-yet-written-out data in case of a crash/power loss/..).

isus said:
The thing is that 400 MB/hour (9.6 GB per day) is indeed a lot for, apparently, doing nothing too useful, given that the servers are doing nothing at all (other than keeping a connection to port 8006 open) while writing all that data. It really is puzzling to us what PVE might be doing that requires all the writing just because one starts (and keeps open) a management session to port 8006. It is also strange that all the writing stops the very moment we close the connection to port 8006 (by simply closing the browser window). What is all the information that must be written to disk only while one is connected to port 8006? As I mentioned in my previous post, if one is not connected to the management GUI, the writing shrinks to a mere, reasonable, and very healthy 10MB/hour (240MB per day).

the thing is - while you are connected to the GUI the server is "not doing nothing" - it gets queried and responds every X seconds with various status information

at least the /cluster/resources and /cluster/tasks endpoints every 3 seconds, potentially more if you have specific views like the dashboard, guest details, .. open.

isus said:
The thing is that the kind of solid state drives we use (high-end enterprise) cost a pretty penny per GB. Every byte written counts. We always keep all the writing to a strict minimum in all of our servers, a practice that has given us excellent results over several decades of experience looking after very many and sundry servers. Even hard drives will live longer the less you write onto them, that is why they are always rated for a specific yearly cycle (at least the good ones). We learned that the hard way many years ago.

any enterprise SSD should handle at least 1 DWPD for > 3 years, so writing 10GB/day is just a tiny drop in the bucket (e.g., even on the lowest end with 500GB and only 1 DWPD for 3 years, that's less than 2% of your wearout budget over the whole expected lifetime, and that is only with the GUI open 24/7). but now you know why we strongly advise people not to use consumer or prosumer SSDs with PVE and ZFS..

isus · May 20, 2021

fabian said:
that's probably just ZFS syncing out the async writes (the default maximum time for a ZFS transaction group is exactly 5 seconds). it is possible to adjust this (at the risk of losing more of that not-yet-written-out data in case of a crash/power loss/..).

We know that. Problem is that there would not be any syncing every 5 seconds unless there were data in need to be written out to disk every 5 seconds. We just find it unusual that, if it were just logging, some of the syncing, sometimes, did not happen after, say, 10 or 15 seconds. But, obviously, there is a lot more than logging going on.

fabian said:
the thing is - while you are connected to the GUI the server is "not doing nothing" - it gets queried and responds every X seconds with various status information at least the /cluster/resources and /cluster/tasks endpoints every 3 seconds, potentially more if you have specific views like the dashboard,

Fair enough, but, as I mentioned in my first post, we disabled all the clustering (including all related services) and, while being connected to the management GUI, we need not be looking at any specific part of the dashboard for all the considerable writing out to disk to occur. That is, we are not looking at the "summary" pages or anything like that. In fact, as soon as we authenticate and log into the dashboard, PVE goes into its mysterious data writing fit. That is, we do not need to go beyond the "No valid subscription" window (no further than that, allow me to emphasize) for the writing fit to start. Forgive me for insisting, but there is nothing too useful about writing whatever PVE is writing while all VMs and containers are fully stopped, there is no network activity at all, and we, the only PVE administrators around, are doing exactly nothing else than staring at a little window that says "No valid subscription".

fabian said:
guest details, .. open.

any enterprise SSD should handle at least 1 DWPD for > 3 years, so writing 10GB/day is just a tiny drop in the bucket (e.g., even on the lowest end with 500GB and only 1 DWPD for 3 years, that's less than 2% of your wearout budget over the whole expected lifetime, and that is only with the GUI open 24/7). but now you know why we strongly advise people not to use consumer or prosumer SSDs with PVE and ZFS..

In our line of business (as in all profitable lines of business, I would imagine) resources are lovingly looked after and thoroughly used in a fully productive manner. If you do not do that, your company is out of business and you are out of a job. As I mentioned before, every byte, just like every penny, counts. Is there any way for us to be able modify the described behavior (and considerably reduce all the writing) in our test PVE servers?

fabian · May 20, 2021

isus said:
Fair enough, but, as I mentioned in my first post, we disabled all the clustering (including all related services) and, while being connected to the management GUI, we need not be looking at any specific part of the dashboard for all the considerable writing out to disk to occur. That is, we are not looking at the "summary" pages or anything like that. In fact, as soon as we authenticate and log into the dashboard, PVE goes into its mysterious data writing fit. That is, we do not need to go beyond the "No valid subscription" window (no further than that, allow me to emphasize) for the writing fit to start. Forgive me for insisting, but there is nothing too useful about writing whatever PVE is writing while all VMs and containers are fully stopped, there is no network activity at all, and we, the only PVE administrators around, are doing exactly nothing else than staring at a little window that says "No valid subscription".

like I said - if you open the GUI, two API calls WILL be made every 3s: /cluster/resources and /cluster/tasks . those are needed for refreshing the tree on the left side (which guests/storages/pools/nodes/.. exist, and what is their last known status) and the task information on the bottom panel. the API calls are fairly cheap (else we wouldn't make them that often obviously). opening a dashboard/summary/... panel might cause additional API calls to be made (once, or periodically, e.g., if you open a system log panel, or task log viewer for an ongoing task, or ...), those might not be as cheap but are only done when triggered by user action.

isus said:
In our line of business (as in all profitable lines of business, I would imagine) resources are lovingly looked after and thoroughly used in a fully productive manner. If you do not do that, your company is out of business and you are out of a job. As I mentioned before, every byte, just like every penny, counts. Is there any way for us to be able modify the described behavior (and considerably reduce all the writing) in our test PVE servers?

yes, being diligent about resource usage is a good thing. any kind of production usage will absolutely dwarf anything we are talking about here though, so there is a point where adding additional complexity (or reducing consistency guarantees, debugging options, ...) is not worth it.

things you can do:
- add additional monitoring to find out which files exactly cause the IO (likely candidates: the pmxcfs DB, pveproxy access log)
- move things like logs into RAM only (or in some kind of RAM-but-synced-every-hour thing, to have some level of persistency)
- turn on relatime to reduce metadata writes
- increase the txg timeout to coalesce more writes to reduce write-amplification

I wouldn't recommend any of that except maybe relatime on a production system though.

isus · May 22, 2021

That is quite unfortunate, then, because every little written byte does count, indeed. The fact of the matter (your boundless faith in enterprise SSD technology notwithstanding) is that even top-notch enterprise SSDs fail and, when they do, things can get very nasty. One can only hope that the Proxmox developers do come to realize, sooner than later, how much of a problem this sort of thing actually is. I am sure nobody wants Proxmox to obtain the reputation of an "ssd killer".

On to testing XCP-ng.

Thanks, anyway.

krikey · May 25, 2021

isus said:
That is quite unfortunate, then, because every little written byte does count, indeed. The fact of the matter (your boundless faith in enterprise SSD technology notwithstanding) is that even top-notch enterprise SSDs fail and, when they do, things can get very nasty. One can only hope that the Proxmox developers do come to realize, sooner than later, how much of a problem this sort of thing actually is. I am sure nobody wants Proxmox to obtain the reputation of an "ssd killer".

On to testing XCP-ng.

Thanks, anyway.

I wonder if this is exclusive to Proxmox, or whether you're going to see the same or similar results with other Hypervisor systems whilst the GUI is open. I'd be interested to hear if your other experiments increase disk access.

TheForumTroll · May 25, 2021

fabian said:
the only thing an open GUI session does is poll certain status API endpoints, which leads to additional logging (API access log mostly)

I agree it is a drop in the bucket but to not turn this into a philosophical discussion on SSD longevity maybe one could boil it down to this: Can you disable that "additional logging (API access log mostly)"? If possible without breaking the system (and at their own risk) that seems to make everyone happy, no?

Search

Search

Proxmox VE management GUI doing a fair amount of writing to disk

isus

New Member

fabian

Proxmox Staff Member

isus

New Member

fabian

Proxmox Staff Member

isus

New Member

fabian

Proxmox Staff Member

isus

New Member

krikey

Renowned Member

TheForumTroll

Member