pmxcfs lock-up causing I/O issues

jmatkinson

New Member
Nov 12, 2022
2
0
1
Hi all. I've been using Proxmox successfully for about a year and a half now - I have two boxes, and I use VMs, containers, and general services like samba on both. They are not in a HA cluster, they operate stand-alone. I also do not use subscription repositories. However I have a semi-recurring issue which has a tendency to lock up my entire server forcing a restart.

One of the services I run is a plex server which runs in a VM. There seems to be a very strong correlation in watching movies with this plex server and causing this issue. This does not always happen when watching movies on plex, but it does almost always seem to start while watching a movie on plex.

The issue is that the pmxcfs seems to lock up. Back in August, this was visible in syslog (about 10-30 mins into my movie) as:
Code:
Aug 23 21:38:17 mir pmxcfs[3787]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-node/mir: -1
Aug 23 21:38:17 mir pmxcfs[3787]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/102: -1
Aug 23 21:38:17 mir pmxcfs[3787]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/108: -1
Aug 23 21:38:17 mir pmxcfs[3787]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/100: -1
Aug 23 21:38:17 mir pmxcfs[3787]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/107: -1
Aug 23 21:38:17 mir pmxcfs[3787]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/101: -1
Aug 23 21:38:17 mir pmxcfs[3787]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/110: -1
Aug 23 21:38:17 mir pmxcfs[3787]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/106: -1
Aug 23 21:38:17 mir pmxcfs[3787]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/105: -1
Aug 23 21:38:17 mir pmxcfs[3787]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/104: -1
Aug 23 21:38:17 mir pmxcfs[3787]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/103: -1
Aug 23 21:38:17 mir pmxcfs[3787]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/mir/local-zfs: -1
Aug 23 21:38:17 mir pmxcfs[3787]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/mir/local: -1
Aug 23 21:38:22 mir pve-ha-crm[3960]: loop take too long (78 seconds)
Aug 23 21:38:22 mir pve-ha-lrm[3969]: loop take too long (84 seconds)
Code:
Aug 23 22:06:19 mir pvescheduler[1013120]: jobs: cfs-lock 'file-jobs_cfg' error: got lock request timeout

And this first line seems to be the critical error:
Code:
Aug 23 22:12:07 mir pmxcfs[3787]: [database] crit: commit transaction failed: database or disk is full#010
Aug 23 22:12:10 mir pmxcfs[3787]: [database] crit: rollback transaction failed: cannot rollback - no transaction is active#010
Aug 23 22:12:10 mir pve-ha-lrm[3969]: unable to write lrm status file - unable to delete old temp file: Input/output error
Aug 23 22:12:23 mir pve-ha-lrm[3969]: unable to write lrm status file - unable to delete old temp file: Input/output error
Aug 23 22:12:23 mir pvescheduler[1127531]: jobs: cfs-lock 'file-jobs_cfg' error: got lock request timeout
Aug 23 22:12:23 mir pve-ha-lrm[3969]: unable to write lrm status file - unable to delete old temp file: Input/output error
Aug 23 22:12:23 mir pvescheduler[1127530]: replication: cfs-lock 'file-replication_cfg' error: got lock request timeout
Aug 23 22:12:23 mir pve-ha-lrm[3969]: unable to write lrm status file - unable to delete old temp file: Input/output error
The lrm status file lines continue until I resolved it the next morning. They are also interspersed with:
Code:
Aug 23 23:25:13 mir pvescheduler[1781478]: jobs: cfs-lock 'file-jobs_cfg' error: got lock request timeout
Aug 23 23:25:13 mir pvescheduler[1781477]: replication: cfs-lock 'file-replication_cfg' error: got lock request timeout
Restarting also resolves the issue.

I had this issue again in the last few days:

Code:
Nov 10 18:34:00 mir pmxcfs[3264]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-node/mir: -1
Nov 10 18:34:00 mir pmxcfs[3264]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/108: -1
Nov 10 18:34:00 mir pmxcfs[3264]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/100: -1
Nov 10 18:34:00 mir pmxcfs[3264]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/107: -1
Nov 10 18:34:00 mir pmxcfs[3264]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/101: -1
Nov 10 18:34:00 mir pmxcfs[3264]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/102: -1
Nov 10 18:34:00 mir pmxcfs[3264]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/110: -1
Nov 10 18:34:00 mir pmxcfs[3264]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/105: -1
Nov 10 18:34:00 mir pmxcfs[3264]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/106: -1
Nov 10 18:34:00 mir pmxcfs[3264]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/104: -1
Nov 10 18:34:00 mir pmxcfs[3264]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/103: -1
Nov 10 18:34:00 mir pmxcfs[3264]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/mir/local-zfs: -1
Nov 10 18:34:00 mir pmxcfs[3264]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/mir/local: -1
Nov 10 18:34:17 mir pvescheduler[668614]: jobs: cfs-lock 'file-jobs_cfg' error: got lock request timeout

Code:
Nov 10 19:07:37 mir pmxcfs[3264]: [database] crit: commit transaction failed: database or disk is full#010
Nov 10 19:07:37 mir pmxcfs[3264]: [database] crit: rollback transaction failed: cannot rollback - no transaction is active#010

Code:
Nov 12 14:56:39 mir pve-ha-lrm[3384]: unable to write lrm status file - unable to open file '/etc/pve/nodes/mir/lrm_status.tmp.3384' - Input/output error
Nov 12 14:56:44 mir pve-ha-lrm[3384]: unable to write lrm status file - unable to open file '/etc/pve/nodes/mir/lrm_status.tmp.3384' - Input/output error
Nov 12 14:56:48 mir pvestatd[3325]: authkey rotation error: cfs-lock 'authkey' error: got lock request timeout
Nov 12 14:56:48 mir pvestatd[3325]: status update time (9.078 seconds)

For information's sake:
Code:
Linux mir 5.15.53-1-pve #1 SMP PVE 5.15.53-1 (Fri, 26 Aug 2022 16:53:52 +0200) x86_64 GNU/Linux

Current uptime 46 days. Happy to provide any other useful info.

The impacts of this are severe. It prevents the webui from working unless the workaround in this thread is taken, and most concerningly it sometimes pauses all VMs (not containers though). The VMs are visible in the UI with a yellow triangle saying "IO error" when they are paused. It seems they are paused because of this. Sometimes the VMs don't pause, but it seems to interrupt network traffic to them and I cannot access them until I restart.

sudo qm resume <vmid> is required to get the VMs going again, and only after the actions in the above thread are taken.

Similar symptoms but this guy seems to have a genuine disk issue so I don't think quite the same:
https://forum.proxmox.com/threads/cant-unlock-vm.68335/

The high-level consensus for what I have found is that the pmxcfs database is overflowing, but there seems to be no way to predict this, or relaibliy avoid it without just scheduling a restart every so often, which I would rather not do.

I have confirmed we haven't run out of root disk space yet. I can access SSH and files over samba. It just seems to be the webconsole and some vms that are affected.

This issue seems to happen seemingly randomly but consistenley when watching movies with plex. It's not frequent but totally derailing when it does occur.

Is there anything that can be done?
 
Code:
Nov 10 19:07:37 mir pmxcfs[3264]: [database] crit: commit transaction failed: database or disk is full#010
I have confirmed we haven't run out of root disk space yet.
This issue seems to happen seemingly randomly but consistenley when watching movies with plex.

I am no expert, therefore maybe a wild theory:

Is the Plex (root) vdisk thin-provisioned and on the same disk as your PVE-host's root?
If yes, would the total free disk space be exceeded, if the Plex (root) vdisk would be filled up?
If also yes and we assume that your physical disk has no problems, the first thing that came in my mind was, that maybe the disk runs full with the temporary transcoding-files from Plex?

Transcoder temporary directory
Directory to use when transcoding temporary files for streaming. This is useful if your primary drive has limited space. The directory used (whether default or not) needs sufficient free space, roughly equal to the size of the source file of the transcode plus 100MB.

This setting does not affect where transcoded files are stored for background transcoding tasks (sync or Media Optimizer).
https://support.plex.tv/articles/200250347-transcoder

I hope it is understandable what I mean. Not even sure, if I should have posted this with all those assumptions... Sorry, if this makes absolutely no sense. :confused:
 
I am no expert, therefore maybe a wild theory:

Is the Plex (root) vdisk thin-provisioned and on the same disk as your PVE-host's root?
If yes, would the total free disk space be exceeded, if the Plex (root) vdisk would be filled up?
If also yes and we assume that your physical disk has no problems, the first thing that came in my mind was, that maybe the disk runs full with the temporary transcoding-files from Plex?


https://support.plex.tv/articles/200250347-transcoder

I hope it is understandable what I mean. Not even sure, if I should have posted this with all those assumptions... Sorry, if this makes absolutely no sense. :confused:

Hi Neobin,

No I totally understand and I think you're on to something.

I tried an experiment.

I filled the disk nearly to the brim with by copying a file from my larger disks:
Code:
NAME                           USED  AVAIL     REFER  MOUNTPOINT
rpool/ROOT/pve-1              25.0G  1.25G     25.0G  /
rpool/data/subvol-103-disk-0  34.6G  1.25G     7.45G  /rpool/data/subvol-103-disk-0

On the actual Plex container (the nominal size of this container disk is 20GB):
Code:
Filesystem                    Size  Used Avail Use% Mounted on
rpool/data/subvol-103-disk-0  8.8G  7.5G  1.3G  86% /

What's interesting is that I got these errors, and some connectivity (like 503's from my webservers) errors while copying the file over:
Code:
Nov 16 19:52:17 mir pmxcfs[3160]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-node/mir: -1
Nov 16 19:52:17 mir pmxcfs[3160]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/101: -1
...
Nov 16 19:52:17 mir pmxcfs[3160]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/mir/local: -1
Nov 16 19:52:17 mir pmxcfs[3160]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/mir/local-zfs: -1
I wonder if there could be an issue when high volumes of data is being processed. This SSD is

Anyhow, then I played a movie. I tried to play the one that broke it earlier, but Plex refused to play it due to lack of disk space.

I watched the Plex logs and got many repeated warnings like:

Code:
[Req#60299/Transcode/2jv6eyx0yeh0wfyywpp6ued1/7969ebe4-c864-4d13-8a9a-55ff2a87bef4] Low disk space: 785.81MB source file, 8.7GB capacity, 1.12MB available on "/var/lib/plexmediaserver/Library/Application Support/Plex Media Server/Cache/Transcode/Sessions/plex-transcode-2jv6eyx0yeh0wfyywpp6ued1-7969ebe4-c864-4d13-8a9a-55ff2a87bef4"

And then I had a watch running on the disk space. Basically, It quickly got down to 100MB, and then seemed to hover around there. I left it running for a few minutes, and it always cleared enough disk space in time.

To force the disk to run out of space, I started another copy so another stream of data was coming in. This got the disk down to totally full as you can see in the plex report above.

This promptly crashed plex: Conversion failed. The transcoder exited due to an error. Normaly Plex doesn't crash despite it breaking other aspects of my system.

I noticed these errors in the syslog:
Code:
Nov 16 20:08:42 mir pvescheduler[2160000]: replication: close (rename) atomic file '/var/lib/pve-manager/pve-replication-state.json' failed: No space left on device
...
Nov 16 20:09:03 mir nmbd[2167918]:   messaging_reinit() failed: NT_STATUS_DISK_FULL

To see if it was a full disk causing it, I filled the rest of the disk with another file. I watched syslog for an hour and got a few of these:
Code:
Nov 16 20:57:02 mir pmxcfs[3160]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/103: -1
Nov 16 20:57:02 mir pmxcfs[3160]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/105: -1
But it didn't seem to crash pmxcfs.

I'm wondering if it's a timing thing. It could be that when the disk is close to full, Plex fills it up, and as it's full for a while at some point it conflicts with pmxcfs causing it to crash. Obviously when the plex movie finishes it drops all that data and thus it never appears as though the disk was full.

I might try somehow limiting the container disk to a specific size to ensure there is always enough space left on the disk for the OS.

The other thing to note is the SSD is, at this point, quite old. It's from circa 2012. I'm wondering if there could be an issue with aged flash memory.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!