Hi all. I've been using Proxmox successfully for about a year and a half now - I have two boxes, and I use VMs, containers, and general services like samba on both. They are not in a HA cluster, they operate stand-alone. I also do not use subscription repositories. However I have a semi-recurring issue which has a tendency to lock up my entire server forcing a restart.
One of the services I run is a plex server which runs in a VM. There seems to be a very strong correlation in watching movies with this plex server and causing this issue. This does not always happen when watching movies on plex, but it does almost always seem to start while watching a movie on plex.
The issue is that the pmxcfs seems to lock up. Back in August, this was visible in syslog (about 10-30 mins into my movie) as:
And this first line seems to be the critical error:
The lrm status file lines continue until I resolved it the next morning. They are also interspersed with:
Restarting also resolves the issue.
I had this issue again in the last few days:
For information's sake:
Current uptime 46 days. Happy to provide any other useful info.
The impacts of this are severe. It prevents the webui from working unless the workaround in this thread is taken, and most concerningly it sometimes pauses all VMs (not containers though). The VMs are visible in the UI with a yellow triangle saying "IO error" when they are paused. It seems they are paused because of this. Sometimes the VMs don't pause, but it seems to interrupt network traffic to them and I cannot access them until I restart.
Similar symptoms but this guy seems to have a genuine disk issue so I don't think quite the same:
https://forum.proxmox.com/threads/cant-unlock-vm.68335/
The high-level consensus for what I have found is that the pmxcfs database is overflowing, but there seems to be no way to predict this, or relaibliy avoid it without just scheduling a restart every so often, which I would rather not do.
I have confirmed we haven't run out of root disk space yet. I can access SSH and files over samba. It just seems to be the webconsole and some vms that are affected.
This issue seems to happen seemingly randomly but consistenley when watching movies with plex. It's not frequent but totally derailing when it does occur.
Is there anything that can be done?
One of the services I run is a plex server which runs in a VM. There seems to be a very strong correlation in watching movies with this plex server and causing this issue. This does not always happen when watching movies on plex, but it does almost always seem to start while watching a movie on plex.
The issue is that the pmxcfs seems to lock up. Back in August, this was visible in syslog (about 10-30 mins into my movie) as:
Code:
Aug 23 21:38:17 mir pmxcfs[3787]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-node/mir: -1
Aug 23 21:38:17 mir pmxcfs[3787]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/102: -1
Aug 23 21:38:17 mir pmxcfs[3787]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/108: -1
Aug 23 21:38:17 mir pmxcfs[3787]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/100: -1
Aug 23 21:38:17 mir pmxcfs[3787]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/107: -1
Aug 23 21:38:17 mir pmxcfs[3787]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/101: -1
Aug 23 21:38:17 mir pmxcfs[3787]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/110: -1
Aug 23 21:38:17 mir pmxcfs[3787]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/106: -1
Aug 23 21:38:17 mir pmxcfs[3787]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/105: -1
Aug 23 21:38:17 mir pmxcfs[3787]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/104: -1
Aug 23 21:38:17 mir pmxcfs[3787]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/103: -1
Aug 23 21:38:17 mir pmxcfs[3787]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/mir/local-zfs: -1
Aug 23 21:38:17 mir pmxcfs[3787]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/mir/local: -1
Aug 23 21:38:22 mir pve-ha-crm[3960]: loop take too long (78 seconds)
Aug 23 21:38:22 mir pve-ha-lrm[3969]: loop take too long (84 seconds)
Code:
Aug 23 22:06:19 mir pvescheduler[1013120]: jobs: cfs-lock 'file-jobs_cfg' error: got lock request timeout
And this first line seems to be the critical error:
Code:
Aug 23 22:12:07 mir pmxcfs[3787]: [database] crit: commit transaction failed: database or disk is full#010
Aug 23 22:12:10 mir pmxcfs[3787]: [database] crit: rollback transaction failed: cannot rollback - no transaction is active#010
Aug 23 22:12:10 mir pve-ha-lrm[3969]: unable to write lrm status file - unable to delete old temp file: Input/output error
Aug 23 22:12:23 mir pve-ha-lrm[3969]: unable to write lrm status file - unable to delete old temp file: Input/output error
Aug 23 22:12:23 mir pvescheduler[1127531]: jobs: cfs-lock 'file-jobs_cfg' error: got lock request timeout
Aug 23 22:12:23 mir pve-ha-lrm[3969]: unable to write lrm status file - unable to delete old temp file: Input/output error
Aug 23 22:12:23 mir pvescheduler[1127530]: replication: cfs-lock 'file-replication_cfg' error: got lock request timeout
Aug 23 22:12:23 mir pve-ha-lrm[3969]: unable to write lrm status file - unable to delete old temp file: Input/output error
Code:
Aug 23 23:25:13 mir pvescheduler[1781478]: jobs: cfs-lock 'file-jobs_cfg' error: got lock request timeout
Aug 23 23:25:13 mir pvescheduler[1781477]: replication: cfs-lock 'file-replication_cfg' error: got lock request timeout
I had this issue again in the last few days:
Code:
Nov 10 18:34:00 mir pmxcfs[3264]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-node/mir: -1
Nov 10 18:34:00 mir pmxcfs[3264]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/108: -1
Nov 10 18:34:00 mir pmxcfs[3264]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/100: -1
Nov 10 18:34:00 mir pmxcfs[3264]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/107: -1
Nov 10 18:34:00 mir pmxcfs[3264]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/101: -1
Nov 10 18:34:00 mir pmxcfs[3264]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/102: -1
Nov 10 18:34:00 mir pmxcfs[3264]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/110: -1
Nov 10 18:34:00 mir pmxcfs[3264]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/105: -1
Nov 10 18:34:00 mir pmxcfs[3264]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/106: -1
Nov 10 18:34:00 mir pmxcfs[3264]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/104: -1
Nov 10 18:34:00 mir pmxcfs[3264]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/103: -1
Nov 10 18:34:00 mir pmxcfs[3264]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/mir/local-zfs: -1
Nov 10 18:34:00 mir pmxcfs[3264]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/mir/local: -1
Nov 10 18:34:17 mir pvescheduler[668614]: jobs: cfs-lock 'file-jobs_cfg' error: got lock request timeout
Code:
Nov 10 19:07:37 mir pmxcfs[3264]: [database] crit: commit transaction failed: database or disk is full#010
Nov 10 19:07:37 mir pmxcfs[3264]: [database] crit: rollback transaction failed: cannot rollback - no transaction is active#010
Code:
Nov 12 14:56:39 mir pve-ha-lrm[3384]: unable to write lrm status file - unable to open file '/etc/pve/nodes/mir/lrm_status.tmp.3384' - Input/output error
Nov 12 14:56:44 mir pve-ha-lrm[3384]: unable to write lrm status file - unable to open file '/etc/pve/nodes/mir/lrm_status.tmp.3384' - Input/output error
Nov 12 14:56:48 mir pvestatd[3325]: authkey rotation error: cfs-lock 'authkey' error: got lock request timeout
Nov 12 14:56:48 mir pvestatd[3325]: status update time (9.078 seconds)
For information's sake:
Code:
Linux mir 5.15.53-1-pve #1 SMP PVE 5.15.53-1 (Fri, 26 Aug 2022 16:53:52 +0200) x86_64 GNU/Linux
Current uptime 46 days. Happy to provide any other useful info.
The impacts of this are severe. It prevents the webui from working unless the workaround in this thread is taken, and most concerningly it sometimes pauses all VMs (not containers though). The VMs are visible in the UI with a yellow triangle saying "IO error" when they are paused. It seems they are paused because of this. Sometimes the VMs don't pause, but it seems to interrupt network traffic to them and I cannot access them until I restart.
sudo qm resume <vmid>
is required to get the VMs going again, and only after the actions in the above thread are taken.Similar symptoms but this guy seems to have a genuine disk issue so I don't think quite the same:
https://forum.proxmox.com/threads/cant-unlock-vm.68335/
The high-level consensus for what I have found is that the pmxcfs database is overflowing, but there seems to be no way to predict this, or relaibliy avoid it without just scheduling a restart every so often, which I would rather not do.
I have confirmed we haven't run out of root disk space yet. I can access SSH and files over samba. It just seems to be the webconsole and some vms that are affected.
This issue seems to happen seemingly randomly but consistenley when watching movies with plex. It's not frequent but totally derailing when it does occur.
Is there anything that can be done?