Hi,
we have one node in our cluster what becomes "red" every 1-3 days. Can't open "summary" or "system" then. We see a 100% pmxcfs process and need to restart the whole node to fix the issue temporarily.
Syslog:
Any ideas?
we have one node in our cluster what becomes "red" every 1-3 days. Can't open "summary" or "system" then. We see a 100% pmxcfs process and need to restart the whole node to fix the issue temporarily.
Linux bondsir003-74050-bl10 4.15.18-14-pve #1 SMP PVE 4.15.18-38 (Tue, 30 Apr 2019 10:51:33 +0200) x86_64 GNU/Linux
Syslog:
May 21 16:20:45 bondsir003-74050-bl10 systemd[1]: Started Session 4309 of user root.
May 21 16:25:41 bondsir003-74050-bl10 systemd[1]: Started Session 4315 of user root.
May 21 16:25:57 bondsir003-74050-bl10 systemd[1]: Stopping The Proxmox VE cluster filesystem...
May 21 16:26:01 bondsir003-74050-bl10 cron[1840]: (*system*vzdump) CAN'T OPEN SYMLINK (/etc/cron.d/vzdump)
May 21 16:26:08 bondsir003-74050-bl10 systemd[1]: pve-cluster.service: State 'stop-sigterm' timed out. Killing.
May 21 16:26:08 bondsir003-74050-bl10 systemd[1]: pve-cluster.service: Killing process 1788 (pmxcfs) with signal SIGKILL.
May 21 16:26:08 bondsir003-74050-bl10 systemd[1]: pve-cluster.service: Main process exited, code=killed, status=9/KILL
May 21 16:26:08 bondsir003-74050-bl10 pvesr[33277]: error with cfs lock 'file-replication_cfg': no quorum!
May 21 16:26:08 bondsir003-74050-bl10 pve-ha-lrm[1988]: unable to write lrm status file - closing file '/etc/pve/nodes/bondsir003-74050-bl10/lrm_status.tmp.1988' failed - Software caused connection abort
May 21 16:26:08 bondsir003-74050-bl10 systemd[1]: Stopped The Proxmox VE cluster filesystem.
May 21 16:26:08 bondsir003-74050-bl10 systemd[1]: pve-cluster.service: Unit entered failed state.
May 21 16:26:08 bondsir003-74050-bl10 systemd[1]: pve-cluster.service: Failed with result 'timeout'.
May 21 16:26:08 bondsir003-74050-bl10 systemd[1]: Stopping Corosync Cluster Engine...
May 21 16:26:08 bondsir003-74050-bl10 systemd[1]: pvesr.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
May 21 16:26:08 bondsir003-74050-bl10 systemd[1]: Failed to start Proxmox VE replication runner.
May 21 16:26:08 bondsir003-74050-bl10 systemd[1]: pvesr.service: Unit entered failed state.
May 21 16:26:08 bondsir003-74050-bl10 systemd[1]: pvesr.service: Failed with result 'exit-code'.
May 21 16:26:08 bondsir003-74050-bl10 systemd[1]: Starting Proxmox VE replication runner...
May 21 16:26:08 bondsir003-74050-bl10 pve-firewall[1896]: status update error: Connection refused
May 21 16:26:08 bondsir003-74050-bl10 pve-firewall[1896]: firewall update time (972.065 seconds)
May 21 16:26:08 bondsir003-74050-bl10 pve-firewall[1896]: status update error: Connection refused
May 21 16:26:08 bondsir003-74050-bl10 pveproxy[63193]: ipcc_send_rec[1] failed: Connection refused
May 21 16:26:08 bondsir003-74050-bl10 pveproxy[63193]: ipcc_send_rec[2] failed: Connection refused
May 21 16:26:08 bondsir003-74050-bl10 pveproxy[63193]: ipcc_send_rec[3] failed: Connection refused
May 21 16:26:08 bondsir003-74050-bl10 pvestatd[1892]: ipcc_send_rec[4] failed: Connection refused
May 21 16:26:08 bondsir003-74050-bl10 pvesr[35287]: ipcc_send_rec[1] failed: Connection refused
May 21 16:26:08 bondsir003-74050-bl10 pvesr[35287]: ipcc_send_rec[2] failed: Connection refused
May 21 16:26:08 bondsir003-74050-bl10 pvesr[35287]: ipcc_send_rec[3] failed: Connection refused
May 21 16:26:08 bondsir003-74050-bl10 pvesr[35287]: Unable to load access control list: Connection refused
May 21 16:26:08 bondsir003-74050-bl10 systemd[1]: pvesr.service: Main process exited, code=exited, status=111/n/a
May 21 16:26:08 bondsir003-74050-bl10 systemd[1]: Failed to start Proxmox VE replication runner.
May 21 16:26:08 bondsir003-74050-bl10 systemd[1]: pvesr.service: Unit entered failed state.
May 21 16:26:08 bondsir003-74050-bl10 systemd[1]: pvesr.service: Failed with result 'exit-code'.
May 21 16:26:08 bondsir003-74050-bl10 pveproxy[63194]: ipcc_send_rec[1] failed: Connection refused
May 21 16:26:08 bondsir003-74050-bl10 pveproxy[63194]: ipcc_send_rec[2] failed: Connection refused
May 21 16:26:08 bondsir003-74050-bl10 pveproxy[63194]: ipcc_send_rec[3] failed: Connection refused
May 21 16:26:08 bondsir003-74050-bl10 pvestatd[1892]: ipcc_send_rec[4] failed: Connection refused
May 21 16:26:08 bondsir003-74050-bl10 pvestatd[1892]: ipcc_send_rec[4] failed: Connection refused
May 21 16:26:08 bondsir003-74050-bl10 pvestatd[1892]: ipcc_send_rec[4] failed: Connection refused
May 21 16:26:08 bondsir003-74050-bl10 pvestatd[1892]: ipcc_send_rec[4] failed: Connection refused
May 21 16:26:08 bondsir003-74050-bl10 pvestatd[1892]: status update time (977.179 seconds)
May 21 16:26:08 bondsir003-74050-bl10 pvestatd[1892]: ipcc_send_rec[1] failed: Connection refused
May 21 16:26:08 bondsir003-74050-bl10 pvestatd[1892]: ipcc_send_rec[2] failed: Connection refused
May 21 16:26:08 bondsir003-74050-bl10 pvestatd[1892]: ipcc_send_rec[3] failed: Connection refused
May 21 16:26:08 bondsir003-74050-bl10 pvestatd[1892]: ipcc_send_rec[4] failed: Connection refused
May 21 16:26:08 bondsir003-74050-bl10 pvestatd[1892]: status update error: Connection refused
May 21 16:26:09 bondsir003-74050-bl10 pveproxy[63192]: ipcc_send_rec[1] failed: Connection refused
May 21 16:26:09 bondsir003-74050-bl10 pveproxy[63192]: ipcc_send_rec[2] failed: Connection refused
May 21 16:26:09 bondsir003-74050-bl10 pveproxy[63192]: ipcc_send_rec[3] failed: Connection refused
May 21 16:26:13 bondsir003-74050-bl10 pve-ha-lrm[1988]: loop take too long (1228 seconds)
May 21 16:26:13 bondsir003-74050-bl10 pve-ha-lrm[1988]: updating service status from manager failed: Connection refused
May 21 16:26:14 bondsir003-74050-bl10 pve-ha-crm[1941]: loop take too long (976 seconds)
May 21 16:26:17 bondsir003-74050-bl10 pveproxy[63194]: ipcc_send_rec[1] failed: Connection refused
May 21 16:26:17 bondsir003-74050-bl10 pveproxy[63194]: ipcc_send_rec[2] failed: Connection refused
May 21 16:26:17 bondsir003-74050-bl10 pveproxy[63194]: ipcc_send_rec[3] failed: Connection refused
May 21 16:26:18 bondsir003-74050-bl10 pve-ha-lrm[1988]: updating service status from manager failed: Connection refused
Binary file (standard input) matches
Any ideas?