Hi forum,
we run a six node pve/ceph cluster with two corosync rings.
Yesterday we have a segfault on all nodes a the same time. Every node make a reboot.
Feb 25 13:16:01 node1 systemd[1]: Started Proxmox VE replication runner.
Feb 25 13:16:01 node1 pve-ha-crm[4427]: service 'vm:3064' without node
Feb 25 13:16:04 node1 kernel: [2668121.457529] server[3363]: segfault at 7efe5d546c79 ip 000055f06e82b7b9 sp 00007efe53ffea10 error 4 in pmxcfs[55f06e80e0
00+2b000]
Feb 25 13:16:04 node1 systemd[1]: pve-cluster.service: Main process exited, code=killed, status=11/SEGV
Feb 25 13:16:04 node1 systemd[1]: pve-cluster.service: Unit entered failed state.
Feb 25 13:16:04 node1 systemd[1]: pve-cluster.service: Failed with result 'signal'.
Feb 25 13:16:05 node1 pve-ha-lrm[6471]: updating service status from manager failed: Connection refused
Feb 25 13:16:05 node1 pve-ha-lrm[6471]: lost lock 'ha_agent_node1_lock - can't create '/etc/pve/priv/lock' (pmxcfs not mounted?)
Feb 25 13:16:06 node1 pveproxy[188716]: ipcc_send_rec[1] failed: Connection refused
Feb 25 13:16:06 node1 pveproxy[188716]: ipcc_send_rec[2] failed: Connection refused
Feb 25 13:16:06 node1 pveproxy[188716]: ipcc_send_rec[3] failed: Connection refused
Feb 25 13:16:06 node1 pveproxy[195960]: ipcc_send_rec[1] failed: Connection refused
We see a lot of messages simular to this in the syslog
Feb 25 13:16:01 node1 pve-ha-crm[4427]: service 'vm:3064' without node
HA-manager dont'n know the ressource 3064 but we see him at /etc/pve/ha/resources.cfg as ignored.
We have no network problem on the corosync nics and switches.
Anybody know this problem?
we run a six node pve/ceph cluster with two corosync rings.
Yesterday we have a segfault on all nodes a the same time. Every node make a reboot.
Feb 25 13:16:01 node1 systemd[1]: Started Proxmox VE replication runner.
Feb 25 13:16:01 node1 pve-ha-crm[4427]: service 'vm:3064' without node
Feb 25 13:16:04 node1 kernel: [2668121.457529] server[3363]: segfault at 7efe5d546c79 ip 000055f06e82b7b9 sp 00007efe53ffea10 error 4 in pmxcfs[55f06e80e0
00+2b000]
Feb 25 13:16:04 node1 systemd[1]: pve-cluster.service: Main process exited, code=killed, status=11/SEGV
Feb 25 13:16:04 node1 systemd[1]: pve-cluster.service: Unit entered failed state.
Feb 25 13:16:04 node1 systemd[1]: pve-cluster.service: Failed with result 'signal'.
Feb 25 13:16:05 node1 pve-ha-lrm[6471]: updating service status from manager failed: Connection refused
Feb 25 13:16:05 node1 pve-ha-lrm[6471]: lost lock 'ha_agent_node1_lock - can't create '/etc/pve/priv/lock' (pmxcfs not mounted?)
Feb 25 13:16:06 node1 pveproxy[188716]: ipcc_send_rec[1] failed: Connection refused
Feb 25 13:16:06 node1 pveproxy[188716]: ipcc_send_rec[2] failed: Connection refused
Feb 25 13:16:06 node1 pveproxy[188716]: ipcc_send_rec[3] failed: Connection refused
Feb 25 13:16:06 node1 pveproxy[195960]: ipcc_send_rec[1] failed: Connection refused
We see a lot of messages simular to this in the syslog
Feb 25 13:16:01 node1 pve-ha-crm[4427]: service 'vm:3064' without node
HA-manager dont'n know the ressource 3064 but we see him at /etc/pve/ha/resources.cfg as ignored.
We have no network problem on the corosync nics and switches.
Anybody know this problem?