Hi,
My cluster suddenly reboot in the morning and all hardware node become dead, VM can not start, master node dead and LRM services is wait_for_agent_lock.
VM status: fence and can not start
Firstly, may i know what may lead to this problem? VM can not start due to master dead (attachment)?
I tried to shutdown all node, except 1 node (called node19) and modify totem bindnetaddr back to specific IP address instead of network address.
after edit and reboot this node (node19), pvecm status only see this node. I decided to start back all remaining node, they can not see this node19. This looks like separate cluster.
Secondly, is there anyway to update node19 back to this cluster? i tried copy a backup of corosync.conf override to /etc/pve/corosync.conf (after pvecm e 1) but does not success.
After nearly 10 hours downtime, i have to disable all ha-manager config and start VPS manually. I think this cluster has wide problem (we have nearly 10 clusters but this cluster have incident every month).
ring1 network always has this error: (multicast enabled)
My cluster suddenly reboot in the morning and all hardware node become dead, VM can not start, master node dead and LRM services is wait_for_agent_lock.
VM status: fence and can not start
Firstly, may i know what may lead to this problem? VM can not start due to master dead (attachment)?
I tried to shutdown all node, except 1 node (called node19) and modify totem bindnetaddr back to specific IP address instead of network address.
totem {
cluster_name: clustername
config_version: 34
interface {
bindnetaddr: 10.10.30.0
ringnumber: 0
}
interface {
bindnetaddr: 10.20.30.0
ringnumber: 1
}
ip_version: ipv4
rrp_mode: passive
secauth: on
version: 2
}
cluster_name: clustername
config_version: 34
interface {
bindnetaddr: 10.10.30.0
ringnumber: 0
}
interface {
bindnetaddr: 10.20.30.0
ringnumber: 1
}
ip_version: ipv4
rrp_mode: passive
secauth: on
version: 2
}
totem {
cluster_name: clustername
config_version: 35
interface {
bindnetaddr: 10.10.30.169
ringnumber: 0
}
interface {
bindnetaddr: 10.20.30.169
ringnumber: 1
}
ip_version: ipv4
rrp_mode: passive
secauth: on
version: 2
}
cluster_name: clustername
config_version: 35
interface {
bindnetaddr: 10.10.30.169
ringnumber: 0
}
interface {
bindnetaddr: 10.20.30.169
ringnumber: 1
}
ip_version: ipv4
rrp_mode: passive
secauth: on
version: 2
}
after edit and reboot this node (node19), pvecm status only see this node. I decided to start back all remaining node, they can not see this node19. This looks like separate cluster.
Secondly, is there anyway to update node19 back to this cluster? i tried copy a backup of corosync.conf override to /etc/pve/corosync.conf (after pvecm e 1) but does not success.
After nearly 10 hours downtime, i have to disable all ha-manager config and start VPS manually. I think this cluster has wide problem (we have nearly 10 clusters but this cluster have incident every month).
ring1 network always has this error: (multicast enabled)
systemctl status corosync.service
â corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
Active: active (running) since Wed 2018-10-17 17:42:38 +07; 4 days ago
Docs: man:corosync
man:corosync.conf
man:corosync_overview
Main PID: 3514 (corosync)
Tasks: 2 (limit: 23347)
Memory: 66.6M
CPU: 1h 44min 33.832s
CGroup: /system.slice/corosync.service
ââ3514 /usr/sbin/corosync -f
Oct 21 22:23:55 node109 corosync[3514]: notice [TOTEM ] Retransmit List: 1aeccce
Oct 21 22:23:55 node109 corosync[3514]: [TOTEM ] Retransmit List: 1aeccce
Oct 22 01:11:10 node109 corosync[3514]: error [TOTEM ] Marking ringid 1 interface 10.20.30.159 FAULTY
Oct 22 01:11:10 node109 corosync[3514]: [TOTEM ] Marking ringid 1 interface 10.20.30.159 FAULTY
Oct 22 01:11:11 node109 corosync[3514]: notice [TOTEM ] Automatically recovered ring 1
Oct 22 01:11:11 node109 corosync[3514]: [TOTEM ] Automatically recovered ring 1
Oct 22 01:12:51 node109 corosync[3514]: error [TOTEM ] Marking ringid 1 interface 10.20.30.159 FAULTY
Oct 22 01:12:51 node109 corosync[3514]: [TOTEM ] Marking ringid 1 interface 10.20.30.159 FAULTY
Oct 22 01:12:52 node109 corosync[3514]: notice [TOTEM ] Automatically recovered ring 1
Oct 22 01:12:52 node109 corosync[3514]: [TOTEM ] Automatically recovered ring 1
â corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
Active: active (running) since Wed 2018-10-17 17:42:38 +07; 4 days ago
Docs: man:corosync
man:corosync.conf
man:corosync_overview
Main PID: 3514 (corosync)
Tasks: 2 (limit: 23347)
Memory: 66.6M
CPU: 1h 44min 33.832s
CGroup: /system.slice/corosync.service
ââ3514 /usr/sbin/corosync -f
Oct 21 22:23:55 node109 corosync[3514]: notice [TOTEM ] Retransmit List: 1aeccce
Oct 21 22:23:55 node109 corosync[3514]: [TOTEM ] Retransmit List: 1aeccce
Oct 22 01:11:10 node109 corosync[3514]: error [TOTEM ] Marking ringid 1 interface 10.20.30.159 FAULTY
Oct 22 01:11:10 node109 corosync[3514]: [TOTEM ] Marking ringid 1 interface 10.20.30.159 FAULTY
Oct 22 01:11:11 node109 corosync[3514]: notice [TOTEM ] Automatically recovered ring 1
Oct 22 01:11:11 node109 corosync[3514]: [TOTEM ] Automatically recovered ring 1
Oct 22 01:12:51 node109 corosync[3514]: error [TOTEM ] Marking ringid 1 interface 10.20.30.159 FAULTY
Oct 22 01:12:51 node109 corosync[3514]: [TOTEM ] Marking ringid 1 interface 10.20.30.159 FAULTY
Oct 22 01:12:52 node109 corosync[3514]: notice [TOTEM ] Automatically recovered ring 1
Oct 22 01:12:52 node109 corosync[3514]: [TOTEM ] Automatically recovered ring 1