Hello all,
Before I begin I want to thank you all for your great posts/replies here. We've been using Proxmox for over 1 year and have never made a post as we've always found the answers here by searching (and that could be the case here, but I'm currently on vacation and wife gonna kill me if I spend 8+ hours on a computer ). But we're a large web host based in the US and are currently in the process of building out our own Datacenter; using Proxmox to power our virtualization. I cannot say enough good things about the software and the community behind it.
That out of the way, Zabbix notified us about 4 hours ago of a failed Interface/Link on one of our switches which was immediately followed by alerts unable to reach one of our Proxmox Nodes in a 4 Node cluster. Initially, I did do a graceful reboot of the failed-node to see if that made any difference (it didn't). Upon further investigation it appears to be a failed optic or line cable - which we are in the process of getting replaced. In the meantime though, we do have a redundant cluster in place, but the failover link(s) don't appear to be kicking over:
Here's the syslog on the failed node at the time of the failure - see line Oct 16 10:33:21 (replacing the hostname to "failed-node", which is host 2 in the cluster)
root@failed-node:/etc/pve# grep "Oct 16 10:3" /var/log/syslog
root@failed-node:/etc/pve# systemctl status pve-cluster corosync
root@failed-node:/etc/pve# cat /etc/corosync/corosync.conf
root@healthy-node-22:~# cat /etc/pve/corosync.conf
root@failed-node:/etc/pve# pvecm status
Any help/feedback/suggestions are greatly appreciated. Additionally, if it's been answered here before feel free to flame me =>.
Before I begin I want to thank you all for your great posts/replies here. We've been using Proxmox for over 1 year and have never made a post as we've always found the answers here by searching (and that could be the case here, but I'm currently on vacation and wife gonna kill me if I spend 8+ hours on a computer ). But we're a large web host based in the US and are currently in the process of building out our own Datacenter; using Proxmox to power our virtualization. I cannot say enough good things about the software and the community behind it.
That out of the way, Zabbix notified us about 4 hours ago of a failed Interface/Link on one of our switches which was immediately followed by alerts unable to reach one of our Proxmox Nodes in a 4 Node cluster. Initially, I did do a graceful reboot of the failed-node to see if that made any difference (it didn't). Upon further investigation it appears to be a failed optic or line cable - which we are in the process of getting replaced. In the meantime though, we do have a redundant cluster in place, but the failover link(s) don't appear to be kicking over:
Here's the syslog on the failed node at the time of the failure - see line Oct 16 10:33:21 (replacing the hostname to "failed-node", which is host 2 in the cluster)
root@failed-node:/etc/pve# grep "Oct 16 10:3" /var/log/syslog
Code:
Oct 16 10:31:06 failed-node pmxcfs[2900]: [status] notice: received log
Oct 16 10:32:00 failed-node systemd[1]: Starting Proxmox VE replication runner...
Oct 16 10:32:00 failed-node systemd[1]: pvesr.service: Succeeded.
Oct 16 10:32:00 failed-node systemd[1]: Started Proxmox VE replication runner.
Oct 16 10:33:00 failed-node systemd[1]: Starting Proxmox VE replication runner...
Oct 16 10:33:00 failed-node systemd[1]: pvesr.service: Succeeded.
Oct 16 10:33:00 failed-node systemd[1]: Started Proxmox VE replication runner.
Oct 16 10:33:21 failed-node kernel: [2493654.770594] i40e 0000:25:00.0 ens2f0: NIC Link is Down
Oct 16 10:33:23 failed-node corosync[3013]: [KNET ] link: host: 1 link: 0 is down
Oct 16 10:33:23 failed-node corosync[3013]: [KNET ] link: host: 4 link: 0 is down
Oct 16 10:33:23 failed-node corosync[3013]: [KNET ] link: host: 3 link: 0 is down
Oct 16 10:33:23 failed-node corosync[3013]: [KNET ] host: host: 1 (passive) best link: 1 (pri: 1)
Oct 16 10:33:23 failed-node corosync[3013]: [KNET ] host: host: 4 (passive) best link: 1 (pri: 1)
Oct 16 10:33:23 failed-node corosync[3013]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Oct 16 10:33:23 failed-node corosync[3013]: [KNET ] host: host: 3 has no active links
Oct 16 10:33:29 failed-node corosync[3013]: [TOTEM ] Token has not been received in 5514 ms
Oct 16 10:33:33 failed-node corosync[3013]: [TOTEM ] Token has not been received in 9815 ms
Oct 16 10:33:39 failed-node corosync[3013]: [QUORUM] Sync members[1]: 2
Oct 16 10:33:39 failed-node corosync[3013]: [QUORUM] Sync left[3]: 1 3 4
Oct 16 10:33:39 failed-node corosync[3013]: [TOTEM ] A new membership (2.298) was formed. Members left: 1 3 4
Oct 16 10:33:39 failed-node corosync[3013]: [TOTEM ] Failed to receive the leave message. failed: 1 3 4
Oct 16 10:33:39 failed-node pmxcfs[2900]: [dcdb] notice: members: 2/2900
Oct 16 10:33:39 failed-node pmxcfs[2900]: [status] notice: members: 2/2900
Oct 16 10:33:39 failed-node corosync[3013]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
Oct 16 10:33:39 failed-node corosync[3013]: [QUORUM] Members[1]: 2
Oct 16 10:33:39 failed-node corosync[3013]: [MAIN ] Completed service synchronization, ready to provide service.
Oct 16 10:33:39 failed-node pmxcfs[2900]: [status] notice: node lost quorum
Oct 16 10:33:39 failed-node pmxcfs[2900]: [dcdb] crit: received write while not quorate - trigger resync
Oct 16 10:33:39 failed-node pmxcfs[2900]: [dcdb] crit: leaving CPG group
Oct 16 10:33:39 failed-node pve-ha-lrm[3106]: unable to write lrm status file - closing file '/etc/pve/nodes/failed-node/lrm_status.tmp.3106' failed - Operation not permitted
Oct 16 10:33:42 failed-node corosync[3013]: [TOTEM ] Token has not been received in 3226 ms
Oct 16 10:33:47 failed-node corosync[3013]: [TOTEM ] Token has not been received in 7527 ms
Oct 16 10:33:53 failed-node corosync[3013]: [QUORUM] Sync members[1]: 2
Oct 16 10:33:53 failed-node corosync[3013]: [TOTEM ] A new membership (2.2a4) was formed. Members
Oct 16 10:33:53 failed-node corosync[3013]: [QUORUM] Members[1]: 2
Oct 16 10:33:53 failed-node corosync[3013]: [MAIN ] Completed service synchronization, ready to provide service.
Oct 16 10:33:53 failed-node pmxcfs[2900]: [dcdb] notice: start cluster connection
Oct 16 10:33:53 failed-node pmxcfs[2900]: [dcdb] crit: cpg_join failed: 14
Oct 16 10:33:53 failed-node pmxcfs[2900]: [dcdb] crit: can't initialize service
Oct 16 10:33:53 failed-node pve-ha-crm[3096]: lost lock 'ha_manager_lock - cfs lock update failed - Device or resource busy
Oct 16 10:33:53 failed-node pve-ha-crm[3096]: status change master => lost_manager_lock
Oct 16 10:33:53 failed-node pve-ha-crm[3096]: watchdog closed (disabled)
Oct 16 10:33:53 failed-node pve-ha-crm[3096]: status change lost_manager_lock => wait_for_quorum
Oct 16 10:33:56 failed-node corosync[3013]: [TOTEM ] Token has not been received in 3226 ms
Oct 16 10:34:00 failed-node systemd[1]: Starting Proxmox VE replication runner...
Oct 16 10:34:00 failed-node pvesr[51807]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 16 10:34:01 failed-node corosync[3013]: [TOTEM ] Token has not been received in 7526 ms
Oct 16 10:34:01 failed-node pvesr[51807]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 16 10:34:02 failed-node pvesr[51807]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 16 10:34:03 failed-node pvesr[51807]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 16 10:34:04 failed-node pvesr[51807]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 16 10:34:05 failed-node pvesr[51807]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 16 10:34:06 failed-node pvesr[51807]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 16 10:34:07 failed-node corosync[3013]: [QUORUM] Sync members[1]: 2
Oct 16 10:34:07 failed-node corosync[3013]: [TOTEM ] A new membership (2.2b0) was formed. Members
Oct 16 10:34:07 failed-node corosync[3013]: [QUORUM] Members[1]: 2
Oct 16 10:34:07 failed-node corosync[3013]: [MAIN ] Completed service synchronization, ready to provide service.
Oct 16 10:34:07 failed-node pmxcfs[2900]: [dcdb] notice: members: 2/2900
Oct 16 10:34:07 failed-node pmxcfs[2900]: [dcdb] notice: all data is up to date
Oct 16 10:34:07 failed-node pvesr[51807]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 16 10:34:08 failed-node pvesr[51807]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 16 10:34:09 failed-node pvesr[51807]: cfs-lock 'file-replication_cfg' error: no quorum!
Oct 16 10:34:09 failed-node systemd[1]: pvesr.service: Main process exited, code=exited, status=13/n/a
Oct 16 10:34:09 failed-node systemd[1]: pvesr.service: Failed with result 'exit-code'.
Oct 16 10:34:09 failed-node systemd[1]: Failed to start Proxmox VE replication runner.
Oct 16 10:34:10 failed-node corosync[3013]: [TOTEM ] Token has not been received in 3226 ms
Oct 16 10:34:14 failed-node corosync[3013]: [TOTEM ] Token has not been received in 7527 ms
Oct 16 10:34:21 failed-node corosync[3013]: [QUORUM] Sync members[1]: 2
Oct 16 10:34:21 failed-node corosync[3013]: [TOTEM ] A new membership (2.2bc) was formed. Members
Oct 16 10:34:21 failed-node corosync[3013]: [QUORUM] Members[1]: 2
Oct 16 10:34:21 failed-node corosync[3013]: [MAIN ] Completed service synchronization, ready to provide service.
Oct 16 10:34:24 failed-node corosync[3013]: [TOTEM ] Token has not been received in 3226 ms
Oct 16 10:34:28 failed-node corosync[3013]: [TOTEM ] Token has not been received in 7527 ms
Oct 16 10:34:34 failed-node corosync[3013]: [QUORUM] Sync members[1]: 2
Oct 16 10:34:34 failed-node corosync[3013]: [TOTEM ] A new membership (2.2c8) was formed. Members
Oct 16 10:34:34 failed-node corosync[3013]: [QUORUM] Members[1]: 2
Oct 16 10:34:34 failed-node corosync[3013]: [MAIN ] Completed service synchronization, ready to provide service.
Oct 16 10:34:38 failed-node corosync[3013]: [TOTEM ] Token has not been received in 3226 ms
Oct 16 10:34:42 failed-node corosync[3013]: [TOTEM ] Token has not been received in 7527 ms
Oct 16 10:34:48 failed-node corosync[3013]: [QUORUM] Sync members[1]: 2
root@failed-node:/etc/pve# systemctl status pve-cluster corosync
Code:
● pve-cluster.service - The Proxmox VE cluster filesystem
Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; vendor preset: enabled)
Active: active (running) since Sat 2021-10-16 13:46:13 CDT; 1h 30min ago
Process: 7423 ExecStart=/usr/bin/pmxcfs (code=exited, status=0/SUCCESS)
Main PID: 7433 (pmxcfs)
Tasks: 7 (limit: 9830)
Memory: 60.1M
CGroup: /system.slice/pve-cluster.service
└─7433 /usr/bin/pmxcfs
Oct 16 15:16:30 failed-node pmxcfs[7433]: [status] notice: cpg_send_message retried 100 times
Oct 16 15:16:30 failed-node pmxcfs[7433]: [status] crit: cpg_send_message failed: 6
Oct 16 15:16:31 failed-node pmxcfs[7433]: [status] notice: cpg_send_message retry 10
Oct 16 15:16:31 failed-node pmxcfs[7433]: [status] notice: cpg_send_message retried 14 times
Oct 16 15:16:31 failed-node pmxcfs[7433]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/102: -1
Oct 16 15:16:31 failed-node pmxcfs[7433]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/106: -1
Oct 16 15:16:31 failed-node pmxcfs[7433]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/failed-node/local-lvm: -1
Oct 16 15:16:31 failed-node pmxcfs[7433]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/failed-node/nvme-raid-10: -1
Oct 16 15:16:31 failed-node pmxcfs[7433]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/failed-node/local: -1
Oct 16 15:16:31 failed-node pmxcfs[7433]: [status] notice: RRD update error /var/lib/rrdcached/db/pve2-storage/failed-node/local: /var/lib/r
● corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
Active: active (running) since Sat 2021-10-16 13:46:14 CDT; 1h 30min ago
Docs: man:corosync
man:corosync.conf
man:corosync_overview
Main PID: 7582 (corosync)
Tasks: 9 (limit: 9830)
Memory: 197.2M
CGroup: /system.slice/corosync.service
└─7582 /usr/sbin/corosync -f
Oct 16 15:16:45 failed-node corosync[7582]: [TOTEM ] A new membership (2.3d99) was formed. Members
Oct 16 15:16:45 failed-node corosync[7582]: [QUORUM] Members[1]: 2
Oct 16 15:16:45 failed-node corosync[7582]: [MAIN ] Completed service synchronization, ready to provide service.
Oct 16 15:16:48 failed-node corosync[7582]: [TOTEM ] Token has not been received in 3226 ms
Oct 16 15:16:52 failed-node corosync[7582]: [TOTEM ] Token has not been received in 7527 ms
Oct 16 15:16:58 failed-node corosync[7582]: [QUORUM] Sync members[1]: 2
Oct 16 15:16:58 failed-node corosync[7582]: [TOTEM ] A new membership (2.3da5) was formed. Members
Oct 16 15:16:58 failed-node corosync[7582]: [QUORUM] Members[1]: 2
Oct 16 15:16:58 failed-node corosync[7582]: [MAIN ] Completed service synchronization, ready to provide service.
Oct 16 15:17:02 failed-node corosync[7582]: [TOTEM ] Token has not been received in 3226 ms
lines 21-43/43 (END)
root@failed-node:/etc/pve# cat /etc/corosync/corosync.conf
Code:
logging {
debug: off
to_syslog: yes
}
nodelist {
node {
name: healthy-node-20
nodeid: 3
quorum_votes: 1
ring0_addr: 10.10.1.20
ring1_addr: 10.10.2.20
}
node {
name: healthy-node-21
nodeid: 4
quorum_votes: 1
ring0_addr: 10.10.1.21
ring1_addr: 10.10.2.21
}
node {
name: healthy-node-22
nodeid: 1
quorum_votes: 1
ring0_addr: 10.10.1.22
ring1_addr: 10.10.2.22
}
node {
name: failed-node
nodeid: 2
quorum_votes: 1
ring0_addr: 10.10.1.23
ring1_addr: 10.10.2.23
}
}
quorum {
provider: corosync_votequorum
}
totem {
cluster_name: Cloud-300
config_version: 9
interface {
linknumber: 0
}
interface {
linknumber: 1
}
ip_version: ipv4-6
link_mode: passive
secauth: on
version: 2
}
root@healthy-node-22:~# cat /etc/pve/corosync.conf
Code:
logging {
debug: off
to_syslog: yes
}
nodelist {
node {
name: healthy-node-20
nodeid: 3
quorum_votes: 1
ring0_addr: 10.10.1.20
ring1_addr: 10.10.2.20
}
node {
name: healthy-node-21
nodeid: 4
quorum_votes: 1
ring0_addr: 10.10.1.21
ring1_addr: 10.10.2.21
}
node {
name: healthy-node-22
nodeid: 1
quorum_votes: 1
ring0_addr: 10.10.1.22
ring1_addr: 10.10.2.22
}
node {
name: failed-node
nodeid: 2
quorum_votes: 1
ring0_addr: 10.10.1.23
ring1_addr: 10.10.2.23
}
}
quorum {
provider: corosync_votequorum
}
totem {
cluster_name: Cloud-300
config_version: 9
interface {
linknumber: 0
}
interface {
linknumber: 1
}
ip_version: ipv4-6
link_mode: passive
secauth: on
version: 2
}
root@failed-node:/etc/pve# pvecm status
Code:
Cluster information
-------------------
Name: Cloud-300
Config Version: 9
Transport: knet
Secure auth: on
Quorum information
------------------
Date: Sat Oct 16 15:23:51 2021
Quorum provider: corosync_votequorum
Nodes: 1
Node ID: 0x00000002
Ring ID: 2.3f15
Quorate: No
Votequorum information
----------------------
Expected votes: 4
Highest expected: 4
Total votes: 1
Quorum: 3 Activity blocked
Flags:
Membership information
----------------------
Nodeid Votes Name
0x00000002 1 10.10.1.23 (local)
Any help/feedback/suggestions are greatly appreciated. Additionally, if it's been answered here before feel free to flame me =>.