Good Morning
So I have been having this problem consistently with various nodes within my cluster for about a month now.
I have 4 nodes in my cluster (px1,px2,px3,px5)
Here are the specs for PX1,PX2,PX3:
(2)500GB sata drives
(2) AMD Opteron 6168
64GB DDR3
(2) Intel 8257EB Gigabit Ethernet
(2) Intel 82576 Gigabit Ethernet
Here are the specs for PX5:
(2)500GB sata drives
(2) AMD Opteron 6234
64GB DDR3
(4) Intel 82576 Gigabit Ethernet
The problem is that 1 or 2 nodes will lose quorum randomly. It appears to be running corosync when suddenly it gets a FAILED TO RECEIVE error and loses quorum.
In this example PX3 has lost quorum.
Here are the logs:
Also here are logs from PX1 while PX3 has lost quorum:
We are a school district and upgraded from Proxmox 1.9 to 2.1 over the summer, we we're having similar problems in 1.9 but not the exact same problems it was much more stable. We had hoped 2.1 would resolve many of the issues we were having, sometime it will run smooth however most of the time this is the type of activity we are receiving.
Please let me know if there is any more information I should post on the nodes and I would be glad to grab it.
Thank you in advance for any help as this is an urgent situation, we continue to have random down time during school hours due to these issues.
Best Regards,
Jared Planter
I.T Director
Escondido Charter High School
So I have been having this problem consistently with various nodes within my cluster for about a month now.
I have 4 nodes in my cluster (px1,px2,px3,px5)
Here are the specs for PX1,PX2,PX3:
(2)500GB sata drives
(2) AMD Opteron 6168
64GB DDR3
(2) Intel 8257EB Gigabit Ethernet
(2) Intel 82576 Gigabit Ethernet
Here are the specs for PX5:
(2)500GB sata drives
(2) AMD Opteron 6234
64GB DDR3
(4) Intel 82576 Gigabit Ethernet
The problem is that 1 or 2 nodes will lose quorum randomly. It appears to be running corosync when suddenly it gets a FAILED TO RECEIVE error and loses quorum.
In this example PX3 has lost quorum.
Here are the logs:
Code:
Aug 20 09:19:05 px3 corosync[3677]: [TOTEM ] Retransmit List: a1c a1d
Aug 20 09:19:05 px3 corosync[3677]: [TOTEM ] Retransmit List: a1e a1f
Aug 20 09:19:05 px3 corosync[3677]: [TOTEM ] Retransmit List: a1c a1d
Aug 20 09:19:05 px3 corosync[3677]: [TOTEM ] Retransmit List: a1e a1f
Aug 20 09:19:05 px3 corosync[3677]: [TOTEM ] FAILED TO RECEIVE
Aug 20 09:19:29 px3 pmxcfs[3520]: [quorum] crit: quorum_dispatch failed: 2
Aug 20 09:19:29 px3 dlm_controld[3750]: cluster is down, exiting
Aug 20 09:19:29 px3 dlm_controld[3750]: daemon cpg_dispatch error 2
Aug 20 09:19:29 px3 fenced[3731]: cluster is down, exiting
Aug 20 09:19:29 px3 fenced[3731]: daemon cpg_dispatch error 2
Aug 20 09:19:29 px3 pmxcfs[3520]: [libqb] warning: epoll_ctl(del): Bad file descriptor (9)
Aug 20 09:19:29 px3 pmxcfs[3520]: [confdb] crit: confdb_dispatch failed: 2
Aug 20 09:19:31 px3 kernel: dlm: closing connection to node 1
Aug 20 09:19:31 px3 kernel: dlm: closing connection to node 4
Aug 20 09:19:31 px3 kernel: dlm: closing connection to node 3
Aug 20 09:19:32 px3 pmxcfs[3520]: [status] crit: cpg_send_message failed: 2
Aug 20 09:19:32 px3 pmxcfs[3520]: [status] crit: cpg_send_message failed: 2
Aug 20 09:19:34 px3 pmxcfs[3520]: [status] crit: cpg_send_message failed: 2
Aug 20 09:19:34 px3 pmxcfs[3520]: [status] crit: cpg_send_message failed: 2
Aug 20 09:19:35 px3 pmxcfs[3520]: [libqb] warning: epoll_ctl(del): Bad file descriptor (9)
Aug 20 09:19:35 px3 pmxcfs[3520]: [dcdb] crit: cpg_dispatch failed: 2
Aug 20 09:19:36 px3 pmxcfs[3520]: [status] crit: cpg_send_message failed: 2
Aug 20 09:19:36 px3 pmxcfs[3520]: [status] crit: cpg_send_message failed: 2
Aug 20 09:19:37 px3 pmxcfs[3520]: [dcdb] crit: cpg_leave failed: 2
Aug 20 09:19:38 px3 pmxcfs[3520]: [status] crit: cpg_send_message failed: 2
Aug 20 09:19:38 px3 pmxcfs[3520]: [status] crit: cpg_send_message failed: 2
Aug 20 09:19:39 px3 pmxcfs[3520]: [libqb] warning: epoll_ctl(del): Bad file descriptor (9)
Aug 20 09:19:39 px3 pmxcfs[3520]: [dcdb] crit: cpg_dispatch failed: 2
Aug 20 09:19:40 px3 pmxcfs[3520]: [status] crit: cpg_send_message failed: 2
Aug 20 09:19:40 px3 pmxcfs[3520]: [status] crit: cpg_send_message failed: 2
Aug 20 09:19:42 px3 pmxcfs[3520]: [dcdb] crit: cpg_leave failed: 2
Aug 20 09:19:44 px3 pmxcfs[3520]: [status] crit: cpg_send_message failed: 2
Aug 20 09:19:44 px3 pmxcfs[3520]: [status] crit: cpg_send_message failed: 2
Aug 20 09:19:46 px3 pmxcfs[3520]: [libqb] warning: epoll_ctl(del): Bad file descriptor (9)
Aug 20 09:19:46 px3 pmxcfs[3520]: [quorum] crit: quorum_initialize failed: 6
Aug 20 09:19:46 px3 pmxcfs[3520]: [quorum] crit: can't initialize service
Aug 20 09:19:46 px3 pmxcfs[3520]: [confdb] crit: confdb_initialize failed: 6
Aug 20 09:19:46 px3 pmxcfs[3520]: [quorum] crit: can't initialize service
Aug 20 09:19:46 px3 pmxcfs[3520]: [dcdb] notice: start cluster connection
Aug 20 09:19:46 px3 pmxcfs[3520]: [dcdb] crit: cpg_initialize failed: 6
Aug 20 09:19:46 px3 pmxcfs[3520]: [quorum] crit: can't initialize service
Aug 20 09:19:48 px3 pmxcfs[3520]: [status] crit: cpg_send_message failed: 2
Aug 20 09:19:48 px3 pmxcfs[3520]: [status] crit: cpg_send_message failed: 2
Aug 20 09:19:48 px3 pmxcfs[3520]: [dcdb] notice: start cluster connection
Aug 20 09:19:48 px3 pmxcfs[3520]: [dcdb] crit: cpg_initialize failed: 6
Aug 20 09:19:48 px3 pmxcfs[3520]: [quorum] crit: can't initialize service
Aug 20 09:19:48 px3 pmxcfs[3520]: [status] crit: cpg_send_message failed: 9
Aug 20 09:19:48 px3 pmxcfs[3520]: [status] crit: cpg_send_message failed: 9
Aug 20 09:19:48 px3 pmxcfs[3520]: [status] crit: cpg_send_message failed: 9
Aug 20 09:19:48 px3 pmxcfs[3520]: [status] crit: cpg_send_message failed: 9
Aug 20 09:19:48 px3 pmxcfs[3520]: [status] crit: cpg_send_message failed: 9
Aug 20 09:19:48 px3 pmxcfs[3520]: [status] crit: cpg_send_message failed: 9
Aug 20 09:19:48 px3 pmxcfs[3520]: [status] crit: cpg_send_message failed: 9
Also here are logs from PX1 while PX3 has lost quorum:
Code:
Aug 20 09:19:29 px1 corosync[3824]: [TOTEM ] Process pause detected for 12061 ms, flushing membership messages.
Aug 20 09:19:29 px1 corosync[3824]: [TOTEM ] Process pause detected for 12103 ms, flushing membership messages.
Aug 20 09:19:29 px1 corosync[3824]: [TOTEM ] Process pause detected for 12131 ms, flushing membership messages.
Aug 20 09:19:29 px1 corosync[3824]: [TOTEM ] Process pause detected for 12201 ms, flushing membership messages.
Aug 20 09:19:29 px1 corosync[3824]: [CLM ] CLM CONFIGURATION CHANGE
Aug 20 09:19:29 px1 corosync[3824]: [CLM ] New Configuration:
Aug 20 09:19:29 px1 corosync[3824]: [CLM ] #011r(0) ip(10.10.12.230)
Aug 20 09:19:29 px1 corosync[3824]: [CLM ] #011r(0) ip(10.10.12.233)
Aug 20 09:19:29 px1 corosync[3824]: [CLM ] Members Left:
Aug 20 09:19:29 px1 corosync[3824]: [CLM ] Members Joined:
Aug 20 09:19:29 px1 corosync[3824]: [CLM ] CLM CONFIGURATION CHANGE
Aug 20 09:19:29 px1 corosync[3824]: [CLM ] New Configuration:
Aug 20 09:19:29 px1 corosync[3824]: [CLM ] #011r(0) ip(10.10.12.230)
Aug 20 09:19:29 px1 corosync[3824]: [CLM ] #011r(0) ip(10.10.12.233)
Aug 20 09:19:29 px1 corosync[3824]: [CLM ] Members Left:
Aug 20 09:19:29 px1 corosync[3824]: [CLM ] Members Joined:
Aug 20 09:19:29 px1 corosync[3824]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Aug 20 09:19:29 px1 corosync[3824]: [CPG ] chosen downlist: sender r(0) ip(10.10.12.230) ; members(old:2 left:0)
Aug 20 09:19:29 px1 corosync[3824]: [MAIN ] Completed service synchronization, ready to provide service.
Aug 20 09:19:46 px1 pvedaemon[134410]: <root@pam> successful auth for user 'root@pam'
Aug 20 09:21:11 px1 pvedaemon[134803]: starting vnc proxy UPID:px1:00020E93:00C7F417:503263F7:vncshell::root@pam:
Aug 20 09:21:11 px1 pvedaemon[134803]: launch command: /usr/bin/vncterm -rfbport 5901 -timeout 10 -authpath /nodes/px3 -perm Sys.Console -c /usr/bin/ssh -c blowfish-cbc -t 10.10.12.232 /bin/bash -l
Aug 20 09:21:11 px1 pvedaemon[134410]: <root@pam> starting task UPID:px1:00020E93:00C7F417:503263F7:vncshell::root@pam:
Aug 20 09:21:11 px1 pvedaemon[134687]: <root@pam> successful auth for user 'root@pam'
Aug 20 09:21:12 px1 pvedaemon[134687]: <root@pam> successful auth for user 'root@pam'
Aug 20 09:21:52 px1 pvedaemon[134410]: <root@pam> successful auth for user 'root@pam'
We are a school district and upgraded from Proxmox 1.9 to 2.1 over the summer, we we're having similar problems in 1.9 but not the exact same problems it was much more stable. We had hoped 2.1 would resolve many of the issues we were having, sometime it will run smooth however most of the time this is the type of activity we are receiving.
Please let me know if there is any more information I should post on the nodes and I would be glad to grab it.
Thank you in advance for any help as this is an urgent situation, we continue to have random down time during school hours due to these issues.
Best Regards,
Jared Planter
I.T Director
Escondido Charter High School