Hello,
I have been very slowly migrating my containers from the old style VZ containers to LXC, and while things got off to a rocky start, they have seemed to be working better lately, until, that is, today.
This morning I went to go look at a node in the cluster, and I ran "pct list" and it completely froze the terminal - nothing would interrupt or suspend it. I discovered to my dismay that only a single cluster member could still run pct, qm, or, in fact, the website! That machine seems to believe everything is perfectly normal and happy, but it cannot of course communicate with the other machines in the cluster (I get a Connection Refused (595) error when I try).
So some quick looking around and I see that on each of the hung nodes, I cannot read the contents of /etc/pve/nodes/[local-node-name]. I can see the correct information on all the other nodes, but I cannot restart the pve-cluster daemon (hangs indefinitely just like pct and qm).
I can see from the /var/log/daemon.log file that it thinks a node failed (which was the one still working)
At the same time on the node that everyone else thought was failed, I see this:
I eventually rebooted one of the cluster (that wasn't responding), and when it finally went down (it hung for ~45 minutes on several PVE related messages, which I thought I had grabbed, but had not), the other machines could all run pct & qm again, and /etc/pve/nodes/ is back to normal however, their websites still do not work.
The machine that stayed up, and now the rebooted machine's website both work, however, they both get "Connection Refused (595)" messages when trying to talk to any other nodes.
I have some production machines on all the other nodes, so while I can move them around and reboot them, that obviously isn't a long term solution.
I did capture the list of stuck processes before the reboot snapped everything back into mostly working:
FWIW, the cluster is a mix of
pve-manager/4.4-5/c43015a5 (running kernel: 4.4.35-1-pve)
and
pve-manager/4.4-5/c43015a5 (running kernel: 4.4.35-2-pve)
but, several of the machines that were not responding were the newer version.
Any thoughts on what went wrong, or suggestions for where to go from here? I'd like to get the website responding for all the nodes, and of course avoid this situation in the future!
Thanks in advance!
I have been very slowly migrating my containers from the old style VZ containers to LXC, and while things got off to a rocky start, they have seemed to be working better lately, until, that is, today.
This morning I went to go look at a node in the cluster, and I ran "pct list" and it completely froze the terminal - nothing would interrupt or suspend it. I discovered to my dismay that only a single cluster member could still run pct, qm, or, in fact, the website! That machine seems to believe everything is perfectly normal and happy, but it cannot of course communicate with the other machines in the cluster (I get a Connection Refused (595) error when I try).
So some quick looking around and I see that on each of the hung nodes, I cannot read the contents of /etc/pve/nodes/[local-node-name]. I can see the correct information on all the other nodes, but I cannot restart the pve-cluster daemon (hangs indefinitely just like pct and qm).
I can see from the /var/log/daemon.log file that it thinks a node failed (which was the one still working)
Code:
Feb 3 03:09:39 node-b corosync[3917]: [TOTEM ] A processor failed, forming new configuration.
Feb 3 03:09:45 node-b corosync[3917]: [TOTEM ] A new membership (10.0.0.8:188) was formed. Members left: 6
Feb 3 03:09:45 node-b corosync[3917]: [TOTEM ] Failed to receive the leave message. failed: 6
At the same time on the node that everyone else thought was failed, I see this:
Code:
Feb 3 03:09:49 node-f corosync[16522]: [MAIN ] Corosync main process was not scheduled for 14964.5771 ms (threshold is 3400.0000 ms). Consider token timeout increase.
Feb 3 03:10:21 node-f corosync[16522]: [TOTEM ] A processor failed, forming new configuration.
Feb 3 03:10:21 node-f corosync[16522]: [MAIN ] Corosync main process was not scheduled for 32056.3379 ms (threshold is 3400.0000 ms). Consider token timeout increase.
Feb 3 03:10:21 node-f pvestatd[3961]: status update time (37.547 seconds)
Feb 3 03:10:21 node-f pve-firewall[3958]: firewall update time (43.401 seconds)
Feb 3 03:10:21 node-f corosync[16522]: [TOTEM ] A new membership (10.0.0.8:192) was formed. Members joined: 7 5 1 4 3 2 left: 7 5 1 4 3 2
Feb 3 03:10:21 node-f corosync[16522]: [TOTEM ] Failed to receive the leave message. failed: 7 5 1 4 3 2
Feb 3 03:10:21 node-f corosync[16522]: [QUORUM] Members[7]: 7 6 5 1 4 3 2
Feb 3 03:10:21 node-f corosync[16522]: [MAIN ] Completed service synchronization, ready to provide service.
I eventually rebooted one of the cluster (that wasn't responding), and when it finally went down (it hung for ~45 minutes on several PVE related messages, which I thought I had grabbed, but had not), the other machines could all run pct & qm again, and /etc/pve/nodes/ is back to normal however, their websites still do not work.
The machine that stayed up, and now the rebooted machine's website both work, however, they both get "Connection Refused (595)" messages when trying to talk to any other nodes.
I have some production machines on all the other nodes, so while I can move them around and reboot them, that obviously isn't a long term solution.
I did capture the list of stuck processes before the reboot snapped everything back into mostly working:
Code:
node-e# ps auxf | grep -E ' [DR]'
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 28507 0.0 0.1 240680 60296 pts/0 D+ 12:16 0:00 | \_ /usr/bin/perl -T /usr/sbin/pct list
root 31204 0.0 0.1 240668 60296 pts/2 DN 12:21 0:00 \_ /usr/bin/perl -T /usr/sbin/pct
root 8506 0.0 0.0 19760 3008 pts/2 R+ 13:42 0:00 \_ ps auxf
root 2523 0.0 0.1 239812 63544 ? Ds 06:25 0:00 /usr/bin/perl -T /usr/bin/pveproxy stop
root 7400 0.0 0.1 239788 63628 ? Ds 06:32 0:00 /usr/bin/perl -T /usr/bin/pveproxy start
root 9872 0.0 0.1 239232 63240 ? Ds 06:37 0:00 /usr/bin/perl -T /usr/bin/spiceproxy stop
root 12982 0.0 0.1 239784 63780 ? Ds 06:44 0:00 /usr/bin/perl -T /usr/bin/pveproxy start
root 15525 0.0 0.1 239244 63084 ? Ds 06:49 0:00 /usr/bin/perl -T /usr/bin/spiceproxy start
root 6113 0.1 0.1 224316 54808 ? D 13:38 0:00 /usr/bin/perl /usr/sbin/qm list
FWIW, the cluster is a mix of
pve-manager/4.4-5/c43015a5 (running kernel: 4.4.35-1-pve)
and
pve-manager/4.4-5/c43015a5 (running kernel: 4.4.35-2-pve)
but, several of the machines that were not responding were the newer version.
Any thoughts on what went wrong, or suggestions for where to go from here? I'd like to get the website responding for all the nodes, and of course avoid this situation in the future!
Thanks in advance!