Nodes going red

Q-wulf · Dec 1, 2015

i still think its network related, all signs so far point to it.

You could setup detailed network monitoring on all hosts / switches / vm's and see what is happening at the time your nodes are going red.

RobFantini · Dec 1, 2015

Q-wulf said:
i still think its network related, all signs so far point to it.

You could setup detailed network monitoring on all hosts / switches / vm's and see what is happening at the time your nodes are going red.

That is a good idea to have detailed monitoring . Currently i use monit and our own log check scripts.
Around the time of the last red issue, mail and imap servers reported a ldap connection issue. For less then 5 seconds. Then those systems recovered.

So what caused the issue in the 1-st place is unknown.

Could you suggest a system to use for detailed monitoring ?

Also - any clues on why would sys5 be able to write to /etc/pve and others not?

Q-wulf · Dec 1, 2015

zenoss, OMD-Labs, openItMonitor,..., all of em would do the job.

RobFantini · Dec 1, 2015

Q-wulf said:
zenoss, OMD-Labs, openItMonitor,..., all of em would do the job.

I've started trying to set up different monitoring systems over the years, as a one person shop with other things going on the time needed to set up was not feasible.

Could you narrow the suggestion list down to ( free or not - I do not believe in getting something for nothing ) the easiest to get up and running ? ( Unless that would get you in trouble . )

mir · Dec 1, 2015

A free an easy test could be to run a ping between every host and log the output.
ping host1 > /tmp/ping_host1.dump
ping host2 > /tmp/ping_host2.dump
......

dietmar · Dec 1, 2015

RobFantini said:
Also - any clues on why would sys5 be able to write to /etc/pve and others not?

You set expected votes manually?

RobFantini · Dec 1, 2015

dietmar said:
You set expected votes manually?

nope. and this has happened a few other times with that node.

RobFantini · Dec 1, 2015

So I deleted sys5 and added s5 using the same hardware.

when I added s5 it got stuck at "waiting for quorum..."

Code:

root@10.1.10.181's password: 
copy corosync auth key
stopping pve-cluster service
backup old database
waiting for quorum...

at another node in syslog it seems like the add node worked:

Code:

Dec  1 14:49:35 dell1 pmxcfs[28948]: [dcdb] notice: wrote new corosync config '/etc/corosync/corosync.conf' (version = 17)
Dec  1 14:49:35 dell1 corosync[7779]:  [CFG   ] Config reload requested by node 3
Dec  1 14:49:35 dell1 pmxcfs[28948]: [status] notice: update cluster info (cluster name  cluster-v4, version = 17)
Dec  1 14:49:40 dell1 corosync[7779]:  [TOTEM ] A new membership (10.1.10.19:12312) was formed. Members joined: 4
Dec  1 14:49:40 dell1 corosync[7779]:  [QUORUM] Members[4]: 4 2 1 3
Dec  1 14:49:40 dell1 corosync[7779]:  [MAIN  ] Completed service synchronization, ready to provide service.

more info:

Code:

dell1  ~ # pvecm nodes

Membership information
----------------------
    Nodeid      Votes Name
         4          1 s5
         2          1 sys4
         1          1 sys3
         3          1 dell1 (local)

dell1  ~ # pvecm status
Quorum information
------------------
Date:             Tue Dec  1 14:58:49 2015
Quorum provider:  corosync_votequorum
Nodes:            4
Node ID:          0x00000003
Ring ID:          12312
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   4
Highest expected: 4
Total votes:      4
Quorum:           3  
Flags:            Quorate 

Membership information
----------------------
    Nodeid      Votes Name
0x00000004          1 10.1.10.19
0x00000002          1 10.1.10.41
0x00000001          1 10.1.10.42
0x00000003          1 10.1.10.181 (local)

node s5 is still stuck at 'waiting for quorum...'

at pve web pages, sys3 , sys4 and dell1 show s5 red, and all other nodes green.

at s5 pve web page only s5 is shown. no other nodes.

any suggestions to fix?

RobFantini · Dec 1, 2015

issue is probably related to ssh known_hosts
when i grt from dell1 to ssh to s5 :

Code:

# ssh 10.1.10.19
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @

I suppose I should have deleted that host key before hand.

Anyways do you think a reinstall is needed?

Or can I press ctl-c at 'waiting for quorum...' and procede ?

also at s5 syslog :
[Dec 1 15:08:33 s5 pveproxy[6487]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE
/HTTPServer.pm line 1631.
Dec 1 15:08:33 s5 pveproxy[5248]: EV: error in callback (ignoring): Can't call method "push_write" on an undefined value at /usr/share/p
erl5/PVE/HTTPServer.pm line 295.
Dec 1 15:08:34 s5 pveproxy[5248]: problem with client 10.1.10.42; rsa_padding_check_pkcs1_type_1: block type is not 01
[/code]

RobFantini · Dec 1, 2015

I'll just reinstall... and delete known host entry before adding to the node.

and do ssh-keygen -f "/etc/ssh/ssh_known_hosts" -R 10.1.10.19 before adding...

OK sys5 reinstalled and all green.

Q-wulf · Dec 12, 2015

did 4.1 fix this ?

Still think its due to jitter.

RobFantini · Dec 12, 2015

Q-wulf said:
did 4.1 fix this ?

Still think its due to jitter.

corosync config now has a multicast key . before 4.1 that key did not exist. it is possible that the lack of a totem.interface.0.mcastaddr key interfered with multicast.

time will tell if the red node issue is solved.

RobFantini · Dec 16, 2015

Since installing PVE 4.1 we have not had a red issue ( related to network) .

Search

Search

Nodes going red

Q-wulf

Renowned Member

RobFantini

Famous Member

Q-wulf

Renowned Member

RobFantini

Famous Member

mir

Famous Member

dietmar

Proxmox Staff Member

RobFantini

Famous Member

RobFantini

Famous Member

RobFantini

Famous Member

RobFantini

Famous Member

Q-wulf

Renowned Member

RobFantini

Famous Member

RobFantini

Famous Member

We value your privacy