Nodes going red

i still think its network related, all signs so far point to it.

You could setup detailed network monitoring on all hosts / switches / vm's and see what is happening at the time your nodes are going red.
 
i still think its network related, all signs so far point to it.

You could setup detailed network monitoring on all hosts / switches / vm's and see what is happening at the time your nodes are going red.

That is a good idea to have detailed monitoring . Currently i use monit and our own log check scripts.
Around the time of the last red issue, mail and imap servers reported a ldap connection issue. For less then 5 seconds. Then those systems recovered.

So what caused the issue in the 1-st place is unknown.

Could you suggest a system to use for detailed monitoring ?

Also - any clues on why would sys5 be able to write to /etc/pve and others not?
 
zenoss, OMD-Labs, openItMonitor,..., all of em would do the job.

I've started trying to set up different monitoring systems over the years, as a one person shop with other things going on the time needed to set up was not feasible.

Could you narrow the suggestion list down to ( free or not - I do not believe in getting something for nothing ) the easiest to get up and running ? ( Unless that would get you in trouble . )
 
A free an easy test could be to run a ping between every host and log the output.
ping host1 > /tmp/ping_host1.dump
ping host2 > /tmp/ping_host2.dump
......
 
So I deleted sys5 and added s5 using the same hardware.

when I added s5 it got stuck at "waiting for quorum..."
Code:
root@10.1.10.181's password: 
copy corosync auth key
stopping pve-cluster service
backup old database
waiting for quorum...

at another node in syslog it seems like the add node worked:
Code:
Dec  1 14:49:35 dell1 pmxcfs[28948]: [dcdb] notice: wrote new corosync config '/etc/corosync/corosync.conf' (version = 17)
Dec  1 14:49:35 dell1 corosync[7779]:  [CFG   ] Config reload requested by node 3
Dec  1 14:49:35 dell1 pmxcfs[28948]: [status] notice: update cluster info (cluster name  cluster-v4, version = 17)
Dec  1 14:49:40 dell1 corosync[7779]:  [TOTEM ] A new membership (10.1.10.19:12312) was formed. Members joined: 4
Dec  1 14:49:40 dell1 corosync[7779]:  [QUORUM] Members[4]: 4 2 1 3
Dec  1 14:49:40 dell1 corosync[7779]:  [MAIN  ] Completed service synchronization, ready to provide service.

more info:
Code:
dell1  ~ # pvecm nodes

Membership information
----------------------
    Nodeid      Votes Name
         4          1 s5
         2          1 sys4
         1          1 sys3
         3          1 dell1 (local)

dell1  ~ # pvecm status
Quorum information
------------------
Date:             Tue Dec  1 14:58:49 2015
Quorum provider:  corosync_votequorum
Nodes:            4
Node ID:          0x00000003
Ring ID:          12312
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   4
Highest expected: 4
Total votes:      4
Quorum:           3  
Flags:            Quorate 

Membership information
----------------------
    Nodeid      Votes Name
0x00000004          1 10.1.10.19
0x00000002          1 10.1.10.41
0x00000001          1 10.1.10.42
0x00000003          1 10.1.10.181 (local)

node s5 is still stuck at 'waiting for quorum...'

at pve web pages, sys3 , sys4 and dell1 show s5 red, and all other nodes green.

at s5 pve web page only s5 is shown. no other nodes.


any suggestions to fix?
 
issue is probably related to ssh known_hosts
when i grt from dell1 to ssh to s5 :

Code:
# ssh 10.1.10.19
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @

I suppose I should have deleted that host key before hand.

Anyways do you think a reinstall is needed?

Or can I press ctl-c at 'waiting for quorum...' and procede ?


also at s5 syslog :
[Dec 1 15:08:33 s5 pveproxy[6487]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE
/HTTPServer.pm line 1631.
Dec 1 15:08:33 s5 pveproxy[5248]: EV: error in callback (ignoring): Can't call method "push_write" on an undefined value at /usr/share/p
erl5/PVE/HTTPServer.pm line 295.
Dec 1 15:08:34 s5 pveproxy[5248]: problem with client 10.1.10.42; rsa_padding_check_pkcs1_type_1: block type is not 01
[/code]
 
Last edited:
I'll just reinstall... and delete known host entry before adding to the node.

and do ssh-keygen -f "/etc/ssh/ssh_known_hosts" -R 10.1.10.19 before adding...

OK sys5 reinstalled and all green.
 
Last edited:
did 4.1 fix this ?

Still think its due to jitter.

corosync config now has a multicast key . before 4.1 that key did not exist. it is possible that the lack of a totem.interface.0.mcastaddr key interfered with multicast.

time will tell if the red node issue is solved.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!