cluster pve 5.4 network strangeness

halley80

Well-Known Member
Feb 13, 2019
49
2
48
54
Hi,

My cluster consists of 4 nodes (Dell R440)
Networks :
192.168.38.0/24 (private network for vms)
193.49.x.y (public network for vms)

The vms communicate perfectly on the first three nodes with an external physical server (public network).

But the physical server fails to "ping" a vm that is on the fourth node (one public network card for each 193.49.x.y).
The fourth node is configured in the same way as the first three.

However, when I ping the physical server from the vm of the fourth node, the connection initializes with difficulty
(cf response time):

********************************************
root@nas-pve-prod:~# ping 193.49.x.y
PING 193.49.x.y (193.49.x.y) 56(84) bytes of data.
64 bytes from 193.49.201.164: icmp_seq=4 ttl=64 time=2112 ms
64 bytes from 193.49.201.164: icmp_seq=5 ttl=64 time=1088 ms
64 bytes from 193.49.201.164: icmp_seq=6 ttl=64 time=64.2 ms

********************************************

The ping then works from the physical server to the vm of the fourth node.

Have you encountered this problem before?
Thanking you for your answers.
Yours sincerely,
 
The ping then works from the physical server to the vm of the fourth node.

Does it stay that way then or revert back to being broken after a while?

It sounds like an ARP issue if I had to guess. Try running arp -a/ip neigh on the physical server when it's not working and when it is, then compare the result. tcpdump -i any arp could also be helpful.
 
Hi

Thank you for your response.
Indeed it becomes "broken" again after a while.

The tests with arp :
arp -a 193.49.201.188
? (193.49.201.188) at <incomplete> on ens18
arp -a 193.49.201.164
sakai.univ-littoral.fr (193.49.201.164) at 00:1e:0b:c1:d3:02 [ether] on ens18
arp -a 193.49.201.188
? (193.49.201.188) at <incomplete> on ens18
arp -a 193.49.201.164 neigh
sakai.univ-littoral.fr (193.49.201.164) at 00:1e:0b:c1:d3:02 [ether] on ens18

I think I have a problem with the known_hosts on the fourth node (/etc/pve/priv/known_hosts) :

Every time I connect to the third node, it asks me to add the SSH key:
root@ipmpve6:~# ssh root@ipmpve5
Warning: the RSA host key for 'ipmpve5' differs from the key for the IP address '192.168.38.105'.
Offending key for IP in /etc/ssh/ssh_known_hosts:8
Matching host key in /etc/ssh/ssh_known_hosts:11
Are you sure you want to continue connecting (yes/no)?

If I delete the line in question on this node, will the "pve" cluster be impacted?

This cluster is in production with 50 vms ...

Thank you for your answers.
Sincerely
 
Try running 'pvecm updatecerts' to update the known_hosts file.

The thing is, however, that /etc/pve/* should be the same on *all* nodes in the cluster, at all times. If you only experience this on one node (i.e. all other nodes can connect to the third node), your cluster is broken already.

What does 'pvecm status' report? Anything in journalctl? (on all nodes)

Another wild guess here, but could it be an IP conflict? That could explain the wrong SSH certificate. Or did you maybe remove a node from the cluster at one point? Or reinstall one?
 
Yes, you're right.

I had to reinstall nodes 5&6 because I encountered routing problems now fixed.

I'm system administrator of the cluster but not network administrator (university).

And yes, the "know hosts" are not the same on all nodes ...

If I launch the "pvecm updatecerts" on all nodes (?), will there be an impact on the vms in production/nodes ?

Thanks again for your answers
Yours sincerely,
 

Attachments

I had to reinstall nodes 5&6 because I encountered routing problems now fixed.

The status you posted only shows 4 nodes in total? Which nodes are you referring to here?

If I launch the "pvecm updatecerts" on all nodes (?), will there be an impact on the vms in production/nodes ?

The VMs will not be affected, especially if you're not using HA. However, if the files are currently different inbetween nodes, it won't help. Differing files mean a broken cluster, so no update command can affect nodes not in the quorum.
 
Yes, excuse me those are the names of the pve's but I have four nodes.

I don't use (for the moment the HA), but I don't understand.
Can a "pvecm updatecerts" on each node fix the problem ?

Thanks again,
 
If you don't use HA, even if your cluster breaks entirely, your VMs will keep running, yes.
 
All right. Thank you.
I'll run the command 'pvecm updatecerts'.
I'll be sure to keep the community informed.

If it can be of use to other admins
Yours sincerely,
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!