cluster pve 5.4 network strangeness

halley80 · Jan 22, 2020

Hi,

My cluster consists of 4 nodes (Dell R440)
Networks :
192.168.38.0/24 (private network for vms)
193.49.x.y (public network for vms)

The vms communicate perfectly on the first three nodes with an external physical server (public network).

But the physical server fails to "ping" a vm that is on the fourth node (one public network card for each 193.49.x.y).
The fourth node is configured in the same way as the first three.

However, when I ping the physical server from the vm of the fourth node, the connection initializes with difficulty
(cf response time):

********************************************
root@nas-pve-prod:~# ping 193.49.x.y
PING 193.49.x.y (193.49.x.y) 56(84) bytes of data.
64 bytes from 193.49.201.164: icmp_seq=4 ttl=64 time=2112 ms
64 bytes from 193.49.201.164: icmp_seq=5 ttl=64 time=1088 ms
64 bytes from 193.49.201.164: icmp_seq=6 ttl=64 time=64.2 ms
********************************************

The ping then works from the physical server to the vm of the fourth node.

Have you encountered this problem before?
Thanking you for your answers.
Yours sincerely,

Stefan_R · Jan 22, 2020

halley80 said:
The ping then works from the physical server to the vm of the fourth node.

Does it stay that way then or revert back to being broken after a while?

It sounds like an ARP issue if I had to guess. Try running arp -a/ip neigh on the physical server when it's not working and when it is, then compare the result. tcpdump -i any arp could also be helpful.

halley80 · Jan 23, 2020

Hi

Thank you for your response.
Indeed it becomes "broken" again after a while.

The tests with arp :
arp -a 193.49.201.188
? (193.49.201.188) at <incomplete> on ens18
arp -a 193.49.201.164
sakai.univ-littoral.fr (193.49.201.164) at 00:1e:0b:c1:d3:02 [ether] on ens18
arp -a 193.49.201.188
? (193.49.201.188) at <incomplete> on ens18
arp -a 193.49.201.164 neigh
sakai.univ-littoral.fr (193.49.201.164) at 00:1e:0b:c1:d3:02 [ether] on ens18

I think I have a problem with the known_hosts on the fourth node (/etc/pve/priv/known_hosts) :

Every time I connect to the third node, it asks me to add the SSH key:
root@ipmpve6:~# ssh root@ipmpve5
Warning: the RSA host key for 'ipmpve5' differs from the key for the IP address '192.168.38.105'.
Offending key for IP in /etc/ssh/ssh_known_hosts:8
Matching host key in /etc/ssh/ssh_known_hosts:11
Are you sure you want to continue connecting (yes/no)?

If I delete the line in question on this node, will the "pve" cluster be impacted?

This cluster is in production with 50 vms ...

Thank you for your answers.
Sincerely

Stefan_R · Jan 23, 2020

Try running 'pvecm updatecerts' to update the known_hosts file.

The thing is, however, that /etc/pve/* should be the same on *all* nodes in the cluster, at all times. If you only experience this on one node (i.e. all other nodes can connect to the third node), your cluster is broken already.

What does 'pvecm status' report? Anything in journalctl? (on all nodes)

Another wild guess here, but could it be an IP conflict? That could explain the wrong SSH certificate. Or did you maybe remove a node from the cluster at one point? Or reinstall one?

halley80 · Jan 23, 2020

Yes, you're right.

I had to reinstall nodes 5&6 because I encountered routing problems now fixed.

I'm system administrator of the cluster but not network administrator (university).

And yes, the "know hosts" are not the same on all nodes ...

If I launch the "pvecm updatecerts" on all nodes (?), will there be an impact on the vms in production/nodes ?

Thanks again for your answers
Yours sincerely,

Stefan_R · Jan 23, 2020

halley80 said:
I had to reinstall nodes 5&6 because I encountered routing problems now fixed.

The status you posted only shows 4 nodes in total? Which nodes are you referring to here?

halley80 said:
If I launch the "pvecm updatecerts" on all nodes (?), will there be an impact on the vms in production/nodes ?

The VMs will not be affected, especially if you're not using HA. However, if the files are currently different inbetween nodes, it won't help. Differing files mean a broken cluster, so no update command can affect nodes not in the quorum.

halley80 · Jan 23, 2020

Yes, excuse me those are the names of the pve's but I have four nodes.

I don't use (for the moment the HA), but I don't understand.
Can a "pvecm updatecerts" on each node fix the problem ?

Thanks again,

halley80 · Jan 23, 2020

No impact on production and/or clustering ?

Stefan_R · Jan 23, 2020

If you don't use HA, even if your cluster breaks entirely, your VMs will keep running, yes.

halley80 · Jan 23, 2020

All right. Thank you.
I'll run the command 'pvecm updatecerts'.
I'll be sure to keep the community informed.

If it can be of use to other admins
Yours sincerely,

Search

Search

cluster pve 5.4 network strangeness

halley80

Well-Known Member

Stefan_R

Proxmox Retired Staff

halley80

Well-Known Member

Stefan_R

Proxmox Retired Staff

halley80

Well-Known Member

Attachments

Stefan_R

Proxmox Retired Staff

halley80

Well-Known Member

halley80

Well-Known Member

Stefan_R

Proxmox Retired Staff

halley80

Well-Known Member