Cluster fails with 1500 byte MTU, works with 1400

CRCinAU

Well-Known Member
May 4, 2020
120
36
48
crc.id.au
Hi all,

This ones got me kinda stumped... I'm trying to build a total of 5 nodes in a cluster - but am getting stuck at three with issues I can't really explain.

So far, I have 3 machines with the following IPs:
192.168.51.1
192.168.133.2
192.168.133.4

If I leave these all set to the default MTU of 1500, then the nodes seem to have intermittent connectivity between them... Logs appear in journalctl as follows:
Code:
Oct 06 13:04:40 cly-pm-tmp corosync[1101]:   [KNET  ] pmtud: possible MTU misconfiguration detected. kernel is reporting MTU: 1500 bytes for host 2 link 0 but the other node is not acknowledging packets of this size. 
Oct 06 13:04:40 cly-pm-tmp corosync[1101]:   [KNET  ] pmtud: This can be caused by this node interface MTU too big or a network device that does not support or has been misconfigured to manage MTU of this size, or packet loss. knet will c 
Oct 06 13:04:41 cly-pm-tmp pmxcfs[959]: [dcdb] notice: cpg_send_message retry 40 
Oct 06 13:04:41 cly-pm-tmp pmxcfs[959]: [dcdb] notice: cpg_send_message retry 40 
Oct 06 13:04:42 cly-pm-tmp pmxcfs[959]: [dcdb] notice: cpg_send_message retry 50 
Oct 06 13:04:42 cly-pm-tmp pmxcfs[959]: [dcdb] notice: cpg_send_message retry 50 
Oct 06 13:04:43 cly-pm-tmp pmxcfs[959]: [dcdb] notice: cpg_send_message retry 60 
Oct 06 13:04:44 cly-pm-tmp pmxcfs[959]: [status] notice: cpg_send_message retry 70 
Oct 06 13:04:44 cly-pm-tmp pmxcfs[959]: [status] notice: cpg_send_message retry 70 
Oct 06 13:04:45 cly-pm-tmp pmxcfs[959]: [dcdb] notice: cpg_send_message retry 80 
Oct 06 13:04:45 cly-pm-tmp pmxcfs[959]: [dcdb] notice: cpg_send_message retry 80 
Oct 06 13:04:46 cly-pm-tmp pmxcfs[959]: [dcdb] notice: cpg_send_message retry 90 
Oct 06 13:04:46 cly-pm-tmp pmxcfs[959]: [dcdb] notice: cpg_send_message retry 90 
Oct 06 13:04:47 cly-pm-tmp pmxcfs[959]: [status] notice: cpg_send_message retry 100 
Oct 06 13:04:47 cly-pm-tmp pmxcfs[959]: [status] notice: cpg_send_message retried 100 times 
Oct 06 13:04:47 cly-pm-tmp pvesr[1673]: error during cfs-locked 'file-replication_cfg' operation: got lock request timeout 
Oct 06 13:04:47 cly-pm-tmp pmxcfs[959]: [status] crit: cpg_send_message failed: 6 
Oct 06 13:04:47 cly-pm-tmp systemd[1]: pvesr.service: Main process exited, code=exited, status=16/n/a 
Oct 06 13:04:47 cly-pm-tmp systemd[1]: pvesr.service: Failed with result 'exit-code'. 
Oct 06 13:04:47 cly-pm-tmp systemd[1]: Failed to start Proxmox VE replication runner. 
Oct 06 13:04:47 cly-pm-tmp pve-firewall[1119]: firewall update time (10.042 seconds) 
Oct 06 13:04:48 cly-pm-tmp pmxcfs[959]: [dcdb] notice: cpg_send_message retry 10 
Oct 06 13:04:48 cly-pm-tmp pmxcfs[959]: [status] notice: cpg_send_message retry 10 
Oct 06 13:04:49 cly-pm-tmp pmxcfs[959]: [dcdb] notice: cpg_send_message retry 20 
Oct 06 13:04:49 cly-pm-tmp pmxcfs[959]: [status] notice: cpg_send_message retry 20 
Oct 06 13:04:50 cly-pm-tmp pmxcfs[959]: [dcdb] notice: cpg_send_message retry 30 
Oct 06 13:04:50 cly-pm-tmp pmxcfs[959]: [status] notice: cpg_send_message retry 30

corosync.conf is as follows:
Code:
logging { 
  debug: off 
  to_syslog: yes 
} 
 
nodelist { 
  node { 
    name: cbd-pm-2 
    nodeid: 3 
    quorum_votes: 1 
    ring0_addr: 192.168.133.2 
  } 
  node { 
    name: cly-pm-1 
    nodeid: 2 
    quorum_votes: 1 
    ring0_addr: 192.168.51.1 
  } 
  node { 
    name: cly-pm-tmp 
    nodeid: 1 
    quorum_votes: 1 
    ring0_addr: 192.168.133.4 
  } 
} 
 
quorum { 
  provider: corosync_votequorum 
} 
 
totem { 
  cluster_name: proxmoxcluster 
  config_version: 4 
  interface { 
    linknumber: 0 
  } 
  ip_version: ipv4-6 
  link_mode: passive 
  secauth: on 
  version: 2 
}

With the MTU set to 1400 across the board, the nodes talk to each other etc - but this seems to cause issues with running CT's / VMs that still want to use 1500 bytes as an MTU...

So, with 1400 bytes:
Code:
# pvecm status 
Cluster information 
------------------- 
Name:             proxmoxcluster 
Config Version:   4 
Transport:        knet 
Secure auth:      on 
 
Quorum information 
------------------ 
Date:             Tue Oct  6 14:24:03 2020 
Quorum provider:  corosync_votequorum 
Nodes:            3 
Node ID:          0x00000001 
Ring ID:          1.1ab6 
Quorate:          Yes 
 
Votequorum information 
---------------------- 
Expected votes:   3 
Highest expected: 3 
Total votes:      3 
Quorum:           2   
Flags:            Quorate  
 
Membership information 
---------------------- 
    Nodeid      Votes Name 
0x00000001          1 192.168.133.4 (local) 
0x00000002          1 192.168.51.1 
0x00000003          1 192.168.133.2

Everything on the cluster side seems happy....

Node #2 is over an IPSec VPN - so it is possible that there is a lower MTU required for transit over the IPSec VPN.

Pinging across the VPN to non-proxmox hosts, I see:
Code:
$ ping -M do -s 1411 192.168.133.1 
PING 192.168.133.1 (192.168.133.1) 1411(1439) bytes of data. 
ping: local error: Message too long, mtu=1438 
ping: local error: Message too long, mtu=1438 
ping: local error: Message too long, mtu=1438

So, is it possible to figure out why pMTU with corosync says 1484 bytes, but I'm seeing different?

Can I pin the cluster at a lower MTU but leave the ethernet interfaces at 1500? We've only seen this happen whilst using Proxmox... Normal web traffic etc across the same VPN has been operating fine for years...
 
Thanks - I've proved that there's an issue and our networks guy is looking into it now - the following debug on hosts with 1500 byte MTUs seems to show an issue one way, but not the other:

Going from 192.168.51.x -> 192.168.133.x with:
Code:
ping -M do -s 1411 192.168.133.1
gives me:

Code:
PING 192.168.133.1 (192.168.133.1) 1411(1439) bytes of data.
ping: local error: Message too long, mtu=1438

Which is what I'd expect...

Going the other way however, just gives me 100% packet loss - no fragmentation advice or similar - so I'm guessing that something is 'helpfully' dropping those instead of replying properly.

In the meantime, setting Corosync to use an MTU of 1400 and setting the ethernet MTU back to 1500 will probably be a good enough workaround for now...
 
I also note that when I change the MTU via 'ip' on the console, I get the following logs from Corosync:
Code:
Oct 06 17:09:28 cbd-pm-2 corosync[9158]:   [KNET  ] rx: host: 1 link: 0 is up 
Oct 06 17:09:28 cbd-pm-2 corosync[9158]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1) 
Oct 06 17:09:28 cbd-pm-2 corosync[9158]:   [KNET  ] rx: host: 2 link: 0 is up 
Oct 06 17:09:28 cbd-pm-2 corosync[9158]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1) 
Oct 06 17:09:28 cbd-pm-2 corosync[9158]:   [KNET  ] pmtud: PMTUD link change for host: 1 link: 0 from 1285 to 1397

I set the MTU in corosync.conf as follows:
Code:
totem { 
  cluster_name: proxmoxcluster 
  config_version: 6 
  interface { 
    linknumber: 0 
  } 
  ip_version: ipv4-6 
  link_mode: passive 
  secauth: on 
  version: 2 
  netmtu: 1200 
}

It seems setting netmtu: 1400 caused the same problem to occur - so I'm not sure what's going on there right now - or if theres other overheads within the corosync setup that would add to the number of bytes lost....
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!