All nodes rebooted when moving one node to another rack

rndlinux · Dec 5, 2019

I have 18 nodes in Proxmox Cluster on 2 rack (rack A (2 switch for ring0 và ring1) and rack B (2 switch for ring0 and ring1). Ring0 and Ring1 run on two vlan.
I move node04 from rack A to rack C (2 switch for ring0 and ring1). After turn on node04 => all 18 nodes were rebooted.

Has anyone encountered this error yet? Please support.
Some logs:

Code:

Dec  4 14:20:17  node04 corosync[2420]:   [SERV  ] Service engine loaded: corosync watchdog service [7]

Dec  4 14:20:17  node04 corosync[2420]:   [QUORUM] Using quorum provider corosync_votequorum

Dec  4 14:20:17  node04 corosync[2420]:   [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]

Dec  4 14:20:17  node04 corosync[2420]:   [QB    ] server name: votequorum

Dec  4 14:20:17  node04 corosync[2420]:   [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]

Dec  4 14:20:17  node04 corosync[2420]:   [QB    ] server name: quorum

Dec  4 14:20:17  node04 corosync[2420]:   [KNET  ] host: host: 5 (passive) best link: 0 (pri: 1)

Dec  4 14:20:17  node04 corosync[2420]:   [TOTEM ] A new membership (7.7e4) was formed. Members joined: 7

Dec  4 14:20:17  node04 corosync[2420]:   [KNET  ] host: host: 13 (passive) best link: 0 (pri: 1)

Dec  4 14:20:17  node04 corosync[2420]:   [KNET  ] host: host: 13 has no active links

Dec  4 14:20:17  node04 corosync[2420]:   [KNET  ] host: host: 13 (passive) best link: 0 (pri: 1)

Dec  4 14:20:17  node04 corosync[2420]:   [KNET  ] host: host: 13 has no active links

Dec  4 14:20:17  node04 corosync[2420]:   [CPG   ] downlist left_list: 0 received

Dec  4 14:20:17  node04 systemd[1]: Started Corosync Cluster Engine.

Dec  4 14:20:17  node04 corosync[2420]:   [KNET  ] host: host: 15 (passive) best link: 0 (pri: 0)

Dec  4 14:20:17  node04 corosync[2420]:   [KNET  ] host: host: 15 has no active links

Dec  4 14:20:17  node04 corosync[2420]:   [KNET  ] host: host: 15 (passive) best link: 0 (pri: 1)

Dec  4 14:20:17  node04 corosync[2420]:   [KNET  ] host: host: 15 has no active links

Dec  4 14:20:17  node04 corosync[2420]:   [QUORUM] Members[1]: 7

Dec  4 14:20:17  node04 corosync[2420]:   [MAIN  ] Completed service synchronization, ready to provide service.
Dec  4 14:20:17  node04 corosync[2420]:   [KNET  ] host: host: 1 has no active links

Code:

Dec  4 14:20:17 node04 corosync[2420]:   [KNET  ] host: host: 2 has no active links
Dec  4 14:20:17 node04 corosync[2420]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Dec  4 14:20:17 node04 corosync[2420]:   [KNET  ] host: host: 2 has no active links
Dec  4 14:20:17 node04 corosync[2420]:   [KNET  ] host: host: 5 (passive) best link: 0 (pri: 1)
Dec  4 14:20:17 node04 corosync[2420]:   [KNET  ] host: host: 5 has no active links
Dec  4 14:20:18 node04 pve-firewall[2438]: starting server
Dec  4 14:20:18 node04 pvestatd[2439]: starting server
Dec  4 14:20:18 node04 systemd[1]: Started PVE Status Daemon.
Dec  4 14:20:18 node04 systemd[1]: Started Proxmox VE firewall.
Dec  4 14:20:18 node04 pvefw-logger[1082]: received terminate request (signal)
Dec  4 14:20:18 node04 pvefw-logger[1082]: stopping pvefw logger
Dec  4 14:20:18 node04 systemd[1]: Stopping Proxmox VE firewall logger...
Dec  4 14:20:18 node04 systemd[1]: pvefw-logger.service: Succeeded.
Dec  4 14:20:18 node04 systemd[1]: Stopped Proxmox VE firewall logger.
Dec  4 14:20:18 node04 systemd[1]: Starting Proxmox VE firewall logger...
Dec  4 14:20:18 node04 pvefw-logger[2482]: starting pvefw logger
Dec  4 14:20:18 node04 systemd[1]: Started Proxmox VE firewall logger.
Dec  4 14:20:18 node04 kernel: [   17.121760] tg3 0000:01:00.0 eno1: Link is up at 1000 Mbps, full duplex
Dec  4 14:20:18 node04 kernel: [   17.121767] tg3 0000:01:00.0 eno1: Flow control is on for TX and on for RX
Dec  4 14:20:18 node04 kernel: [   17.121769] tg3 0000:01:00.0 eno1: EEE is disabled
Dec  4 14:20:18 node04 kernel: [   17.121795] IPv6: ADDRCONF(NETDEV_CHANGE): eno1: link becomes ready
Dec  4 14:20:18 node04 pvedaemon[2485]: starting server
Dec  4 14:20:18 node04 pvedaemon[2485]: starting 3 worker(s)
Dec  4 14:20:18 node04 pvedaemon[2485]: worker 2486 started
Dec  4 14:20:18 node04 pvedaemon[2485]: worker 2487 started
Dec  4 14:20:18 node04 pvedaemon[2485]: worker 2488 started
Dec  4 14:20:18 node04 systemd[1]: Started PVE API Daemon.
Dec  4 14:20:18 node04 systemd[1]: Starting PVE Cluster Resource Manager Daemon...
Dec  4 14:20:18 node04 systemd[1]: Starting PVE API Proxy Server...
Dec  4 14:20:18 node04 kernel: [   17.520089] tg3 0000:01:00.1 eno2: Link is up at 1000 Mbps, full duplex
Dec  4 14:20:18 node04 kernel: [   17.520098] tg3 0000:01:00.1 eno2: Flow control is on for TX and on for RX
Dec  4 14:20:18 node04 kernel: [   17.520100] tg3 0000:01:00.1 eno2: EEE is disabled
Dec  4 14:20:18 node04 kernel: [   17.520127] IPv6: ADDRCONF(NETDEV_CHANGE): eno2: link becomes ready
Dec  4 14:20:19 node04 kernel: [   17.613462] tg3 0000:02:00.0 eno3: Link is up at 1000 Mbps, full duplex
Dec  4 14:20:19 node04 kernel: [   17.613470] tg3 0000:02:00.0 eno3: Flow control is on for TX and on for RX
Dec  4 14:20:19 node04 kernel: [   17.613471] tg3 0000:02:00.0 eno3: EEE is disabled
Dec  4 14:20:19 node04 kernel: [   17.613497] IPv6: ADDRCONF(NETDEV_CHANGE): eno3: link becomes ready
Dec  4 14:20:19 node04 pve-ha-crm[2492]: starting server

Code:

Dec  4 14:20:19 node04 pve-ha-crm[2492]: starting server
Dec  4 14:20:19 node04 pve-ha-crm[2492]: status change startup => wait_for_quorum
Dec  4 14:20:19 node04 systemd[1]: Started PVE Cluster Resource Manager Daemon.
Dec  4 14:20:19 node04 systemd[1]: Starting PVE Local HA Resource Manager Daemon...
Dec  4 14:20:19 node04 pveproxy[2494]: starting server
Dec  4 14:20:19 node04 pveproxy[2494]: starting 3 worker(s)
Dec  4 14:20:19 node04 pveproxy[2494]: worker 2495 started
Dec  4 14:20:19 node04 pveproxy[2494]: worker 2496 started
Dec  4 14:20:19 node04 pveproxy[2494]: worker 2497 started
Dec  4 14:20:19 node04 systemd[1]: Started PVE API Proxy Server.
Dec  4 14:20:19 node04 systemd[1]: Starting PVE SPICE Proxy Server...
Dec  4 14:20:19 node04 kernel: [   18.033322] tg3 0000:02:00.1 eno4: Link is up at 1000 Mbps, full duplex
Dec  4 14:20:19 node04 kernel: [   18.033330] tg3 0000:02:00.1 eno4: Flow control is on for TX and on for RX
Dec  4 14:20:19 node04 kernel: [   18.033331] tg3 0000:02:00.1 eno4: EEE is disabled
Dec  4 14:20:19 node04 kernel: [   18.033358] IPv6: ADDRCONF(NETDEV_CHANGE): eno4: link becomes ready
Dec  4 14:20:19 node04 spiceproxy[2499]: starting server
Dec  4 14:20:19 node04 spiceproxy[2499]: starting 1 worker(s)
Dec  4 14:20:19 node04 spiceproxy[2499]: worker 2500 started
Dec  4 14:20:19 node04 systemd[1]: Started PVE SPICE Proxy Server.
Dec  4 14:20:19 node04 pve-ha-lrm[2501]: starting server
Dec  4 14:20:19 node04 pve-ha-lrm[2501]: status change startup => wait_for_agent_lock
Dec  4 14:20:19 node04 systemd[1]: Started PVE Local HA Resource Manager Daemon.
Dec  4 14:20:19 node04 systemd[1]: Starting PVE guests...
Dec  4 14:20:20 node04 pve-guests[2502]: <root@pam> starting task UPID:d104:000009C7:0000075C:5DE75E34:startall::root@pam:
Dec  4 14:20:20 node04 pvesh[2502]: waiting for quorum ...
Dec  4 14:20:22 node04 pmxcfs[2286]: [status] notice: update cluster info (cluster name  cluster1, version = 55)
Dec  4 14:20:22 node04 pmxcfs[2286]: [dcdb] notice: members: 7/2286
Dec  4 14:20:22 node04 pmxcfs[2286]: [dcdb] notice: all data is up to date
Dec  4 14:20:22 node04 pmxcfs[2286]: [status] notice: members: 7/2286
Dec  4 14:20:22 node04 pmxcfs[2286]: [status] notice: all data is up to date
Dec  4 14:20:33 node04 pvestatd[2439]: got timeout
Dec  4 14:20:35 node04 pvestatd[2439]: storage 'backup' is not online
Dec  4 14:20:40 node04 pvestatd[2439]: got timeout
Dec  4 14:20:40 node04 pvestatd[2439]: status update time (12.174 seconds)
Dec  4 14:20:42 node04 pvestatd[2439]: storage 'backup' is not online
Dec  4 14:20:47 node04 pvestatd[2439]: got timeout
Dec  4 14:20:50 node04 kernel: [   48.658610] mlx4_en: enp4s0: Link Down
Dec  4 14:20:52 node04 pvestatd[2439]: got timeout

Code:

Dec  4 14:21:00 node04 corosync[2420]:   [TOTEM ] A new membership (1.7f0) was formed. Members joined: 1 2 3 4 5 6 8 9 10 11 12 13 14 1
5 16 17 19
Dec  4 14:21:00 node04 corosync[2420]:   [CPG   ] downlist left_list: 0 received
Dec  4 14:21:00 node04 corosync[2420]:   [CPG   ] downlist left_list: 0 received
Dec  4 14:21:00 node04 pmxcfs[2286]: [dcdb] notice: members: 1/2340, 2/2302, 3/2332, 4/2654, 5/2352, 6/2355, 7/2286, 8/2141, 9/2418, 10
/2246, 11/2190, 12/2343, 13/2345, 14/2351, 15/2366, 16/2265, 17/4790, 19/2282
Dec  4 14:21:00 node04 pmxcfs[2286]: [dcdb] notice: starting data syncronisation
Dec  4 14:21:00 node04 pmxcfs[2286]: [status] notice: members: 1/2340, 2/2302, 3/2332, 4/2654, 5/2352, 6/2355, 7/2286, 8/2141, 9/2418,
10/2246, 11/2190, 12/2343, 13/2345, 14/2351, 15/2366, 16/2265, 17/4790, 19/2282
Dec  4 14:21:00 node04 pmxcfs[2286]: [status] notice: starting data syncronisation
Dec  4 14:21:00 node04 corosync[2420]:   [QUORUM] This node is within the primary component and will provide service.
Dec  4 14:21:00 node04 corosync[2420]:   [QUORUM] Members[18]: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 19
Dec  4 14:21:00 node04 corosync[2420]:   [MAIN  ] Completed service synchronization, ready to provide service.
Dec  4 14:21:00 node04 pmxcfs[2286]: [status] notice: node has quorum
Dec  4 14:21:00 node04 corosync[2420]:   [KNET  ] pmtud: PMTUD link change for host: 12 link: 0 from 469 to 1397
Dec  4 14:21:00 node04 corosync[2420]:   [KNET  ] pmtud: PMTUD link change for host: 12 link: 1 from 469 to 1397
Dec  4 14:21:00 node04 corosync[2420]:   [KNET  ] pmtud: PMTUD link change for host: 10 link: 0 from 469 to 1397
Dec  4 14:21:00 node04 corosync[2420]:   [KNET  ] pmtud: PMTUD link change for host: 10 link: 1 from 469 to 1397
Dec  4 14:21:00 node04 corosync[2420]:   [KNET  ] pmtud: PMTUD link change for host: 6 link: 0 from 469 to 1397

tim · Dec 12, 2019

Please post your pveversion -> pveversion -v

rndlinux · Dec 19, 2019

Hi tim,

I used proxmox 6.0 on this issues. And 06/12/2019, i have upgraded all node to proxmox 6.1 (pve-manager/6.1-3/37248ce6). After that, clluster stable.

Yesterday, i rebooted node11. After node11 online again => all node another reboot over again and not stop reboot.

I must power off all node and turn on its (turn one from node01 to node20) and cluster stable. Can you give me your advice about this issues?

spirit · Dec 19, 2019

Could you sent corosync logs of differents nodes ? All nodes was proxmox 6.1? (And rebooted After 6.1 upgrade ?)

rndlinux · Dec 19, 2019

Hi Spirit,

Thank you, i have attached logs. Could you please check?

rndlinux · Dec 19, 2019

My cluster have upgraded from 5.4 to 6.1, and pve 5.4 it used multicast. Can anyone help me how to check my cluster use unicast or multicast with proxmox 6.1?

tim · Dec 19, 2019

Please post the output of:
# pveversion -v

And additionally post your corosync.conf from /etc/pve/corosync.conf

rndlinux · Dec 19, 2019

Hi Tim,

Code:

proxmox-ve: 6.1-2 (running kernel: 5.3.10-1-pve)
pve-manager: 6.1-3 (running version: 6.1-3/37248ce6)
pve-kernel-5.3: 6.0-12
pve-kernel-helper: 6.0-12
pve-kernel-5.0: 6.0-11
pve-kernel-5.3.10-1-pve: 5.3.10-1
pve-kernel-5.0.21-5-pve: 5.0.21-10
pve-kernel-5.0.21-2-pve: 5.0.21-7
pve-kernel-5.0.15-1-pve: 5.0.15-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.2-pve4
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.13-pve1
libpve-access-control: 6.0-5
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-9
libpve-guest-common-perl: 3.0-3
libpve-http-server-perl: 3.0-3
libpve-storage-perl: 6.1-2
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve3
lxc-pve: 3.2.1-1
lxcfs: 3.0.3-pve60
novnc-pve: 1.1.0-1
openvswitch-switch: 2.10.0+2018.08.28+git.8ca7c82b7d+ds1-12+deb10u1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.1-1
pve-cluster: 6.1-2
pve-container: 3.0-14
pve-docs: 6.1-3
pve-edk2-firmware: 2.20191002-1
pve-firewall: 4.0-9
pve-firmware: 3.0-4
pve-ha-manager: 3.0-8
pve-i18n: 2.0-3
pve-qemu-kvm: 4.1.1-2
pve-xtermjs: 3.13.2-1
qemu-server: 6.1-2
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.2-pve2

These are corosync.conf

Code:

# cat /etc/pve/corosync.conf
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: node01
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 192.168.186.151
    ring1_addr: 192.168.187.151
  }
  node {
    name: node02
    nodeid: 5
    quorum_votes: 1
    ring0_addr: 192.168.186.152
    ring1_addr: 192.168.187.152
  }
  node {
    name: node03
    nodeid: 6
    quorum_votes: 1
    ring0_addr: 192.168.186.153
    ring1_addr: 192.168.187.153
  }
  node {
    name: node04
    nodeid: 7
    quorum_votes: 1
    ring0_addr: 192.168.186.154
    ring1_addr: 192.168.187.154
  }
  node {
    name: node05
    nodeid: 10
    quorum_votes: 1
    ring0_addr: 192.168.186.155
    ring1_addr: 192.168.187.155
  }
  node {
    name: node06
    nodeid: 16
    quorum_votes: 1
    ring0_addr: 192.168.186.156
    ring1_addr: 192.168.187.156
  }
  node {
    name: node07
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 192.168.186.159
    ring1_addr: 192.168.187.159
  }
  node {
    name: node08
    nodeid: 9
    quorum_votes: 1
    ring0_addr: 192.168.186.160
    ring1_addr: 192.168.187.160
  }
  node {
    name: node09
    nodeid: 18
    quorum_votes: 1
    ring0_addr: 192.168.186.161
    ring1_addr: 192.168.187.161
  }
  node {
    name: node10
    nodeid: 19
    quorum_votes: 1
    ring0_addr: 192.168.186.162
    ring1_addr: 192.168.187.162
  }
  node {
    name: node11
    nodeid: 12
    quorum_votes: 1
    ring0_addr: 192.168.186.163
    ring1_addr: 192.168.187.163
  }
  node {
    name: node12
    nodeid: 13
    quorum_votes: 1
    ring0_addr: 192.168.186.164
    ring1_addr: 192.168.187.164
  }
  node {
    name: node13
    nodeid: 14
    quorum_votes: 1
    ring0_addr: 192.168.186.165
    ring1_addr: 192.168.187.165
  }
  node {
    name: node14
    nodeid: 15
    quorum_votes: 1
    ring0_addr: 192.168.186.166
    ring1_addr: 192.168.187.166
  }
  node {
    name: node15
    nodeid: 20
    quorum_votes: 1
    ring0_addr: 192.168.186.167
    ring1_addr: 192.168.187.167
  }
  node {
    name: node16
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 192.168.186.168
    ring1_addr: 192.168.187.168
  }
  node {
    name: node17
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 192.168.186.170
    ring1_addr: 192.168.187.170
  }
  node {
    name: node18
    nodeid: 8
    quorum_votes: 1
    ring0_addr: 192.168.186.171
    ring1_addr: 192.168.187.171
  }
  node {
    name: node19
    nodeid: 11
    quorum_votes: 1
    ring0_addr: 192.168.186.172
    ring1_addr: 192.168.187.172
  }
  node {
    name: node20
    nodeid: 17
    quorum_votes: 1
    ring0_addr: 192.168.186.173
    ring1_addr: 192.168.187.173
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: Cloud-02
  config_version: 60
  interface {
    bindnetaddr: 192.168.186.0
    ringnumber: 0
  }
  interface {
    bindnetaddr: 192.168.187.0
    ringnumber: 1
  }
  ip_version: ipv4
  rrp_mode: passive
  secauth: on
  version: 2
}

tim · Dec 19, 2019

Your corosync.conf isn't the same version as reported in your first log:
"Dec 4 14:20:22 node04 pmxcfs[2286]: [status] notice: update cluster info (cluster name cluster1, version = 55)" vs 60 now

Did you add nodes or anything else?

The config still uses some deprecated parameters, this isn't a problem right now, but you should clean it up in the future.

From what I can see, there was an issue with one of your links in the initial logs you provided: Dec 4 14:20:50 node04 kernel: [ 48.658610] mlx4_en: enp4s0: Link Down

Can you share your network config as well ?
Do you have HA enabled?

rndlinux · Dec 20, 2019

Hi Tim,

"Dec 4 14:20:22 node04 pmxcfs[2286]: [status] notice: update cluster info (cluster name cluster1, version = 55)" vs 60 now

I have updated version when del and add another node to cluster => for now version is 60.

Can you share your network config as well ?

Let me share my cluster config, hoping you help.

WAN (bond) eno1, eno2 (active,backup), trunk port.
- ring0
LAN (bon1) eno3, eno4 (active, backup). trunk port
- ring1
SAN (bon2 active, backup two port)
- migrate

So, my cluster run in WAN and LAN => SAN

From what I can see, there was an issue with one of your links in the initial logs you provided: Dec 4 14:20:50 node04 kernel: [ 48.658610] mlx4_en: enp4s0: Link Down

not related.

Do you have HA enabled?

Yes, it enabled.

Can you help me how to check my cluster running multicast or unicast?

spirit · Dec 20, 2019

rndlinux said:
Can you help me how to check my cluster running multicast or unicast?

corosync3 don't use multicast anymore, only unicast
if you want to be sure, you can "tcpdump port 5404", and check if you see multicat address. (2XX.XXX.XXX.XXX)

spirit · Dec 20, 2019

the messages:

" Process pause detected for 6353 ms, flushing membership messages."

seem bad.

what is the latency on your different ring ? (can you send result of "ping -f ...." between 2 nodes ?

I have tested corosync3 with 20 nodes on 10GB network with low latency without problem, but if you use gigabit link, it could be great to see the latency. (depend of asic of the switch. What are your switch model ?)

maybe you can try to increase token timeout in corosync.conf (token: 10000 in totem section)

rndlinux · Dec 20, 2019

Hi spirit,

if you want to be sure, you can "tcpdump port 5404", and check if you see multicat address. (2XX.XXX.XXX.XXX)

Port 5404 not listed and open on my cluster. I see only 5405 (ring0) and 5406 (ring1) is open. Two port open on two ring0 and ring1.

corosync3 don't use multicast anymore, only unicast

I know unicast only use for 4 nodes. If cluster using more than 4 nodes => must use multicast? Did you use multicast for your cluster 20 nodes?

What are your switch model

I use switch cisco 4948 model.

what is the latency on your different ring ? (can you send result of "ping -f ...." between 2 nodes ?

Latency is 0.2ms between 2 nodes.

maybe you can try to increase token timeout in corosync.conf (token: 10000 in totem section)

Could you please let me know how to check token timeout on cluster? I am not see it on corosync.conf

Thank you for help.

spirit · Dec 20, 2019

rndlinux said:
I know unicast only use for 4 nodes. If cluster using more than 4 nodes => must use multicast? Did you use multicast for your cluster 20 nodes?

Corosync3 use a new protocol, faster than previously, and this use unicast. (you can't do multicast with corosync3)

I use switch cisco 4948 model.

Latency is 0.2ms between 2 nodes.
View attachment 13563

I'm running 20 clusters nodes, but with very lower lacenties (0.026 ms).
I'm not sure it's possible to have a 100% stable network with 0,150ms latency and 20 nodes without corosync tuning.
Not sure how old is you cisco 4948 and asic model. (first release with 15years ago)

Could you please let me know how to check token timeout on cluster? I am not see it on corosync.conf

The value don't exist by default.
simply add it in totem section and increase config_version.

totem {
......
token: 10000
}

(Normally, the token value is autocompute, but I'm not sure how it's how when a node go offline, and come up later, maybe it take a little bit time (too much time), to compute the correct value).

rndlinux · Dec 20, 2019

Corosync3 use a new protocol, faster than previously, and this use unicast. (you can't do multicast with corosync3)

I read the document, it is possible to enable multicast but I do not know how to do this. Can you help?

And my cluster not open port 5404, can you give me your advice about this port?

spirit · Dec 20, 2019

I'm pretty sure that multiple ring are only support with unicast/knet on corosync3

but indeed, multicast can be enabled with "transport:udp" (but without multiple ring)

itvietnam · Dec 20, 2019

Hi @spirit,

do you see any delay while you type any VMID in Proxmox search box from your cluster 20 nodes?

spirit · Dec 23, 2019

itvietnam said:
Hi @spirit,

do you see any delay while you type any VMID in Proxmox search box from your cluster 20 nodes?

no.
It should be related anyway. as cluster config (/etc/pve) is replicated between all nodes locally.

What could be a little bit slower, is an update of vm configuration or vm stat update.

rndlinux · Dec 23, 2019

Hi Spirit,

I have used OVS for link0 and link1 on my cluster. For now I test multicast with omping and see packet loss. Can you give me your advice how to fix this issues?

rndlinux · Dec 24, 2019

Hi Spirit,

I test multicast with omping and using tcpdum => i see multicast address is 232.43.211.234

After that, I checked my cluster with tcpdum on interface ring0 (active link) and not see any multicast address like bellow:

09:53:17.117362 IP Cloud-node01.5405 > Cloud-node02.5405: UDP, length 128
09:53:17.117603 IP Cloud-node02.5405 > Cloud-node05.5405: UDP, length 128
09:53:17.120727 IP Cloud-node01.5405 > Cloud-node02.5405: UDP, length 128
09:53:17.120965 IP Cloud-node02.5405 > Cloud-node05.5405: UDP, length 128
09:53:17.124060 IP Cloud-node01.5405 > Cloud-node02.5405: UDP, length 128
09:53:17.124303 IP Cloud-node02.5405 > Cloud-node05.5405: UDP, length 128
09:53:17.127409 IP Cloud-node01.5405 > Cloud-node02.5405: UDP, length 128
09:53:17.127667 IP Cloud-node02.5405 > Cloud-node05.5405: UDP, length 128
09:53:17.131271 IP Cloud-node01.5405 > Cloud-node02.5405: UDP, length 128
09:53:17.131526 IP Cloud-node02.5405 > Cloud-node05.5405: UDP, length 128
09:53:17.134755 IP Cloud-node01.5405 > Cloud-node02.5405: UDP, length 128
09:53:17.134991 IP Cloud-node02.5405 > Cloud-node05.5405: UDP, length 128
09:53:17.138125 IP Cloud-node01.5405 > Cloud-node02.5405: UDP, length 128
09:53:17.138372 IP Cloud-node02.5405 > Cloud-node05.5405: UDP, length 128
09:53:17.141759 IP Cloud-node03.5405 > Cloud-node02.5405: UDP, length 1472
09:53:17.141778 IP Cloud-node03.5405 > Cloud-node02.5405: UDP, length 1264
09:53:17.141834 IP Cloud-node01.5405 > Cloud-node02.5405: UDP, length 128
09:53:17.142159 IP Cloud-node02.5405 > Cloud-node05.5405: UDP, length 128
09:53:17.147456 IP Cloud-node01.5405 > Cloud-node02.5405: UDP, length 128
09:53:17.147697 IP Cloud-node02.5405 > Cloud-node05.5405: UDP, length 128
09:53:17.151329 IP Cloud-node01.5405 > Cloud-node02.5405: UDP, length 128
09:53:17.151582 IP Cloud-node02.5405 > Cloud-node05.5405: UDP, length 128
09:53:17.154619 IP Cloud-node01.5405 > Cloud-node02.5405: UDP, length 128
09:53:17.154857 IP Cloud-node02.5405 > Cloud-node05.5405: UDP, length 128
09:53:17.157820 IP Cloud-node01.5405 > Cloud-node02.5405: UDP, length 128
09:53:17.158054 IP Cloud-node02.5405 > Cloud-node05.5405: UDP, length 128
09:53:17.161117 IP Cloud-node01.5405 > Cloud-node02.5405: UDP, length 128
09:53:17.161359 IP Cloud-node02.5405 > Cloud-node05.5405: UDP, length 128
09:53:17.163791 IP Cloud-node04.5405 > Cloud-node02.5405: UDP, length 656
09:53:17.164656 IP Cloud-node01.5405 > Cloud-node02.5405: UDP, length 128
09:53:17.164893 IP Cloud-node02.5405 > Cloud-node05.5405: UDP, length 128
09:53:17.167939 IP Cloud-node01.5405 > Cloud-node02.5405: UDP, length 128
09:53:17.168187 IP Cloud-node02.5405 > Cloud-node05.5405: UDP, length 128
09:53:17.172687 IP Cloud-node01.5405 > Cloud-node02.5405: UDP, length 128
09:53:17.172923 IP Cloud-node02.5405 > Cloud-node05.5405: UDP, length 128
09:53:17.176293 IP Cloud-node01.5405 > Cloud-node02.5405: UDP, length 128
09:53:17.176533 IP Cloud-node02.5405 > Cloud-node05.5405: UDP, length 128
09:53:17.179724 IP Cloud-node01.5405 > Cloud-node02.5405: UDP, length 128
09:53:17.179959 IP Cloud-node02.5405 > Cloud-node05.5405: UDP, length 128
09:53:17.183379 IP Cloud-node01.5405 > Cloud-node02.5405: UDP, length 128
09:53:17.183636 IP Cloud-node02.5405 > Cloud-node05.5405: UDP, length 128
09:53:17.187063 IP Cloud-node01.5405 > Cloud-node02.5405: UDP, length 128
09:53:17.187301 IP Cloud-node02.5405 > Cloud-node05.5405: UDP, length 128
09:53:17.190698 IP Cloud-node01.5405 > Cloud-node02.5405: UDP, length 128
09:53:17.190938 IP Cloud-node02.5405 > Cloud-node05.5405: UDP, length 128
09:53:17.194207 IP Cloud-node01.5405 > Cloud-node02.5405: UDP, length 128
09:53:17.194441 IP Cloud-node02.5405 > Cloud-node05.5405: UDP, length 128
09:53:17.197747 IP Cloud-node01.5405 > Cloud-node02.5405: UDP, length 128
09:53:17.197983 IP Cloud-node02.5405 > Cloud-node05.5405: UDP, length 128
09:53:17.201202 IP Cloud-node01.5405 > Cloud-node02.5405: UDP, length 128
09:53:17.201436 IP Cloud-node02.5405 > Cloud-node05.5405: UDP, length 128
09:53:17.204540 IP Cloud-node01.5405 > Cloud-node02.5405: UDP, length 128
09:53:17.204781 IP Cloud-node02.5405 > Cloud-node05.5405: UDP, length 128
09:53:17.208158 IP Cloud-node01.5405 > Cloud-node02.5405: UDP, length 128
09:53:17.208398 IP Cloud-node02.5405 > Cloud-node05.5405: UDP, length 128
09:53:17.211636 IP Cloud-node01.5405 > Cloud-node02.5405: UDP, length 128
09:53:17.211880 IP Cloud-node02.5405 > Cloud-node05.5405: UDP, length 128
09:53:17.215044 IP Cloud-node01.5405 > Cloud-node02.5405: UDP, length 128
09:53:17.215284 IP Cloud-node02.5405 > Cloud-node05.5405: UDP, length 128
09:53:17.218450 IP Cloud-node01.5405 > Cloud-node02.5405: UDP, length 128
09:53:17.218692 IP Cloud-node02.5405 > Cloud-node05.5405: UDP, length 128
09:53:17.222259 IP Cloud-node01.5405 > Cloud-node02.5405: UDP, length 128
09:53:17.222497 IP Cloud-node02.5405 > Cloud-node05.5405: UDP, length 128
09:53:17.225553 IP Cloud-node01.5405 > Cloud-node02.5405: UDP, length 128
09:53:17.225794 IP Cloud-node02.5405 > Cloud-node05.5405: UDP, length 128

And I also using netstat -an to check multicast address and do not see any thing.

=> So, my cluster are running with unicast? I read the docs and know unicast work well with only 4 nodes:

So, could you please let me know how to run cluster 20 nodes stable in my case? Hoping you help!

All nodes rebooted when moving one node to another rack

New Member

Proxmox Staff Member

New Member

Distinguished Member

New Member

Attachments

New Member

Proxmox Staff Member

New Member

Proxmox Staff Member

New Member

Distinguished Member

Distinguished Member

New Member

Attachments

Distinguished Member

New Member

Distinguished Member

Renowned Member

Distinguished Member

New Member

New Member