Cluster reset when one node can only reached over corosync ring0 -- configuration problem?

mensinck · Mar 6, 2024

Hey All,

I got a complete cluster reset (watchdog based reset of all nodes) in the following scenario.

Got a cluster of 7 hosts.

corosync has 2 rings:

ring0 network 192.168.xx.n/24 using a dedicated cupper switch
rint1 network 192.168.yy.n/24 using a vlan in a 10g fiber.

Here a part of corosync.conf

Code:

nodelist {
  node {
    name: pve40
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 192.168.xx.40
    ring1_addr: 192.168.yy.40
  }
  node {
    name: pve52
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 192.168.xx.52
    ring1_addr: 192.168.yy.52
  }

... more nodes ...

quorum {
  provider: corosync_votequorum
}
totem {
  cluster_name: HA-Cluster-A
  config_version: 15
  interface {
    knet_link_priority: 10
    linknumber: 0
  }
  interface {
    knet_link_priority: 20
    linknumber: 1
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

Now one node (nn) is down to get a new fiber-card. It comes back online only with ring0 connected / ring1 unplugged.

As soon as the node nn is reachable over ring0 all host do net get any heartbeat on lrm any more. as soon as the watchdog timeount (60s) is reached exch host will be resetted.

I can not find my error here in this setup.

the last logs I see before reset is:

Code:

024-03-06T10:16:28.080135+01:00 pve57 corosync[1532]:   [QUORUM] Sync members[6]: 1 2 3 6 7 8
2024-03-06T10:16:28.080844+01:00 pve57 corosync[1532]:   [TOTEM ] A new membership (1.1653) was formed. Members
2024-03-06T10:16:28.096054+01:00 pve57 corosync[1532]:   [QUORUM] Members[6]: 1 2 3 6 7 8
2024-03-06T10:16:28.096288+01:00 pve57 corosync[1532]:   [MAIN  ] Completed service synchronization, ready to provide service.
2024-03-06T10:16:28.099871+01:00 pve57 watchdog-mux[1076]: exit watchdog-mux with active connections
2024-03-06T10:16:28.100344+01:00 pve57 pve-ha-crm[2729]: loop take too long (63 seconds)
2024-03-06T10:16:28.114142+01:00 pve57 systemd[1]: watchdog-mux.service: Deactivated successfully.
2024-03-06T10:16:28.114414+01:00 pve57 kernel: [ 1142.459534] watchdog: watchdog0: watchdog did not stop!
2024-03-06T10:16:28.168392+01:00 pve57 pmxcfs[1402]: [status] notice: cpg_send_message retried 99 times
2024-03-06T10:16:28.222159+01:00 pve57 pmxcfs[1402]: [status] notice: received log

Any hints to fix?

pve version is 8.1.4

Code:

proxmox-ve: 8.1.0 (running kernel: 6.5.13-1-pve)
pve-manager: 8.1.4 (running version: 8.1.4/ec5affc9e41f1d79)
proxmox-kernel-helper: 8.1.0
pve-kernel-5.11: 7.0-10
proxmox-kernel-6.5.13-1-pve-signed: 6.5.13-1
proxmox-kernel-6.5: 6.5.13-1
proxmox-kernel-6.5.11-8-pve-signed: 6.5.11-8
proxmox-kernel-6.2.16-20-pve: 6.2.16-20
proxmox-kernel-6.2: 6.2.16-20
pve-kernel-5.11.22-7-pve: 5.11.22-12
pve-kernel-5.11.22-4-pve: 5.11.22-9
ceph: 18.2.1-pve2
ceph-fuse: 18.2.1-pve2
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx8
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.0
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.1.1
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.1.0
libpve-guest-common-perl: 5.0.6
libpve-http-server-perl: 5.0.5
libpve-network-perl: 0.9.5
libpve-rs-perl: 0.8.8
libpve-storage-perl: 8.0.5
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve4
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.1.4-1
proxmox-backup-file-restore: 3.1.4-1
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.5
proxmox-widget-toolkit: 4.1.3
pve-cluster: 8.0.5
pve-container: 5.0.8
pve-docs: 8.1.3
pve-edk2-firmware: 4.2023.08-4
pve-firewall: 5.0.3
pve-firmware: 3.9-2
pve-ha-manager: 4.0.3
pve-i18n: 3.2.0
pve-qemu-kvm: 8.1.5-3
pve-xtermjs: 5.3.0-3
qemu-server: 8.0.10
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.2-pve2

Regard LM

fabian · Mar 6, 2024

please post the logs a few minutes before the rest, and from all nodes.

mensinck · Mar 6, 2024

Dear Fabian,

Thank you for Your reply.

All of the hosts have this log entries in syslog:

on the lin [QUORUM] member 9 is missing.

Code:

2024-03-06T10:15:35.484881+01:00 pve40 corosync[1561]:   [QUORUM] Sync members[6]: 1 2 3 6 7 8
2024-03-06T10:15:35.485344+01:00 pve40 corosync[1561]:   [TOTEM ] A new membership (1.1637) was formed. Members
2024-03-06T10:15:36.297808+01:00 pve40 systemd[1]: check-mk-agent@8-1268-997.service: Deactivated successfully.
2024-03-06T10:15:36.297949+01:00 pve40 systemd[1]: check-mk-agent@8-1268-997.service: Consumed 2.810s CPU time.
2024-03-06T10:15:36.658054+01:00 pve40 pvedaemon[1969]: <root@pam> successful auth for user 'cmk@pve'
2024-03-06T10:15:37.659021+01:00 pve40 pmxcfs[1468]: [status] notice: cpg_send_message retry 10
2024-03-06T10:15:38.659882+01:00 pve40 pmxcfs[1468]: [status] notice: cpg_send_message retry 20
2024-03-06T10:15:39.660814+01:00 pve40 pmxcfs[1468]: [status] notice: cpg_send_message retry 30
2024-03-06T10:15:40.661798+01:00 pve40 pmxcfs[1468]: [status] notice: cpg_send_message retry 40
2024-03-06T10:15:41.662853+01:00 pve40 pmxcfs[1468]: [status] notice: cpg_send_message retry 50
2024-03-06T10:15:42.663580+01:00 pve40 pmxcfs[1468]: [status] notice: cpg_send_message retry 60
2024-03-06T10:15:42.997063+01:00 pve40 corosync[1561]:   [QUORUM] Sync members[6]: 1 2 3 6 7 8
2024-03-06T10:15:42.997184+01:00 pve40 corosync[1561]:   [TOTEM ] A new membership (1.163b) was formed. Members
2024-03-06T10:15:43.664512+01:00 pve40 pmxcfs[1468]: [status] notice: cpg_send_message retry 70
2024-03-06T10:15:44.665415+01:00 pve40 pmxcfs[1468]: [status] notice: cpg_send_message retry 80
2024-03-06T10:15:45.666278+01:00 pve40 pmxcfs[1468]: [status] notice: cpg_send_message retry 90
2024-03-06T10:15:46.667301+01:00 pve40 pmxcfs[1468]: [status] notice: cpg_send_message retry 100
2024-03-06T10:15:46.667453+01:00 pve40 pmxcfs[1468]: [status] notice: cpg_send_message retried 100 times
2024-03-06T10:15:46.667520+01:00 pve40 pmxcfs[1468]: [status] crit: cpg_send_message failed: 6
2024-03-06T10:15:47.668859+01:00 pve40 pmxcfs[1468]: [dcdb] notice: cpg_send_message retry 10
2024-03-06T10:15:47.670001+01:00 pve40 pmxcfs[1468]: [status] notice: cpg_send_message retry 10
2024-03-06T10:15:48.669809+01:00 pve40 pmxcfs[1468]: [dcdb] notice: cpg_send_message retry 20
2024-03-06T10:15:48.670846+01:00 pve40 pmxcfs[1468]: [status] notice: cpg_send_message retry 20
2024-03-06T10:15:49.670667+01:00 pve40 pmxcfs[1468]: [dcdb] notice: cpg_send_message retry 30
2024-03-06T10:15:49.671762+01:00 pve40 pmxcfs[1468]: [status] notice: cpg_send_message retry 30
2024-03-06T10:15:50.515548+01:00 pve40 corosync[1561]:   [QUORUM] Sync members[6]: 1 2 3 6 7 8
2024-03-06T10:15:50.536882+01:00 pve40 corosync[1561]:   [TOTEM ] A new membership (1.163f) was formed. Members
2024-03-06T10:15:50.671429+01:00 pve40 pmxcfs[1468]: [dcdb] notice: cpg_send_message retry 40
2024-03-06T10:15:50.672548+01:00 pve40 pmxcfs[1468]: [status] notice: cpg_send_message retry 40
2024-03-06T10:15:51.672395+01:00 pve40 pmxcfs[1468]: [dcdb] notice: cpg_send_message retry 50
2024-03-06T10:15:51.673380+01:00 pve40 pmxcfs[1468]: [status] notice: cpg_send_message retry 50
2024-03-06T10:15:52.673367+01:00 pve40 pmxcfs[1468]: [dcdb] notice: cpg_send_message retry 60
2024-03-06T10:15:52.674248+01:00 pve40 pmxcfs[1468]: [status] notice: cpg_send_message retry 60
2024-03-06T10:15:53.674378+01:00 pve40 pmxcfs[1468]: [dcdb] notice: cpg_send_message retry 70
2024-03-06T10:15:53.674992+01:00 pve40 pmxcfs[1468]: [status] notice: cpg_send_message retry 70
2024-03-06T10:15:54.675102+01:00 pve40 pmxcfs[1468]: [dcdb] notice: cpg_send_message retry 80
2024-03-06T10:15:54.675833+01:00 pve40 pmxcfs[1468]: [status] notice: cpg_send_message retry 80

a bit later I see the following in the logs and als I see the ceph OSD are no longer reachable

Code:

2024-03-06T10:16:28.819062+01:00 pve40 pvestatd[1920]: status update time (50.834 seconds)
2024-03-06T10:16:28.907906+01:00 pve40 pmxcfs[1468]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/410021: -1
2024-03-06T10:16:29.371129+01:00 pve40 pmxcfs[1468]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/pve40/xxxxxx: -1
.....
2024-03-06T10:16:29.465613+01:00 pve40 pmxcfs[1468]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/pve52/XXXXXX: -1
2024-03-06T10:16:29.465844+01:00 pve40 pmxcfs[1468]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/pve52/yyyyyyy-iso: -1
.....
2024-03-06T10:16:29.467342+01:00 pve40 pmxcfs[1468]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/pve52/xxx/: -1
2024-03-06T10:16:29.467511+01:00 pve40 pmxcfs[1468]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/pve52/cephfs: -1
2024-03-06T10:16:31.744636+01:00 pve40 pmxcfs[1468]: [status] notice: received log
2024-03-06T10:16:33.087470+01:00 pve40 systemd[1]: Started check-mk-agent@9-1268-997.service - Checkmk agent (PID 1268/UID 997).
2024-03-06T10:16:34.307817+01:00 pve40 pveproxy[11487]: proxy detected vanished client connection
2024-03-06T10:16:36.504673+01:00 pve40 pmxcfs[1468]: [status] notice: received log
2024-03-06T10:16:36.709528+01:00 pve40 pmxcfs[1468]: [status] notice: received log
2024-03-06T10:16:36.756993+01:00 pve40 systemd[1]: check-mk-agent@9-1268-997.service: Deactivated successfully.
2024-03-06T10:16:36.757126+01:00 pve40 systemd[1]: check-mk-agent@9-1268-997.service: Consumed 2.789s CPU time.
2024-03-06T10:16:37.163331+01:00 pve40 pvedaemon[1968]: <root@pam> successful auth for user 'cmk@pve'
2024-03-06T10:16:40.805471+01:00 pve40 corosync[1561]:   [KNET  ] link: host: 2 link: 1 is down
2024-03-06T10:16:40.805704+01:00 pve40 corosync[1561]:   [KNET  ] link: host: 8 link: 1 is down
2024-03-06T10:16:40.805748+01:00 pve40 corosync[1561]:   [KNET  ] link: host: 7 link: 1 is down
2024-03-06T10:16:40.805784+01:00 pve40 corosync[1561]:   [KNET  ] link: host: 6 link: 1 is down
2024-03-06T10:16:40.805825+01:00 pve40 corosync[1561]:   [KNET  ] link: host: 3 link: 1 is down
2024-03-06T10:16:40.805860+01:00 pve40 corosync[1561]:   [KNET  ] host: host: 2 (passive) best link: 1 (pri: 20)
2024-03-06T10:16:40.805894+01:00 pve40 corosync[1561]:   [KNET  ] host: host: 2 has no active links
2024-03-06T10:16:40.805943+01:00 pve40 corosync[1561]:   [KNET  ] host: host: 8 (passive) best link: 1 (pri: 20)
2024-03-06T10:16:40.805977+01:00 pve40 corosync[1561]:   [KNET  ] host: host: 8 has no active links
2024-03-06T10:16:40.806502+01:00 pve40 corosync[1561]:   [KNET  ] host: host: 7 (passive) best link: 1 (pri: 20)
2024-03-06T10:16:40.806565+01:00 pve40 corosync[1561]:   [KNET  ] host: host: 7 has no active links
2024-03-06T10:16:40.807216+01:00 pve40 corosync[1561]:   [KNET  ] host: host: 6 (passive) best link: 1 (pri: 20)
2024-03-06T10:16:40.807313+01:00 pve40 corosync[1561]:   [KNET  ] host: host: 6 has no active links
2024-03-06T10:16:40.807758+01:00 pve40 corosync[1561]:   [KNET  ] host: host: 3 (passive) best link: 1 (pri: 20)
2024-03-06T10:16:40.807843+01:00 pve40 corosync[1561]:   [KNET  ] host: host: 3 has no active links
2024-03-06T10:16:42.778745+01:00 pve40 corosync[1561]:   [TOTEM ] Token has not been received in 4687 ms
2024-03-06T10:16:43.652576+01:00 pve40 pvestatd[1920]: got timeout
2024-03-06T10:16:44.341666+01:00 pve40 corosync[1561]:   [TOTEM ] A processor failed, forming new configuration: token timed out (6250ms), waiting 7500ms for consensus.
2024-03-06T10:16:46.070177+01:00 pve40 pvestatd[1920]: got timeout
2024-03-06T10:16:46.256288+01:00 pve40 pvestatd[1920]: status update time (7.700 seconds)
2024-03-06T10:16:50.551481+01:00 pve40 pvestatd[1920]: got timeout
2024-03-06T10:16:51.842619+01:00 pve40 corosync[1561]:   [QUORUM] Sync members[1]: 1
2024-03-06T10:16:51.842874+01:00 pve40 corosync[1561]:   [QUORUM] Sync left[5]: 2 3 6 7 8
2024-03-06T10:16:51.842931+01:00 pve40 corosync[1561]:   [TOTEM ] A new membership (1.1657) was formed. Members left: 2 3 6 7 8
2024-03-06T10:16:51.842970+01:00 pve40 corosync[1561]:   [TOTEM ] Failed to receive the leave message. failed: 2 3 6 7 8
2024-03-06T10:16:51.843280+01:00 pve40 pmxcfs[1468]: [dcdb] notice: members: 1/1468
2024-03-06T10:16:51.843401+01:00 pve40 pmxcfs[1468]: [status] notice: members: 1/1468
2024-03-06T10:16:51.843469+01:00 pve40 corosync[1561]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.
2024-03-06T10:16:51.843535+01:00 pve40 corosync[1561]:   [QUORUM] Members[1]: 1
2024-03-06T10:16:51.843599+01:00 pve40 pmxcfs[1468]: [status] notice: node lost quorum
2024-03-06T10:16:51.843676+01:00 pve40 corosync[1561]:   [MAIN  ] Completed service synchronization, ready to provide service.
2024-03-06T10:16:51.843732+01:00 pve40 pmxcfs[1468]: [dcdb] crit: received write while not quorate - trigger resync
2024-03-06T10:16:51.843771+01:00 pve40 pmxcfs[1468]: [dcdb] crit: leaving CPG group
....

2024-03-06T10:16:51.843989+01:00 pve40 pve-ha-crm[2255]: lost lock 'ha_manager_lock - cfs lock update failed - Operation not permitted
2024-03-06T10:16:51.844067+01:00 pve40 pve-ha-crm[2255]: status change master => lost_manager_lock
2024-03-06T10:16:51.844113+01:00 pve40 pve-ha-crm[2255]: watchdog closed (disabled)
2024-03-06T10:16:51.844149+01:00 pve40 pve-ha-crm[2255]: status change lost_manager_lock => wait_for_quorum
2024-03-06T10:16:52.288005+01:00 pve40 pmxcfs[1468]: [dcdb] notice: start cluster connection
2024-03-06T10:16:52.288129+01:00 pve40 pmxcfs[1468]: [dcdb] crit: cpg_join failed: 14
2024-03-06T10:16:52.288224+01:00 pve40 pmxcfs[1468]: [dcdb] crit: can't initialize service
2024-03-06T10:16:52.288580+01:00 pve40 pve-ha-lrm[2434]: lost lock 'ha_agent_pve40_lock - cfs lock update failed - Device or resource busy
2024-03-06T10:16:52.288667+01:00 pve40 pve-ha-lrm[2434]: status change active => lost_agent_lock
2024-03-06T10:16:55.748807+01:00 pve40 pvestatd[1920]: got timeout
2024-03-06T10:16:55.921517+01:00 pve40 pvestatd[1920]: status update time (7.665 seconds)
2024-03-06T10:16:58.294858+01:00 pve40 pmxcfs[1468]: [dcdb] notice: members: 1/1468
2024-03-06T10:16:58.295113+01:00 pve40 pmxcfs[1468]: [dcdb] notice: all data is up to date
2024-03-06T10:16:59.150889+01:00 pve40 ceph-osd[2202]: 2024-03-06T10:16:59.146+0100 7464994c86c0 -1 osd.4 119768 heartbeat_check: no reply from 192.168.25.59:6845 osd.6 since back 2024-03-06T10:16:32.609829+0100 front 2024-03-06T10:16:3
2.609714+0100 (oldest deadline 2024-03-06T10:16:58.509809+0100)
2024-03-06T10:16:59.151064+01:00 pve40 ceph-osd[2202]: 2024-03-06T10:16:59.146+0100 7464994c86c0 -1 osd.4 119768 heartbeat_check: no reply from 192.168.25.52:6849 osd.7 since back 2024-03-06T10:16:32.610022+0100 front 2024-03-06T10:16:3
2.609941+0100 (oldest deadline 2024-03-06T10:16:58.509809+0100)

All this 192.168.25.0/24 networking is done over fiber connection and the missing host is not reachable over that network.

I see the same on all other nodes.

Code:

2024-03-06T10:15:27.975811+01:00 pve52 corosync[1172]:   [KNET  ] link: Resetting MTU for link 0 because host 9 joined
2024-03-06T10:15:27.976321+01:00 pve52 corosync[1172]:   [KNET  ] host: host: 9 (passive) best link: 0 (pri: 10)
2024-03-06T10:15:28.173064+01:00 pve52 corosync[1172]:   [KNET  ] pmtud: Global data MTU changed to: 1397
2024-03-06T10:15:35.485227+01:00 pve52 corosync[1172]:   [QUORUM] Sync members[6]: 1 2 3 6 7 8
2024-03-06T10:15:35.485409+01:00 pve52 corosync[1172]:   [TOTEM ] A new membership (1.1637) was formed. Members
2024-03-06T10:15:37.073265+01:00 pve52 pmxcfs[1102]: [status] notice: cpg_send_message retry 10
....

2024-03-06T10:15:46.082806+01:00 pve52 pmxcfs[1102]: [status] notice: cpg_send_message retried 100 times
....
2024-03-06T10:15:50.515973+01:00 pve52 corosync[1172]:   [TOTEM ] A new membership (1.163f) was formed. Members
....
2024-03-06T10:16:10.132608+01:00 pve52 pmxcfs[1102]: [dcdb] notice: cpg_send_message retried 100 times
2024-03-06T10:16:10.132683+01:00 pve52 pmxcfs[1102]: [dcdb] crit: cpg_send_message failed: 6
2024-03-06T10:16:10.133855+01:00 pve52 pvescheduler[14299]: jobs: cfs-lock 'file-jobs_cfg' error: got lock request timeout
....
2024-03-06T10:16:20.567571+01:00 pve52 corosync[1172]:   [QUORUM] Sync members[6]: 1 2 3 6 7 8
2024-03-06T10:16:20.567678+01:00 pve52 corosync[1172]:   [TOTEM ] A new membership (1.164f) was formed. Members
2024-03-06T10:16:20.722725+01:00 pve52 watchdog-mux[783]: client watchdog expired - disable watchdog updates
....
2024-03-06T10:16:26.188861+01:00 pve52 pmxcfs[1102]: [status] notice: cpg_send_message retry 100
2024-03-06T10:16:26.189051+01:00 pve52 pmxcfs[1102]: [status] notice: cpg_send_message retried 100 times
2024-03-06T10:16:26.189133+01:00 pve52 pmxcfs[1102]: [status] crit: cpg_send_message failed: 6
2024-03-06T10:16:27.190883+01:00 pve52 pmxcfs[1102]: [status] notice: cpg_send_message retry 10
2024-03-06T10:16:28.080496+01:00 pve52 corosync[1172]:   [QUORUM] Sync members[6]: 1 2 3 6 7 8
2024-03-06T10:16:28.080757+01:00 pve52 corosync[1172]:   [TOTEM ] A new membership (1.1653) was formed. Members
2024-03-06T10:16:28.097706+01:00 pve52 corosync[1172]:   [QUORUM] Members[6]: 1 2 3 6 7 8
2024-03-06T10:16:28.097810+01:00 pve52 corosync[1172]:   [MAIN  ] Completed service synchronization, ready to provide service.
2024-03-06T10:16:28.103836+01:00 pve52 watchdog-mux[783]: exit watchdog-mux with active connections
2024-03-06T10:16:28.104253+01:00 pve52 pve-ha-crm[1961]: loop take too long (65 seconds)
2024-03-06T10:16:28.112271+01:00 pve52 kernel: [ 1088.567679] watchdog: watchdog0: watchdog did not stop!
2024-03-06T10:16:28.113070+01:00 pve52 systemd[1]: watchdog-mux.service: Deactivated successfully.
2024-03-06T10:16:28.191956+01:00 pve52 pmxcfs[1102]: [status] notice: cpg_send_message retry 20
2024-03-06T10:16:28.192197+01:00 pve52 pmxcfs[1102]: [status] notice: cpg_send_message retried 20 times
2024-03-06T10:16:28.224320+01:00 pve52 pmxcfs[1102]: [status] notice: received log
2024-03-06T10:16:28.288817+01:00 pve52 pve-firewall[1598]: firewall update time (12.037 seconds)
2024-03-06T10:16:28.914912+01:00 pve52 pmxcfs[1102]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/410021: -1
2024-03-06T10:16:28.950254+01:00 pve52 pvestatd[1606]: status update time (52.881 seconds)
2024-03-06T10:16:29.029343+01:00 pve52 corosync[1172]:   [KNET  ] link: host: 9 link: 0 is down
2024-03-06T10:16:29.029623+01:00 pve52 corosync[1172]:   [KNET  ] host: host: 9 (passive) best link: 0 (pri: 10)
2024-03-06T10:16:29.029695+01:00 pve52 corosync[1172]:   [KNET  ] host: host: 9 has no active links
2024-03-06T10:16:29.374124+01:00 pve52 pmxcfs[1102]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/pve40/xxxxxx: -1
2024-03-06T10:16:29.375095+01:00 pve52 pmxcfs[1102]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/pve40/cephfs: -1
2024-03-06T10:16:29.375392+01:00 pve52 pmxcfs[1102]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/pve40/local-lvm: -1
....
2024-03-06T10:16:29.467283+01:00 pve52 pmxcfs[1102]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/pve52/yyyyyy: -1
2024-03-06T10:16:29.467645+01:00 pve52 pmxcfs[1102]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/pve52/xxxx: -1
2024-03-06T10:16:29.467963+01:00 pve52 pmxcfs[1102]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/pve52/cephfs: -1
2024-03-06T10:16:31.745216+01:00 pve52 pvedaemon[1670]: <root@pam> successful auth for user 'cmk@pve'

This host was resettet at this time-

Does this help or do you need more logs from other nodes. I could attach them as file.

Regards LM

fabian · Mar 6, 2024

yes. please attach logs from all nodes like I asked for, and include a few minutes before the reset

mensinck · Mar 6, 2024

Thank You for your time fabian.

here are the logs. node name in aon the filename.

node pve58 was the one switched off and rebooted without ring0 and ceph network. I changed some namings from storage-names

fabian · Mar 7, 2024

could you also post the corosync.cfg of each node?

the logs look rather strange.. pve58 only has link 0 as up (and only 2mins after starting corosync!), while the others only have link1 as up.. either something was very confused and asymmetric on the network level, or your corosync configs were mismatched..

mensinck · Mar 7, 2024

Thank You for your reply.

I attached the corosync.conf. This is the same on all nodes. it is located in /etc/pve/corosync.conf has the same content as /etc/corosync/corosnyc.conf as expected.

On pve58 only link0 is up was expected. This is because it got a new network card on ,,link1" und vlan.425 for ceph, which was not configured at that time. Se expectation was to boot the node with the new network card, long in using link0's address and reconfigure the interfaces.

there should be link0 and link1 up on all other nodes.

2 Minutes after pve58 got up the link on link1 on the other nodes ,,died" .

I should mention that the switch for link0 is isolated from all other networks and solely serves link0 network.

Now, when all hosts are up, link0 and link1 is active on all hosts.
This is the same, when one node is totally down.

Code:

 corosync-cfgtool -s
Local node ID 3, transport knet
LINK ID 0 udp
        addr    = 192.168.24.52
        status:
                nodeid:          1:     disconnected
                nodeid:          2:     connected
                nodeid:          3:     localhost
                nodeid:          6:     connected
                nodeid:          7:     connected
                nodeid:          8:     connected
                nodeid:          9:     connected
LINK ID 1 udp
        addr    = 192.168.25.52
        status:
                nodeid:          1:     connected
                nodeid:          2:     connected
                nodeid:          3:     localhost
                nodeid:          6:     connected
                nodeid:          7:     connected
                nodeid:          8:     connected
                nodeid:          9:     connected

It looks to as if link1 got down on all other nodes at that time pve58 got up with only link0 active.
Thanks for your review

Lukas

fabian · Mar 7, 2024

here's an example from your logs:

Code:

2024-03-06T10:16:40.805471+01:00 pve40 corosync[1561]:   [KNET  ] link: host: 2 link: 1 is down
2024-03-06T10:16:40.805704+01:00 pve40 corosync[1561]:   [KNET  ] link: host: 8 link: 1 is down
2024-03-06T10:16:40.805748+01:00 pve40 corosync[1561]:   [KNET  ] link: host: 7 link: 1 is down
2024-03-06T10:16:40.805784+01:00 pve40 corosync[1561]:   [KNET  ] link: host: 6 link: 1 is down
2024-03-06T10:16:40.805825+01:00 pve40 corosync[1561]:   [KNET  ] link: host: 3 link: 1 is down
2024-03-06T10:16:40.805860+01:00 pve40 corosync[1561]:   [KNET  ] host: host: 2 (passive) best link: 1 (pri: 20)
2024-03-06T10:16:40.805894+01:00 pve40 corosync[1561]:   [KNET  ] host: host: 2 has no active links
2024-03-06T10:16:40.805943+01:00 pve40 corosync[1561]:   [KNET  ] host: host: 8 (passive) best link: 1 (pri: 20)
2024-03-06T10:16:40.805977+01:00 pve40 corosync[1561]:   [KNET  ] host: host: 8 has no active links
2024-03-06T10:16:40.806502+01:00 pve40 corosync[1561]:   [KNET  ] host: host: 7 (passive) best link: 1 (pri: 20)
2024-03-06T10:16:40.806565+01:00 pve40 corosync[1561]:   [KNET  ] host: host: 7 has no active links
2024-03-06T10:16:40.807216+01:00 pve40 corosync[1561]:   [KNET  ] host: host: 6 (passive) best link: 1 (pri: 20)
2024-03-06T10:16:40.807313+01:00 pve40 corosync[1561]:   [KNET  ] host: host: 6 has no active links
2024-03-06T10:16:40.807758+01:00 pve40 corosync[1561]:   [KNET  ] host: host: 3 (passive) best link: 1 (pri: 20)
2024-03-06T10:16:40.807843+01:00 pve40 corosync[1561]:   [KNET  ] host: host: 3 has no active links
2024-03-06T10:16:42.778745+01:00 pve40 corosync[1561]:   [TOTEM ] Token has not been received in 4687 ms

where you can see that link 0 was not (properly) up for the rest of the cluster (no logs about link 0, and link 1 going down resulting in no more active links). maybe earlier logs of the corosync unit can give you more information...

Code:

2024-03-06T10:15:27.055043+01:00 pve58 corosync[1483]:   [KNET  ] rx: host: 2 link: 0 is up
2024-03-06T10:15:27.055276+01:00 pve58 corosync[1483]:   [KNET  ] link: Resetting MTU for link 0 because host 2 joined
2024-03-06T10:15:27.055391+01:00 pve58 corosync[1483]:   [KNET  ] rx: host: 7 link: 0 is up
2024-03-06T10:15:27.055511+01:00 pve58 corosync[1483]:   [KNET  ] link: Resetting MTU for link 0 because host 7 joined
2024-03-06T10:15:27.055625+01:00 pve58 corosync[1483]:   [KNET  ] rx: host: 8 link: 0 is up
2024-03-06T10:15:27.055754+01:00 pve58 corosync[1483]:   [KNET  ] link: Resetting MTU for link 0 because host 8 joined
2024-03-06T10:15:27.055863+01:00 pve58 corosync[1483]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 10)
2024-03-06T10:15:27.055967+01:00 pve58 corosync[1483]:   [KNET  ] host: host: 7 (passive) best link: 0 (pri: 10)
2024-03-06T10:15:27.056073+01:00 pve58 corosync[1483]:   [KNET  ] rx: host: 6 link: 0 is up
2024-03-06T10:15:27.056178+01:00 pve58 corosync[1483]:   [KNET  ] link: Resetting MTU for link 0 because host 6 joined
2024-03-06T10:15:27.056280+01:00 pve58 corosync[1483]:   [KNET  ] rx: host: 3 link: 0 is up
2024-03-06T10:15:27.056394+01:00 pve58 corosync[1483]:   [KNET  ] link: Resetting MTU for link 0 because host 3 joined
2024-03-06T10:15:27.056497+01:00 pve58 corosync[1483]:   [KNET  ] host: host: 8 (passive) best link: 0 (pri: 10)
2024-03-06T10:15:27.056596+01:00 pve58 corosync[1483]:   [KNET  ] host: host: 6 (passive) best link: 0 (pri: 10)
2024-03-06T10:15:27.056703+01:00 pve58 corosync[1483]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 10)
2024-03-06T10:15:27.215940+01:00 pve58 corosync[1483]:   [KNET  ] pmtud: PMTUD link change for host: 2 link: 0 from 469 to 1397
2024-03-06T10:15:27.218546+01:00 pve58 corosync[1483]:   [KNET  ] pmtud: PMTUD link change for host: 8 link: 0 from 469 to 1397
2024-03-06T10:15:27.221126+01:00 pve58 corosync[1483]:   [KNET  ] pmtud: PMTUD link change for host: 7 link: 0 from 469 to 1397
2024-03-06T10:15:27.223651+01:00 pve58 corosync[1483]:   [KNET  ] pmtud: PMTUD link change for host: 6 link: 0 from 469 to 1397
2024-03-06T10:15:27.226167+01:00 pve58 corosync[1483]:   [KNET  ] pmtud: PMTUD link change for host: 3 link: 0 from 469 to 1397
2024-03-06T10:15:27.226334+01:00 pve58 corosync[1483]:   [KNET  ] pmtud: Global data MTU changed to: 1397

this up state here is not reflected on the other nodes.. since a link being marked as up entails UDP packets flowing back and forth, this is very strange, and points to some network issue if the configs were matching.

mensinck · Mar 7, 2024

Since we had this reset twice, I analyzed the log from one node in more details:

I see

Code:

2024-03-06T10:00:12.396742+01:00 pve52 corosync[1172]:   [KNET  ] link: host: 9 link: 0 is down
2024-03-06T10:00:12.396906+01:00 pve52 corosync[1172]:   [KNET  ] host: host: 9 (passive) best link: 0 (pri: 10)
2024-03-06T10:00:12.396989+01:00 pve52 corosync[1172]:   [KNET  ] host: host: 9 has no active links

-- ok, host 9 is down indeed ---

Code:

2024-03-06T10:00:21.202742+01:00 pve52 corosync[1172]:   [KNET  ] rx: host: 2 link: 0 is up
2024-03-06T10:00:21.202876+01:00 pve52 corosync[1172]:   [KNET  ] link: Resetting MTU for link 0 because host 2 joined
2024-03-06T10:00:21.202958+01:00 pve52 corosync[1172]:   [KNET  ] host: host: 2 (passive) best link: 1 (pri: 20)
2024-03-06T10:00:21.217849+01:00 pve52 corosync[1172]:   [KNET  ] pmtud: PMTUD link change for host: 2 link: 0 from 469 to 1397

... I see link 0 and link 1 up for example for node 2

later -- host 9 was startet -- and there is no additional log line on pve52:

Code:

2024-03-06T10:15:27.975811+01:00 pve52 corosync[1172]:   [KNET  ] link: Resetting MTU for link 0 because host 9 joined
2024-03-06T10:15:27.976321+01:00 pve52 corosync[1172]:   [KNET  ] host: host: 9 (passive) best link: 0 (pri: 10)
2024-03-06T10:15:28.173064+01:00 pve52 corosync[1172]:   [KNET  ] pmtud: Global data MTU changed to: 1397
2024-03-06T10:15:35.485227+01:00 pve52 corosync[1172]:   [QUORUM] Sync members[6]: 1 2 3 6 7 8

... and then, the next log entries in syslog are:

Code:

2024-03-06T10:15:35.485227+01:00 pve52 corosync[1172]:   [QUORUM] Sync members[6]: 1 2 3 6 7 8
2024-03-06T10:15:35.485409+01:00 pve52 corosync[1172]:   [TOTEM ] A new membership (1.1637) was formed. Members
2024-03-06T10:15:37.073265+01:00 pve52 pmxcfs[1102]: [status] notice: cpg_send_message retry 10
2024-03-06T10:15:38.074518+01:00 pve52 pmxcfs[1102]: [status] notice: cpg_send_message retry 20
2024-03-06T10:15:39.075291+01:00 pve52 pmxcfs[1102]: [status] notice: cpg_send_message retry 30
2024-03-06T10:15:40.076195+01:00 pve52 pmxcfs[1102]: [status] notice: cpg_send_message retry 40
2024-03-06T10:15:41.077063+01:00 pve52 pmxcfs[1102]: [status] notice: cpg_send_message retry 50
2024-03-06T10:15:42.078026+01:00 pve52 pmxcfs[1102]: [status] notice: cpg_send_message retry 60
2024-03-06T10:15:42.997537+01:00 pve52 corosync[1172]:   [QUORUM] Sync members[6]: 1 2 3 6 7 8
2024-03-06T10:15:42.997674+01:00 pve52 corosync[1172]:   [TOTEM ] A new membership (1.163b) was formed. Members
2024-03-06T10:15:43.079037+01:00 pve52 pmxcfs[1102]: [status] notice: cpg_send_message retry 70
2024-03-06T10:15:44.080108+01:00 pve52 pmxcfs[1102]: [status] notice: cpg_send_message retry 80
2024-03-06T10:15:45.081176+01:00 pve52 pmxcfs[1102]: [status] notice: cpg_send_message retry 90
2024-03-06T10:15:46.082631+01:00 pve52 pmxcfs[1102]: [status] notice: cpg_send_message retry 100
2024-03-06T10:15:46.082806+01:00 pve52 pmxcfs[1102]: [status] notice: cpg_send_message retried 100 times

So what I see is: As soon as host 9 gets up with link 0, the cpg_send_message retry starts. But host 9 gets no quorum.
But why the messages for link 1 are no longer transmitted?

So do you think changing the switch on link:0 could solve the problem?

Regards

Lukas

fabian · Mar 8, 2024

it seems to me that something with link0 is broken, but I can't tell you what. any weird routes in place? MTU/vlan settings that are out of sync?

sky_me · Mar 25, 2024

Hi bro
I would like to know if you have any new progress on this issue. I am also troubled by this matter and I really don’t know what the cause is. To be precise, I'm not sure what mechanism triggered the cluster restart. This is my question.
https://forum.proxmox.com/threads/h...estart-due-to-abnormality.143727/#post-646220
hope for your reply！！！

Search

Search

Cluster reset when one node can only reached over corosync ring0 -- configuration problem?

mensinck

Renowned Member

fabian

Proxmox Staff Member

mensinck

Renowned Member

fabian

Proxmox Staff Member

mensinck

Renowned Member

Attachments

fabian

Proxmox Staff Member

mensinck

Renowned Member

Attachments

fabian

Proxmox Staff Member

mensinck

Renowned Member

fabian

Proxmox Staff Member

sky_me

Member

We value your privacy