3-node cluster cold start: "TASK ERROR: cluster not ready - no quorum?" for "Start at Boot=Yes"

fdcastel · Dec 4, 2024

I have a 3-node cluster running PVE 8.2.4 and recently observed an intermittent issue that occurs infrequently.

SOMETIMES after a power outage, SOME LXC containers with "Start at Boot=Yes" fail to start.

Upon reviewing the logs, in a task called "Bulk start VMs and Containers", I encountered the message: TASK ERROR: cluster not ready - no quorum.

I understand that this behavior may (and likely would) occur during the cluster startup process.

However, shouldn't Proxmox automatically retry as soon as the other nodes have joined the cluster?

Better yet: wouldn't it be more effective to delay the startup of VMs and containers until the cluster is confirmed to be fully ready?

fabian · Dec 4, 2024

if you want to ensure a guest is running, you need to setup HA. if you regularly do cluster cold starts, you can also configure a delay for guests started on boot.

the "startall" task actually waits for quorum - but it aborts if it is lost before the last guest has been started. could you maybe post the full output of that task and the journal of all nodes covering bootup until that task has finished?

fdcastel · Dec 4, 2024

I had 2 power outages in the last days. In both of them one or more LXC containters failed to start.

I don't intend to keep restarting the cluster, but I want the assurance that all services will come back online after a power failure.

Adding a delay seems to be a workaround. I wish to bring all services online as soon as possible.

The tasks log shows:

The output of failed startall tasks is just:

(DEC 1 22:12:48)

Code:

Starting CT 214


TASK ERROR: cluster not ready - no quorum?

(DEC 3 19:38:26)

Code:

Starting CT 210
Starting CT 211
Starting CT 212
Starting CT 213
Starting CT 215


TASK ERROR: cluster not ready - no quorum?

Attached to this message I'm sending the output of journalctl -b -1 for the 3 nodes.

pvecm status

Code:

Cluster information
-------------------
Name:             local-group
Config Version:   3
Transport:        knet
Secure auth:      on


Quorum information
------------------
Date:             Wed Dec  4 14:07:41 2024
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          0x00000003
Ring ID:          1.16d
Quorate:          Yes


Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2
Flags:            Quorate


Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 192.168.10.202
0x00000002          1 192.168.10.203
0x00000003          1 192.168.10.204 (local)

fabian · Dec 5, 2024

filtered the logs a bit:

a:

Code:

1427:Dec 01 22:12:38 antares corosync[1654]:   [QUORUM] Members[3]: 1 2 3
1428:Dec 01 22:12:38 antares corosync[1654]:   [MAIN  ] Completed service synchronization, ready to provide service.

1445:Dec 01 22:12:39 antares corosync[1654]:   [KNET  ] pmtud: PMTUD link change for host: 3 link: 0 from 469 to 1397
1446:Dec 01 22:12:39 antares corosync[1654]:   [KNET  ] pmtud: Global data MTU changed to: 1397

initially, quorum established an pmxfs synced up

now, the service starting all guests that are configured for bootup starts its tasks:

Code:

1447:Dec 01 22:12:39 antares pve-guests[1799]: <root@pam> starting task UPID:antares:00000708:0000093D:674D0987:startall::root@pam:
1449:Dec 01 22:12:39 antares pve-guests[1800]: <root@pam> starting task UPID:antares:00000709:0000093E:674D0987:vzstart:210:root@pam:
1450:Dec 01 22:12:39 antares pve-guests[1801]: starting CT 210: UPID:antares:00000709:0000093E:674D0987:vzstart:210:root@pam:
1453:Dec 01 22:12:39 antares kernel: e1000e 0000:00:1f.6 enp0s31f6: NIC Link is Down    <======
1454:Dec 01 22:12:39 antares kernel: vmbr0: port 1(enp0s31f6) entered disabled state
1455:Dec 01 22:12:39 antares kernel: vmbr0v40: port 1(enp0s31f6.40) entered disabled state
1465:Dec 01 22:12:40 antares pve-guests[1800]: <root@pam> starting task UPID:antares:000007A1:000009A3:674D0988:vzstart:211:root@pam:
1466:Dec 01 22:12:40 antares pve-guests[1953]: starting CT 211: UPID:antares:000007A1:000009A3:674D0988:vzstart:211:root@pam:
1591:Dec 01 22:12:41 antares corosync[1654]:   [KNET  ] link: host: 3 link: 0 is down    <======
1592:Dec 01 22:12:41 antares corosync[1654]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
1593:Dec 01 22:12:41 antares corosync[1654]:   [KNET  ] host: host: 3 has no active links
1595:Dec 01 22:12:41 antares pve-guests[1800]: <root@pam> starting task UPID:antares:0000090B:00000A07:674D0989:vzstart:212:root@pam:
1596:Dec 01 22:12:41 antares pve-guests[2315]: starting CT 212: UPID:antares:0000090B:00000A07:674D0989:vzstart:212:root@pam:
1706:Dec 01 22:12:41 antares corosync[1654]:   [KNET  ] link: host: 2 link: 0 is down  <======
1707:Dec 01 22:12:41 antares corosync[1654]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
1708:Dec 01 22:12:41 antares corosync[1654]:   [KNET  ] host: host: 2 has no active links
1710:Dec 01 22:12:41 antares corosync[1654]:   [TOTEM ] Token has not been received in 2737 ms
1717:Dec 01 22:12:42 antares pve-guests[1800]: <root@pam> starting task UPID:antares:00000A85:00000A6C:674D098A:vzstart:213:root@pam:
1718:Dec 01 22:12:42 antares pve-guests[2693]: starting CT 213: UPID:antares:00000A85:00000A6C:674D098A:vzstart:213:root@pam:
1829:Dec 01 22:12:42 antares corosync[1654]:   [TOTEM ] A processor failed, forming new configuration: token timed out (3650ms), waiting 4380ms for consensus.

but while that is going on, the corosync link goes down! (see lines marked with <=====)

the link does come up basically right away, but it takes time for corosync to re-establish quorum:

Code:

1835:Dec 01 22:12:43 antares kernel: e1000e 0000:00:1f.6 enp0s31f6: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
1841:Dec 01 22:12:43 antares pve-guests[1800]: <root@pam> starting task UPID:antares:00000BC5:00000AD0:674D098B:vzstart:215:root@pam:
1842:Dec 01 22:12:43 antares pve-guests[3013]: starting CT 215: UPID:antares:00000BC5:00000AD0:674D098B:vzstart:215:root@pam:
1953:Dec 01 22:12:47 antares corosync[1654]:   [QUORUM] Sync members[1]: 1
1954:Dec 01 22:12:47 antares corosync[1654]:   [QUORUM] Sync left[2]: 2 3
1955:Dec 01 22:12:47 antares corosync[1654]:   [TOTEM ] A new membership (1.14c) was formed. Members left: 2 3
1956:Dec 01 22:12:47 antares corosync[1654]:   [TOTEM ] Failed to receive the leave message. failed: 2 3
1957:Dec 01 22:12:47 antares corosync[1654]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.
1958:Dec 01 22:12:47 antares corosync[1654]:   [QUORUM] Members[1]: 1
1959:Dec 01 22:12:47 antares corosync[1654]:   [MAIN  ] Completed service synchronization, ready to provide service.
1960:Dec 01 22:12:47 antares pmxcfs[1553]: [status] notice: node lost quorum
1961:Dec 01 22:12:47 antares pmxcfs[1553]: [dcdb] notice: members: 1/1553
1962:Dec 01 22:12:47 antares pmxcfs[1553]: [status] notice: members: 1/1553
1963:Dec 01 22:12:47 antares pmxcfs[1553]: [dcdb] crit: received write while not quorate - trigger resync     
1964:Dec 01 22:12:47 antares pmxcfs[1553]: [dcdb] crit: leaving CPG group
1978:Dec 01 22:12:47 antares pmxcfs[1553]: [dcdb] notice: start cluster connection
1979:Dec 01 22:12:47 antares pmxcfs[1553]: [dcdb] crit: cpg_join failed: 14
1980:Dec 01 22:12:47 antares pmxcfs[1553]: [dcdb] crit: can't initialize service
1981:Dec 01 22:12:48 antares pve-guests[1800]: cluster not ready - no quorum?
1983:Dec 01 22:12:48 antares pve-guests[1799]: <root@pam> end task UPID:antares:00000708:0000093D:674D0987:startall::root@pam: cluster not ready - no quorum?
1984:Dec 01 22:12:48 antares systemd[1]: Finished pve-guests.service - PVE guests.

and as a result the task fails once the next quorum check fails..

fabian · Dec 5, 2024

b:

situation on betelgeuse is similar - first, quorum is established, then the link goes down

Code:

1460:Dec 01 22:12:38 betelgeuse corosync[1662]:   [QUORUM] Sync members[3]: 1 2 3
1461:Dec 01 22:12:38 betelgeuse corosync[1662]:   [QUORUM] Sync joined[1]: 3
1462:Dec 01 22:12:38 betelgeuse corosync[1662]:   [TOTEM ] A new membership (1.148) was formed. Members joined: 3
1463:Dec 01 22:12:38 betelgeuse pmxcfs[1562]: [dcdb] notice: members: 1/1553, 2/1562, 3/1035
1464:Dec 01 22:12:38 betelgeuse pmxcfs[1562]: [dcdb] notice: starting data syncronisation
1465:Dec 01 22:12:38 betelgeuse pmxcfs[1562]: [status] notice: members: 1/1553, 2/1562, 3/1035
1466:Dec 01 22:12:38 betelgeuse pmxcfs[1562]: [status] notice: starting data syncronisation
1467:Dec 01 22:12:38 betelgeuse corosync[1662]:   [QUORUM] Members[3]: 1 2 3
1468:Dec 01 22:12:38 betelgeuse corosync[1662]:   [MAIN  ] Completed service synchronization, ready to provide service.
1486:Dec 01 22:12:39 betelgeuse kernel: e1000e 0000:00:1f.6 enp0s31f6: NIC Link is Down
1488:Dec 01 22:12:39 betelgeuse pve-guests[1760]: <root@pam> starting task UPID:betelgeuse:0000078B:00000833:674D0987:vzstart:217:root@pam:
1489:Dec 01 22:12:39 betelgeuse pve-guests[1931]: starting CT 217: UPID:betelgeuse:0000078B:00000833:674D0987:vzstart:217:root@pam:
1610:Dec 01 22:12:39 betelgeuse kernel: vmbr0v40: port 1(enp0s31f6.40) entered disabled state
1617:Dec 01 22:12:40 betelgeuse pve-guests[1760]: <root@pam> starting task UPID:betelgeuse:000008D8:00000898:674D0988:vzstart:218:root@pam:
1618:Dec 01 22:12:40 betelgeuse pve-guests[2264]: starting CT 218: UPID:betelgeuse:000008D8:00000898:674D0988:vzstart:218:root@pam:
1622:Dec 01 22:12:40 betelgeuse corosync[1662]:   [KNET  ] link: host: 1 link: 0 is down
1623:Dec 01 22:12:40 betelgeuse corosync[1662]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
1624:Dec 01 22:12:40 betelgeuse corosync[1662]:   [KNET  ] host: host: 1 has no active links
1626:Dec 01 22:12:41 betelgeuse corosync[1662]:   [KNET  ] link: host: 3 link: 0 is down
1627:Dec 01 22:12:41 betelgeuse corosync[1662]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
1628:Dec 01 22:12:41 betelgeuse corosync[1662]:   [KNET  ] host: host: 3 has no active links
1635:Dec 01 22:12:41 betelgeuse pve-guests[1760]: <root@pam> starting task UPID:betelgeuse:000009DD:000008FC:674D0989:vzstart:219:root@pam:
1636:Dec 01 22:12:41 betelgeuse pve-guests[2525]: starting CT 219: UPID:betelgeuse:000009DD:000008FC:674D0989:vzstart:219:root@pam:
1748:Dec 01 22:12:41 betelgeuse corosync[1662]:   [TOTEM ] Token has not been received in 2737 ms
1757:Dec 01 22:12:42 betelgeuse pve-guests[1760]: <root@pam> starting task UPID:betelgeuse:00000B2E:00000961:674D098A:vzstart:220:root@pam:
1758:Dec 01 22:12:42 betelgeuse pve-guests[2862]: starting CT 220: UPID:betelgeuse:00000B2E:00000961:674D098A:vzstart:220:root@pam:
1759:Dec 01 22:12:42 betelgeuse corosync[1662]:   [TOTEM ] A processor failed, forming new configuration: token timed out (3650ms), waiting 4380ms for consensus.
1764:Dec 01 22:12:42 betelgeuse kernel: e1000e 0000:00:1f.6 enp0s31f6: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
1769:Dec 01 22:12:46 betelgeuse corosync[1662]:   [QUORUM] Sync members[1]: 2
1770:Dec 01 22:12:46 betelgeuse corosync[1662]:   [QUORUM] Sync left[2]: 1 3
1771:Dec 01 22:12:46 betelgeuse corosync[1662]:   [TOTEM ] A new membership (2.14c) was formed. Members left: 1 3
1772:Dec 01 22:12:46 betelgeuse corosync[1662]:   [TOTEM ] Failed to receive the leave message. failed: 1 3
1773:Dec 01 22:12:46 betelgeuse corosync[1662]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.
1774:Dec 01 22:12:46 betelgeuse corosync[1662]:   [QUORUM] Members[1]: 2
1775:Dec 01 22:12:46 betelgeuse corosync[1662]:   [MAIN  ] Completed service synchronization, ready to provide service.
1776:Dec 01 22:12:46 betelgeuse pmxcfs[1562]: [status] notice: node lost quorum
1777:Dec 01 22:12:46 betelgeuse pmxcfs[1562]: [dcdb] notice: members: 2/1562
1778:Dec 01 22:12:46 betelgeuse pmxcfs[1562]: [status] notice: members: 2/1562
1779:Dec 01 22:12:46 betelgeuse pmxcfs[1562]: [dcdb] crit: received write while not quorate - trigger resync
1780:Dec 01 22:12:46 betelgeuse pmxcfs[1562]: [dcdb] crit: leaving CPG group
1906:Dec 01 22:12:47 betelgeuse pve-guests[1758]: <root@pam> end task UPID:betelgeuse:000006E0:000007CD:674D0986:startall::root@pam: OK
1907:Dec 01 22:12:47 betelgeuse systemd[1]: Finished pve-guests.service - PVE guests.
1909:Dec 01 22:12:47 betelgeuse pmxcfs[1562]: [dcdb] notice: start cluster connection
1910:Dec 01 22:12:47 betelgeuse pmxcfs[1562]: [dcdb] crit: cpg_join failed: 14
1911:Dec 01 22:12:47 betelgeuse pmxcfs[1562]: [dcdb] crit: can't initialize service
1919:Dec 01 22:12:47 betelgeuse systemd[1]: Startup finished in 3.611s (kernel) + 25.828s (userspace) = 29.439s.
1920:Dec 01 22:12:51 betelgeuse corosync[1662]:   [KNET  ] rx: host: 1 link: 0 is up
1921:Dec 01 22:12:51 betelgeuse corosync[1662]:   [KNET  ] link: Resetting MTU for link 0 because host 1 joined

but in this case, the timing is just right that the startall task doesn't even notice..

c:

Code:

1298:Dec 01 22:11:52 canopus systemd[1]: Started corosync.service - Corosync Cluster Engine.
1316:Dec 01 22:11:57 canopus pmxcfs[1035]: [status] notice: update cluster info (cluster name  local-group, version = 3)
1317:Dec 01 22:11:57 canopus pmxcfs[1035]: [dcdb] notice: members: 3/1035
1318:Dec 01 22:11:57 canopus pmxcfs[1035]: [dcdb] notice: all data is up to date
1319:Dec 01 22:11:57 canopus pmxcfs[1035]: [status] notice: members: 3/1035
1320:Dec 01 22:11:57 canopus pmxcfs[1035]: [status] notice: all data is up to date
1321:Dec 01 22:12:23 canopus pvecm[1188]: got timeout when trying to ensure cluster certificates and base file hierarchy is set up - no quorum (yet) or hung pmxcfs?
1338:Dec 01 22:12:24 canopus systemd[1]: Starting pve-guests.service - PVE guests...
1339:Dec 01 22:12:24 canopus pve-guests[1389]: <root@pam> starting task UPID:canopus:00000572:00000F4F:674D0978:startall::root@pam:
1341:Dec 01 22:12:33 canopus kernel: e1000e 0000:00:1f.6 enp0s31f6: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
1342:Dec 01 22:12:33 canopus kernel: vmbr0: port 1(enp0s31f6) entered blocking state
1343:Dec 01 22:12:33 canopus kernel: vmbr0: port 1(enp0s31f6) entered forwarding state
1344:Dec 01 22:12:33 canopus kernel: vmbr0v40: port 1(enp0s31f6.40) entered blocking state
1345:Dec 01 22:12:33 canopus kernel: vmbr0v40: port 1(enp0s31f6.40) entered forwarding state
1346:Dec 01 22:12:38 canopus corosync[1133]:   [KNET  ] link: Resetting MTU for link 0 because host 2 joined
1347:Dec 01 22:12:38 canopus corosync[1133]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
1348:Dec 01 22:12:38 canopus corosync[1133]:   [KNET  ] pmtud: PMTUD link change for host: 2 link: 0 from 469 to 1397
1349:Dec 01 22:12:38 canopus corosync[1133]:   [KNET  ] pmtud: Global data MTU changed to: 1397
1350:Dec 01 22:12:38 canopus corosync[1133]:   [KNET  ] rx: host: 1 link: 0 is up
1351:Dec 01 22:12:38 canopus corosync[1133]:   [KNET  ] link: Resetting MTU for link 0 because host 1 joined
1352:Dec 01 22:12:38 canopus corosync[1133]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
1353:Dec 01 22:12:38 canopus corosync[1133]:   [QUORUM] Sync members[3]: 1 2 3
1354:Dec 01 22:12:38 canopus corosync[1133]:   [QUORUM] Sync joined[2]: 1 2
1355:Dec 01 22:12:38 canopus corosync[1133]:   [TOTEM ] A new membership (1.148) was formed. Members joined: 1 2
1356:Dec 01 22:12:38 canopus pmxcfs[1035]: [dcdb] notice: members: 1/1553, 2/1562, 3/1035
1357:Dec 01 22:12:38 canopus pmxcfs[1035]: [dcdb] notice: starting data syncronisation
1358:Dec 01 22:12:38 canopus pmxcfs[1035]: [status] notice: members: 1/1553, 2/1562, 3/1035
1359:Dec 01 22:12:38 canopus pmxcfs[1035]: [status] notice: starting data syncronisation
1360:Dec 01 22:12:38 canopus corosync[1133]:   [QUORUM] This node is within the primary component and will provide service.
1361:Dec 01 22:12:38 canopus corosync[1133]:   [QUORUM] Members[3]: 1 2 3
1362:Dec 01 22:12:38 canopus corosync[1133]:   [MAIN  ] Completed service synchronization, ready to provide service.
1363:Dec 01 22:12:38 canopus pmxcfs[1035]: [status] notice: node has quorum
1364:Dec 01 22:12:38 canopus pmxcfs[1035]: [dcdb] notice: received sync request (epoch 1/1553/00000003)
1365:Dec 01 22:12:38 canopus pmxcfs[1035]: [status] notice: received sync request (epoch 1/1553/00000003)
1366:Dec 01 22:12:38 canopus pmxcfs[1035]: [dcdb] notice: received all states
1367:Dec 01 22:12:38 canopus pmxcfs[1035]: [dcdb] notice: leader is 1/1553
1368:Dec 01 22:12:38 canopus pmxcfs[1035]: [dcdb] notice: synced members: 1/1553, 2/1562
1369:Dec 01 22:12:38 canopus pmxcfs[1035]: [dcdb] notice: waiting for updates from leader
1370:Dec 01 22:12:38 canopus pmxcfs[1035]: [status] notice: received all states
1371:Dec 01 22:12:38 canopus pmxcfs[1035]: [status] notice: all data is up to date
1372:Dec 01 22:12:38 canopus pmxcfs[1035]: [status] notice: dfsm_deliver_queue: queue length 3
1373:Dec 01 22:12:38 canopus pmxcfs[1035]: [status] notice: received log
1374:Dec 01 22:12:38 canopus pmxcfs[1035]: [main] notice: ignore insert of duplicate cluster log
1375:Dec 01 22:12:38 canopus pmxcfs[1035]: [dcdb] notice: update complete - trying to commit (got 7 inode updates)
1376:Dec 01 22:12:38 canopus pmxcfs[1035]: [dcdb] notice: all data is up to date
1377:Dec 01 22:12:39 canopus kernel: e1000e 0000:00:1f.6 enp0s31f6: NIC Link is Down
1378:Dec 01 22:12:39 canopus kernel: vmbr0: port 1(enp0s31f6) entered disabled state
1379:Dec 01 22:12:39 canopus kernel: vmbr0v40: port 1(enp0s31f6.40) entered disabled state
1381:Dec 01 22:12:39 canopus pve-guests[1389]: <root@pam> end task UPID:canopus:00000572:00000F4F:674D0978:startall::root@pam: OK
1382:Dec 01 22:12:39 canopus systemd[1]: Finished pve-guests.service - PVE guests.
1384:Dec 01 22:12:40 canopus corosync[1133]:   [KNET  ] link: host: 2 link: 0 is down
1385:Dec 01 22:12:40 canopus corosync[1133]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
1386:Dec 01 22:12:40 canopus corosync[1133]:   [KNET  ] host: host: 2 has no active links
1387:Dec 01 22:12:41 canopus corosync[1133]:   [KNET  ] link: host: 1 link: 0 is down
1388:Dec 01 22:12:41 canopus corosync[1133]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
1389:Dec 01 22:12:41 canopus corosync[1133]:   [KNET  ] host: host: 1 has no active links
1390:Dec 01 22:12:41 canopus corosync[1133]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
1391:Dec 01 22:12:41 canopus corosync[1133]:   [KNET  ] host: host: 1 has no active links
1392:Dec 01 22:12:41 canopus corosync[1133]:   [TOTEM ] Token has not been received in 2737 ms
1393:Dec 01 22:12:42 canopus corosync[1133]:   [TOTEM ] A processor failed, forming new configuration: token timed out (3650ms), waiting 4380ms for consensus.
1394:Dec 01 22:12:42 canopus kernel: e1000e 0000:00:1f.6 enp0s31f6: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None

I suspect the startall task here logged that it was waiting for quorum - but this node had no guests to start?

in any case, your corosync link seems to have an issue (at least during bootup - is the switch connected to the same power supply as the servers and maybe resetting itself or interrupting the links while it is booting?)

fdcastel · Dec 5, 2024

Thank you, Fabian, for the valuable information. I have now a better understanding of the startup process.

Indeed, the C node does not have any guests. It is a less powerful node (I3 13100T with 8GB RAM) primarily designated for backups (PBS) and as a third witness for Corosync. I might consider using it for a lightweight LXC container in the future.

It is now clear that the Corosync link is failing during the cold start process. But I can't imagine WHY this is happening. This would be a hard one to debug:

- All three nodes are connected to the same switch within the same VLAN, utilizing 1Gbps Ethernet.

- The switch is an USW Unifi Pro 48 PoE, which takes some time to boot up (like most managed switches, exact duration not measured). However, once operational, it should not cause any interruptions during data transfer.

- The nodes are relatively new Dell Optiplex machines (two with I5 13500 and 32GB RAM, and one with I3 13100T and 8GB RAM), all equipped with Intel I219-LM network adapters, as reported by lshw.

Thank you once again for the valuable tips. If you have any additional insights on how I might debug this issue, I would greatly appreciate it.

fabian · Dec 6, 2024

the switch is also powering up "in parallel"? then it's most likely that the ports get reset as part of that..

fdcastel · Dec 7, 2024

Sorry if I may not fully understand your meaning, but: Yes. Once power is restored, all components start simultaneously, including the switch, router, WAN gateway, and the three nodes.

To the best of my understanding the switch takes some time to start to forward the ethernet packets. But, once it started, it should not interrupt the flow anymore (nor reset any port). Please correct me if I'm wrong.

fabian · Dec 9, 2024

fdcastel said:
Sorry if I may not fully understand your meaning, but: Yes. Once power is restored, all components start simultaneously, including the switch, router, WAN gateway, and the three nodes.

To the best of my understanding the switch takes some time to start to forward the ethernet packets. But, once it started, it should not interrupt the flow anymore (nor reset any port). Please correct me if I'm wrong.

it maybe should, but from the logs I think it doesn't. or your network setup is unstable in general (if you see similar link down/up events happening regularly during operations, that might be the case). it could also be that the switch is not resetting the ports per se, but that there is too much traffic during the cold start to stay within corosync's latency requirements, and that the "link down" is not the physical link going away, but a heartbeat packet not being acknowledged (in time) because of high load.

Search

Search

3-node cluster cold start: "TASK ERROR: cluster not ready - no quorum?" for "Start at Boot=Yes"

fdcastel

Member

fabian

Proxmox Staff Member

fdcastel

Member

Attachments

fabian

Proxmox Staff Member

fabian

Proxmox Staff Member

fdcastel

Member

fabian

Proxmox Staff Member

fdcastel

Member

fabian

Proxmox Staff Member