[SOLVED] Cluster with redundant Corosync networks reboots as soon as I join a new node

if you disarm HA, I'd do it on all nodes of the cluster.
 
Ok, thanks. I'll do it by stopping the daemons just like yesterday (pve-ha-lrm first and pve-ha-crm after) on all nodes.

Once I'm sure corosync stays up, I can re-enable them.

Right?
 
yes, if corosync is stable you can start them again.
 
  • Like
Reactions: godzilla
Hi @fabian ,

I successfully added the new node to the cluster with HA disarmed. Attached the corosync logs from an existing node (proxnode01) and from the new node (proxnode18).

A few minutes have passed and I don't see new retransmit list so it looks OK to me. Can you notice any critical issue? Can I try re-enabling HA?

Thank you
 

Attachments

you can verify with pvecm status and by attempting a write on /etc/pve (e.g., a simple touch /etc/pve/datacenter.cfg should return immediately), but yeah, that logs okay AFAICT.
 
Hi @fabian , sorry for the late reply. I confirm everything went fine. I'm marking the thread as [SOLVED].

Thank you very much again and again! :)
 
  • Like
Reactions: fabian
@fabian just one last question, I promise! :)

I compared the old logs (from 10/19 and 10/20) with the most recent ones, and I'm pretty sure I might now add new nodes without problems, and even without the extra precaution of disabling/reenabling HA.

Can you confirm?

Thank you.
 
disarming HA doesn't really hurt (joining a node is a quick operation, and you are in front of the machine and monitoring it, so the danger of not having HA for that period is small as opposed to a bug breaking something so fundamental that *all* of the cluster has an outage like you previously experienced). but of course, most cluster joining just works, with or without HA disarmed, so the choice is up to you.
 
  • Like
Reactions: godzilla