Hello folks,
following Situation:
Existing Cluster with 3 Nodes - runs for years now - recently upgraded to latest Debian/Proxmox Release
Now the following:
We added 4 new Nodes on another geographical Area which is connected via Darkfibre VPN. Everything still working like a charm. Every Node has a dedicated Mgmt Interface, a dedicated SAN Interface and one for VM Traffic. Everything good here, every server can reach every other server via ping, ssh etc.
Now i migrated some VMs from one of the "old" Nodes to the new Nodes to spread them out - Everything still works.
And here is the question:
Then i shutdown one of the older nodes because we need to swap network cards there. Of course i migrated all vms before doing that. But suddenly the whole cluster became unresponsive, the vms were shutdown and could not get started anymore until we started the shutdown Node again.
The only really error i got:
What am i missing in my config? I mean whats the sense of a cluster if i cant shutdown one node for maintaince - or emergency case?
Did i miss something regarding Quorum?
Thanks for clearification
following Situation:
Existing Cluster with 3 Nodes - runs for years now - recently upgraded to latest Debian/Proxmox Release
Now the following:
We added 4 new Nodes on another geographical Area which is connected via Darkfibre VPN. Everything still working like a charm. Every Node has a dedicated Mgmt Interface, a dedicated SAN Interface and one for VM Traffic. Everything good here, every server can reach every other server via ping, ssh etc.
Now i migrated some VMs from one of the "old" Nodes to the new Nodes to spread them out - Everything still works.
And here is the question:
Then i shutdown one of the older nodes because we need to swap network cards there. Of course i migrated all vms before doing that. But suddenly the whole cluster became unresponsive, the vms were shutdown and could not get started anymore until we started the shutdown Node again.
The only really error i got:
Code:
2023-11-09T14:06:57.070785+01:00 dus1 corosync[2182]: [KNET ] link: host: 1 link: 0 is down
2023-11-09T14:06:57.071127+01:00 dus1 corosync[2182]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
2023-11-09T14:06:57.071198+01:00 dus1 corosync[2182]: [KNET ] host: host: 1 has no active links
2023-11-09T14:06:58.287294+01:00 dus1 pve-ha-crm[2238]: status change slave => wait_for_quorum
2023-11-09T14:07:01.450967+01:00 dus1 pve-ha-lrm[2249]: lost lock 'ha_agent_dus1_lock - cfs lock update failed - Permission denied
2023-11-09T14:07:06.454437+01:00 dus1 pve-ha-lrm[2249]: status change active => lost_agent_lock
2023-11-09T14:07:09.548265+01:00 dus1 pvescheduler[300289]: jobs: cfs-lock 'file-jobs_cfg' error: no quorum!
2023-11-09T14:07:09.548806+01:00 dus1 pvescheduler[300288]: replication: cfs-lock 'file-replication_cfg' error: no quorum!
2023-11-09T14:07:52.296775+01:00 dus1 watchdog-mux[939]: client watchdog expired - disable watchdog updates
What am i missing in my config? I mean whats the sense of a cluster if i cant shutdown one node for maintaince - or emergency case?
Did i miss something regarding Quorum?
Thanks for clearification