[SOLVED] Cluster broken after changing 1 of 2 nodes IP - VMs not starting for no quorum - please help

Cheffabbio

New Member
Jun 4, 2023
14
1
3
Dears,
big mess in Chinatown after changing IP in one node of a 2 nodes cluster, cluster broken.

Cluster: node "faxmox" (the one where I changed IP) + node "famoxout"

A bit of context:
- amended corosync.conf files as per new IP in both nodes
- after some unsuccessful attempts, copied and pasted corosync dirs and conf files from IP-untouched node (faxmoxout) to IP-modified node (faxmox)
- network-wise servers communicate
- on the IP-untouched node (faxmoxout), oddly enough, UI shell (No VNC) is not working ("code 1006"). VMs won't start for no quorum. SSH works well



*) node "faxmox" (192.168.70.2)

systemctl status corosync.service

Code:
systemctl status corosync.service
● corosync.service - Corosync Cluster Engine
     Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
     Active: active (running) since Sun 2023-06-04 00:52:49 CEST; 7min ago
       Docs: man:corosync
             man:corosync.conf
             man:corosync_overview
   Main PID: 1035 (corosync)
      Tasks: 9 (limit: 9298)
     Memory: 131.6M
        CPU: 2.939s
     CGroup: /system.slice/corosync.service
             └─1035 /usr/sbin/corosync -f

Jun 04 00:55:34 faxmox corosync[1035]:   [QUORUM] Sync joined[1]: 2
Jun 04 00:55:34 faxmox corosync[1035]:   [TOTEM ] A new membership (1.23e) was formed. Members joined: 2
Jun 04 00:55:34 faxmox corosync[1035]:   [QUORUM] Sync members[1]: 1
Jun 04 00:55:34 faxmox corosync[1035]:   [QUORUM] Sync left[1]: 2
Jun 04 00:55:34 faxmox corosync[1035]:   [TOTEM ] A new membership (1.242) was formed. Members left: 2
Jun 04 00:55:34 faxmox corosync[1035]:   [QUORUM] Members[1]: 1
Jun 04 00:55:34 faxmox corosync[1035]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jun 04 00:55:35 faxmox corosync[1035]:   [KNET  ] link: host: 2 link: 0 is down
Jun 04 00:55:35 faxmox corosync[1035]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Jun 04 00:55:35 faxmox corosync[1035]:   [KNET  ] host: host: 2 has no active links

pvecm status

Code:
pvecm status
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
    LANGUAGE = (unset),
    LC_ALL = (unset),
    LC_CTYPE = "UTF-8",
    LANG = "en_US.UTF-8"
    are supported and installed on your system.
perl: warning: Falling back to a fallback locale ("en_US.UTF-8").
Cluster information
-------------------
Name:             FaxProxCluster1
Config Version:   6
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Sun Jun  4 01:04:31 2023
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000001
Ring ID:          1.242
Quorate:          No

Votequorum information
----------------------
Expected votes:   2
Highest expected: 2
Total votes:      1
Quorum:           2 Activity blocked
Flags:           

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 192.168.70.2 (local)


/etc/pve/corosync.conf

Code:
root@faxmox:~# more /etc/pve/corosync.conf
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: FaxmoxOUT
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 192.168.60.6
  }
  node {
    name: faxmox
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 192.168.70.2
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: FaxProxCluster1
  config_version: 6
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}



*) node "faxmoxout" (192.168.60.6)


systemctl status corosync.service:

Code:
root@FaxmoxOUT:~# systemctl status corosync.service
● corosync.service - Corosync Cluster Engine
     Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Sun 2023-06-04 00:55:35 CEST; 12min ago
       Docs: man:corosync
             man:corosync.conf
             man:corosync_overview
    Process: 920 ExecStart=/usr/sbin/corosync -f $COROSYNC_OPTIONS (code=exited, status=0/SUCCESS)
    Process: 987 ExecStop=/usr/sbin/corosync-cfgtool -H --force (code=exited, status=1/FAILURE)
   Main PID: 920 (code=exited, status=0/SUCCESS)
        CPU: 115ms

Jun 04 00:55:34 FaxmoxOUT corosync[920]:   [QB    ] withdrawing server sockets
Jun 04 00:55:34 FaxmoxOUT corosync[920]:   [SERV  ] Service engine unloaded: corosync cluster quorum service v0.1
Jun 04 00:55:34 FaxmoxOUT corosync[920]:   [SERV  ] Service engine unloaded: corosync profile loading service
Jun 04 00:55:34 FaxmoxOUT corosync[920]:   [SERV  ] Service engine unloaded: corosync resource monitoring service
Jun 04 00:55:34 FaxmoxOUT corosync[920]:   [SERV  ] Service engine unloaded: corosync watchdog service
Jun 04 00:55:35 FaxmoxOUT corosync[920]:   [KNET  ] link: Resetting MTU for link 0 because host 1 joined
Jun 04 00:55:35 FaxmoxOUT corosync[920]:   [KNET  ] link: Resetting MTU for link 0 because host 2 joined
Jun 04 00:55:35 FaxmoxOUT corosync[920]:   [MAIN  ] Corosync Cluster Engine exiting normally
Jun 04 00:55:35 FaxmoxOUT systemd[1]: corosync.service: Control process exited, code=exited, status=1/FAILURE
Jun 04 00:55:35 FaxmoxOUT systemd[1]: corosync.service: Failed with result 'exit-code'.

pvecm status

Code:
root@FaxmoxOUT:~# pvecm status
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
    LANGUAGE = (unset),
    LC_ALL = (unset),
    LC_CTYPE = "UTF-8",
    LANG = "en_US.UTF-8"
    are supported and installed on your system.
perl: warning: Falling back to a fallback locale ("en_US.UTF-8").
Cluster information
-------------------
Name:             FaxProxCluster1
Config Version:   2
Transport:        knet
Secure auth:      on

Cannot initialize CMAP service

After several hours of failures, trying to finds some solace here among experts!
How can I make nodes speak together again and restore cluster?

Thanks in advance!

cheers
 
Update:
I've rm'ed /etc/pve/corosync.conf and /etc/corosync/* in one of the tow nodes (Faxmoxout) and now at least both nodes shells are working again as this has restore LOCAL web-GUI functionality and VMs operations.
AND it restored quorate.

But, still, they don't see each other + I now have "Connection error 401: No ticket" when accessing the other node from a node web-GUI
So I assume now I "just" need to re-create the cluster again OR or somehow restore the conditions when the nodes were seeing each other.

Thanks in advance for any support.

Cheers

"
 
Last edited:
SOLVED:

*) copied etc/pve/corosync.conf and /etc/corosync/* from the one other node to the node where I've deleted those assets
*) Did "pvecm updatecerts --force" on both nodes
*) rebooted
 
  • Like
Reactions: Darkk