[SOLVED] Cluster broken after changing 1 of 2 nodes IP - VMs not starting for no quorum - please help

Cheffabbio

New Member
Jun 4, 2023
6
1
3
Dears,
big mess in Chinatown after changing IP in one node of a 2 nodes cluster, cluster broken.

Cluster: node "faxmox" (the one where I changed IP) + node "famoxout"

A bit of context:
- amended corosync.conf files as per new IP in both nodes
- after some unsuccessful attempts, copied and pasted corosync dirs and conf files from IP-untouched node (faxmoxout) to IP-modified node (faxmox)
- network-wise servers communicate
- on the IP-untouched node (faxmoxout), oddly enough, UI shell (No VNC) is not working ("code 1006"). VMs won't start for no quorum. SSH works well



*) node "faxmox" (192.168.70.2)

systemctl status corosync.service

Code:
systemctl status corosync.service
● corosync.service - Corosync Cluster Engine
     Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
     Active: active (running) since Sun 2023-06-04 00:52:49 CEST; 7min ago
       Docs: man:corosync
             man:corosync.conf
             man:corosync_overview
   Main PID: 1035 (corosync)
      Tasks: 9 (limit: 9298)
     Memory: 131.6M
        CPU: 2.939s
     CGroup: /system.slice/corosync.service
             └─1035 /usr/sbin/corosync -f

Jun 04 00:55:34 faxmox corosync[1035]:   [QUORUM] Sync joined[1]: 2
Jun 04 00:55:34 faxmox corosync[1035]:   [TOTEM ] A new membership (1.23e) was formed. Members joined: 2
Jun 04 00:55:34 faxmox corosync[1035]:   [QUORUM] Sync members[1]: 1
Jun 04 00:55:34 faxmox corosync[1035]:   [QUORUM] Sync left[1]: 2
Jun 04 00:55:34 faxmox corosync[1035]:   [TOTEM ] A new membership (1.242) was formed. Members left: 2
Jun 04 00:55:34 faxmox corosync[1035]:   [QUORUM] Members[1]: 1
Jun 04 00:55:34 faxmox corosync[1035]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jun 04 00:55:35 faxmox corosync[1035]:   [KNET  ] link: host: 2 link: 0 is down
Jun 04 00:55:35 faxmox corosync[1035]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Jun 04 00:55:35 faxmox corosync[1035]:   [KNET  ] host: host: 2 has no active links

pvecm status

Code:
pvecm status
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
    LANGUAGE = (unset),
    LC_ALL = (unset),
    LC_CTYPE = "UTF-8",
    LANG = "en_US.UTF-8"
    are supported and installed on your system.
perl: warning: Falling back to a fallback locale ("en_US.UTF-8").
Cluster information
-------------------
Name:             FaxProxCluster1
Config Version:   6
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Sun Jun  4 01:04:31 2023
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000001
Ring ID:          1.242
Quorate:          No

Votequorum information
----------------------
Expected votes:   2
Highest expected: 2
Total votes:      1
Quorum:           2 Activity blocked
Flags:           

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 192.168.70.2 (local)


/etc/pve/corosync.conf

Code:
root@faxmox:~# more /etc/pve/corosync.conf
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: FaxmoxOUT
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 192.168.60.6
  }
  node {
    name: faxmox
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 192.168.70.2
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: FaxProxCluster1
  config_version: 6
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}



*) node "faxmoxout" (192.168.60.6)


systemctl status corosync.service:

Code:
root@FaxmoxOUT:~# systemctl status corosync.service
● corosync.service - Corosync Cluster Engine
     Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Sun 2023-06-04 00:55:35 CEST; 12min ago
       Docs: man:corosync
             man:corosync.conf
             man:corosync_overview
    Process: 920 ExecStart=/usr/sbin/corosync -f $COROSYNC_OPTIONS (code=exited, status=0/SUCCESS)
    Process: 987 ExecStop=/usr/sbin/corosync-cfgtool -H --force (code=exited, status=1/FAILURE)
   Main PID: 920 (code=exited, status=0/SUCCESS)
        CPU: 115ms

Jun 04 00:55:34 FaxmoxOUT corosync[920]:   [QB    ] withdrawing server sockets
Jun 04 00:55:34 FaxmoxOUT corosync[920]:   [SERV  ] Service engine unloaded: corosync cluster quorum service v0.1
Jun 04 00:55:34 FaxmoxOUT corosync[920]:   [SERV  ] Service engine unloaded: corosync profile loading service
Jun 04 00:55:34 FaxmoxOUT corosync[920]:   [SERV  ] Service engine unloaded: corosync resource monitoring service
Jun 04 00:55:34 FaxmoxOUT corosync[920]:   [SERV  ] Service engine unloaded: corosync watchdog service
Jun 04 00:55:35 FaxmoxOUT corosync[920]:   [KNET  ] link: Resetting MTU for link 0 because host 1 joined
Jun 04 00:55:35 FaxmoxOUT corosync[920]:   [KNET  ] link: Resetting MTU for link 0 because host 2 joined
Jun 04 00:55:35 FaxmoxOUT corosync[920]:   [MAIN  ] Corosync Cluster Engine exiting normally
Jun 04 00:55:35 FaxmoxOUT systemd[1]: corosync.service: Control process exited, code=exited, status=1/FAILURE
Jun 04 00:55:35 FaxmoxOUT systemd[1]: corosync.service: Failed with result 'exit-code'.

pvecm status

Code:
root@FaxmoxOUT:~# pvecm status
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
    LANGUAGE = (unset),
    LC_ALL = (unset),
    LC_CTYPE = "UTF-8",
    LANG = "en_US.UTF-8"
    are supported and installed on your system.
perl: warning: Falling back to a fallback locale ("en_US.UTF-8").
Cluster information
-------------------
Name:             FaxProxCluster1
Config Version:   2
Transport:        knet
Secure auth:      on

Cannot initialize CMAP service

After several hours of failures, trying to finds some solace here among experts!
How can I make nodes speak together again and restore cluster?

Thanks in advance!

cheers
 
Update:
I've rm'ed /etc/pve/corosync.conf and /etc/corosync/* in one of the tow nodes (Faxmoxout) and now at least both nodes shells are working again as this has restore LOCAL web-GUI functionality and VMs operations.
AND it restored quorate.

But, still, they don't see each other + I now have "Connection error 401: No ticket" when accessing the other node from a node web-GUI
So I assume now I "just" need to re-create the cluster again OR or somehow restore the conditions when the nodes were seeing each other.

Thanks in advance for any support.

Cheers

"
 
Last edited:
SOLVED:

*) copied etc/pve/corosync.conf and /etc/corosync/* from the one other node to the node where I've deleted those assets
*) Did "pvecm updatecerts --force" on both nodes
*) rebooted
 
  • Like
Reactions: Darkk

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!