[SOLVED] Yes i changed hostnames.... almost fixed. One question.

fuzzyduck

Member
Jul 14, 2021
65
3
13
44
I bought a tiny SBC as qdevice, so i thought heck now im going for it i might as well set this up right. I havent tried the qdevice yet, but i changed hostnames(yes stupid i learned) on the two nodes. Im now left with a working setup but weird looking:

what i did thusfar (Proxmox 8.04 on both boxes)
1. Changed hostnames on both nodes. /etc/hosts and /etc/hostnames
2. rebooted node 2 only first, which broke the cluster (with hindsight obviously) with all sorts of webgui problems and such and panicked. SSH worked.
3. changed back /etc/hosts and /etc/hostnames on both nodes and rebooted node2 again.
4. Cluster came back online and stuff worked, including green ticks all is ok. Node1 never had any reboot as of yet.
5. BUT while the ssh prompt of node2 was correct like im used to, the prompt of node1 was weird looking: root@192:~# I checked both host files and never have i made a typo as far as i know. Though it almost feels like a file is parsed incorrectly?
6. To keep calm, I decided to postpone the reboot of node 1.
7. Now 2 days later suddenly i see a ghost node popup in my datacenter when i log in via node1. And the green OK-ticks on the 2 nodes are gone. Or was it always there and didnt see? node1 is my default login so i think its new.

Screenshot 2023-11-19 at 10-48-12 192 - Proxmox Virtual Environment.png

Logging in through node2 give me the correct stats:
Screenshot 2023-11-19 at 12-03-13 pve2 - Proxmox Virtual Environment.png

remember i never rebooted node1 as of yet., but now im even more scared to to so :oops:

The question... Should i reboot node1, or should i check-double-check something first?


Code:
root@192:~# cat /etc/hosts
127.0.0.1 localhost.localdomain localhost
192.168.1.2 pve.local pve

# The following lines are desirable for IPv6 capable hosts

::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

root@192:~# cat /etc/hostname
pve

root@192:~# pvecm status
Cluster information
-------------------
Name:             pve-cluster
Config Version:   6
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Sun Nov 19 11:08:14 2023
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000001
Ring ID:          1.328
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   2
Highest expected: 2
Total votes:      2
Quorum:           2
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 192.168.1.2 (local)
0x00000002          1 192.168.1.3
 
Last edited:
I bought a tiny SBC as qdevice, so i thought heck now im going for it i might as well set this up right. I havent tried the qdevice yet, but i changed hostnames(yes stupid i learned) on the two nodes. Im now left with a working setup but weird looking:

what i did thusfar (Proxmox 8.04 on both boxes)
1. Changed hostnames on both nodes. /etc/hosts and /etc/hostnames
2. rebooted node 2 only first, which broke the cluster (with hindsight obviously) with all sorts of webgui problems and such and panicked. SSH worked.
3. changed back /etc/hosts and /etc/hostnames on both nodes and rebooted node2 again.
4. Cluster came back online and stuff worked, including green ticks all is ok. Node1 never had any reboot as of yet.
5. BUT while the ssh prompt of node2 was correct like im used to, the prompt of node1 was weird looking: root@192:~# I checked both host files and never have i made a typo as far as i know. Though it almost feels like a file is parsed incorrectly?
6. To keep calm, I decided to postpone the reboot of node 1.
7. Now 2 days later suddenly i see a ghost node popup in my datacenter when i log in via node1. And the green OK-ticks on the 2 nodes are gone. Or was it always there and didnt see? node1 is my default login so i think its new.

View attachment 58370

Logging in through node2 give me the correct stats:
View attachment 58376

remember i never rebooted node1 as of yet., but now im even more scared to to so :oops:

The question... Should i reboot node1, or should i check-double-check something first?


Code:
root@192:~# cat /etc/hosts
127.0.0.1 localhost.localdomain localhost
192.168.1.2 pve.local pve

# The following lines are desirable for IPv6 capable hosts

::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

root@192:~# cat /etc/hostname
pve

root@192:~# pvecm status
Cluster information
-------------------
Name:             pve-cluster
Config Version:   6
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Sun Nov 19 11:08:14 2023
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000001
Ring ID:          1.328
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   2
Highest expected: 2
Total votes:      2
Quorum:           2
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 192.168.1.2 (local)
0x00000002          1 192.168.1.3

On your node1 (pve -> 192.168.1.2) run:
# hostnamectl set-hostname pve

Then go and check content of the leftovers directory:
Code:
# cd /etc/pve/nodes
#ls -la

You will likely have the weird 192 directory there which you can delete:
# rm -rf 192

After that, reload GUI in the browser.

Before doing more reboots, do you mind to show content of corosync.conf:
#ls -la /etc/pve/corosync.conf
 
On your node1 (pve -> 192.168.1.2) run:
# hostnamectl set-hostname pve

Then go and check content of the leftovers directory:
Code:
# cd /etc/pve/nodes
#ls -la

You will likely have the weird 192 directory there which you can delete:
# rm -rf 192

After that, reload GUI in the browser.

Before doing more reboots, do you mind to show content of corosync.conf:
#ls -la /etc/pve/corosync.conf

Thank you so much for taking the time to read and respond to my question. Much appreciated! especially because it did something what shouldnt be done in the first place lol.

So before i go and maybe break more by setting the hostname once again, as stated in your command. This is the out put of some commands i found online:

root@192:/# hostname 192.168.1.2

root@192:/# hostnamectl Static hostname: pve Transient hostname: 192.168.1.2 ...etc...

root@192:/# cat /proc/sys/kernel/hostname 192.168.1.2

root@192:/# cat /etc/hostname pve

I DO see the weird 192 directory in /etc/pve/nodes which might be safe to delete indeed,

Contents of corosync. No weird node called 192 there...

Code:
root@192:/# cat /etc/pve/corosync.conf
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: pve
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 192.168.1.2
  }
  node {
    name: pve2
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 192.168.1.3
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: pve-cluster
  config_version: 6
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

I havent taken any steps as of yet to fix this.
 
Last edited:
Thank you so much for taking the time to read and respond to my question. Much appreciated! especially because it did something what shouldnt be done in the first place lol.

No worries. :)
root@192:/# hostname [/QUOTE] Being on a host 192 makes this feel like some sort of a Bond movie. ;) [QUOTE="fuzzyduck, post: 606606, member: 124564"] ... [ICODE]root@192:/# cat /etc/hostname pve

The hostname outputs also entertaining, I believe you would be safe just rebooting after you get it back, before which...

I DO see the weird 192 directory in /etc/pve/nodes which might be safe to delete indeed,

Just do it. :) It really only fixes the zombie in the GUI, but why not do it before reboot.

Contents of corosync. No weird node called 192 there...

Code:
root@192:/# cat /etc/pve/corosync.conf
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: pve
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 192.168.1.2
  }
  node {
    name: pve2
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 192.168.1.3
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: pve-cluster
  config_version: 6
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

So this is good, you do have the right IPs there, the aliases are fine, even it states you have a quorum (mind you while you are rebooting any one of the two, you are by definition having no quorum, so you need to wait for that other node to come back up then to see happy cluster again).

I havent taken any steps as of yet to fix this.

Time to reboot. Report the aftermath. I would want to go click around from GUI of the zombie node and try to access shell of the other, VMs there, etc.
 
Last edited:
  • Like
Reactions: fuzzyduck
right now its evening and the wife is watching TV through the server. So ill report back tomorrow!

What i DO did was removing the /etc/pve/nodes/192 dir which existed on both nodes (while seeing the error only when logging in through node1).

service pveproxy restart did not resolve the ghost node yet though.

EDIT i just found out that i cannot enter LXC nor VM's via webconsole nor terminal.

Code:
root@192:~# pct enter 102
Configuration file 'nodes/192/lxc/102.conf' does not exist
root@192:~#


the 2 node cluster is what i was trying to fix in the first place by adding a qdevice, but ended up here on the forum LOL

thx so far
 
Last edited:
  • Like
Reactions: tempacc375924
No worries. :)


The hostname outputs also entertaining, I believe you would be safe just rebooting after you get it back, before which...



Just do it. :) It really only fixes the zombie in the GUI, but why not do it before reboot.



So this is good, you do have the right IPs there, the aliases are fine, even it states you have a quorum (mind you while you are rebooting any one of the two, you are by definition having no quorum, so you need to wait for that other node to come back up then to see happy cluster again).



Time to reboot. Report the aftermath. I would want to go click around from GUI of the zombie node and try to access shell of the other, VMs there, etc.

The aftermath is pretty hopeful!
Booted without errors to GUI. Zombie node is gone, BUT the /etc/pve/nodes/192 zombie node dir got recreated. I deleted it once more and monitoring the situating. Zo maybe a file somewhere gets parsed every now and then, making the cluster think there is another node?

At least i have access to the LXC and VM again because my prompt is as it should be.
Code:
root@pve:~#

Im keeping this topic unsolved for couple more days to monitor.
 
The aftermath is pretty hopeful!
Booted without errors to GUI. Zombie node is gone, BUT the /etc/pve/nodes/192 zombie node dir got recreated. I deleted it once more and monitoring the situating. Zo maybe a file somewhere gets parsed every now and then, making the cluster think there is another node?
In your case (renaming back and forth), it might have been actually best to reboot both nodes at the same time.

The interesting part is, you should be able to run the corosync purely on IPs, so your names could be anything really.

At least i have access to the LXC and VM again because my prompt is as it should be.
Code:
root@pve:~#

Ok good!

Im keeping this topic unsolved for couple more days to monitor.

Keep us posted. Also if the 192 gets recreated, have a look at the corosync files what's in there at that very time.
 
  • Like
Reactions: fuzzyduck
One more thing ... how is the qdevice working for you so far? :) SBC device is ... ? :)
Haha thx for asking. Its a Single Board Computer. In my case:

http://www.orangepi.org/html/hardWare/computerAndMicrocontrollers/details/Orange-Pi-Zero-LTS.html

10 euros/dollar 2nd hand, with NIC.

I wanted one of my 4 router/accesspoints to be a qdevice, since they run OpenWrt firmware(custom linux) which opens them up for more then just router (you get ssh prompt). But since i cannot compile on it( no know-how ) i went this route, which has plain vanilla debian on it.

Already have the latest corosync-qnetd installed , but ran into setup-issues. Its another story, and not for this topic i guess.
 
Haha thx for asking. Its a Single Board Computer. In my case:

http://www.orangepi.org/html/hardWare/computerAndMicrocontrollers/details/Orange-Pi-Zero-LTS.html

10 euros/dollar 2nd hand, with NIC.
That's a Cortex A7, it sounds like a great candidate for just that.

I wanted one of my 4 router/accesspoints to be a qdevice, since they run OpenWrt firmware(custom linux) which opens them up for more then just router (you get ssh prompt). But since i cannot compile on it( no know-how ) i went this route, which has plain vanilla debian on it.
You could even run it outside of the network even (if you already run a VM or CT out there), it's the one thing that does NOT require low latency (with the rest of the cluster).

Already have the latest corosync-qnetd installed , but ran into setup-issues. Its another story, an not for this topic i guess.
Fair enough, feel free to open a new one. ;)
 
  • Like
Reactions: fuzzyduck
After the reboot of node1 and monitoring the 2 node cluster for a day i didnt see the ghost 192 dir return. Even the qdevice setup is now working and i have 3 votes now for HA. Happy days.

Thank you

tempacc375924

for helping me out kicking me forward, where i had stalled. ;)
 
  • Like
Reactions: tempacc375924

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!