[SOLVED] Reinstall node in cluster

floh

Active Member
Jul 19, 2018
62
5
28
Good morning Proxmox Community!

I'm currently planing to reinstall one node in a working cluster because of a completely new disk-setup. (I'll be swapping all disks of the node.)

As far as I've researched there are two working ways:

1. like documented here: https://pve.proxmox.com/wiki/Upgrade_from_5.x_to_6.0#New_installation
So I would backup all files and reinstall it (without leaving the cluster). The question is: Do I then need to rejoin the cluster, or does the other nodes think the node which was newly installed with all old config-files restored was only shutdown and is now online again?

My second way I thought of is:
Remove the node from the cluster (as documented here: https://pve.proxmox.com/wiki/Cluster_Manager#_remove_a_cluster_node).
Then reinstall the node (With same name & ip-address - so I would need to swap the certificates and the SSH fingerprint after rejoining).


Has anyone done that so far? Is there a better way to do that?


Best regards,
Floh



PS: After reinstalling this one node, I have to reinstall every node one after another.
 
Hi,

So I would backup all files and reinstall it (without leaving the cluster). The question is: Do I then need to rejoin the cluster, or does the other nodes think the node which was newly installed with all old config-files restored was only shutdown and is now online again?

If you restore the whole /etc/corosync directory including it's authkey and the local corosync.conf the node can directly join, as it never left. Copy the corosync.conf also to /etc/pve and restart both corosync and pve-cluster service in one go: systemctl restart corosync pve-cluster

The configs from /etc/pve are all restored if that works, but other local configs or disks are not.

Additionally it would be good to also restore the root SSH key pair under /root/.ssh/id* as else the node has changed access and migration or the like will give you SSH host indentity changed errors. Alternatively you could run a "pvecm updatecerts" after the node rejoined the cluster.

Has anyone done that so far? Is there a better way to do that?

Both work fine. You could try out with a virtual node or cluster (PVE as PVE VMs) and play your scenario through. This ensures you have a feeling for it and gives a chance to ask about problems, if one arises, while not doing the change in production.
 
Update:
I reinstalled the Node and restored: /etc/pve/* + /etc/corosync/* + /etc/passwd + /root/.ssh/*
Then I had to update the certs with "pvecm updatecerts" and everything seems working fine.

The only thing is that the Subscription-Key isn't working anymore. Status: Invalid: Invalid Server ID.

Is there a way to restore the subscription key from the old backup-files? (I have a backup from /etc from the node before reinstall)
Or is the only way to reactivate the key at shop.maurer-it.com?
 
  • Like
Reactions: padi
Just to clarify:
(1) backup ~/.ssh, /etc/passwd, /etc/shadow, /etc/pve and /etc/corosync
(2) reinstall without leaving the cluster
(3) install re-activated license key
(4) restore from (1)
(5) reboot

is this correct?
 
is this correct?
shadow and passwd are normally not required, at least if you did not add extra user you'd like to conserve.
I'd also recommend avoiding doing cluster-wide changes during the reinstallation, or at least ensure that you really only restore the affected node relevant configuration files in /etc/pve (most gets synced anyway after the reboot). IOW. backing up the /etc/pve part is mostly done out of safety if anything goes wrong with rejoining or the like.
So .ssh and corosync are the most important directories to restore.

Ensure also the VM disks ain't on local storage (e.g., local-lvm) which gets destroyed if its disk is selected as target in the ISO installer for reinstallation.
 
  • Like
Reactions: ViennaTux
I moved all VMs to another node.
I Installed anew, activated the license and restored /etc/pve and /etc/corosync.
(passwd and shadow were restored too but are irrelevant at the moment)

After reboot

systemctl status corosync

says:

Bash:
● corosync.service - Corosync Cluster Engine
     Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
     Active: active (running) since Tue 2022-02-08 11:52:34 CET; 35min ago
       Docs: man:corosync
             man:corosync.conf
             man:corosync_overview
   Main PID: 1327 (corosync)
      Tasks: 9 (limit: 56116)
     Memory: 116.6M
        CPU: 16.359s
     CGroup: /system.slice/corosync.service
             └─1327 /usr/sbin/corosync -f

Feb 08 11:52:38 qonos corosync[1327]:   [QUORUM] Sync joined[1]: 3
Feb 08 11:52:38 qonos corosync[1327]:   [TOTEM ] A new membership (2.2d0) was formed. Members joined: 3
Feb 08 11:52:38 qonos corosync[1327]:   [QUORUM] This node is within the primary component and will provide service.
Feb 08 11:52:38 qonos corosync[1327]:   [QUORUM] Members[2]: 2 3
Feb 08 11:52:38 qonos corosync[1327]:   [MAIN  ] Completed service synchronization, ready to provide service.
Feb 08 11:52:39 qonos corosync[1327]:   [KNET  ] rx: host: 3 link: 0 is up
Feb 08 11:52:39 qonos corosync[1327]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Feb 08 11:52:54 qonos corosync[1327]:   [KNET  ] pmtud: PMTUD link change for host: 3 link: 0 from 469 to 1397
Feb 08 11:52:54 qonos corosync[1327]:   [KNET  ] pmtud: PMTUD link change for host: 3 link: 1 from 469 to 1397
Feb 08 11:52:54 qonos corosync[1327]:   [KNET  ] pmtud: Global data MTU changed to: 1397

but

systemctl status pve-cluster.service

gives (among others)

Bash:
Feb 08 12:24:58 qonos pveproxy[1362]: ipcc_send_rec[1] failed: Connection refused

Any idea?
 
Last edited:
This is the output of journatctl -b -u pve-cluster (the same block appears over and over, only the counter increases)

Code:
Feb 08 12:24:51 qonos systemd[1]: pve-cluster.service: Scheduled restart job, restart counter is at 3.
Feb 08 12:24:51 qonos systemd[1]: Stopped The Proxmox VE cluster filesystem.
Feb 08 12:24:51 qonos systemd[1]: Starting The Proxmox VE cluster filesystem...
Feb 08 12:24:51 qonos pmxcfs[1706]: fuse: mountpoint is not empty
Feb 08 12:24:51 qonos pmxcfs[1706]: fuse: if you are sure this is safe, use the 'nonempty' mount option
Feb 08 12:24:51 qonos pmxcfs[1706]: [main] crit: fuse_mount error: File exists
Feb 08 12:24:51 qonos pmxcfs[1706]: [main] notice: exit proxmox configuration filesystem (-1)
Feb 08 12:24:51 qonos pmxcfs[1706]: [main] crit: fuse_mount error: File exists
Feb 08 12:24:51 qonos pmxcfs[1706]: [main] notice: exit proxmox configuration filesystem (-1)
Feb 08 12:24:51 qonos systemd[1]: pve-cluster.service: Control process exited, code=exited, status=255/EXCEPTION
Feb 08 12:24:51 qonos systemd[1]: pve-cluster.service: Failed with result 'exit-code'.
Feb 08 12:24:51 qonos systemd[1]: Failed to start The Proxmox VE cluster filesystem.
Feb 08 12:24:52 qonos systemd[1]: pve-cluster.service: Scheduled restart job, restart counter is at 4.
Feb 08 12:24:52 qonos systemd[1]: Stopped The Proxmox VE cluster filesystem.
Feb 08 12:24:52 qonos systemd[1]: Starting The Proxmox VE cluster filesystem...
Feb 08 12:24:52 qonos pmxcfs[1708]: fuse: mountpoint is not empty
Feb 08 12:24:52 qonos pmxcfs[1708]: fuse: if you are sure this is safe, use the 'nonempty' mount option
Feb 08 12:24:52 qonos pmxcfs[1708]: [main] crit: fuse_mount error: File exists
Feb 08 12:24:52 qonos pmxcfs[1708]: [main] crit: fuse_mount error: File exists
Feb 08 12:24:52 qonos pmxcfs[1708]: [main] notice: exit proxmox configuration filesystem (-1)
Feb 08 12:24:52 qonos pmxcfs[1708]: [main] notice: exit proxmox configuration filesystem (-1)
Feb 08 12:24:52 qonos systemd[1]: pve-cluster.service: Control process exited, code=exited, status=255/EXCEPTION
Feb 08 12:24:52 qonos systemd[1]: pve-cluster.service: Failed with result 'exit-code'.
Feb 08 12:24:52 qonos systemd[1]: Failed to start The Proxmox VE cluster filesystem.
Feb 08 12:24:52 qonos systemd[1]: pve-cluster.service: Scheduled restart job, restart counter is at 5.
Feb 08 12:24:52 qonos systemd[1]: Stopped The Proxmox VE cluster filesystem.
Feb 08 12:24:52 qonos systemd[1]: pve-cluster.service: Start request repeated too quickly.
Feb 08 12:24:52 qonos systemd[1]: pve-cluster.service: Failed with result 'exit-code'.
Feb 08 12:24:52 qonos systemd[1]: Failed to start The Proxmox VE cluster filesystem.
 
Feb 08 12:24:51 qonos pmxcfs[1706]: fuse: mountpoint is not empty
Feb 08 12:24:51 qonos pmxcfs[1706]: fuse: if you are sure this is safe, use the 'nonempty' mount option
Obiges ist der Fehler, da wurde wohl das backup nach /etc/pve enpackt als pve-cluster (also pmxcfs) nicht rannte, nun kann es nicht mehr mounten.

Bash:
mv /etc/pve /etc/pve-OLD
mkdir /etc/pve
chown root:www-data /etc/pve
systemctl restart pve-cluster
 
Danke, das hat mich deutlich wetergebracht. Allerdings zeigt der Node immer noch ein graues Fragezeichen im Cluster.
Habe ich was übersehen?
 
Habe ich was übersehen?
Das kann heißen, dass der pvestatd service nicht (mehr) rennt oder sich aufgehängt hat, evtl. mal probieren den zu restarten

systemctl restart pvestatd
 
  • Like
Reactions: ViennaTux
Danke, daran hatte ich auch schon gedacht.
Erstaunlicherweise hat er jetzt, nach dem gefühlt fünften Mal, wirklich durchgestartet und den Status aktualisiert.
TAL!
Ich werde das gleich in unsere Doku einfügen, damit das nicht nochmal passiert.
 
Ich habe ein ähnliches Problem. Einer der Server im Cluster ist tot (Hardware-Fehler). Ich will den jetzt ersetzen, dabei würde ich gerne die IP und den Hostname des alten Servers verwenden. Wie gehe ich da am besten vor?

Es müssen keine Daten, außer der IP, vom alten Server übernommen werden. Von den VMs/CTs die mal auf dem Server drauf waren existieren Backups, die wurden schon woanders wieder hergestellt.

a) Alten Server aus dem Cluster schmeißen
b) neuen server installieren und normal zum cluster hinzufügen?
Geht das? Oder wird irgendwo der Key/Fingerprint des alten Servers in Verbindung mit seiner IP gespeichert?
 
Geht das? Oder wird irgendwo der Key/Fingerprint des alten Servers in Verbindung mit seiner IP gespeichert?
Der public Key des Servers wir zwar in /etc/ssh/ssh_known_hosts was ein Link ins Cluster Filesystem ist (/etc/pve/priv/known_hosts) gespeichert, aber beim joinen einer node wird das Keyfile gemerged, dabei werden die RSA public Keys mit gleichen Hostnamen von der "joining node" immer überschrieben.

Also ja der Ansatz sollte so funktionieren.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!