Adding new PVE 6 node to a cluster which is being upgraded from PVE 5.4 to PVE 6

carles89

Renowned Member
May 27, 2015
76
7
73
Hello,

We're in process of upgrading a 4 node cluster to PVE6, but we would like to know if it's possible to add a new PVE6 node to the cluster if we still have one node with 5.4. Here is our environment:

ENVIRONMENT:
node0: PVE 5.4 (with Corosync 3) -> with VMs running on local-lvm
node1: PVE 6.2 (already upgraded) -> NO VMs running
node2: PVE 6.2 (already upgraded) -> with VMs running on local-lvm
node3: PVE 6.2 (already upgraded) -> with VMs running on local-lvm

All nodes are working with HW RAID controller and local storage (LVM Thin).

Our upgrading procedure consists on moving out all VMs from a node before upgrading it, and our intention was to move VMs from node0 to node1 and finish the cluster upgrade by upgrading node0.

THE PROBLEM:
The problem is node1 has started to fail. Last week we saw these errors on syslog (without any other failure on the server). We sent the logs to our provider, and the techs on the datacenter changed the offending ECC DIMM:

Code:
Jul 05 20:23:41 node1 kernel: {4099}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
Jul 05 20:23:41 node1 kernel: {4099}[Hardware Error]: It has been corrected by h/w and requires no further action
Jul 05 20:23:41 node1 kernel: {4099}[Hardware Error]: event severity: corrected
Jul 05 20:23:41 node1 kernel: {4099}[Hardware Error]: Error 0, type: corrected
Jul 05 20:23:41 node1 kernel: {4099}[Hardware Error]: fru_text: DIMM ??
Jul 05 20:23:41 node1 kernel: {4099}[Hardware Error]: section_type: memory error
Jul 05 20:23:41 node1 kernel: {4099}[Hardware Error]: error_status: 0x0000000000000400
Jul 05 20:23:41 node1 kernel: {4099}[Hardware Error]: node: 0
Jul 05 20:23:41 node1 kernel: ghes_edac: Internal error: Can't find EDAC structure

After that, a couple of days ago, we saw two reboots overnight, during a backup. We checked if there was anything on the syslog before the reboot, and we only found those ^@ strange characters:

Code:
Jul 10 23:42:02 node1 systemd[1]: pvesr.service: Succeeded.
Jul 10 23:42:02 node1 systemd[1]: Started Proxmox VE replication runner.
Jul 10 23:43:00 node1 systemd[1]: Starting Proxmox VE replication runner...
Jul 10 23:43:01 node1 systemd[1]: pvesr.service: Succeeded.
Jul 10 23:43:01 node1 systemd[1]: Started Proxmox VE replication runner.
Jul 10 23:44:00 node1 systemd[1]: Starting Proxmox VE replication runner...
Jul 10 23:44:01 node1 pvesr[8417]: trying to acquire cfs lock 'file-replication_cfg' ...
Jul 10 23:44:02 node1 pvesr[8417]: trying to acquire cfs lock 'file-replication_cfg' ...
Jul 10 23:44:03 node1 systemd[1]: pvesr.service: Succeeded.
Jul 10 23:44:03 node1 systemd[1]: Started Proxmox VE replication runner.
Jul 10 23:45:00 node1 systemd[1]: Starting Proxmox VE replication runner...
Jul 10 23:45:01 node1 systemd[1]: pvesr.service: Succeeded.
Jul 10 23:45:01 node1 systemd[1]: Started Proxmox VE replication runner.
Jul 10 23:45:01 node1 CRON[8776]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@Jul 10 23:49:51 node1 systemd-modules-load[568]: Inserted module 'vhost_net'
Jul 10 23:49:51 node1 systemd[1]: Starting Flush Journal to Persistent Storage...
Jul 10 23:49:51 node1 systemd[1]: Started Flush Journal to Persistent Storage.
Jul 10 23:49:51 node1 systemd[1]: Started udev Coldplug all Devices.
Jul 10 23:49:51 node1 systemd[1]: Starting Helper to synchronize boot up for ifupdown...
Jul 10 23:49:51 node1 systemd[1]: Starting udev Wait for Complete Device Initialization...
Jul 10 23:49:51 node1 systemd-udevd[641]: Using default interface naming scheme 'v240'.
Jul 10 23:49:51 node1 systemd-udevd[621]: Using default interface naming scheme 'v240'.
Jul 10 23:49:51 node1 systemd-udevd[641]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
Jul 10 23:49:51 node1 systemd-udevd[621]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
Jul 10 23:49:51 node1 systemd-udevd[664]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
Jul 10 23:49:51 node1 systemd[1]: Found device MR9271-4i swap-sda3.
Jul 10 23:49:51 node1 systemd[1]: Activating swap /dev/sda3...

We did some memory tests while recording IPMI screen output, and the first one crashed with this error, followed with a reboot:
Error n1b.png
After that, we're unable to reproduce the issue (we're continously doing CPU and RAM tests and nothing crashes).

TL;DR - THE QUESTION:
Since our intention already was to replace node1 when the cluster upgrade was done (we don't trust this machine anymore because of other issues we had), is it possible to add a new server with PVE 6.2 to the cluster, even having one node still with PVE 5.4? Then, we could move VMs from node0 to this new node and upgrade node0.

If I connect to a node with PVE 6.2, I can click on Join Information button, so I assume it should work, but I would like to confirm with Proxmox staff if it's possible, before messing up anything.

Thanks in advance for your help!

Carles
 
Hi,
We're in process of upgrading a 4 node cluster to PVE6, but we would like to know if it's possible to add a new PVE6 node to the cluster if we still have one node with 5.4. Here is our environment:

and

is it possible to add a new server with PVE 6.2 to the cluster, even having one node still with PVE 5.4?

it may be but it is not supported that is why it is written this in our docs
It is not possible to mix Proxmox VE 6.x and earlier with Proxmox VE 5.X cluster nodes.[1]

[1] https://pve.proxmox.com/pve-docs/chapter-pvecm.html
 
Hi Moayad,

Thanks for your answer. The idea is to join the new PVE6 node to the cluster, move VMs from the 5.4 node to the new one, and immediately upgrade the 5.4 node. We have to do it that way because node1, which already has been updated to PVE6, started to fail after starting the cluster upgrade process.

What would you recommend?

a) Add the new PVE6 node, move the VMs from 5.4 node to it and then upgrade 5.4 node, leaving all 4 nodes with PVE6.
b) Wait for a fix of the node1 (techs on the datacenter are changing its motherboard), which already has PVE6, and move VMs from 5.4 node to it, expecting hardware doesn't fail.

Just asking it because I don't know the risks involved on adding the new node in this scenario.

Thanks again
 
Hi,

To safer to use node2 or node3 for holding the VMs/LXC Containers until upgrade node0 to PVE 6.2 and fixing node1.

 
Hi,

nodes 2 and 3 cannot afford more VMs (no resources available). That's why I asked about adding another node when node1 failed.

Right now, technicians are replacing node1 motherboard. Do you think it's better to move vms to node1 when fixed instead of adding a new one? (we'll do that anyway when migration finishes).

Thanks
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!