Adding new PVE 6 node to a cluster which is being upgraded from PVE 5.4 to PVE 6

carles89 · Jul 12, 2020

Hello,

We're in process of upgrading a 4 node cluster to PVE6, but we would like to know if it's possible to add a new PVE6 node to the cluster if we still have one node with 5.4. Here is our environment:

ENVIRONMENT:
node0: PVE 5.4 (with Corosync 3) -> with VMs running on local-lvm
node1: PVE 6.2 (already upgraded) -> NO VMs running
node2: PVE 6.2 (already upgraded) -> with VMs running on local-lvm
node3: PVE 6.2 (already upgraded) -> with VMs running on local-lvm

All nodes are working with HW RAID controller and local storage (LVM Thin).

Our upgrading procedure consists on moving out all VMs from a node before upgrading it, and our intention was to move VMs from node0 to node1 and finish the cluster upgrade by upgrading node0.

THE PROBLEM:
The problem is node1 has started to fail. Last week we saw these errors on syslog (without any other failure on the server). We sent the logs to our provider, and the techs on the datacenter changed the offending ECC DIMM:

Code:

Jul 05 20:23:41 node1 kernel: {4099}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
Jul 05 20:23:41 node1 kernel: {4099}[Hardware Error]: It has been corrected by h/w and requires no further action
Jul 05 20:23:41 node1 kernel: {4099}[Hardware Error]: event severity: corrected
Jul 05 20:23:41 node1 kernel: {4099}[Hardware Error]: Error 0, type: corrected
Jul 05 20:23:41 node1 kernel: {4099}[Hardware Error]: fru_text: DIMM ??
Jul 05 20:23:41 node1 kernel: {4099}[Hardware Error]: section_type: memory error
Jul 05 20:23:41 node1 kernel: {4099}[Hardware Error]: error_status: 0x0000000000000400
Jul 05 20:23:41 node1 kernel: {4099}[Hardware Error]: node: 0
Jul 05 20:23:41 node1 kernel: ghes_edac: Internal error: Can't find EDAC structure

After that, a couple of days ago, we saw two reboots overnight, during a backup. We checked if there was anything on the syslog before the reboot, and we only found those ^@ strange characters:

Code:

Jul 10 23:42:02 node1 systemd[1]: pvesr.service: Succeeded.
Jul 10 23:42:02 node1 systemd[1]: Started Proxmox VE replication runner.
Jul 10 23:43:00 node1 systemd[1]: Starting Proxmox VE replication runner...
Jul 10 23:43:01 node1 systemd[1]: pvesr.service: Succeeded.
Jul 10 23:43:01 node1 systemd[1]: Started Proxmox VE replication runner.
Jul 10 23:44:00 node1 systemd[1]: Starting Proxmox VE replication runner...
Jul 10 23:44:01 node1 pvesr[8417]: trying to acquire cfs lock 'file-replication_cfg' ...
Jul 10 23:44:02 node1 pvesr[8417]: trying to acquire cfs lock 'file-replication_cfg' ...
Jul 10 23:44:03 node1 systemd[1]: pvesr.service: Succeeded.
Jul 10 23:44:03 node1 systemd[1]: Started Proxmox VE replication runner.
Jul 10 23:45:00 node1 systemd[1]: Starting Proxmox VE replication runner...
Jul 10 23:45:01 node1 systemd[1]: pvesr.service: Succeeded.
Jul 10 23:45:01 node1 systemd[1]: Started Proxmox VE replication runner.
Jul 10 23:45:01 node1 CRON[8776]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@Jul 10 23:49:51 node1 systemd-modules-load[568]: Inserted module 'vhost_net'
Jul 10 23:49:51 node1 systemd[1]: Starting Flush Journal to Persistent Storage...
Jul 10 23:49:51 node1 systemd[1]: Started Flush Journal to Persistent Storage.
Jul 10 23:49:51 node1 systemd[1]: Started udev Coldplug all Devices.
Jul 10 23:49:51 node1 systemd[1]: Starting Helper to synchronize boot up for ifupdown...
Jul 10 23:49:51 node1 systemd[1]: Starting udev Wait for Complete Device Initialization...
Jul 10 23:49:51 node1 systemd-udevd[641]: Using default interface naming scheme 'v240'.
Jul 10 23:49:51 node1 systemd-udevd[621]: Using default interface naming scheme 'v240'.
Jul 10 23:49:51 node1 systemd-udevd[641]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
Jul 10 23:49:51 node1 systemd-udevd[621]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
Jul 10 23:49:51 node1 systemd-udevd[664]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
Jul 10 23:49:51 node1 systemd[1]: Found device MR9271-4i swap-sda3.
Jul 10 23:49:51 node1 systemd[1]: Activating swap /dev/sda3...

We did some memory tests while recording IPMI screen output, and the first one crashed with this error, followed with a reboot:

After that, we're unable to reproduce the issue (we're continously doing CPU and RAM tests and nothing crashes).

TL;DR - THE QUESTION:
Since our intention already was to replace node1 when the cluster upgrade was done (we don't trust this machine anymore because of other issues we had), is it possible to add a new server with PVE 6.2 to the cluster, even having one node still with PVE 5.4? Then, we could move VMs from node0 to this new node and upgrade node0.

If I connect to a node with PVE 6.2, I can click on Join Information button, so I assume it should work, but I would like to confirm with Proxmox staff if it's possible, before messing up anything.

Thanks in advance for your help!

Carles

carles89 · Jul 13, 2020

Anyone?

Thank you!

Moayad · Jul 13, 2020

Hi,

carles89 said:
We're in process of upgrading a 4 node cluster to PVE6, but we would like to know if it's possible to add a new PVE6 node to the cluster if we still have one node with 5.4. Here is our environment:

and

carles89 said:
is it possible to add a new server with PVE 6.2 to the cluster, even having one node still with PVE 5.4?

it may be but it is not supported that is why it is written this in our docs
It is not possible to mix Proxmox VE 6.x and earlier with Proxmox VE 5.X cluster nodes.[1]

[1] https://pve.proxmox.com/pve-docs/chapter-pvecm.html

carles89 · Jul 13, 2020

Hi Moayad,

Thanks for your answer. The idea is to join the new PVE6 node to the cluster, move VMs from the 5.4 node to the new one, and immediately upgrade the 5.4 node. We have to do it that way because node1, which already has been updated to PVE6, started to fail after starting the cluster upgrade process.

What would you recommend?

a) Add the new PVE6 node, move the VMs from 5.4 node to it and then upgrade 5.4 node, leaving all 4 nodes with PVE6.
b) Wait for a fix of the node1 (techs on the datacenter are changing its motherboard), which already has PVE6, and move VMs from 5.4 node to it, expecting hardware doesn't fail.

Just asking it because I don't know the risks involved on adding the new node in this scenario.

Thanks again

Moayad · Jul 13, 2020

Hi,

To safer to use node2 or node3 for holding the VMs/LXC Containers until upgrade node0 to PVE 6.2 and fixing node1.

carles89 · Jul 13, 2020

Hi,

nodes 2 and 3 cannot afford more VMs (no resources available). That's why I asked about adding another node when node1 failed.

Right now, technicians are replacing node1 motherboard. Do you think it's better to move vms to node1 when fixed instead of adding a new one? (we'll do that anyway when migration finishes).

Thanks

Search

Search

Adding new PVE 6 node to a cluster which is being upgraded from PVE 5.4 to PVE 6

carles89

Renowned Member

carles89

Renowned Member

Moayad

Proxmox Staff Member

carles89

Renowned Member

Moayad

Proxmox Staff Member

carles89

Renowned Member

We value your privacy