After joining a cluster, internal storage is lost

SergioRius

Renowned Member
Mar 11, 2015
22
1
68
I already have a Proxmox server. I've raised a new Proxmox server and before adding any service, I'm trying to add it as a node to the first one. So I copy the join info and launch it and it stays thinking. After some time I can see that in the background cluster info screen has appeared a message that says "Connection Error".But the join process seems to have ended, so I close the window and the node seems added to the cluster. Shows the other server and it appears in the Main server GUI.

The only problem it's that it has lost the main storage. It says:
Code:
could not activate storage 'local-zfs', zfs error: cannot import 'rpool': no such pool available (500)

And logically adding any VM fails.Does anyone knows what is happening here? It is a blank new Proxmox instance (7.1-8)

Code:
# pvecm nodes

Membership information
----------------------
    Nodeid      Votes Name
         1          1 deathshadow (local)
         2          1 core
 
the storage.cfg of the joining node and the cluster were different, but only the pre-existing one from the cluster remains after joining.

you can add the missing storage entry over the GUI (Datacenter -> Storage), and limit both the newly added and any pre-existing local storages to their respective nodes.
 
the storage.cfg of the joining node and the cluster were different, but only the pre-existing one from the cluster remains after joining.

you can add the missing storage entry over the GUI (Datacenter -> Storage), and limit both the newly added and any pre-existing local storages to their respective nodes.
Hi Fabian, thanks for your reply. And excuse, me for my lack of experience. I'm trying to move from another hipervisor solution and still don't know Proxmox enough.

With storage.cfg files being different, what do you mean? Being on different machines aren't them supposed to be different? (I imagine them as a config template-like fstab?)

> Even when it can't connect to the storage, it appears in the Datacenter-Storage and nodes child. But clicking on the contents brings the error.
Investigating on the console brings that rpool/data (local-zfs) does not exist and there are no pools.

> Could it be that Proxmox installer for this new node formats with ext4+lvm by default and I should change it to zfs?

I'll try to add that missing storage as you say, but don't know if I'll be able if there isn't a selector or preexisting choices.
By local limiting you mean simply not checking share?
> I haven't been able to do anything. No zfs pools available.

There two circumstances that could cause this problem... one is that the existing server is a "network" server. It does host a pfsense VM and DNS and some other related things. Router is configured as a stick, so it has a trunk interface and it's "main" vlan is to say 3.
The new node will be an automation server and it's untagged to vlan 5. I didn't think that should be related because the first is the gateway and as for the moment there aren't any rules on the fw. Unless the fact that the gateway is a VM has something to do.

The second point is that node1 it's at version 6.4 and the new one's 7.1-8. I saw several threads telling that there will be enough compatibility between those version, although not recommended for production.
I have full VM backups on internal and external storage, but I'm so, so scared to upgrade the first node, it being the main gateway.
 
Last edited:
Hi Fabian, thanks for your reply. And excuse, me for my lack of experience. I'm trying to move from another hipervisor solution and still don't know Proxmox enough.

With storage.cfg files being different, what do you mean? Being on different machines aren't them supposed to be different? (I imagine them as a config template-like fstab?)

I mean, before the join operation, you had two storage.cfg files
  • shared one of the existing cluster (which might only have consisted of a single node ;))
  • one of the joining node
the storage config is shared across a cluster - it's a single file in /etc/pve, and each entry can (optionally) have an option set to tell PVE that this storage is not valid on all nodes, but just a subset.

> Even when it can't connect to the storage, it appears in the Datacenter-Storage and nodes child. But clicking on the contents brings the error.
Investigating on the console brings that rpool/data (local-zfs) does not exist and there are no pools.

> Could it be that Proxmox installer for this new node formats with ext4+lvm by default and I should change it to zfs?

I'll try to add that missing storage as you say, but don't know if I'll be able if there isn't a selector or preexisting choices.
By local limiting you mean simply not checking share?
the 'shared' flag is to tell PVE that this storage has the same contents on all nodes where it is available (e.g., an NFS storage, or a distributed storage like Ceph). a local storage should never have this flag set, as it will confuse PVE and cause issues. what I meant is that if you previously had a local-zfs storage on the cluster, and your new node doesn't have this storage, you need to edit it and set the 'nodes' limit to those nodes where it actually is available. same for the previous storages of the new node when re-adding them.

> I haven't been able to do anything. No zfs pools available.

There two circumstances that could cause this problem... one is that the existing server is a "network" server. It does host a pfsense VM and DNS and some other related things. Router is configured as a stick, so it has a trunk interface and it's "main" vlan is to say 3.
The new node will be an automation server and it's untagged to vlan 5. I didn't think that should be related because the first is the gateway and as for the moment there aren't any rules on the fw. Unless the fact that the gateway is a VM has something to do.

The second point is that node1 it's at version 6.4 and the new one's 7.1-8. I saw several threads telling that there will be enough compatibility between those version, although not recommended for production.
I have full VM backups on internal and external storage, but I'm so, so scared to upgrade the first node, it being the main gateway.

I am not sure what the question is here - but running different major version is definitely not recommended except for a short period of time while upgrading.
 
I mean, before the join operation, you had two storage.cfg files
  • shared one of the existing cluster (which might only have consisted of a single node ;))
  • one of the joining node
the storage config is shared across a cluster - it's a single file in /etc/pve, and each entry can (optionally) have an option set to tell PVE that this storage is not valid on all nodes, but just a subset.


the 'shared' flag is to tell PVE that this storage has the same contents on all nodes where it is available (e.g., an NFS storage, or a distributed storage like Ceph). a local storage should never have this flag set, as it will confuse PVE and cause issues. what I meant is that if you previously had a local-zfs storage on the cluster, and your new node doesn't have this storage, you need to edit it and set the 'nodes' limit to those nodes where it actually is available. same for the previous storages of the new node when re-adding them.



I am not sure what the question is here - but running different major version is definitely not recommended except for a short period of time while upgrading.

I don't know what I'm supposed to do. On a brand new installation on core, that is what is shown:
1639997186380.png
And I can't change anything bc they are read-only.

core (node2):
Code:
# cat /etc/pve/storage.cfg
dir: local
path /var/lib/vz
content iso,vztmpl,backup

lvmthin: local-lvm
thinpool data
vgname pve
content rootdir,images

deathshadow (node1):
Code:
# cat /etc/pve/storage.cfg
dir: local
path /var/lib/vz
content vztmpl,iso,backup

zfspool: local-zfs
pool rpool/data
content rootdir,images
sparse 1

dir: bulk
path /mnt/bulk
content vztmpl,iso,backup
prune-backups keep-all=1
shared 0
is_mountpoint 1
mkdir 0

When Joining:
1639997873894.png
Both nodes have ip and dns ping resolution. If I close the window now, even the SSH keys have been shared.

After the join, the storage.cfg file seems replaced with the node1 ones. Restoring it doesn't have any effect. Trying to recreate the storage from the GUI is impossible because say there are not unused disks.

I don't know what to do.
 

Attachments

  • 1639997791785.png
    1639997791785.png
    13 KB · Views: 9
Last edited:
you need to follow the steps that I gave you in my previous answer:
- edit the local-zfs storage and limit it to 'deathshadow'
- add a new LVM-thin storage on 'core' and limit it to 'core'

both should be do-able over the GUI using Datacenter->Storage (for the second step you need to be connected to 'core', else scanning local LVM will not be possible..)
 
you need to follow the steps that I gave you in my previous answer:
- edit the local-zfs storage and limit it to 'deathshadow'
- add a new LVM-thin storage on 'core' and limit it to 'core'

both should be do-able over the GUI using Datacenter->Storage (for the second step you need to be connected to 'core', else scanning local LVM will not be possible..)

Finally my main server storage was lost. So I installed Proxmox 7.1-8 on it, and recovered from backups. The process took hours because it was stucking installing (3% creating lvs, for example)

After that, I finally understand what you told me and changed the local-lvm (and usb backup device) storage to only this server. But couldn't do anything with "local" (iso, etc storage) as it's readonly.
Then reinstalled node2, same version, restricted local-lvm to node2 and added to the cluster I've previously created in node1.

The process took a while and ended in:
Code:
permission denied - invalid PVE ticket (401)
Connection lost.

And the node2 storage was lost again.
Both machines are at the same 7.1-8 version.
Both machines have time in sync.
Both machines can ping each other by Ip and name.
All attempts to create/add/recover a local-lvm is futile as doesn't detect any available space nor picks up the old.
This is not working, I would say.

Any steps in how to clean remove this node2 without breaking again node1?
Any steps in how to add this node to the cluster, diagnose the fault or make it work?
Thanks.

Edit: Join is replacing storage.cfg in node2 with config from the node1 even when the above steps are taken:
1640025751132.png
And removes it's storage.

Node1 has the correct mixed configuration.
 
Last edited:
Ok, it seems after editing the config file underneath and waiting for it to sync I've managed to be able to unlock Datacenter->Storage and add the node2 storage.
Still don't know if it's operative, but I'm now gonna take a nap :)

Thank you very much @fabian for your help.
 
it seems you have misunderstood me - the steps I described should be done AFTER joining. joining will ALWAYS OVERWRITE the configs of the joining node - not just storage.cfg but everything that is cluster-wide. if you can't add the LVM storage you are likely attempting to do so while connected to the wrong node (failing to find any LVM stuff) or the cluster is not quorate (if you get permission denied or similar messages).
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!