[SOLVED] Remove Node -> Join Node & Update packages.

CyberGuy

Member
Dec 20, 2021
24
3
8
28
Hi guys,

I will need your help, I am Patryk, during the past month I was working as Developer but as Hobby I was doing the sysAdmin/Hardware thing for our company. The main SysAdmin left the company, so it is now in my sholders... I was left alone, I think with not so good configuration configuration. I may need help as I am not experience touching production configuration but I got plan which I would like to ask you to be check, maybe some of you have similar experience and can comment about it?

I added a new node to cluster, join was fine but when I installed ceph something something seems to be not right, when I go to the interface of Ceph on that node I get a time out. So the plan would be to remove the node and re-install the node and add again.
We use Virtual Environment 6.4, problem is that we run. out of space... I am left with around 300 GB of space on Ceph, replica 2. Meaning I am super stressed that everything will work fine.


I create list what I need and I hope I done it right? Just need a bit cheer up before I start I believe:)


1)Process of removing the node from cluster :
1) What I understood is that node which needs to be remove/re-installed from cluster needs to be shut down first?
2) On another node I should run this command : pvecm delnode [node]
3) Check if node as properly delete : pvecm status
*** Because I would like to re-add broken node again and keep the same host/ip should I ?
  • remove some config files from /etc/pve - related to removed node
  • this broken node was not part of ceph yet, no OSD, no monitor, no manager - should I care if shut it down this node, what location should I check, in config files of /etc/pve - I could not find anything related to this machine.
4) /etc/hosts - I would like to reuse this node, should I remove it for the moment?


2)Process of Joining the new node:
1) Install proxmox Virtual Environment 6.4, use repoistory from previous nodes in cluster to have the same packages - and update the system.
Problem is on others nodes in cluster i have: ceph: 15.2.14-pve1~bpo10 but on new node I will have : ceph: 15.2.15-pve1~bpo10.
I will be able to join the cluster, install ceph and create OSD, monitors, managers on new node so that others nodes will see it? Is quiet important because we reach the full capacity because I did not expect it has so big impact.
2) We use same network 40Gbit/s - QSFP+ - no blocking switches for corosync - network cluster and ceph - storage network . Is it the issue or it will be fine?
3) Join process should be simple as I know already how to configure network cards, hosts, check time, locale.
4) should I ssh from the new node to other servers? Because what I understood the nodes comunnicate via SSH, or do anything more as only check the : pvecm status ?

3)Process of Install and configure Ceph:
1) On new node install Ceph via Web interface and when I do this , it should directly connect to ceph storage cluster as the config file is shared via /etc/pve and corosync should synchronize the file to new node with auth key, config files from ceph ?
2) I will be able to add new OSD and so this node will be the Monitor and Manager as well? Even if the this new node will have new version of the packages, so that other nodes in cluster will still work on older version? They will be able to see new node, OSD, Managers, Monitors?
3) When I add new OSD the process of scrubbing starts and I will need to add drive one by one? or all at the same time from new machine? What would be the best thing to do?


4)Process of Update packages - no Reboot required
1) I would like to update pacakges on other nodes in cluster so that they match new node, so they won't be any issue, if I understand the update process won't interupt the cluster- I hope there should be no reboot at all?
2) As is we have plenty VMs and each node is part of cluster, what you guys do to have the same version? Is maybe stupid question but what happens when something go wrong during update?
3) Updated Ceph packages - OSD, Monitors, Managers - Right now I have 34 OSD, 6 Managers, 6 Monitors, active 1 Manager and 1 Monitor. What I understand there can be only one active Monitor and Manager - So I will do update one by one and see, problem can be what happens to move active Manager to be active on another node? others are in stand by mode so it should be pick up automatically by the system and use?
4) Everything should work just fine at this step?


So this is in my case, I hope you guys understand and if you happy to help I will be super glad!!!

Patryk
 
Last edited:
should I care if shut it down this node, what location should I check, in config files of /etc/pve - I could not find anything related to this machine.
Shut it down before removing it from the cluster. If you run a hyperconverged setup with PVE and Ceph you are actually running two clusters. PVE is its own cluster as is Ceph, which gets installed and managed by PVE.

Then look under /etc/pve/nodes is there is a directory with the name of the node you want to remove.

4) /etc/hosts - I would like to reuse this node, should I remove it for the moment?
You have a list of all the nodes in the cluster and their IP addresses in there? should be okay to keep, as long as you keep the same IPs

1) Install proxmox Virtual Environment 6.4, use repoistory from previous nodes in cluster to have the same packages - and update the system.
Problem is on others nodes in cluster i have: ceph: 15.2.14-pve1~bpo10 but on new node I will have : ceph: 15.2.15-pve1~bpo10.
I will be able to join the cluster, install ceph and create OSD, monitors, managers on new node so that others nodes will see it? Is quiet important because we reach the full capacity because I did not expect it has so big impact.
Minor Version differences should be okay. You could also install the latest updates on the existing cluster first, rebooting one node at a time and live migrating VMs between the nodes if they need to keep running.

2) We use same network 40Gbit/s - QSFP+ - no blocking switches for corosync - network cluster and ceph - storage network . Is it the issue or it will be fine?
If that is the only network for Corosync, you could run into issues. The recommendation is to have at least one physical network just for Corosync itself so another service will not use up all the bandwidth, causing high latency for the Corosync packets. Ceph is such a service. If you have more networks available, consider to configure Corosync to use those as well: https://pve.proxmox.com/pve-docs-6/pve-admin-guide.html#pvecm_redundancy
If a network becomes unusable it will switch to another one by itself, if you have it configured.

4) should I ssh from the new node to other servers? Because what I understood the nodes comunnicate via SSH, or do anything more as only check the : pvecm status ?
SSH is used in a few instances, but most of the time the API is used (port 8006) and Corosync as well, to sync the /etc/pve directory which is also used for the HA stack and a few other things.

1) On new node install Ceph via Web interface and when I do this , it should directly connect to ceph storage cluster as the config file is shared via /etc/pve and corosync should synchronize the file to new node with auth key, config files from ceph ?
Yes.

2) I will be able to add new OSD and so this node will be the Monitor and Manager as well? Even if the this new node will have new version of the packages, so that other nodes in cluster will still work on older version? They will be able to see new node, OSD, Managers, Monitors?
Should work fine. A minor version difference is absolutely okay.

3) When I add new OSD the process of scrubbing starts and I will need to add drive one by one? or all at the same time from new machine? What would be the best thing to do?
What will happen is that Ceph will rebalance the data. Add all at the same time as each new OSD will cause a new rebalance. Better to only have it once.

1) I would like to update pacakges on other nodes in cluster so that they match new node, so they won't be any issue, if I understand the update process won't interupt the cluster- I hope there should be no reboot at all?
Reboots are needed if a new kernel is installed in order to boot with them. If you plan to destroy those nodes anyway, you could skip on the reboot. If there is a new Ceph version, you will see yellow icons, idicating that you should restart these services (mon, mgr, osd, ..) so they start with the new version. You can do so one at a time and there should not be any issues.

2) As is we have plenty VMs and each node is part of cluster, what you guys do to have the same version? Is maybe stupid question but what happens when something go wrong during update?
Migrate VMs to other nodes, update and potentially reboot that node, wait for it to be back up, migrate VMs back. Continue with the next node until you are done. Live migration from an older to a newer version should always work. We do not guarantee that a live migration from a newer version to an older one will always work.

3) Updated Ceph packages - OSD, Monitors, Managers - Right now I have 34 OSD, 6 Managers, 6 Monitors, active 1 Manager and 1 Monitor. What I understand there can be only one active Monitor and Manager - So I will do update one by one and see, problem can be what happens to move active Manager to be active on another node? others are in stand by mode so it should be pick up automatically by the system and use?
How many nodes do you have in your cluster? Unless we are talking about a lot, there is no need for 6 Mons, 3 should be plenty. You should also have an odd number since they work on a majority voting principle (as PVE itself) and with an even number, you could theoretically run into a split-brain situation.
If the active MGR is down, then one of the standbys will take over. But again, 6 looks like a lot, you need at least 2, should the first one reboot or fail, but 3 might be a good idea to have enough redundancy in place.



I hope this helps you a bit. If you have the resources, try to set up a virtual PVE cluster and try the procedure there before you attempt it in the production one.
 
  • Like
Reactions: CyberGuy
Aaron that is amazing, thank you :)

I really appreacaite your time :)

I removed note succesfully, did not get any error but it was not fully configure yet and missing packages. Was just better to reinstalled.

The other problem I got from ceph itself is the PG,

We have 34 osd with PG = 512 right now when optimal is 1024, I will need to dig in documentation to find out all details about. Because I read is better to change PG before adding the new osd, but we run out of space and I would like to prevent data lose.
 
We have 34 osd with PG = 512 right now when optimal is 1024, I will need to dig in documentation to find out all details about. Because I read is better to change PG before adding the new osd, but we run out of space and I would like to prevent data lose.
Ideally you have around 100 PGs per OSD, you can check it with ceph osd df tree.

This can also show you, if the PGs and data usage is spread somewhat evenly across your OSDs or if you have a few OSDs with much more data and PGs than others. Which is a hint that you don't have enough PGs to split the data very evenly.

You can use the PG calculator ( https://old.ceph.com/pgcalc/ ) to get an idea of how many PGs you need to define for the pool. Choose the "All in one" in the drop down above, set the number of OSD and add pools if you have more than one pool using the same OSDs (if you have separate types like HDD and SSD)

There is also the autoscaler which can take care of this, but will only become active if the ideal number of PGs is off from the current one by a factor of 3. Meaning, if you now have 512 PGs, and ideally you would have 1024, it will warn you, but you will have to set it manually. If you had 256 PGs and should have 1024, then the autoscaler would change that automatically.

Because I read is better to change PG before adding the new osd, but we run out of space and I would like to prevent data lose.
Enough PGs makes it easier for Ceph to split large data chunks into smaller ones -> easier to spread it over the OSDs. Having too many will result in more resources needed for accounting, hence the rule of thumb of around 100 PGs per OSD
 
Hi Aaron,

Thank you for a lot of informations :)

Rejoin node was successful, install ceph, we still did not add the OSD yet to ceph storage.

Because I am thinking if I should increase the number before adding new OSD. Stupid question but when it comes to extend manually we can just increase the number we have right now from 512 ->1024 and then from 1024 -> 2048 to support future expansion? We will have more hard drives in production.

1640180382003.png

belowe is your global configuration of ceph:

Bash:
cluster_network = 192.168.200.241/24      
 fsid = XXXXXXXXXXXXXXXXXXXXXXXX    
  mon_allow_pool_delete = true    
  mon_host = 192.168.200.241 192.168.200.242 192.168.200.243 192.168.200.244 192.168.200.245 192.168.200.246      
  osd_pool_default_min_size = 1        
  osd_pool_default_size = 2      
  public_network = 192.168.200.241/24

It was init by previous sysAdmin.


I would like to understand how long this process might take, how big impact will be on our infrascture, I belive it will increase usage CPU , ram usage and network.

My most concern is it safe? with our configuration I have . to prevent data lose and any downtime.

Patryk
 
Last edited:
Can you post the output of ceph osd df tree?

Are those spinning HDDs?
 
Hi Aaron,

I think I have bigger issue:
When I run ceph -s on new node i have this error:

Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)')


But web interface works, problem is I have no configurations file of ceph cluster on that node, but Init was taken and installation was successful....

Here output of command you wanted :)

ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS TYPE NAME
-1 61.85950 - 62 TiB 36 TiB 36 TiB 315 MiB 86 GiB 25 TiB 58.93 1.00 - root default
-3 7.27759 - 7.3 TiB 4.3 TiB 4.3 TiB 16 MiB 9.8 GiB 3.0 TiB 58.66 1.00 - host pve1
0 ssd 1.81940 1.00000 1.8 TiB 1.2 TiB 1.2 TiB 2 MiB 2.7 GiB 629 GiB 66.21 1.12 35 up osd.0
1 ssd 1.81940 1.00000 1.8 TiB 1.1 TiB 1.1 TiB 2.1 MiB 2.6 GiB 733 GiB 60.65 1.03 37 up osd.1
2 ssd 1.81940 1.00000 1.8 TiB 878 GiB 876 GiB 8.6 MiB 2.0 GiB 985 GiB 47.13 0.80 30 up osd.2
3 ssd 1.81940 1.00000 1.8 TiB 1.1 TiB 1.1 TiB 3.5 MiB 2.6 GiB 733 GiB 60.64 1.03 35 up osd.3
-5 7.27759 - 7.3 TiB 4.5 TiB 4.5 TiB 23 MiB 10 GiB 2.8 TiB 61.70 1.05 - host pve2
4 ssd 1.81940 1.00000 1.8 TiB 949 GiB 947 GiB 1.4 MiB 2.2 GiB 914 GiB 50.95 0.86 30 up osd.4
5 ssd 1.81940 1.00000 1.8 TiB 1.4 TiB 1.4 TiB 4.4 MiB 3.1 GiB 400 GiB 78.51 1.33 47 up osd.5
6 ssd 1.81940 1.00000 1.8 TiB 1.3 TiB 1.3 TiB 8.2 MiB 2.9 GiB 554 GiB 70.29 1.19 39 up osd.6
7 ssd 1.81940 1.00000 1.8 TiB 876 GiB 874 GiB 9.2 MiB 2.2 GiB 987 GiB 47.05 0.80 29 up osd.7
-7 7.27759 - 7.3 TiB 4.5 TiB 4.5 TiB 81 MiB 11 GiB 2.8 TiB 62.04 1.05 - host pve3
8 ssd 1.81940 1.00000 1.8 TiB 1.1 TiB 1.1 TiB 3.6 MiB 2.5 GiB 772 GiB 58.58 0.99 32 up osd.8
9 ssd 1.81940 1.00000 1.8 TiB 1.6 TiB 1.6 TiB 3.0 MiB 4.3 GiB 187 GiB 89.96 1.53 49 up osd.9
10 ssd 1.81940 1.00000 1.8 TiB 1.2 TiB 1.2 TiB 4.1 MiB 2.7 GiB 591 GiB 68.29 1.16 39 up osd.10
11 ssd 1.81940 1.00000 1.8 TiB 584 GiB 582 GiB 71 MiB 1.6 GiB 1.2 TiB 31.33 0.53 20 up osd.11
-9 7.27759 - 7.3 TiB 4.4 TiB 4.4 TiB 88 MiB 11 GiB 2.9 TiB 60.57 1.03 - host pve4
12 ssd 1.81940 1.00000 1.8 TiB 837 GiB 835 GiB 7.2 MiB 1.9 GiB 1.0 TiB 44.95 0.76 28 up osd.12
13 ssd 1.81940 1.00000 1.8 TiB 1.3 TiB 1.3 TiB 69 MiB 3.0 GiB 514 GiB 72.44 1.23 41 up osd.13
14 ssd 1.81940 1.00000 1.8 TiB 1.1 TiB 1.1 TiB 5.7 MiB 2.8 GiB 736 GiB 60.48 1.03 35 up osd.14
15 ssd 1.81940 1.00000 1.8 TiB 1.2 TiB 1.2 TiB 5.8 MiB 3.0 GiB 663 GiB 64.41 1.09 35 up osd.15
-11 10.91638 - 11 TiB 5.8 TiB 5.8 TiB 31 MiB 14 GiB 5.1 TiB 53.27 0.90 - host pve5
16 ssd 1.81940 1.00000 1.8 TiB 767 GiB 765 GiB 5.2 MiB 1.9 GiB 1.1 TiB 41.14 0.70 25 up osd.16
17 ssd 1.81940 1.00000 1.8 TiB 1.0 TiB 1.0 TiB 4.2 MiB 2.6 GiB 802 GiB 56.95 0.97 33 up osd.17
18 ssd 1.81940 1.00000 1.8 TiB 985 GiB 982 GiB 2.4 MiB 2.6 GiB 878 GiB 52.87 0.90 34 up osd.18
19 ssd 1.81940 1.00000 1.8 TiB 1.1 TiB 1.1 TiB 7.4 MiB 2.6 GiB 729 GiB 60.85 1.03 34 up osd.19
24 ssd 1.81940 1.00000 1.8 TiB 1.0 TiB 1.0 TiB 5.7 MiB 2.4 GiB 804 GiB 56.86 0.96 32 up osd.24
25 ssd 1.81940 1.00000 1.8 TiB 949 GiB 947 GiB 5.7 MiB 2.2 GiB 914 GiB 50.95 0.86 29 up osd.25
-13 10.91638 - 11 TiB 6.4 TiB 6.4 TiB 39 MiB 15 GiB 4.5 TiB 58.72 1.00 - host pve6
20 ssd 1.81940 1.00000 1.8 TiB 1.1 TiB 1.1 TiB 3.3 MiB 2.5 GiB 768 GiB 58.76 1.00 34 up osd.20
21 ssd 1.81940 1.00000 1.8 TiB 1.4 TiB 1.4 TiB 5.2 MiB 3.0 GiB 438 GiB 76.49 1.30 42 up osd.21
22 ssd 1.81940 1.00000 1.8 TiB 1.0 TiB 1.0 TiB 8.9 MiB 2.4 GiB 806 GiB 56.72 0.96 32 up osd.22
23 ssd 1.81940 1.00000 1.8 TiB 1.1 TiB 1.1 TiB 6.9 MiB 2.6 GiB 769 GiB 58.72 1.00 36 up osd.23
26 ssd 1.81940 1.00000 1.8 TiB 911 GiB 909 GiB 6.3 MiB 2.2 GiB 952 GiB 48.90 0.83 28 up osd.26
27 ssd 1.81940 1.00000 1.8 TiB 982 GiB 980 GiB 8.5 MiB 2.3 GiB 881 GiB 52.70 0.89 31 up osd.27
-15 10.91638 - 11 TiB 6.5 TiB 6.5 TiB 37 MiB 15 GiB 4.4 TiB 59.96 1.02 - host pve7
28 ssd 1.81940 1.00000 1.8 TiB 1020 GiB 1018 GiB 6.2 MiB 2.3 GiB 843 GiB 54.76 0.93 30 up osd.28
29 ssd 1.81940 1.00000 1.8 TiB 875 GiB 873 GiB 5.0 MiB 2.0 GiB 988 GiB 46.94 0.80 28 up osd.29
30 ssd 1.81940 1.00000 1.8 TiB 981 GiB 978 GiB 2.6 MiB 2.2 GiB 882 GiB 52.63 0.89 33 up osd.30
31 ssd 1.81940 1.00000 1.8 TiB 1.4 TiB 1.4 TiB 9.3 MiB 3.1 GiB 441 GiB 76.30 1.29 42 up osd.31
32 ssd 1.81940 1.00000 1.8 TiB 1.0 TiB 1.0 TiB 6.0 MiB 2.4 GiB 805 GiB 56.80 0.96 31 up osd.32
33 ssd 1.81940 1.00000 1.8 TiB 1.3 TiB 1.3 TiB 8.2 MiB 2.9 GiB 515 GiB 72.33 1.23 39 up osd.33
TOTAL 62 TiB 36 TiB 36 TiB 315 MiB 86 GiB 25 TiB 58.93
MIN/MAX VAR: 0.53/1.53 STDDEV: 11.77
 
Last edited:
But web interface works, problem is I have no configurations file of ceph cluster on that node, but Init was taken and installation was successful....
You did install it via the web GUI right? Do you have the /etc/pve/ceph.conf file on that node with the same contents as on the other nodes? Is there a /etc/ceph/ceph.conf file which is a symlink to the one in /etc/pve?

Code:
ls -l /etc/ceph/ceph.conf

Some of the OSDs are rather full, while others are quite empty. OSD 9 is close to 90% which should trigger some warnings.
With that in mind, I think I would rather opt to add more OSDs on the new node (once it is working properly). This will cause Ceph to rebalance the data and hopefully will just move data off the very full OSDs. Then increase the PG num manually. You can check what the autoscaler suggests and what the PG calculator suggests.
The reason why I would do that after you have more OSDs, is to give Ceph enough space to rearrange and split up the current PGs.

I hope it all will work fine, but with some OSDs that full, and especially OSD 9, you can only hope. It might take quite some time, depending on the load and network. Though, Ceph usually tries to keep the rebalance IO a bit slower than what could be possible, to not impact the operational IO too much.

Considering that the pool is set to a size/min-size of 2/1, you should consider, once the cluster is in a safer state, to add more OSDs or replace them for larger ones, because you really should use a size/min-size of 3/2 to keep your data safe. That does need more space though and you also want to have the OSDs empty enough so they can suffer the loss of a node, or more if you need that.
 
Hmm, from the PVE side the new node is there and all correct? You see it in the GUI and can access it fine, and it shows up in the Cluster overview as well as when you run pvecm status?

You also installed Ceph on that node? Check by selecting the Node in the GUI and then the Ceph menu for that node. If it is not installed, it should prompt you with the installation wizard.
 
Aaron, I fixed


by the command : ln -s /etc/pve/ceph.conf /etc/ceph/ceph.conf. Works good now.

pvecm status shows fine.

Now I can run the ceph -s on the new node and see all.
 
I hope it all will work fine, but with some OSDs that full, and especially OSD 9, you can only hope. It might take quite some time, depending on the load and network. Though, Ceph usually tries to keep the rebalance IO a bit slower than what could be possible, to not impact the operational IO too much.
So you mean that it can be critcal to run it and not safe? If I understand correctly it can cause data lose?
 
You will not lose data, unless you actually lose OSDs. But if an OSD gets too full, IO might be blocked for the clients.
 
Hi Aaron,

sorry for late reply. I will put it as solve and write new thread as i got new topic.

Merry Christmas :)
 
  • Like
Reactions: aaron

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!