[SOLVED] Ceph Luminous to Nautilus upgrade issues

troycarpenter

Renowned Member
Feb 28, 2012
103
8
83
Central Texas
I have upgraded my 6 node cluster (3 ceph-only plus 3 compute-only nodes) from 5.4 to 6. The Ceph config was created on the Luminous release and I am following the upgrade instructions provided at https://pve.proxmox.com/wiki/Ceph_Luminous_to_Nautilus. During the upgrade the OSDs were "adapted" from ceph-disk to ceph-volume and the old SSD DB partitions (1GB each) were revealed to be too small. As outlined in the thread https://forum.proxmox.com/threads/bluefs-spillover-detected-on-30-osd-s.56230/ I figured the only way to fix the problem was to remove and re-add the OSDs one server at a time, adopting the new LVM method for Ceph with much larger (around 150GB) db partitions on the SSD.

I've done two of the three Ceph servers (each one has 10 OSDs), but each time while the cluster was recovering, all VMs lost connection to their disk images. The compute nodes where the VMs run all had empty /etc/ceph directories and needless to say the VMs all hung. However, the odd fix for this was to reboot the storage server on which the OSDs had just been re-added...once it went offline for the reboot all disk images on Ceph storage became available again. Everything continued working fine after the rebooted Ceph node restarted and the Ceph recovery continued. Unfortunately all the VMs had to be restarted to gain access to their disk images again.

I'm wondering if anyone else has seen this. It may not necessarily be related to the upgrade but instead the after actions of deleting and re-adding OSDs on a per server basis. I still have one more storage node in the cluster to upgrade, then I have another identical cluster that will need to have the same upgrade/disk conversion done and I want to avoid this happening again.
 
Can you please describe the cluster layout and hardware?
 
All hardware was recommended by SMC engineering after consultation. I have two clusters, each cluster contains:

Ceph Storage Node Hardware: SMC SSG-6029P-E1CR12L (x3)
24 x Xeon Gold 6128 CPU@3.40GHz
192GB memory
2x Samsung SM863a 1.9TB SSD (one for system, one for CephDB)
10x Toshiba MG04SCA40EE 3.5 4TB 7200RPM SAS
4 10GbE copper ports

Compute Node Hardware: SYS-1029P-WTR (x5)
80 x Xeon Gold 6230 CPU@2.10GHz
392GB memory
Single Intel SSD for OS
4 10GbE copper ports
4 10GbE SFP+ ports
2 1GbE ports

Networking for all nodes:
1 10GbE for network connectivity of VMs (vmbr0)
1 10GbE for cluster communication
2 x 10GbE LAG for Ceph network

I am currently waiting for 2 port 100Gb NICs from SMC for all servers to replace 20Gb LAGs with 200Gb LAGs.

All servers are currently running the latest 6.0-4 code and all servers are in the Proxmox cluster. The Ceph nodes do NOT have any storage enabled (RBD nor CephFS) so that no VMs can be created or migrated to the storage nodes.
-----

All nodes were converted from 5.4 to 6.0 using the online instructions. First Corosync3, then Proxmox, and finally Ceph. All OSDs were kept as they were from Luminous, but when created originally they only had 1GB DB partitions on the SSD and threw leaking warnings after the upgrade. Starting with healthy cluster, I deleted all ten OSDs on one node and recreated them with 120GB DB partitions, waiting for the system to recover before doing the next node.

I found that each time I did that procedure, all nodes would lose connectivity to the Ceph storage for about the first 30 minutes of the rebalancing period. After the rebuilding OSDs in the second node, I leaned to schedule time to shut down all VMs in the cluster before deleting and recreating the OSDs. If I didn't, then the VMs lost connection with their disk images and data became corrupt, especially for any database VMs.

I tested by using an "rbd ls" command on all nodes. When there was a problem, that command would hang. When I could issue that command and it consistently gave me the listing of the VM images, then I knew I could finally restart the VMs while the rebalancing continued.

I also found that during this period where Ceph wasn't available, if I did a reset on any newly recreated OSD, the system would work until the OSD came back online. This behavior was consistent each time I did the OSD recreation on a node.

Tonight I will be starting the same procedure on the second cluster. I have VM downtime scheduled for tonight and the next two nights so I can do all three storage nodes.
 
Ceph Storage Node Hardware: SMC SSG-6029P-E1CR12L (x3)
...
4 10GbE copper ports

Compute Node Hardware: SYS-1029P-WTR (x5)
...
4 10GbE copper ports
4 10GbE SFP+ ports
2 1GbE ports

Networking for all nodes:
1 10GbE for network connectivity of VMs (vmbr0)
1 10GbE for cluster communication
2 x 10GbE LAG for Ceph network
Why 1x 10 GbE for vmbr0 if the ceph nodes do not host VM/CT?
As the compute nodes do have 4x 10 GbE SFP+ and 2x 1 GbE more, how are they used?
It might be possible that the network is the bottleneck and the Ceph clients can't connect anymore because the bandwidth is used up.

Other things I noticed:
2x Samsung SM863a 1.9TB SSD (one for system, one for CephDB)
10x Toshiba MG04SCA40EE 3.5 4TB 7200RPM SAS
The ration 10:1 is high, this will tax the SSD very much and it may also be a bottleneck. Best to split the HDDs into two and use both SSDs. The OS could very much live on a small SSD alone. But for performance it has to be a SSD, as the MON DB will flush very often.

I am currently waiting for 2 port 100Gb NICs from SMC for all servers to replace 20Gb LAGs with 200Gb LAGs.
I recommend not to use LAGs, they introduce an extra layer, that will raise latency and complexity. With the 100 GbE NICs there is plenty of bandwidth and low latency for those 8x nodes. You could separate the Ceph traffic (public/cluster), to provision 100 GbE for Ceph OSD traffic alone.

And something from the config side, what is the value of osd_max_backfills (ceph daemon osd.X config show | grep osd_max_backfills)?
 
Why 1x 10 GbE for vmbr0 if the ceph nodes do not host VM/CT?
That's an easy one...the ceph nodes have 10GbE built into the motherboard. There are no NICs less than 10GbE in those 3 servers. And all my infrastructure switches are 10GbE and up.

As the compute nodes do have 4x 10 GbE SFP+ and 2x 1 GbE more, how are they used?
It might be possible that the network is the bottleneck and the Ceph clients can't connect anymore because the bandwidth is used up.
I'm not using the SFP+ connections. I think that network card will need to be replaced by the 100GbE when those cards arrive next week so I didn't want to use them unless I had to.

The ration 10:1 is high, this will tax the SSD very much and it may also be a bottleneck. Best to split the HDDs into two and use both SSDs. The OS could very much live on a small SSD alone. But for performance it has to be a SSD, as the MON DB will flush very often.
That may be why SMC spec'ed the two identical SSDs...They really didn't provide feedback after the sale to tell me what they were thinking with that. I will look at moving the OS to another SSD and use both of the Samsungs for DB. However, is there a procedure that lets me move the DB without recreating the OSDs? This would require creating the OSDs with their DBs on the new SSD and increasing the size for the DB LVMs for the remaining ones to fill up the drives.

I recommend not to use LAGs, they introduce an extra layer, that will raise latency and complexity. With the 100 GbE NICs there is plenty of bandwidth and low latency for those 8x nodes. You could separate the Ceph traffic (public/cluster), to provision 100 GbE for Ceph OSD traffic alone.
The LAGs are actually more for redundancy instead of performance. I have duplicated all the network connections with stacked switches to allow for outages (accidental cable pulls, switch upgrades, etc).

And something from the config side, what is the value of osd_max_backfills (ceph daemon osd.X config show | grep osd_max_backfills)?
Just about everything is defaulted from when it was created in Luminous. That's part of why I'm recreating the OSDs in Nautilus. Pools:
Code:
Name                       size   min_size     pg_num     %-used                 used
dmz.cephfs_data               3          2        128       0.00         109093847040
dmz.cephfs_metadata           3          2         32       0.00              4128768
vm-storage                    3          2       1024       0.01         503304290484

The cephfs pool was added after the fact, so I just used the Proxmox GUI to create it.

"osd_max_backfills": "1". I may have changed that some time last year to reduce IO lag for VMs. Back then I had a more hyperconverged cluster allowing VMs to run on the ceph nodes as well, before getting this new hardware, that is.
 
Just out of curiousity, why are you replacing your 10g network with a 100g? considering your OSDs are HDDs not only are you not going to get much benefit, but at that cost (nics plus switches) you would get MUCH more bang for your buck replacing your HDDs with SSDs...
 
That's an easy one...the ceph nodes have 10GbE built into the motherboard. There are no NICs less than 10GbE in those 3 servers. And all my infrastructure switches are 10GbE and up.
You misunderstood me, why leave that 10 GbE unused if there never will be any VM/CT on those nodes?

I'm not using the SFP+ connections. I think that network card will need to be replaced by the 100GbE when those cards arrive next week so I didn't want to use them unless I had to.
Understood, but as you put them to the compute nodes, I am a bit confused about the network setup.

That may be why SMC spec'ed the two identical SSDs...They really didn't provide feedback after the sale to tell me what they were thinking with that. I will look at moving the OS to another SSD and use both of the Samsungs for DB. However, is there a procedure that lets me move the DB without recreating the OSDs? This would require creating the OSDs with their DBs on the new SSD and increasing the size for the DB LVMs for the remaining ones to fill up the drives.
As it is easy to destroy and re-create OSDs, I would go this route. But it is possible to move/expand the DB with the ceph-bluestore-tool.

The LAGs are actually more for redundancy instead of performance. I have duplicated all the network connections with stacked switches to allow for outages (accidental cable pulls, switch upgrades, etc).
I understand. As said above, I am in general struggling to understand the network setup. Maybe you would like to check out Ceph's network reference guide. As Ceph's network setup has some implication for the cluster.
http://docs.ceph.com/docs/nautilus/rados/configuration/network-config-ref/

"osd_max_backfills": "1". I may have changed that some time last year to reduce IO lag for VMs. Back then I had a more hyperconverged cluster allowing VMs to run on the ceph nodes as well, before getting this new hardware, that is.
This is default in new ceph clusters.
 
You misunderstood me, why leave that 10 GbE unused if there never will be any VM/CT on those nodes?
I'm afraid I still don't understand the question. The compute nodes have two 10Gb connections to the Ceph public network, one for the Proxmox Cluster and one for regular network traffic. The Ceph nodes have vmbr0 because that's what Proxmox put there and I never bothered to remove that part of the config. I think what really happened is that these nodes were ordered with more NICs than necessary.

Also, I don't see any performance issues on the Ceph cluster with this new hardware. Before I used to see slow transactions all the time. The system was rock solid until the upgrade from 5.4 to 6.0, but even that wasn't an issue until the Ceph upgrade portion.

Understood, but as you put them to the compute nodes, I am a bit confused about the network setup.
I understand. As said above, I am in general struggling to understand the network setup. Maybe you would like to check out Ceph's network reference guide. As Ceph's network setup has some implication for the cluster.
http://docs.ceph.com/docs/nautilus/rados/configuration/network-config-ref/
The network diagram at the top of that page is what I have, minus the Ceph cluster network. When I've used the term "cluster network", I've meant the Proxmox Cluster network.

As it is easy to destroy and re-create OSDs, I would go this route. But it is possible to move/expand the DB with the ceph-bluestore-tool.
I will look into the ceph-bluestore-tool. That sounds like what I need for the OSDs that I've already converted to the LVM-type in Nautilus.

Thanks for all your help. It is appreciated.
 
You're welcome. All in all, I just try to find out if the recovery traffic interfered with the client traffic of Ceph.
 
Final followup...this isn't a solution per se, but I did want to say what I did...and clarify that I understand now that the original problem wasn't the upgrade from Luminous to Nautilus, but in the fact that I needed to convert the old OSD created in Luminous to the new method used in Nautilus.

The method that caused the problem was to take all the OSDs offline, then out, then destroy, then recreate. I did the down and out / destroy via the GUI, but I recreated with the CLI. When I did things this way, I lost access to the data on the Ceph servers. I think I saw an "error" message about some PGs not available for a short while.

The method I've used since then is to take an OSD down / out / destroy / recreate one at a time, all through the GUI to make me go slower. Especially the first OSD, which causes a lot of activity to kick off and a few alarms to clear. I waited a short time (like a minute) before doing the next one. With that process, I never lost access to the data in the Ceph server and all the VMs maintained connection to their images.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!