Problem when using pveceph createosd with live vms

Nick Lawrence · Jun 7, 2018

Hi there, we are running proxmox in our staging environment with new hardware with a 4 nodes- namely :

Dell R640 - Dual 3.5Ghz 8Core.
192GB RAM
6 OSDs (all ssd)
10Gbps networking.
1VLAN for networking
1VLAN for ceph networking

Software :
Proxmox 5.2 latest kernel 4.15.17-2-pve
Ceph latest from proxmox repos : ceph version 12.2.5 (dfcb7b53b2e4fcd2a5af0240d4975adc711ab96e) luminous (stable)
Ceph Monitors are installed on all 4 nodes.

We have a 4 dell node config with 3 replication 2 minimum for IO.

We use proxmox defaults for the crushmap, we just change the device class from hdd to ssd.

Since install we have been using ceph and the short answer is - its really good. love it in fact. self healing etc is brilliant - failover works if you dont use grsec etc.

We have created virtual machines that are currently running many different services.

Now..

If we destroy one node (6x OSD) (for testing - since we want to roll this out to live), the cluster recovers fine (plenty of space left - We dont use a huge amount of space) - No IO problems.

We then - ceph zap the disks and recreate the osds. they add back into ceph and recovery happens again. The cluster then starts to get "slow requests are blocked > 32 sec (REQUEST_SLOW)"
If i go into a virtual machine I see that there are problems doing things that require hard drive tasks like install a package etc.
This will carry on - more and more requests become blocked.

The only way to fix this is to either stop one disk out of the 6 and everything is fine (doesnt matter which one, tested multiple). wait till rebalance has finished and then add the other disk back in. Or reweight 1 osd out of the 6 (also doesnt matter which one, tested multiple) to 0.1 and wait till done. Then readd.

I have tried this on 2 nodes out of the 4 and played with multiple OSDS. The problem is repeatable over and over again. It doesnt matter if we change the backfill settings to make it faster or leave it on default. We run snmp checking on the 10Gbps switches so can see the traffic is absolutely fine (when we first remove the disks the 10Gbps networking is extremely fast)

This feels like a bug in ceph...

The question I have is, why when the disks are removed the placement groups recover with no problems (even if we raise the backfill and max recovery at speeds of 700MB/s ) but get problems when adding back in?

any help would be appreciated.

Thanks

Nick

Alwin · Jun 7, 2018

Nick Lawrence said:
6 OSDs (all ssd)

How are the OSDs connected (RAID, HBA)? What SSDs are these?

Nick Lawrence said:
10Gbps networking.
1VLAN for networking
1VLAN for ceph networking

This is a bad idea, because the ceph traffic will max out the 10GbE (esp. on recovery), this in turn will interfere with the client, backup traffic and the corosync. So in worst case (when HA activated), the nodes fence themselves and reset (the cluster died).

Nick Lawrence said:
The question I have is, why when the disks are removed the placement groups recover with no problems (even if we raise the backfill and max recovery at speeds of 700MB/s ) but get problems when adding back in?

The three remaining nodes can cope with the traffic on recovery. The other way around, three nodes send data to one. You will see in the ceph logs which OSDs are involved, when the slow requests appear.

As very simplified example: On node failure, 6x 100GB need to be distributed by three other nodes, is 200 GB per Node. While adding a Node, suddenly 600GB have to be distributed to one node alone (can be read from 3 nodes simultaneously).

This problem will appear more frequent, as more load will be on the cluster.

Nick Lawrence · Jun 7, 2018

"How are the OSDs connected (RAID, HBA)? What SSDs are these?"

The dells are connected via raid card (H740P) - only way to configure them. 2 OS drives are RAID1, 6 OSD Drives are each JBOD.

-----

"This is a bad idea, because the ceph traffic will max out the 10GbE (esp. on recovery), this in turn will interfere with the client, backup traffic and the corosync. So in worst case (when HA activated), the nodes fence themselves and reset (the cluster died)."

Apologies - I should have given more info about this config - each node has 4x 10Gbps fibre connections. 2 are used for ceph, 2 are used for networking - configured as bond0 for networking and used by vmbr0 and bond 1 used for ceph. So if ssds max out the 20Gbps LACP ceph the network is different ports.

--------

"The three remaining nodes can cope with the traffic on recovery. The other way around, three nodes send data to one. You will see in the ceph logs which OSDs are involved, when the slow requests appear.

As very simplified example: On node failure, 6x 100GB need to be distributed by three other nodes, is 200 GB per Node. While adding a Node, suddenly 600GB have to be distributed to one node alone (can be read from 3 nodes simultaneously).

This problem will appear more frequent, as more load will be on the cluster."

This makes sense. What doesn't for me is that recovery defaults don't let the cluster recover at full speed. And the problem happens even if the cluster is recovering to one node at 25MB/s. What commands or debug commands can give me more info? in the ceph.log im only seeing the slow request and then backfilling information. Rather than telling me which osd is having a problem.

Another question if possible - Is there a standard practise to add disks to ceph? like adding with a specific weight ? or not too many at a time? etc

Cheers

Alwin · Jun 7, 2018

Nick Lawrence said:
The dells are connected via raid card (H740P) - only way to configure them. 2 OS drives are RAID1, 6 OSD Drives are each JBOD.

This can be already the reason, why the ceph cluster shows slow requests. Ceph needs to know about its OSDs/disks. A raid controller masks or even lies about the disks. Most controller leave the cache and its algorithms in place, even when used as JBOD. Use HBAs for ceph (counts even for zfs).

Eg. ceph tries to write to a OSD disk, the RAID controller cache is filled up, then the next write to a different OSD disk on that controller will be blocked as the cache is still full and ceph doesn't know about it. With a HBA, ceph can throttle its write, so blocked requests are less likely to happen (there sure exist other reasons).

What SSDs are you using (Model)?

Nick Lawrence said:
Apologies - I should have given more info about this config - each node has 4x 10Gbps fibre connections. 2 are used for ceph, 2 are used for networking - configured as bond0 for networking and used by vmbr0 and bond 1 used for ceph. So if ssds max out the 20Gbps LACP ceph the network is different ports.

This looks better but still leaves the corosync traffic on a shared port. What I have written in the post above still holds true. I recommend that you use a dedicated physical network link for corosync and additional add a second corosync ring for redundancy.
https://pve.proxmox.com/pve-docs/chapter-pvecm.html#_cluster_network

For ceph, it could be possible (depends also on the algorithm used) that by chance only one 10GbE link on the bond1 might be used for traffic. But this is more a point for optimization, then problem solving.

Nick Lawrence said:
This makes sense. What doesn't for me is that recovery defaults don't let the cluster recover at full speed. And the problem happens even if the cluster is recovering to one node at 25MB/s. What commands or debug commands can give me more info? in the ceph.log im only seeing the slow request and then backfilling information. Rather than telling me which osd is having a problem.

Every OSD/MON/MGR has its own log, you will find the information in them. It needs a little bit of correlation to find the culprit. With the additional information on the RAID controller, it well might be already that problem.

Nick Lawrence · Jun 7, 2018

"This can be already the reason, why the ceph cluster shows slow requests. Ceph needs to know about its OSDs/disks. A raid controller masks or even lies about the disks. Most controller leave the cache and its algorithms in place, even when used as JBOD. Use HBAs for ceph (counts even for zfs).

Eg. ceph tries to write to a OSD disk, the RAID controller cache is filled up, then the next write to a different OSD disk on that controller will be blocked as the cache is still full and ceph doesn't know about it. With a HBA, ceph can throttle its write, so blocked requests are less likely to happen (there sure exist other reasons).

What SSDs are you using (Model)?

The ssds are Samsung MZ7KM960HMJP0D3 Enterprise Drives

------------------------

"This looks better but still leaves the corosync traffic on a shared port. What I have written in the post above still holds true. I recommend that you use a dedicated physical network link for corosync and additional add a second corosync ring for redundancy.
https://pve.proxmox.com/pve-docs/chapter-pvecm.html#_cluster_network

For ceph, it could be possible (depends also on the algorithm used) that by chance only one 10GbE link on the bond1 might be used for traffic. But this is more a point for optimization, then problem solving."

I agree a physical link for corosync is better but we dont have RJ45 Ethernet ports, and using 10Gbps for just corosync is over kill.
We have switch redundancy with mlag, and our network traffic not a bottleneck by far, so this satisfies our requirements.

The network config originally was only using 1link on bond1, but we have changed that to use 3+4 hashing so now balancing correctly.

------------

Every OSD/MON/MGR has its own log, you will find the information in them. It needs a little bit of correlation to find the culprit. With the additional information on the RAID controller, it well might be already that problem.

Going through the logs now to see if it gives more detail on when the blocked requests happen.

When running "ceph health detail" after the 6 osds (osd.12 - osd.17) are put back in, i get those blocked requests from :

REQUEST_SLOW 16 slow requests are blocked > 32 sec
1 ops are blocked > 131.072 sec
12 ops are blocked > 65.536 sec
3 ops are blocked > 32.768 sec
osds 7,9,21 have blocked requests > 65.536 sec
osd.3 has blocked requests > 131.072 sec

So it seems I have a read issue rather than write. or writes are being blocked because of reads. still investigating.

Any insight would be great

Cheers

Nick

Alwin · Jun 8, 2018

Nick Lawrence said:
The ssds are Samsung MZ7KM960HMJP0D3 Enterprise Drives

Samsung SM863a, they have good performance for its price.

Nick Lawrence said:
When running "ceph health detail" after the 6 osds (osd.12 - osd.17) are put back in, i get those blocked requests from :

REQUEST_SLOW 16 slow requests are blocked > 32 sec
1 ops are blocked > 131.072 sec
12 ops are blocked > 65.536 sec
3 ops are blocked > 32.768 sec
osds 7,9,21 have blocked requests > 65.536 sec
osd.3 has blocked requests > 131.072 sec

So it seems I have a read issue rather than write. or writes are being blocked because of reads. still investigating.

The logs are a little bit tricky to read. You should see that the named OSDs are waiting on their peer to finish a operation. I suppose those OSDs are from the node added to the cluster.

Code:

#: ceph osd perf
osd commit_latency(ms) apply_latency(ms)
  3                 13                13
  1                  0                 0
  0                  0                 0

With this command you can see the latency of the OSDs in the cluster. There you might be seeing big delays, going into seconds. This would be a good indication that the OSDs are starving and Ceph is not able to throttle.

Nick Lawrence · Jun 8, 2018

Alwin said:
Samsung SM863a, they have good performance for its price.

The logs are a little bit tricky to read. You should see that the named OSDs are waiting on their peer to finish a operation. I suppose those OSDs are from the node added to the cluster.

Code:

#: ceph osd perf osd commit_latency(ms) apply_latency(ms) 3 13 13 1 0 0 0 0 0

With this command you can see the latency of the OSDs in the cluster. There you might be seeing big delays, going into seconds. This would be a good indication that the OSDs are starving and Ceph is not able to throttle.

I appreciate the reply - I have just removed the disks (osd.12 to osd.17) again from the 4th node, zapped them, unmounted the rockdb partition and then recreated them with pvecreate.

within 5 seconds of recovery I get blocked requests, the osds having the blocked requests are not any of the newly added disks (12->17) :

cluster:
id: 8899df0e-0b20-4360-b8c5-44d3f0724489
health: HEALTH_WARN
3323/833691 objects misplaced (0.399%)
Reduced data availability: 41 pgs inactive
Degraded data redundancy: 211932/833691 objects degraded (25.421%), 783 pgs degraded, 780 pgs undersized
22 slow requests are blocked > 32 sec

services:
mon: 4 daemons, quorum staging-proxmox-l01,staging-proxmox-l02,staging-proxmox-l03,staging-proxmox-l04
mgr: staging-proxmox-l01(active), standbys: staging-proxmox-l02, staging-proxmox-l03, staging-proxmox-l04
osd: 24 osds: 24 up, 24 in; 792 remapped pgs

data:
pools: 1 pools, 1024 pgs
objects: 271k objects, 1080 GB
usage: 2628 GB used, 18819 GB / 21447 GB avail
pgs: 4.004% pgs not active
211932/833691 objects degraded (25.421%)
3323/833691 objects misplaced (0.399%)
739 active+undersized+degraded+remapped+backfill_wait
229 active+clean
37 activating+undersized+degraded+remapped
7 active+remapped+backfill_wait
4 active+undersized+degraded+remapped+backfilling
4 activating+remapped
3 active+recovery_wait+degraded
1 active+remapped+backfilling

io:
client: 13175 B/s wr, 0 op/s rd, 0 op/s wr
recovery: 50085 kB/s, 12 objects/s

osd commit_latency(ms) apply_latency(ms)
17 0 0
16 4 4
15 4 4
14 4 4
13 5 5
12 3 3
23 0 0
22 0 0
21 0 0
20 0 0
19 0 0
18 0 0
9 0 0
8 0 0
7 0 0
6 0 0
5 0 0
4 0 0
0 0 0
1 0 0
2 0 0
3 0 0
10 0 0
11 0 0

REQUEST_SLOW 23 slow requests are blocked > 32 sec
9 ops are blocked > 131.072 sec
11 ops are blocked > 65.536 sec
3 ops are blocked > 32.768 sec
osd.0 has blocked requests > 65.536 sec
osds 5,9,21 have blocked requests > 131.072 sec

im confused. I agree that ceph can get upset with non passthrough raid controllers - but this doesn't seem like that is the issue. If it was, wouldnt I see huge spikes in lantency? most i see is 10ms...

Im running watch -n1 'ceph osd perf'

Alwin · Jun 8, 2018

Could you maybe post the ceph.log and logs from two OSDs involved (eg. osd.12 & osd.5; different hosts)?

Nick Lawrence said:
im confused. I agree that ceph can get upset with non passthrough raid controllers - but this doesn't seem like that is the issue. If it was, wouldnt I see huge spikes in lantency? most i see is 10ms...

On filestore backends, I would say totally yes. With OSDs it could be still in RAM and it might not look like a spike. The only good way to test is, to replace the RAID controller with a HBA or maybe it is possible to set it into IT-mode.

Nick Lawrence · Aug 10, 2018

Apologies, I have not been able to post those logs. Paternity and all that.

I have left my test case in staging for all this time, and just for anyone else having the issue - upgrading to the latest 12.2.7 has solved the issue.
I cannot get problems when adding zapping 6 osds from a host and adding them back in again as long as im on the newest version.

Setting ceph tell 'osd.*' injectargs '--osd-recovery-max-active 4'
and
ceph tell 'osd.*' injectargs '--osd-max-backfills 16'
runs at over 5gbps - and still have no issues.

not sure what was fixed, but is looking good so far.

Nick Lawrence · Jan 16, 2019

To add to this ticket in case this is useful for someone else.

Using the above hardware for the past 5-6months has been flawless and a brilliant system.

Another useful piece of info for Alwin, Dell has finally seen the light and I see that they have a new firmware update for the h740p raid controller.

"PERC H740P Mini/ H740P Adapter/ H840 Adapter RAID Controllers firmware version 50.5.0-1750"

Enhancements:

1. Enhanced HBA mode (eHBA) added to support non-RAID disks on H740P PERC controllers.

Specifically this is for use with SDS architecture..

Search

Search

Problem when using pveceph createosd with live vms

Nick Lawrence

New Member

Alwin

Proxmox Retired Staff

Nick Lawrence

New Member

Alwin

Proxmox Retired Staff

Nick Lawrence

New Member

Alwin

Proxmox Retired Staff

Nick Lawrence

New Member

Alwin

Proxmox Retired Staff

Nick Lawrence

New Member

Nick Lawrence

New Member

We value your privacy