Hi there, we are running proxmox in our staging environment with new hardware with a 4 nodes- namely :
Dell R640 - Dual 3.5Ghz 8Core.
192GB RAM
6 OSDs (all ssd)
10Gbps networking.
1VLAN for networking
1VLAN for ceph networking
Software :
Proxmox 5.2 latest kernel 4.15.17-2-pve
Ceph latest from proxmox repos : ceph version 12.2.5 (dfcb7b53b2e4fcd2a5af0240d4975adc711ab96e) luminous (stable)
Ceph Monitors are installed on all 4 nodes.
We have a 4 dell node config with 3 replication 2 minimum for IO.
We use proxmox defaults for the crushmap, we just change the device class from hdd to ssd.
Since install we have been using ceph and the short answer is - its really good. love it in fact. self healing etc is brilliant - failover works if you dont use grsec etc.
We have created virtual machines that are currently running many different services.
Now..
If we destroy one node (6x OSD) (for testing - since we want to roll this out to live), the cluster recovers fine (plenty of space left - We dont use a huge amount of space) - No IO problems.
We then - ceph zap the disks and recreate the osds. they add back into ceph and recovery happens again. The cluster then starts to get "slow requests are blocked > 32 sec (REQUEST_SLOW)"
If i go into a virtual machine I see that there are problems doing things that require hard drive tasks like install a package etc.
This will carry on - more and more requests become blocked.
The only way to fix this is to either stop one disk out of the 6 and everything is fine (doesnt matter which one, tested multiple). wait till rebalance has finished and then add the other disk back in. Or reweight 1 osd out of the 6 (also doesnt matter which one, tested multiple) to 0.1 and wait till done. Then readd.
I have tried this on 2 nodes out of the 4 and played with multiple OSDS. The problem is repeatable over and over again. It doesnt matter if we change the backfill settings to make it faster or leave it on default. We run snmp checking on the 10Gbps switches so can see the traffic is absolutely fine (when we first remove the disks the 10Gbps networking is extremely fast)
This feels like a bug in ceph...
The question I have is, why when the disks are removed the placement groups recover with no problems (even if we raise the backfill and max recovery at speeds of 700MB/s ) but get problems when adding back in?
any help would be appreciated.
Thanks
Nick
Dell R640 - Dual 3.5Ghz 8Core.
192GB RAM
6 OSDs (all ssd)
10Gbps networking.
1VLAN for networking
1VLAN for ceph networking
Software :
Proxmox 5.2 latest kernel 4.15.17-2-pve
Ceph latest from proxmox repos : ceph version 12.2.5 (dfcb7b53b2e4fcd2a5af0240d4975adc711ab96e) luminous (stable)
Ceph Monitors are installed on all 4 nodes.
We have a 4 dell node config with 3 replication 2 minimum for IO.
We use proxmox defaults for the crushmap, we just change the device class from hdd to ssd.
Since install we have been using ceph and the short answer is - its really good. love it in fact. self healing etc is brilliant - failover works if you dont use grsec etc.
We have created virtual machines that are currently running many different services.
Now..
If we destroy one node (6x OSD) (for testing - since we want to roll this out to live), the cluster recovers fine (plenty of space left - We dont use a huge amount of space) - No IO problems.
We then - ceph zap the disks and recreate the osds. they add back into ceph and recovery happens again. The cluster then starts to get "slow requests are blocked > 32 sec (REQUEST_SLOW)"
If i go into a virtual machine I see that there are problems doing things that require hard drive tasks like install a package etc.
This will carry on - more and more requests become blocked.
The only way to fix this is to either stop one disk out of the 6 and everything is fine (doesnt matter which one, tested multiple). wait till rebalance has finished and then add the other disk back in. Or reweight 1 osd out of the 6 (also doesnt matter which one, tested multiple) to 0.1 and wait till done. Then readd.
I have tried this on 2 nodes out of the 4 and played with multiple OSDS. The problem is repeatable over and over again. It doesnt matter if we change the backfill settings to make it faster or leave it on default. We run snmp checking on the 10Gbps switches so can see the traffic is absolutely fine (when we first remove the disks the 10Gbps networking is extremely fast)
This feels like a bug in ceph...
The question I have is, why when the disks are removed the placement groups recover with no problems (even if we raise the backfill and max recovery at speeds of 700MB/s ) but get problems when adding back in?
any help would be appreciated.
Thanks
Nick