[SOLVED] Cehp timeout and losted disks

kaltsi · Jun 26, 2018

Hi all,
I get many times "timeout" from cehp.

Some VM lost its disk. And cannot boot from disk.

I didn't analyze logs yet, but I 'll update this post.

Add:
The ceph log file contains lot of similar entry:
check update: 59 slow requests are blocked > 32 sec

Alwin · Jun 26, 2018

Elaborate more on your ceph setup, hardware, configuration, status. For now, I can only say that you have requests from ceph clients pending, these are in the queue for more then 32 sec already.

kaltsi · Jun 28, 2018

Thanks your reply.

The system has 10G optical network. Therefore I think there isn't network issue.
he average speed between the nodes is 418.4MB/s (include the disk IO)
But I don't know where the problem may be.

Could you help me how can I use this to get more details about this errors.
https://github.com/ceph/ceph/pull/21750

And I have another question. Which choose is better if I use lot of smaller disk or few bigger disk

Alwin · Jun 28, 2018

Here are some links with further information to ceph and its architecture.
https://forum.proxmox.com/threads/proxmox-ceph-cluster-advice.42891/#post-205912
http://docs.ceph.com/docs/master/architecture/

But besides that there is a optical 10G network, I don't know what the rest of the system looks like. As ceph has many moving part, it es best to go from the network, over the CPU/RAM, down to the single disk.

kaltsi · Jun 28, 2018

We have the same 4 nodes.
CPU(s) 48 x Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz (2 Sockets)
Mem: 256GB/node

Code:

root@vCNT-host-4:~# ceph -s
  cluster:
    id:     fb926dd9-17b9-42fb-88d6-27f4944fd554
    health: HEALTH_OK

  services:
    mon: 4 daemons, quorum vCNT-host-1,vCNT-host-2,vCNT-host-3,vCNT-host-4
    mgr: vCNT-host-3(active), standbys: vCNT-host-4, vCNT-host-1, vCNT-host-2
    osd: 16 osds: 16 up, 16 in

  data:
    pools:   1 pools, 512 pgs
    objects: 1102k objects, 3339 GB
    usage:   8299 GB used, 4994 GB / 13294 GB avail
    pgs:     512 active+clean

  io:
    client:   15524 kB/s rd, 6982 kB/s wr, 765 op/s rd, 267 op/s wr

Code:

create linked clone of drive sata0 (vCNT-Storage_vm:base-279-disk-1)
trying to aquire cfs lock 'storage-vCNT-Storage_vm' ...
trying to aquire cfs lock 'storage-vCNT-Storage_vm' ...
trying to aquire cfs lock 'storage-vCNT-Storage_vm' ...
trying to aquire cfs lock 'storage-vCNT-Storage_vm' ...
trying to aquire cfs lock 'storage-vCNT-Storage_vm' ...
trying to aquire cfs lock 'storage-vCNT-Storage_vm' ...
trying to aquire cfs lock 'storage-vCNT-Storage_vm' ...
trying to aquire cfs lock 'storage-vCNT-Storage_vm' ...
trying to aquire cfs lock 'storage-vCNT-Storage_vm' ...
TASK ERROR: clone failed: error with cfs lock 'storage-vCNT-Storage_vm': got lock request timeout

The chep status was "overall HEALTH_OK" in this time.
In the syslog:

Code:

Jun 28 15:07:52 vCNT-host-4 ceph-osd[10770]: 2018-06-28 15:07:52.761655 7ff73acb5700 -1 osd.13 pg_epoch: 1421 pg[2.174( v 1421'355128 (1421'353606,1421'355128] local-lis/les=938/939 n=2187 ec=99/72 lis/c 938/938 les/c/f 939/939/0 105/938/938) [1,13,8] r=1 lpr=938 luod=0'0 crt=1421'355128 lcod 1421'355127 active] _scan_snaps no head for 2:2eae9ed9:::rbd_data.1008782eb141f2.000000000000412b:155 (have MIN)

Thanks for the sent URLs, I'll read it.

Alwin · Jun 28, 2018

kaltsi said:
TASK ERROR: clone failed: error with cfs lock 'storage-vCNT-Storage_vm': got lock request timeout

The access to the storage is working? The PVE cluster has quorum?

kaltsi said:
mon: 4 daemons, quorum vCNT-host-1,vCNT-host-2,vCNT-host-3,vCNT-host-4 mgr: vCNT-host-3(active), standbys: vCNT-host-4, vCNT-host-1, vCNT-host-2

More then 3 MON/MGR are not needed, when you go into the thousands of ceph nodes/clients, additional MONs may then be required. At least two MGRs, if one dies.

kaltsi · Jun 28, 2018

Yes, the cluster is working well.

Code:

root@vCNT-host-4:~# pvecm status
Quorum information
------------------
Date:             Thu Jun 28 17:25:36 2018
Quorum provider:  corosync_votequorum
Nodes:            4
Node ID:          0x00000002
Ring ID:          1/436
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   4
Highest expected: 4
Total votes:      4
Quorum:           3
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.102.166.130
0x00000004          1 10.102.166.131
0x00000003          1 10.102.166.132
0x00000002          1 10.102.166.133 (local)

The access to the storage is working usually. But sometimes there is timeout.
The next try is usually succeeded.
And some VMs losted its data of disk (disk was not bootable anymore). but after rollback it can work again.

For example I convert a VM to Template and there was timeout while the system converted the disk from "vm" to "base" type. The VM configuration become to Template but the disk stayed vm format. After this situation it is needed to remove the "template:1" line from vm configuration by hand. And the next try is usually working.

Alwin · Jun 28, 2018

kaltsi said:
The access to the storage is working usually. But sometimes there is timeout.
The next try is usually succeeded.
And some VMs losted its data of disk (disk was not bootable anymore). but after rollback it can work again.

How is the ceph health then? I assume, the slow requests appear.

What das a 'rados -p <POOL> bench 60 write' give you?

kaltsi said:
For example I convert a VM to Template and there was timeout while the system converted the disk from "vm" to "base" type. The VM configuration become to Template but the disk stayed vm format. After this situation it is needed to remove the "template:1" line from vm configuration by hand. And the next try is usually working.

Are there orphan images on the rbd pool?

Still missing hardware details, like model of disks, RAID or HBA.

kaltsi · Jun 29, 2018

It is a HP Apollo server. The hosts have Smart Array P440 Raid controller, but all disk in raid0 for ceph and raid1 for OS.
The Ceph health usually HEALTH_OK. But there are lot of events from slow request in the ceph log file.
For example:

Code:

cluster [WRN] Health check failed: 6 slow requests are blocked > 32 sec (REQUEST_SLOW)
cluster [WRN] Health check update: 5 slow requests are blocked > 32 sec (REQUEST_SLOW)
cluster [WRN] Health check cleared: REQUEST_SLOW (was: 5 slow requests are blocked > 32 sec)
cluster [INF]Cluster is now healthy

Code:

root@vCNT-host-4:~# rados -p vCNT-Storage bench 60 write
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 60 seconds or 0 objects
Object prefix: benchmark_data_vCNT-host-4_3583797
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
    1      16       137       121   483.963       484   0.0160639   0.0582168
    2      16       137       121   241.972         0           -   0.0582168
    3      16       137       121   161.316         0           -   0.0582168
    4      16       137       121   120.987         0           -   0.0582168
    5      16       161       145   115.987        24   0.0161211    0.100628
    6      16       245       229    152.65       336   0.0177876    0.261065
    7      16       377       361   206.263       528   0.0177685    0.284036
    8      16       377       361    180.48         0           -    0.284036
    9      16       377       361   160.427         0           -    0.284036
   10      16       377       361   144.384         0           -    0.284036
   11      16       377       361   131.258         0           -    0.284036
   12      16       437       421   140.318        48   0.0363651    0.360888
   13      16       562       546   167.982       500   0.0134725    0.284943
   14      16       562       546   155.983         0           -    0.284943
   15      16       649       633   168.781       174   0.0132016    0.344774
   16      16       649       633   158.233         0           -    0.344774
   17      16       649       633   148.925         0           -    0.344774
   18      16       654       638   141.763   6.66667   0.0232407    0.365441
   19      16       703       687   144.616       196   0.0152398    0.352621
2018-06-29 09:26:59.014971 min lat: 0.0130459 max lat: 8.11459 avg lat: 0.352621
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
   20      16       703       687   137.385         0           -    0.352621
   21      16       723       707   134.652        40   0.0238429    0.387419
   22      16       801       785   142.712       312   0.0166144    0.351341
   23      16       831       815   141.724       120   0.0153901    0.339042
   24      16       903       887   147.817       288   0.0154528    0.401815
   25      16       909       893   142.864        24   0.0134731    0.399216
   26      16       909       893    137.37         0           -    0.399216
   27      16       929       913   135.245        40   0.0151433    0.407127
   28      16       929       913   130.415         0           -    0.407127
   29      16       930       914   126.056         2     5.29795    0.412478
   30      16      1016      1000   133.319       344   0.0126818    0.391834
   31      16      1016      1000   129.019         0           -    0.391834
   32      16      1035      1019   127.362        38   0.0163367    0.409573
   33      16      1035      1019   123.502         0           -    0.409573
   34      16      1042      1026   120.693        14   0.0167332     0.41071
   35      16      1042      1026   117.245         0           -     0.41071
   36      16      1113      1097   121.876       142   0.0153936    0.418399
   37      16      1113      1097   118.583         0           -    0.418399
   38      16      1113      1097   115.462         0           -    0.418399
   39      16      1146      1130   115.886        44   0.0129354    0.411341
2018-06-29 09:27:19.016944 min lat: 0.0126818 max lat: 16.8512 avg lat: 0.414792
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
   40      16      1172      1156   115.588       104   0.0224196    0.414792
   41      16      1205      1189   115.988       132    0.121636    0.404011
   42      16      1205      1189   113.226         0           -    0.404011
   43      16      1244      1228   114.221        78   0.0140761    0.425268
   44      16      1244      1228   111.625         0           -    0.425268
   45      16      1358      1342   119.277       228   0.0169134    0.412128
   46      16      1591      1575   136.943       932   0.0804751    0.466864
   47      16      1800      1784   151.814       836   0.0211982    0.420438
   48      16      1940      1924   160.317       560   0.0529476    0.398188
   49      16      2112      2096   171.084       688   0.0412349    0.373047
   50      16      2188      2172   173.742       304   0.0255911     0.36615
   51      16      2257      2241   175.747       276    0.022596    0.362043
   52      16      2329      2313   177.905       288   0.0273724    0.357763
   53      16      2380      2364   178.397       204    0.361083    0.356179
   54      16      2449      2433   180.204       276     0.35389    0.353385
   55      16      2487      2471    179.69       152   0.0724898    0.352563
   56      16      2533      2517   179.767       184    0.630852    0.353883
   57      16      2610      2594   182.016       308   0.0455911    0.350184
   58      16      2728      2712   187.015       472   0.0278062    0.340587
   59      16      2788      2772   187.912       240   0.0206987    0.338436
2018-06-29 09:27:39.019146 min lat: 0.0126818 max lat: 30.0243 avg lat: 0.339645
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
   60      16      2821      2805    186.98       132     1.32027    0.339645
Total time run:         60.819601
Total writes made:      2822
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     185.598
Stddev Bandwidth:       219.711
Max bandwidth (MB/sec): 932
Min bandwidth (MB/sec): 0
Average IOPS:           46
Stddev IOPS:            54
Max IOPS:               233
Min IOPS:               0
Average Latency(s):     0.344473
Stddev Latency(s):      1.58766
Max latency(s):         30.0243
Min latency(s):         0.0126818
Cleaning up (deleting benchmark objects)
Removed 2822 objects
Clean up completed and total clean up time :12.508660

I found some interesting disks.
For example:
VM-170

Code:

root@vCNT-host-1:~# rbd ls vCNT-Storage | grep 170
vm-170-disk-1
vm-170-disk-2
vm-170-disk-3

Code:

root@vCNT-host-1:~# cat /etc/pve/nodes/vCNT-host-2/qemu-server/170.conf
balloon: 4096
bootdisk: sata0
cores: 2
memory: 6144
name: RobertH-TNL-v3.16-1-FED01
net0: e1000=3E:4A:AB:E3:6A:AE,bridge=vmbr1,tag=3908
net1: e1000=BA:95:70:35:68:BC,bridge=vmbr1,tag=3908
net2: e1000=8E:18:93:8F:15:1A,bridge=vmbr1,tag=3908
net3: e1000=6A:86:6E:0C:C2:42,bridge=vmbr1,tag=3908
numa: 0
ostype: l26
parent: Init_snapshot
sata0: vCNT-Storage_vm:base-791-disk-1/vm-170-disk-3,size=70G
smbios1: uuid=17229829-3cda-467b-8d92-84204b61677f
sockets: 1

[Init_snapshot]
#Init_comment
balloon: 4096
bootdisk: sata0
cores: 2
memory: 6144
name: RobertH-TNL-v3.16-1-FED01
net0: e1000=3E:4A:AB:E3:6A:AE,bridge=vmbr1,tag=3908
net1: e1000=BA:95:70:35:68:BC,bridge=vmbr1,tag=3908
net2: e1000=8E:18:93:8F:15:1A,bridge=vmbr1,tag=3908
net3: e1000=6A:86:6E:0C:C2:42,bridge=vmbr1,tag=3908
numa: 0
ostype: l26
sata0: vCNT-Storage_vm:base-791-disk-1/vm-170-disk-3,size=70G
smbios1: uuid=17229829-3cda-467b-8d92-84204b61677f
snaptime: 1529666334
sockets: 1

Alwin · Jun 29, 2018

kaltsi said:
Smart Array P440 Raid controller,

Put the controller into HBA (IT-mode).

Code:

   4      16       137       121   120.987         0           -   0.0582168
   5      16       161       145   115.987        24   0.0161211    0.100628
   6      16       245       229    152.65       336   0.0177876    0.261065
   7      16       377       361   206.263       528   0.0177685    0.284036
   8      16       377       361    180.48         0           -    0.284036
   9      16       377       361   160.427         0           -    0.284036

This is an indication that the OSDs are starving, as the RAID controller interferes with the writes.

In general avoid RAID use HBA with Ceph (or ZFS).

kaltsi · Jul 6, 2018

Thanks your hints.

I reconfigured the whole system.
All raind controllers use HBA mode.

But it seems it is not better.

Code:

2018-07-06 07:00:00.000157 mon.vCNT-host-1 mon.0 10.102.166.130:6789/0 62593 : cluster [WRN] overall HEALTH_WARN 1 nearfull osd(s); 1 pool(s) nearfull
2018-07-06 08:00:00.000151 mon.vCNT-host-1 mon.0 10.102.166.130:6789/0 63480 : cluster [WRN] overall HEALTH_WARN 1 nearfull osd(s); 1 pool(s) nearfull
2018-07-06 09:00:00.000164 mon.vCNT-host-1 mon.0 10.102.166.130:6789/0 64406 : cluster [WRN] overall HEALTH_WARN 1 nearfull osd(s); 1 pool(s) nearfull
2018-07-06 09:07:04.909589 mon.vCNT-host-1 mon.0 10.102.166.130:6789/0 64505 : cluster [WRN] Health check failed: 1 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-06 09:07:11.207034 mon.vCNT-host-1 mon.0 10.102.166.130:6789/0 64510 : cluster [INF] Health check cleared: REQUEST_SLOW (was: 1 slow requests are blocked > 32 sec)
2018-07-06 09:07:25.117456 mon.vCNT-host-1 mon.0 10.102.166.130:6789/0 64511 : cluster [WRN] Health check failed: 2 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-06 09:07:31.177075 mon.vCNT-host-1 mon.0 10.102.166.130:6789/0 64513 : cluster [WRN] Health check update: 4 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-06 09:07:41.281063 mon.vCNT-host-1 mon.0 10.102.166.130:6789/0 64517 : cluster [INF] Health check cleared: REQUEST_SLOW (was: 4 slow requests are blocked > 32 sec)
2018-07-06 09:26:01.027166 mon.vCNT-host-1 mon.0 10.102.166.130:6789/0 65186 : cluster [WRN] Health check failed: 10 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-06 09:26:10.533493 mon.vCNT-host-1 mon.0 10.102.166.130:6789/0 65192 : cluster [INF] Health check cleared: REQUEST_SLOW (was: 7 slow requests are blocked > 32 sec)

Code:

root@vCNT-host-4:~# rados -p vCNT-Storage bench 60 write
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 60 seconds or 0 objects
Object prefix: benchmark_data_vCNT-host-4_2695674
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
    1      16        42        26   103.994       104    0.442838    0.460067
    2      16        63        47   93.9896        84    0.372876    0.520925
    3      16        85        69   91.9913        88    0.310366    0.567248
    4      16       123       107   106.988       152    0.400993    0.556787
    5      16       148       132   105.588       100    0.402739    0.553861
    6      16       170       154   102.655        88    0.639312    0.571264
    7      16       202       186   106.274       128    0.487974    0.574715
    8      16       232       216   107.988       120    0.525585    0.563723
    9      16       261       245   108.876       116    0.479361    0.563683
   10      16       298       282   112.787       148    0.415036    0.558593
   11      16       327       311   113.078       116    0.917769    0.556129
   12      16       350       334    111.32        92    0.412973    0.547002
   13      16       384       368   113.218       136     0.48994    0.552062
   14      16       409       393   112.273       100    0.130571    0.563136
   15      16       436       420   111.987       108    0.334692    0.556066
   16      16       469       453   113.236       132    0.283018    0.552478
   17      16       501       485   114.104       128    0.314325     0.54894
   18      16       533       517   114.876       128     0.37995    0.548436
   19      16       561       545   114.724       112    0.360336    0.544479
2018-07-06 09:44:53.157006 min lat: 0.128848 max lat: 1.43031 avg lat: 0.54302
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
   20      16       592       576   115.187       124    0.182168     0.54302
   21      16       615       599   114.082        92    0.234216    0.551228
   22      16       644       628   114.168       116    0.752318    0.548768
   23      16       670       654   113.726       104    0.683905    0.551409
   24      16       701       685   114.153       124    0.277923    0.555179
   25      16       730       714   114.226       116    0.374056    0.553792
   26      16       759       743   114.294       116    0.220685    0.552543
   27      16       783       767   113.616        96    0.338427    0.552144
   28      16       811       795   113.558       112    0.308912    0.553524
   29      16       843       827   114.055       128    0.138018    0.557699
   30      16       862       846   112.787        76    0.236558      0.5596
   31      16       894       878   113.277       128    0.529105    0.560271
   32      16       916       900   112.487        88    0.368218    0.561754
   33      16       945       929   112.593       116    0.565941    0.563599
   34      16       976       960   112.928       124    0.226921    0.561286
   35      16      1004       988   112.901       112    0.454622     0.56244
   36      16      1033      1017   112.987       116    0.537357    0.563696
   37      16      1053      1037   112.095        80     1.15554    0.565556
   38      16      1079      1063   111.882       104    0.412322    0.566183
   39      16      1108      1092   111.987       116    0.730968    0.567856
2018-07-06 09:45:13.159330 min lat: 0.127764 max lat: 1.49225 avg lat: 0.569049
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
   40      16      1131      1115   111.487        92    0.515967    0.569049
   41      16      1163      1147   111.889       128    0.468317    0.567324
   42      16      1185      1169    111.32        88    0.484569    0.569685
   43      16      1217      1201   111.708       128    0.519618    0.570298
   44      16      1237      1221   110.987        80    0.761683    0.571563
   45      16      1251      1235   109.765        56    0.554703    0.573311
   46      16      1276      1260   109.552       100    0.526327    0.579003
   47      16      1303      1287   109.519       108    0.469738    0.580984
   48      16      1329      1313   109.404       104    0.743499    0.579879
   49      16      1356      1340   109.375       108    0.295312    0.580171
   50      16      1379      1363   109.027        92    0.471967    0.580474
   51      16      1407      1391   109.085       112    0.773703    0.584233
   52      16      1434      1418   109.064       108    0.154409    0.583459
   53      16      1462      1446   109.119       112    0.238878      0.5841
   54      16      1487      1471    108.95       100    0.558933    0.583237
   55      16      1524      1508    109.66       148    0.317236    0.580367
   56      16      1555      1539   109.916       124    0.544948    0.580141
   57      16      1578      1562   109.601        92    0.524091    0.579395
   58      16      1609      1593   109.849       124    0.504469    0.580712
   59      16      1635      1619    109.75       104    0.534606     0.57997
2018-07-06 09:45:33.161760 min lat: 0.127764 max lat: 1.69057 avg lat: 0.578658
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
   60      16      1671      1655    110.32       144    0.873459    0.578658
Total time run:         60.960629
Total writes made:      1672
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     109.71
Stddev Bandwidth:       19.0954
Max bandwidth (MB/sec): 152
Min bandwidth (MB/sec): 56
Average IOPS:           27
Stddev IOPS:            4
Max IOPS:               38
Min IOPS:               14
Average Latency(s):     0.581926
Stddev Latency(s):      0.252402
Max latency(s):         1.69057
Min latency(s):         0.127764
Cleaning up (deleting benchmark objects)
Removed 1672 objects
Clean up completed and total clean up time :8.814471

Code:

root@vCNT-host-4:~# ceph -s
  cluster:
    id:     fb926dd9-17b9-42fb-88d6-27f4944fd554
    health: HEALTH_WARN
            1 nearfull osd(s)
            1 pool(s) nearfull

  services:
    mon: 3 daemons, quorum vCNT-host-1,vCNT-host-3,vCNT-host-4
    mgr: vCNT-host-4(active), standbys: vCNT-host-3, vCNT-host-1
    osd: 12 osds: 12 up, 12 in

  data:
    pools:   1 pools, 512 pgs
    objects: 977k objects, 2881 GB
    usage:   8311 GB used, 2865 GB / 11176 GB avail
    pgs:     511 active+clean
             1   active+clean+scrubbing+deep

  io:
    client:   26351 B/s rd, 1240 kB/s wr, 90 op/s rd, 141 op/s wr

Code:

root@vCNT-host-4:~# ceph osd tree
ID CLASS WEIGHT   TYPE NAME              STATUS REWEIGHT PRI-AFF
-1       10.91510 root default
-3        2.72878     host vCNT-host-1
 0   hdd  0.90959         osd.0              up  1.00000 1.00000
 1   hdd  0.90959         osd.1              up  1.00000 1.00000
 2   hdd  0.90959         osd.2              up  1.00000 1.00000
-5        2.72878     host vCNT-host-2
 3   hdd  0.90959         osd.3              up  1.00000 1.00000
 4   hdd  0.90959         osd.4              up  1.00000 1.00000
 5   hdd  0.90959         osd.5              up  1.00000 1.00000
-7        2.72878     host vCNT-host-3
 8   hdd  0.90959         osd.8              up  1.00000 1.00000
 9   hdd  0.90959         osd.9              up  1.00000 1.00000
10   hdd  0.90959         osd.10             up  1.00000 1.00000
-9        2.72878     host vCNT-host-4
12   hdd  0.90959         osd.12             up  1.00000 1.00000
13   hdd  0.90959         osd.13             up  1.00000 1.00000
14   hdd  0.90959         osd.14             up  1.00000 1.00000

Could you please offer something else what I can do to make better the performance?

Datanat · Jul 6, 2018

Hello @kaltsi, it's a tough one.

Slow requests can be caused by network issues, disks issues, or event controller issues as stated by @Alwin.

Looks like you have a 12x7200RPMs SATA drives. Do you use filestore or bluestore ?

In lab we tested filestore with 12 SATA drives too and it was not good until we put journal on SSD.

An then this :

health: HEALTH_WARN
1 nearfull osd(s)
1 pool(s) nearfull

if the drives are full : performance dropdown is catastrophic, SATA drives performances can drop to a few iops with indecent latency if near full.

So assuming a 3/2 pool, 12x70 nominal iops but here more like 20/30 iops per drive with 30+ms latency could end with the results you have.

rule of thumb (12x30/3/3) = between 27 to 40 large iops issued

Regards,

Alwin · Jul 6, 2018

kaltsi said:
But it seems it is not better.

I would say the contrary, it got better, as your rados bench is now writing stable. But I agree with @Datanat, that you will need faster disks (incl. SSD/NVMe). Also you need to consider, that you can't fill up your cluster, as you need to have space on ceph if a host fails and its data has to be replicated to the other nodes.

kaltsi · Jul 13, 2018

Hi all,

Thank you for your helphull answers.

The system has 12x 1T 7.2k SATA HDD. (4 nodes x 3 disk for ceph)
Yes you are right the system is better than previously. But the the disks are very very slow.
The system uses bluestore OSD. Therefore I can not use SSD for journal.

I moved a 10G size file with scp between two nodes and between a node and a virtual node.

Code:

10GB 306.9MB/s   00:33 from node OS disk to node disk
10GB  11.5MB/s   14:47 from node OS disk to Ceph VM disk
10GB  46.4MB/s   03:4 from node OS disk to Disk VM disk

I can transfer data between two nodes more then 1GB/s speed. In this case the here is no network issue.

But the network is scarcely used.

I looked for some performance report on the internet. There are 4 times better performance with similar HW then I have.

Code:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           4.10    0.00    1.68    0.69    0.00   93.53

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sdb              80.73         3.20         2.91    2895238    2633180
sdc              82.50         3.14         2.89    2839708    2611683
sdd              84.38         3.39         3.00    3060710    2705882

It seems I have to find another solution.

Alwin · Jul 16, 2018

kaltsi said:
The system uses bluestore OSD. Therefore I can not use SSD for journal.

Bluestore allows on creation of the OSD to put the DB/WAL on different devices (pveceph help createosd).

kaltsi said:
I moved a 10G size file with scp between two nodes and between a node and a virtual node.

Depending on the available cache the results may look very different on each run. Better use fio to test the IO performance.
Look at this thread for more information. https://forum.proxmox.com/threads/proxmox-ve-ceph-benchmark-2018-02.41761/

kaltsi said:
I can transfer data between two nodes more then 1GB/s speed. In this case the here is no network issue.

Please check with iperf what bandwidth you have between the links, to verify the graphs above.

kaltsi · Jul 16, 2018

Hi,

iperf

Code:

[ ID] Interval       Transfer     Bandwidth
[  6]  0.0-10.0 sec  2.73 GBytes  2.35 Gbits/sec
[  4]  0.0-10.0 sec  2.73 GBytes  2.35 Gbits/sec
[  5]  0.0-10.0 sec  2.73 GBytes  2.35 Gbits/sec
[  3]  0.0-10.0 sec  2.73 GBytes  2.35 Gbits/sec
[SUM]  0.0-10.0 sec  10.9 GBytes  9.39 Gbits/sec

fio
4ktest

Code:

fio --filename=/dev/sdd --direct=1 --rw=randrw --refill_buffers --norandommap --randrepeat=0 --ioengine=libaio --bs=4k --rwmixread=100 --iodepth=16 --numjobs=16 --runtime=60 --group_reporting --name=4ktest
4ktest: (g=0): rw=randrw, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=16
...
fio-2.16
Starting 16 processes
Jobs: 16 (f=16): [r(16)] [100.0% done] [920KB/0KB/0KB /s] [230/0/0 iops] [eta 00m:00s]
4ktest: (groupid=0, jobs=16): err= 0: pid=1373324: Mon Jul 16 16:21:11 2018
  read : io=56844KB, bw=959234B/s, iops=234, runt= 60682msec
    slat (usec): min=1, max=1016.5K, avg=61651.56, stdev=216010.11
    clat (msec): min=9, max=2878, avg=1025.78, stdev=413.95
     lat (msec): min=14, max=2951, avg=1087.44, stdev=466.56
    clat percentiles (msec):
     |  1.00th=[   46],  5.00th=[  490], 10.00th=[  791], 20.00th=[  840],
     | 30.00th=[  865], 40.00th=[  889], 50.00th=[  914], 60.00th=[  938],
     | 70.00th=[  979], 80.00th=[ 1074], 90.00th=[ 1778], 95.00th=[ 1844],
     | 99.00th=[ 1958], 99.50th=[ 1991], 99.90th=[ 2540], 99.95th=[ 2606],
     | 99.99th=[ 2868]
    lat (msec) : 10=0.01%, 20=0.10%, 50=1.06%, 100=2.14%, 250=1.44%
    lat (msec) : 500=0.27%, 750=1.75%, 1000=66.63%, 2000=26.21%, >=2000=0.39%
  cpu          : usr=0.01%, sys=0.00%, ctx=2526, majf=0, minf=148
  IO depths    : 1=0.1%, 2=0.2%, 4=0.5%, 8=0.9%, 16=98.3%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=99.9%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=14211/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
   READ: io=56844KB, aggrb=936KB/s, minb=936KB/s, maxb=936KB/s, mint=60682msec, maxt=60682msec

Disk stats (read/write):
  sdd: ios=14343/0, merge=0/0, ticks=8086248/0, in_queue=8090536, util=99.87%

8k7030test

Code:

fio --filename=/dev/sdd --direct=1 --rw=randrw --refill_buffers --norandommap --randrepeat=0 --ioengine=libaio --bs=8k --rwmixread=70 --iodepth=16 --numjobs=16 --runtime=60 --group_reporting --name=8k7030test
8k7030test: (g=0): rw=randrw, bs=8K-8K/8K-8K/8K-8K, ioengine=libaio, iodepth=16
...
fio-2.16
Starting 16 processes
Jobs: 14 (f=14): [f(2),_(1),f(12),_(1)] [100.0% done] [1712KB/536KB/0KB /s] [214/67/0 iops] [eta 00m:00s]
8k7030test: (groupid=0, jobs=16): err= 0: pid=1375632: Mon Jul 16 16:24:58 2018
  read : io=69456KB, bw=1140.1KB/s, iops=142, runt= 60879msec
    slat (usec): min=1, max=1186.1K, avg=72396.31, stdev=252419.78
    clat (msec): min=13, max=3253, avg=1182.40, stdev=478.54
     lat (msec): min=13, max=3526, avg=1254.80, stdev=543.08
    clat percentiles (msec):
     |  1.00th=[   49],  5.00th=[  237], 10.00th=[  914], 20.00th=[  971],
     | 30.00th=[ 1004], 40.00th=[ 1037], 50.00th=[ 1057], 60.00th=[ 1090],
     | 70.00th=[ 1139], 80.00th=[ 1237], 90.00th=[ 2057], 95.00th=[ 2147],
     | 99.00th=[ 2278], 99.50th=[ 2343], 99.90th=[ 3064], 99.95th=[ 3130],
     | 99.99th=[ 3261]
  write: io=29104KB, bw=489536B/s, iops=59, runt= 60879msec
    slat (usec): min=1, max=1182.2K, avg=68616.06, stdev=247002.88
    clat (msec): min=19, max=4203, avg=1195.66, stdev=479.54
     lat (msec): min=19, max=4203, avg=1264.28, stdev=534.50
    clat percentiles (msec):
     |  1.00th=[   61],  5.00th=[  865], 10.00th=[  922], 20.00th=[  971],
     | 30.00th=[ 1012], 40.00th=[ 1037], 50.00th=[ 1057], 60.00th=[ 1090],
     | 70.00th=[ 1139], 80.00th=[ 1221], 90.00th=[ 2057], 95.00th=[ 2147],
     | 99.00th=[ 2409], 99.50th=[ 3097], 99.90th=[ 3326], 99.95th=[ 3490],
     | 99.99th=[ 4228]
    lat (msec) : 20=0.02%, 50=0.92%, 100=1.73%, 250=1.95%, 500=0.28%
    lat (msec) : 750=0.40%, 1000=22.09%, 2000=59.75%, >=2000=12.86%
  cpu          : usr=0.01%, sys=0.01%, ctx=2133, majf=0, minf=164
  IO depths    : 1=0.1%, 2=0.3%, 4=0.5%, 8=1.0%, 16=98.1%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=99.9%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=8682/w=3638/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
   READ: io=69456KB, aggrb=1140KB/s, minb=1140KB/s, maxb=1140KB/s, mint=60879msec, maxt=60879msec
  WRITE: io=29104KB, aggrb=478KB/s, minb=478KB/s, maxb=478KB/s, mint=60879msec, maxt=60879msec

Disk stats (read/write):
  sdd: ios=8777/3637, merge=0/0, ticks=5571560/2499196, in_queue=8108924, util=99.89%

Alwin · Jul 16, 2018

When you run the test form page three of the benchmark paper and compare the results, then they are at the level of the Seagate Constellation. There is not really much more to get with those 7.2k SATA disks.

For faster results, you need either more SATA disks or SSD/NVMe. With the later, you can use a separate pool to gain a high speed vs high capacity pool.

kaltsi · Sep 7, 2018

Thank you for your help and hints.
I replaced some SATA disks to SAS 10k disk and I added one SSD to every nodes for handle the ceph db.
I got a 10 times better ceph. This configuration is enoght for my activities.
It was not easy because I could not shut down the cluster.

Search

Search

[SOLVED] Cehp timeout and losted disks

kaltsi

Active Member

Alwin

Proxmox Retired Staff

kaltsi

Active Member

Alwin

Proxmox Retired Staff

kaltsi

Active Member

Alwin

Proxmox Retired Staff

kaltsi

Active Member

Alwin

Proxmox Retired Staff

kaltsi

Active Member

Alwin

Proxmox Retired Staff

kaltsi

Active Member

Datanat

New Member

Alwin

Proxmox Retired Staff

kaltsi

Active Member

Alwin

Proxmox Retired Staff

kaltsi

Active Member

Alwin

Proxmox Retired Staff

kaltsi

Active Member