Ceph Rados Bench Segfault

gdi2k · Nov 22, 2016

I have a 3-way Proxmox VE cluster with Ceph as shared storage.

I wanted to do some Ceph benchmarks on my pool. On one of the monitors (prox3), it runs fine:

Code:

root@prox3:~# rados bench -p ceph 20 rand
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
    1      16       533       517   2067.65      2068  0.00483917   0.0290085
    2      15      1056      1041   2080.05      2096   0.0240834   0.0290321
    3      15      1578      1563   2082.66      2088   0.0188987   0.0291045
    4      15      2113      2098   2095.24      2140   0.0176386   0.0290918
    5      16      2634      2618   2091.72      2080   0.0183882   0.0290534
    6      15      3141      3126    2081.7      2032   0.0469602   0.0293062
    7      15      3681      3666   2091.98      2160    0.010834   0.0291488
    8      15      4214      4199   2096.95      2132   0.0201944   0.0291226
    9      15      4754      4739   2103.55      2160   0.0136696   0.0290463
   10      15      5230      5215   2083.58      1904   0.0618851   0.0293151
   11      15      5758      5743   2086.13      2112     0.04911   0.0292812
   12      15      6299      6284   2092.58      2164   0.0574348   0.0292058
   13      15      6815      6800   2090.36      2064   0.0421002   0.0292674
   14      15      7342      7327   2091.61      2108   0.0134611     0.02924
   15      16      7862      7846    2090.2      2076   0.0365688   0.0292554
   16      15      8396      8381   2093.29      2140    0.075214   0.0292107
   17      15      8936      8921   2097.19      2160   0.0139684   0.0291771
   18      16      9484      9468   2102.21      2188  0.00950573   0.0290996
   19      15     10017     10002   2103.73      2136  0.00912706   0.0290682
2016-11-22 10:48:23.772890 min lat: 0.00340854 max lat: 0.136083 avg lat: 0.0290562
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
   20      13     10539     10526   2102.85      2096   0.0193621   0.0290562
Total time run:       20.045966
Total reads made:     10539
Read size:            4194304
Bandwidth (MB/sec):   2102.97
Average IOPS:         525
Average Latency(s):   0.0291114
Max latency(s):       0.136083
Min latency(s):       0.00340854

But on the other two monitors (prox1 and prox2), it fails like this:

Code:

root@prox2:~# rados bench -p ceph 20 rand
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
read got -2
error during benchmark: -5
error 5: (5) Input/output error

Doing a random write test results in a segfault on both prox1 and prox2, but works on prox3.

Code:

root@prox1:~# rados bench -p ceph 10 seq
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
*** Caught signal (Segmentation fault) **
in thread 7fa7a1636780
ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90)
1: rados() [0x4e77d3]
2: (()+0xf8d0) [0x7fa79ea578d0]
3: (()+0x147460) [0x7fa79d426460]
4: (ObjBencher::seq_read_bench(int, int, int, int, bool)+0x1011) [0x4daf51]
5: (ObjBencher::aio_bench(int, int, int, int, int, bool, char const*, bool)+0x2df) [0x4dfbdf]
6: (main()+0xa664) [0x4bf4e4]
7: (__libc_start_main()+0xf5) [0x7fa79d300b45]
8: rados() [0x4c3947]
2016-11-22 10:41:51.496241 7fa7a1636780 -1 *** Caught signal (Segmentation fault) **
in thread 7fa7a1636780

ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90)
1: rados() [0x4e77d3]
2: (()+0xf8d0) [0x7fa79ea578d0]
3: (()+0x147460) [0x7fa79d426460]
4: (ObjBencher::seq_read_bench(int, int, int, int, bool)+0x1011) [0x4daf51]
5: (ObjBencher::aio_bench(int, int, int, int, int, bool, char const*, bool)+0x2df) [0x4dfbdf]
6: (main()+0xa664) [0x4bf4e4]
7: (__libc_start_main()+0xf5) [0x7fa79d300b45]
8: rados() [0x4c3947]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---

(CUT HERE DUE TO POST LENGTH RESTRICTION)

--- end dump of recent events ---
Segmentation fault

Ceph seems to be otherwise working fine, status is HEALTH_OK:

Code:

root@prox1:~# ceph -s
    cluster ab9b66eb-4363-4fca-85dd-e67e47aef05f
     health HEALTH_OK
     monmap e3: 3 mons at {0=10.15.15.50:6789/0,1=10.15.15.51:6789/0,2=10.15.15.52:6789/0}
            election epoch 64, quorum 0,1,2 0,1,2
     osdmap e106: 6 osds: 6 up, 6 in
      pgmap v318719: 450 pgs, 1 pools, 620 GB data, 155 kobjects
            1236 GB used, 2211 GB / 3448 GB avail
                 450 active+clean
  client io 18715 kB/s rd, 7539 kB/s wr, 459 op/s

Any idea what could be causing this?

Search

Search

Ceph Rados Bench Segfault

gdi2k

Renowned Member