I have a 3-way Proxmox VE cluster with Ceph as shared storage.
I wanted to do some Ceph benchmarks on my pool. On one of the monitors (prox3), it runs fine:
But on the other two monitors (prox1 and prox2), it fails like this:
Doing a random write test results in a segfault on both prox1 and prox2, but works on prox3.
Ceph seems to be otherwise working fine, status is HEALTH_OK:
Any idea what could be causing this?
I wanted to do some Ceph benchmarks on my pool. On one of the monitors (prox3), it runs fine:
Code:
root@prox3:~# rados bench -p ceph 20 rand
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
0 0 0 0 0 0 - 0
1 16 533 517 2067.65 2068 0.00483917 0.0290085
2 15 1056 1041 2080.05 2096 0.0240834 0.0290321
3 15 1578 1563 2082.66 2088 0.0188987 0.0291045
4 15 2113 2098 2095.24 2140 0.0176386 0.0290918
5 16 2634 2618 2091.72 2080 0.0183882 0.0290534
6 15 3141 3126 2081.7 2032 0.0469602 0.0293062
7 15 3681 3666 2091.98 2160 0.010834 0.0291488
8 15 4214 4199 2096.95 2132 0.0201944 0.0291226
9 15 4754 4739 2103.55 2160 0.0136696 0.0290463
10 15 5230 5215 2083.58 1904 0.0618851 0.0293151
11 15 5758 5743 2086.13 2112 0.04911 0.0292812
12 15 6299 6284 2092.58 2164 0.0574348 0.0292058
13 15 6815 6800 2090.36 2064 0.0421002 0.0292674
14 15 7342 7327 2091.61 2108 0.0134611 0.02924
15 16 7862 7846 2090.2 2076 0.0365688 0.0292554
16 15 8396 8381 2093.29 2140 0.075214 0.0292107
17 15 8936 8921 2097.19 2160 0.0139684 0.0291771
18 16 9484 9468 2102.21 2188 0.00950573 0.0290996
19 15 10017 10002 2103.73 2136 0.00912706 0.0290682
2016-11-22 10:48:23.772890 min lat: 0.00340854 max lat: 0.136083 avg lat: 0.0290562
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
20 13 10539 10526 2102.85 2096 0.0193621 0.0290562
Total time run: 20.045966
Total reads made: 10539
Read size: 4194304
Bandwidth (MB/sec): 2102.97
Average IOPS: 525
Average Latency(s): 0.0291114
Max latency(s): 0.136083
Min latency(s): 0.00340854
But on the other two monitors (prox1 and prox2), it fails like this:
Code:
root@prox2:~# rados bench -p ceph 20 rand
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
0 0 0 0 0 0 - 0
read got -2
error during benchmark: -5
error 5: (5) Input/output error
Doing a random write test results in a segfault on both prox1 and prox2, but works on prox3.
Code:
root@prox1:~# rados bench -p ceph 10 seq
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
0 0 0 0 0 0 - 0
*** Caught signal (Segmentation fault) **
in thread 7fa7a1636780
ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90)
1: rados() [0x4e77d3]
2: (()+0xf8d0) [0x7fa79ea578d0]
3: (()+0x147460) [0x7fa79d426460]
4: (ObjBencher::seq_read_bench(int, int, int, int, bool)+0x1011) [0x4daf51]
5: (ObjBencher::aio_bench(int, int, int, int, int, bool, char const*, bool)+0x2df) [0x4dfbdf]
6: (main()+0xa664) [0x4bf4e4]
7: (__libc_start_main()+0xf5) [0x7fa79d300b45]
8: rados() [0x4c3947]
2016-11-22 10:41:51.496241 7fa7a1636780 -1 *** Caught signal (Segmentation fault) **
in thread 7fa7a1636780
ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90)
1: rados() [0x4e77d3]
2: (()+0xf8d0) [0x7fa79ea578d0]
3: (()+0x147460) [0x7fa79d426460]
4: (ObjBencher::seq_read_bench(int, int, int, int, bool)+0x1011) [0x4daf51]
5: (ObjBencher::aio_bench(int, int, int, int, int, bool, char const*, bool)+0x2df) [0x4dfbdf]
6: (main()+0xa664) [0x4bf4e4]
7: (__libc_start_main()+0xf5) [0x7fa79d300b45]
8: rados() [0x4c3947]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
--- begin dump of recent events ---
(CUT HERE DUE TO POST LENGTH RESTRICTION)
--- end dump of recent events ---
Segmentation fault
Ceph seems to be otherwise working fine, status is HEALTH_OK:
Code:
root@prox1:~# ceph -s
cluster ab9b66eb-4363-4fca-85dd-e67e47aef05f
health HEALTH_OK
monmap e3: 3 mons at {0=10.15.15.50:6789/0,1=10.15.15.51:6789/0,2=10.15.15.52:6789/0}
election epoch 64, quorum 0,1,2 0,1,2
osdmap e106: 6 osds: 6 up, 6 in
pgmap v318719: 450 pgs, 1 pools, 620 GB data, 155 kobjects
1236 GB used, 2211 GB / 3448 GB avail
450 active+clean
client io 18715 kB/s rd, 7539 kB/s wr, 459 op/s
Any idea what could be causing this?