Hi,
yesterday i have upgraded a cluster to the current Proxmox Version (4-4-87). (ceph: 10.2.7-1~bpo80+1)
It is also a threenode ceph cluster with one osd per node and pg_size/min 3/2. (data gets replicated to all three nodes, a minimum of 2 nodes must be up that the ceph cluster is accessible)
Today I get a ceph health warning "PGS 128 active+undersized+degraded".
I can see that the osd on one node is down and out: "osd.2"
Digging deeper in the osd.log shows me a "leveldb: Compaction error: Corruption: bad entry in block". Please see the detailed log below.
I never had this before, the cluster was up and running without problems for more than a year...
Any Ideas?
ceph health:
cluster 42ca65c7-716e-4357-802e-44178a1a0c03
health HEALTH_WARN
128 pgs degraded
128 pgs stuck degraded
128 pgs stuck unclean
128 pgs stuck undersized
128 pgs undersized
recovery 40654/121962 objects degraded (33.333%)
monmap e3: 3 mons at {0=10.0.99.82:6789/0,1=10.0.99.81:6789/0,2=10.0.99.83:6789/0}
election epoch 332, quorum 0,1,2 1,0,2
osdmap e505: 3 osds: 2 up, 2 in; 30 remapped pgs
flags sortbitwise,require_jewel_osds
pgmap v4364407: 128 pgs, 1 pools, 157 GB data, 40654 objects
318 GB used, 680 GB / 999 GB avail
40654/121962 objects degraded (33.333%)
128 active+undersized+degraded
client io 28669 B/s wr, 0 op/s rd, 9 op/s wr
osd.2.log:
2017-05-11 17:25:53.051623 7f995c629700 1 leveldb: Compacting 4@0 + 5@1 files
2017-05-11 17:25:53.102394 7f995c629700 1 leveldb: compacted to: files[ 4 5 6 0 0 0 0 ]
2017-05-11 17:25:53.102410 7f995c629700 1 leveldb: Compaction error: Corruption: bad entry in block
2017-05-11 17:25:53.156461 7f99617a1700 0 filestore(/var/lib/ceph/osd/ceph-2) error (1) Operation not permitted not handled on opera
tion 0x5616f4aac000 (51655477.0.0, or op 0, counting from 0)
2017-05-11 17:25:53.156471 7f99617a1700 0 filestore(/var/lib/ceph/osd/ceph-2) EPERM suggests file(s) in osd data dir not owned by cep
h user, or leveldb corruption
2017-05-11 17:25:53.156473 7f99617a1700 0 filestore(/var/lib/ceph/osd/ceph-2) transaction dump:
{
"ops": [
{
"op_num": 0,
"op_name": "omap_setkeys",
"collection": "0.10_head",
"oid": "#0:08000000::::head#",
"attr_lens": {
"0000000485.00000000000002022936": 183,
"_info": 863
}
}
]
}
2017-05-11 17:25:53.174231 7f99617a1700 -1 os/filestore/FileStore.cc: In function 'void FileStore::_do_transaction(ObjectStore::Transa
ction&, uint64_t, int, ThreadPool::TPHandle*)' thread 7f99617a1700 time 2017-05-11 17:25:53.172674
os/filestore/FileStore.cc: 2920: FAILED assert(0 == "unexpected error")
ceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x82) [0x5616d902a202]
2: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned long, int, ThreadPool::TPHandle*)+0xed4) [0x5616d8cca1e4]
3: (FileStore::_do_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, unsigned long, ThreadPool::TPHandle*)+0x3b) [0x5616d8cd052b]
4: (FileStore::_do_op(FileStore::OpSequencer*, ThreadPool::TPHandle&)+0x2c6) [0x5616d8cd0826]
5: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa9f) [0x5616d901ae3f]
6: (ThreadPool::WorkThread::entry()+0x10) [0x5616d901bd70]
7: (()+0x8064) [0x7f996f59b064]
8: (clone()+0x6d) [0x7f996d69c62d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
--- begin dump of recent events ---
-10000> 2017-05-11 17:22:15.803795 7f994ddfd700 5 -- op tracker -- seq: 874420, time: 2017-05-11 17:22:15.803794, event: reached_pg,
...
yesterday i have upgraded a cluster to the current Proxmox Version (4-4-87). (ceph: 10.2.7-1~bpo80+1)
It is also a threenode ceph cluster with one osd per node and pg_size/min 3/2. (data gets replicated to all three nodes, a minimum of 2 nodes must be up that the ceph cluster is accessible)
Today I get a ceph health warning "PGS 128 active+undersized+degraded".
I can see that the osd on one node is down and out: "osd.2"
Digging deeper in the osd.log shows me a "leveldb: Compaction error: Corruption: bad entry in block". Please see the detailed log below.
I never had this before, the cluster was up and running without problems for more than a year...
Any Ideas?
ceph health:
cluster 42ca65c7-716e-4357-802e-44178a1a0c03
health HEALTH_WARN
128 pgs degraded
128 pgs stuck degraded
128 pgs stuck unclean
128 pgs stuck undersized
128 pgs undersized
recovery 40654/121962 objects degraded (33.333%)
monmap e3: 3 mons at {0=10.0.99.82:6789/0,1=10.0.99.81:6789/0,2=10.0.99.83:6789/0}
election epoch 332, quorum 0,1,2 1,0,2
osdmap e505: 3 osds: 2 up, 2 in; 30 remapped pgs
flags sortbitwise,require_jewel_osds
pgmap v4364407: 128 pgs, 1 pools, 157 GB data, 40654 objects
318 GB used, 680 GB / 999 GB avail
40654/121962 objects degraded (33.333%)
128 active+undersized+degraded
client io 28669 B/s wr, 0 op/s rd, 9 op/s wr
osd.2.log:
2017-05-11 17:25:53.051623 7f995c629700 1 leveldb: Compacting 4@0 + 5@1 files
2017-05-11 17:25:53.102394 7f995c629700 1 leveldb: compacted to: files[ 4 5 6 0 0 0 0 ]
2017-05-11 17:25:53.102410 7f995c629700 1 leveldb: Compaction error: Corruption: bad entry in block
2017-05-11 17:25:53.156461 7f99617a1700 0 filestore(/var/lib/ceph/osd/ceph-2) error (1) Operation not permitted not handled on opera
tion 0x5616f4aac000 (51655477.0.0, or op 0, counting from 0)
2017-05-11 17:25:53.156471 7f99617a1700 0 filestore(/var/lib/ceph/osd/ceph-2) EPERM suggests file(s) in osd data dir not owned by cep
h user, or leveldb corruption
2017-05-11 17:25:53.156473 7f99617a1700 0 filestore(/var/lib/ceph/osd/ceph-2) transaction dump:
{
"ops": [
{
"op_num": 0,
"op_name": "omap_setkeys",
"collection": "0.10_head",
"oid": "#0:08000000::::head#",
"attr_lens": {
"0000000485.00000000000002022936": 183,
"_info": 863
}
}
]
}
2017-05-11 17:25:53.174231 7f99617a1700 -1 os/filestore/FileStore.cc: In function 'void FileStore::_do_transaction(ObjectStore::Transa
ction&, uint64_t, int, ThreadPool::TPHandle*)' thread 7f99617a1700 time 2017-05-11 17:25:53.172674
os/filestore/FileStore.cc: 2920: FAILED assert(0 == "unexpected error")
ceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x82) [0x5616d902a202]
2: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned long, int, ThreadPool::TPHandle*)+0xed4) [0x5616d8cca1e4]
3: (FileStore::_do_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, unsigned long, ThreadPool::TPHandle*)+0x3b) [0x5616d8cd052b]
4: (FileStore::_do_op(FileStore::OpSequencer*, ThreadPool::TPHandle&)+0x2c6) [0x5616d8cd0826]
5: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa9f) [0x5616d901ae3f]
6: (ThreadPool::WorkThread::entry()+0x10) [0x5616d901bd70]
7: (()+0x8064) [0x7f996f59b064]
8: (clone()+0x6d) [0x7f996d69c62d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
--- begin dump of recent events ---
-10000> 2017-05-11 17:22:15.803795 7f994ddfd700 5 -- op tracker -- seq: 874420, time: 2017-05-11 17:22:15.803794, event: reached_pg,
...