ProxMox 4.2 less stable than 3.4

hybrid512 · Jul 18, 2016

Hi,

I was using ProxMox 3.4 on a 9 nodes cluster with Ceph quite nicely for more than a year and it was very stable.

I reinstalled this cluster (and even added nodes) with ProxMox 4.2 updated to current release as of today and there is not a week without a kernel panic or down OSDs or Ceph problem (MONs are going down frquently for no reasons).

It is definitely less stable than before but this is the same hardware.
I have no clue from where the problems are coming and this is a shame because there are plenty of new features that I really like with this release except stability which had been greatly degraded.

Any idea on which information I can provide in order to help debugging this situation ?

Best regards.

hybrid512 · Jul 18, 2016

Well ... I would think this is somewhat related to Kernel/Ceph (maybe KRBD) because most of the time, the origin of my crashes are Ceph related.
OSDs or MONs are going down for no reasons at any moment on any node.
My Ceph storage volume is configured with KRBD and not qemu driver. I read somewhere that this would provide less overhead on the node and probably some performance gains.
I didn't notice some real gains but I don't know if this is more or less stable than qemu driver, I can't even be sure that this is related ... I'm just trying to figure out from where those problems are coming.

This is probably not my hard drives, if it where the same OSDs that were going down all the time, I would be ok with this but these are hard drivers that were working flawlessly with the preceding setup (ProxMox 3.4) and they are going down randomly, this is never the same disk on the same node (I have 13 nodes with 39 OSDs)

Here the backtrace I got today when one of my OSDs went down ... if that can help :

Code:

Jul 18 12:38:14 pc2-px03 kernel: [1107987.885749] sd 0:2:1:0: [sdb] tag#0 FAILED Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK
Jul 18 12:38:14 pc2-px03 kernel: [1107987.885779] sd 0:2:1:0: [sdb] tag#0 CDB: Read(10) 28 00 01 f0 5b 10 00 01 00 00
Jul 18 12:38:14 pc2-px03 kernel: [1107987.885782] blk_update_request: I/O error, dev sdb, sector 32529168
Jul 18 12:38:14 pc2-px03 kernel: [1107987.925681] sd 0:2:1:0: [sdb] tag#0 FAILED Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK
Jul 18 12:38:14 pc2-px03 kernel: [1107987.925707] sd 0:2:1:0: [sdb] tag#0 CDB: Read(10) 28 00 01 f0 5b 38 00 00 08 00
Jul 18 12:38:14 pc2-px03 kernel: [1107987.925710] blk_update_request: I/O error, dev sdb, sector 32529208
Jul 18 12:38:14 pc2-px03 kernel: [1107987.965706] sd 0:2:1:0: [sdb] tag#0 FAILED Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK
Jul 18 12:38:14 pc2-px03 kernel: [1107987.965734] sd 0:2:1:0: [sdb] tag#0 CDB: Read(10) 28 00 01 f0 5b 38 00 00 08 00
Jul 18 12:38:14 pc2-px03 kernel: [1107987.965737] blk_update_request: I/O error, dev sdb, sector 32529208
Jul 18 12:38:14 pc2-px03 bash[3892]: os/FileStore.cc: In function 'virtual int FileStore::read(coll_t, const ghobject_t&, uint64_t, size_t, ceph::bufferlist&, uint32_t, bool)' thread 7fc2b07ac700 time 2016-07-18 12:38:14.144530
Jul 18 12:38:14 pc2-px03 bash[3892]: os/FileStore.cc: 2854: FAILED assert(allow_eio || !m_filestore_fail_eio || got != -5)
Jul 18 12:38:14 pc2-px03 bash[3892]: ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)
Jul 18 12:38:14 pc2-px03 bash[3892]: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x76) [0xc0e746]
Jul 18 12:38:14 pc2-px03 bash[3892]: 2: (FileStore::read(coll_t, ghobject_t const&, unsigned long, unsigned long, ceph::buffer::list&, unsigned int, bool)+0xcc2) [0x9108a2]
Jul 18 12:38:14 pc2-px03 bash[3892]: 3: (ReplicatedBackend::be_deep_scrub(hobject_t const&, unsigned int, ScrubMap::object&, ThreadPool::TPHandle&)+0x31c) [0xa21cfc]
Jul 18 12:38:14 pc2-px03 bash[3892]: 4: (PGBackend::be_scan_list(ScrubMap&, std::vector<hobject_t, std::allocator<hobject_t> > const&, bool, unsigned int, ThreadPool::TPHandle&)+0x2ca) [0x8d2c9a]
Jul 18 12:38:14 pc2-px03 bash[3892]: 5: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t, bool, unsigned int, ThreadPool::TPHandle&)+0x1fa) [0x7dfc2a]
Jul 18 12:38:14 pc2-px03 bash[3892]: 6: (PG::replica_scrub(MOSDRepScrub*, ThreadPool::TPHandle&)+0x4ae) [0x7e03de]
Jul 18 12:38:14 pc2-px03 bash[3892]: 7: (OSD::RepScrubWQ::_process(MOSDRepScrub*, ThreadPool::TPHandle&)+0xb3) [0x6ce513]
Jul 18 12:38:14 pc2-px03 bash[3892]: 8: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa77) [0xbfecf7]
Jul 18 12:38:14 pc2-px03 bash[3892]: 9: (ThreadPool::WorkThread::entry()+0x10) [0xbffdc0]
Jul 18 12:38:14 pc2-px03 bash[3892]: 10: (()+0x80a4) [0x7fc2d39d00a4]
Jul 18 12:38:14 pc2-px03 bash[3892]: 11: (clone()+0x6d) [0x7fc2d1f2b87d]
Jul 18 12:38:14 pc2-px03 bash[3892]: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Jul 18 12:38:14 pc2-px03 bash[3892]: 2016-07-18 12:38:14.190616 7fc2b07ac700 -1 os/FileStore.cc: In function 'virtual int FileStore::read(coll_t, const ghobject_t&, uint64_t, size_t, ceph::bufferlist&, uint32_t, bool)' thread 7fc2b07ac700 time 2016-07-18 12:38:14.144530
Jul 18 12:38:14 pc2-px03 bash[3892]: os/FileStore.cc: 2854: FAILED assert(allow_eio || !m_filestore_fail_eio || got != -5)
Jul 18 12:38:14 pc2-px03 bash[3892]: ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)
Jul 18 12:38:14 pc2-px03 bash[3892]: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x76) [0xc0e746]
Jul 18 12:38:14 pc2-px03 bash[3892]: 2: (FileStore::read(coll_t, ghobject_t const&, unsigned long, unsigned long, ceph::buffer::list&, unsigned int, bool)+0xcc2) [0x9108a2]
Jul 18 12:38:14 pc2-px03 bash[3892]: 3: (ReplicatedBackend::be_deep_scrub(hobject_t const&, unsigned int, ScrubMap::object&, ThreadPool::TPHandle&)+0x31c) [0xa21cfc]
Jul 18 12:38:14 pc2-px03 bash[3892]: 4: (PGBackend::be_scan_list(ScrubMap&, std::vector<hobject_t, std::allocator<hobject_t> > const&, bool, unsigned int, ThreadPool::TPHandle&)+0x2ca) [0x8d2c9a]
Jul 18 12:38:14 pc2-px03 bash[3892]: 5: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t, bool, unsigned int, ThreadPool::TPHandle&)+0x1fa) [0x7dfc2a]
...
(stack trace is too long for the forum, just ask if you need the expanded log)

I restarted the node and guess what ? Every OSDs are up and running and everything run smoothly ... until next time I suppose.

So, if anyone has a clue, that would be really great.

wosp · Jul 18, 2016

Sure it's not a time issue (clock skews)? That is a known problem since 4.x (4.x is based on Debian Jessie, where systemd-timesyncd is introduced) and Ceph on the same host. If it is, here is how to fix it: https://forum.proxmox.com/threads/pve-4-1-systemd-timesyncd-and-ceph-clock-skew.27043/

udo · Jul 19, 2016

hybrid512 said:
Well ... I would think this is somewhat related to Kernel/Ceph (maybe KRBD) because most of the time, the origin of my crashes are Ceph related.
OSDs or MONs are going down for no reasons at any moment on any node.

Hi,
don't use KRBD but pve 4.2 with ceph work well for me (but I have only MONs on the pve-hosts - the OSDs are on seperate OSD-Nodes (without pve)).

...
Here the backtrace I got today when one of my OSDs went down ... if that can help :

Code:

Jul 18 12:38:14 pc2-px03 kernel: [1107987.885749] sd 0:2:1:0: [sdb] tag#0 FAILED Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK
Jul 18 12:38:14 pc2-px03 kernel: [1107987.885779] sd 0:2:1:0: [sdb] tag#0 CDB: Read(10) 28 00 01 f0 5b 10 00 01 00 00
Jul 18 12:38:14 pc2-px03 kernel: [1107987.885782] blk_update_request: I/O error, dev sdb, sector 32529168
Jul 18 12:38:14 pc2-px03 kernel: [1107987.925681] sd 0:2:1:0: [sdb] tag#0 FAILED Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK
...

This looks not good - your osd-disk has an hardware issue!! look with smartctl ("apt-get install smartmontools") and replace the disk in an shorter time...
I would not expected less trouble with this osd allready up and running...

Udo

hybrid512 · Jul 22, 2016

That's what it looks like but it is not ... smart state is good and since I restarted the node, for now, it is running smoothly with no errors.
As I said, I don't think this is a hardware issue but more a software issue but I don't really know where.
As I said again, I was using the same hardware with ProxMox 3 for more than a year ruuning continuously and it was extremely stable.
Since my migration to ProxMox 4, I had many issues of that type and completely random (never the same OSD or node that fails).

linkstat · Jul 22, 2016

The unique problem that I have with PVE 4.x series is with LXC Containers, that in some cases produces a reboot of the node in witch they are. But is a problem related to LXC (that i recently discover, yesterday), and no with PVE 4.x itself... (in PVE 3.x series, the LXC doesn't exist).

Search

Search

ProxMox 4.2 less stable than 3.4

hybrid512

Active Member

hybrid512

Active Member

wosp

Renowned Member

udo

Distinguished Member

hybrid512

Active Member

linkstat

Renowned Member