ProxMox 4.2 less stable than 3.4

hybrid512

Active Member
Jun 6, 2013
76
4
28
Hi,

I was using ProxMox 3.4 on a 9 nodes cluster with Ceph quite nicely for more than a year and it was very stable.

I reinstalled this cluster (and even added nodes) with ProxMox 4.2 updated to current release as of today and there is not a week without a kernel panic or down OSDs or Ceph problem (MONs are going down frquently for no reasons).

It is definitely less stable than before but this is the same hardware.
I have no clue from where the problems are coming and this is a shame because there are plenty of new features that I really like with this release except stability which had been greatly degraded.

Any idea on which information I can provide in order to help debugging this situation ?

Best regards.
 
Well ... I would think this is somewhat related to Kernel/Ceph (maybe KRBD) because most of the time, the origin of my crashes are Ceph related.
OSDs or MONs are going down for no reasons at any moment on any node.
My Ceph storage volume is configured with KRBD and not qemu driver. I read somewhere that this would provide less overhead on the node and probably some performance gains.
I didn't notice some real gains but I don't know if this is more or less stable than qemu driver, I can't even be sure that this is related ... I'm just trying to figure out from where those problems are coming.

This is probably not my hard drives, if it where the same OSDs that were going down all the time, I would be ok with this but these are hard drivers that were working flawlessly with the preceding setup (ProxMox 3.4) and they are going down randomly, this is never the same disk on the same node (I have 13 nodes with 39 OSDs)

Here the backtrace I got today when one of my OSDs went down ... if that can help :

Code:
Jul 18 12:38:14 pc2-px03 kernel: [1107987.885749] sd 0:2:1:0: [sdb] tag#0 FAILED Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK
Jul 18 12:38:14 pc2-px03 kernel: [1107987.885779] sd 0:2:1:0: [sdb] tag#0 CDB: Read(10) 28 00 01 f0 5b 10 00 01 00 00
Jul 18 12:38:14 pc2-px03 kernel: [1107987.885782] blk_update_request: I/O error, dev sdb, sector 32529168
Jul 18 12:38:14 pc2-px03 kernel: [1107987.925681] sd 0:2:1:0: [sdb] tag#0 FAILED Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK
Jul 18 12:38:14 pc2-px03 kernel: [1107987.925707] sd 0:2:1:0: [sdb] tag#0 CDB: Read(10) 28 00 01 f0 5b 38 00 00 08 00
Jul 18 12:38:14 pc2-px03 kernel: [1107987.925710] blk_update_request: I/O error, dev sdb, sector 32529208
Jul 18 12:38:14 pc2-px03 kernel: [1107987.965706] sd 0:2:1:0: [sdb] tag#0 FAILED Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK
Jul 18 12:38:14 pc2-px03 kernel: [1107987.965734] sd 0:2:1:0: [sdb] tag#0 CDB: Read(10) 28 00 01 f0 5b 38 00 00 08 00
Jul 18 12:38:14 pc2-px03 kernel: [1107987.965737] blk_update_request: I/O error, dev sdb, sector 32529208
Jul 18 12:38:14 pc2-px03 bash[3892]: os/FileStore.cc: In function 'virtual int FileStore::read(coll_t, const ghobject_t&, uint64_t, size_t, ceph::bufferlist&, uint32_t, bool)' thread 7fc2b07ac700 time 2016-07-18 12:38:14.144530
Jul 18 12:38:14 pc2-px03 bash[3892]: os/FileStore.cc: 2854: FAILED assert(allow_eio || !m_filestore_fail_eio || got != -5)
Jul 18 12:38:14 pc2-px03 bash[3892]: ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)
Jul 18 12:38:14 pc2-px03 bash[3892]: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x76) [0xc0e746]
Jul 18 12:38:14 pc2-px03 bash[3892]: 2: (FileStore::read(coll_t, ghobject_t const&, unsigned long, unsigned long, ceph::buffer::list&, unsigned int, bool)+0xcc2) [0x9108a2]
Jul 18 12:38:14 pc2-px03 bash[3892]: 3: (ReplicatedBackend::be_deep_scrub(hobject_t const&, unsigned int, ScrubMap::object&, ThreadPool::TPHandle&)+0x31c) [0xa21cfc]
Jul 18 12:38:14 pc2-px03 bash[3892]: 4: (PGBackend::be_scan_list(ScrubMap&, std::vector<hobject_t, std::allocator<hobject_t> > const&, bool, unsigned int, ThreadPool::TPHandle&)+0x2ca) [0x8d2c9a]
Jul 18 12:38:14 pc2-px03 bash[3892]: 5: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t, bool, unsigned int, ThreadPool::TPHandle&)+0x1fa) [0x7dfc2a]
Jul 18 12:38:14 pc2-px03 bash[3892]: 6: (PG::replica_scrub(MOSDRepScrub*, ThreadPool::TPHandle&)+0x4ae) [0x7e03de]
Jul 18 12:38:14 pc2-px03 bash[3892]: 7: (OSD::RepScrubWQ::_process(MOSDRepScrub*, ThreadPool::TPHandle&)+0xb3) [0x6ce513]
Jul 18 12:38:14 pc2-px03 bash[3892]: 8: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa77) [0xbfecf7]
Jul 18 12:38:14 pc2-px03 bash[3892]: 9: (ThreadPool::WorkThread::entry()+0x10) [0xbffdc0]
Jul 18 12:38:14 pc2-px03 bash[3892]: 10: (()+0x80a4) [0x7fc2d39d00a4]
Jul 18 12:38:14 pc2-px03 bash[3892]: 11: (clone()+0x6d) [0x7fc2d1f2b87d]
Jul 18 12:38:14 pc2-px03 bash[3892]: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Jul 18 12:38:14 pc2-px03 bash[3892]: 2016-07-18 12:38:14.190616 7fc2b07ac700 -1 os/FileStore.cc: In function 'virtual int FileStore::read(coll_t, const ghobject_t&, uint64_t, size_t, ceph::bufferlist&, uint32_t, bool)' thread 7fc2b07ac700 time 2016-07-18 12:38:14.144530
Jul 18 12:38:14 pc2-px03 bash[3892]: os/FileStore.cc: 2854: FAILED assert(allow_eio || !m_filestore_fail_eio || got != -5)
Jul 18 12:38:14 pc2-px03 bash[3892]: ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)
Jul 18 12:38:14 pc2-px03 bash[3892]: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x76) [0xc0e746]
Jul 18 12:38:14 pc2-px03 bash[3892]: 2: (FileStore::read(coll_t, ghobject_t const&, unsigned long, unsigned long, ceph::buffer::list&, unsigned int, bool)+0xcc2) [0x9108a2]
Jul 18 12:38:14 pc2-px03 bash[3892]: 3: (ReplicatedBackend::be_deep_scrub(hobject_t const&, unsigned int, ScrubMap::object&, ThreadPool::TPHandle&)+0x31c) [0xa21cfc]
Jul 18 12:38:14 pc2-px03 bash[3892]: 4: (PGBackend::be_scan_list(ScrubMap&, std::vector<hobject_t, std::allocator<hobject_t> > const&, bool, unsigned int, ThreadPool::TPHandle&)+0x2ca) [0x8d2c9a]
Jul 18 12:38:14 pc2-px03 bash[3892]: 5: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t, bool, unsigned int, ThreadPool::TPHandle&)+0x1fa) [0x7dfc2a]
...
(stack trace is too long for the forum, just ask if you need the expanded log)

I restarted the node and guess what ? Every OSDs are up and running and everything run smoothly ... until next time I suppose.

So, if anyone has a clue, that would be really great.
 
Well ... I would think this is somewhat related to Kernel/Ceph (maybe KRBD) because most of the time, the origin of my crashes are Ceph related.
OSDs or MONs are going down for no reasons at any moment on any node.
Hi,
don't use KRBD but pve 4.2 with ceph work well for me (but I have only MONs on the pve-hosts - the OSDs are on seperate OSD-Nodes (without pve)).
...
Here the backtrace I got today when one of my OSDs went down ... if that can help :

Code:
Jul 18 12:38:14 pc2-px03 kernel: [1107987.885749] sd 0:2:1:0: [sdb] tag#0 FAILED Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK
Jul 18 12:38:14 pc2-px03 kernel: [1107987.885779] sd 0:2:1:0: [sdb] tag#0 CDB: Read(10) 28 00 01 f0 5b 10 00 01 00 00
Jul 18 12:38:14 pc2-px03 kernel: [1107987.885782] blk_update_request: I/O error, dev sdb, sector 32529168
Jul 18 12:38:14 pc2-px03 kernel: [1107987.925681] sd 0:2:1:0: [sdb] tag#0 FAILED Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK
...
This looks not good - your osd-disk has an hardware issue!! look with smartctl ("apt-get install smartmontools") and replace the disk in an shorter time...
I would not expected less trouble with this osd allready up and running...

Udo
 
That's what it looks like but it is not ... smart state is good and since I restarted the node, for now, it is running smoothly with no errors.
As I said, I don't think this is a hardware issue but more a software issue but I don't really know where.
As I said again, I was using the same hardware with ProxMox 3 for more than a year ruuning continuously and it was extremely stable.
Since my migration to ProxMox 4, I had many issues of that type and completely random (never the same OSD or node that fails).
 
The unique problem that I have with PVE 4.x series is with LXC Containers, that in some cases produces a reboot of the node in witch they are. But is a problem related to LXC (that i recently discover, yesterday), and no with PVE 4.x itself... (in PVE 3.x series, the LXC doesn't exist).
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!