Ceph issue after 7.4 to 8.0 migration

kifeo · Jun 28, 2023

Hi

I've recently upgraded from 7.4 to 8.0.
I have a cluster of 5 nodes, and 2 have the below failed condition. I mention that these two are HP N54L.

All ceph-related process are failing like this :

Code:

0> 2023-06-28T12:19:08.096+0200 7f5efbaf3a00 -1 *** Caught signal (Illegal instruction) **
 in thread 7f5efbaf3a00 thread_name:ceph-mon

 ceph version 17.2.6 (810db68029296377607028a6c6da1ec06f5a2b27) quincy (stable)
 1: /lib/x86_64-linux-gnu/libc.so.6(+0x3bf90) [0x7f5efc193f90]
 2: gf_init_hard()
 3: gf_init_easy()
 4: galois_init_default_field()
 5: jerasure_init()
 6: __erasure_code_init()
 7: (ceph::ErasureCodePluginRegistry::load(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::ErasureCodePl
ugin**, std::ostream*)+0x2b5) [0x55a42d3fb605]
 8: (ceph::ErasureCodePluginRegistry::preload(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::ostream*)+0
x9f) [0x55a42d3fbbaf]
 9: (global_init_preload_erasure_code(ceph::common::CephContext const*)+0x7c2) [0x55a42cea8f92]
 10: main()
 11: /lib/x86_64-linux-gnu/libc.so.6(+0x2718a) [0x7f5efc17f18a]
 12: __libc_start_main()
 13: _start()
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

the same happens for a monitor or a osd, and also after a reboot.

Any hint on how I could resolve this ?

Thanks

fiona · Jun 28, 2023

Hi,
it's likely that the Ceph build is not supported on these old CPUs anymore: https://forum.proxmox.com/threads/proxmox-ve-8-0-released.129320/post-567968

EDIT: thread with suggestions for how to debug/diagnose further: https://forum.proxmox.com/threads/p...no-longer-working-on-amd-opteron-2427.129613/

kifeo · Jun 28, 2023

Thanks !
would you advice to get back to 7.4 ? (if supported? )

fiona · Jun 28, 2023

Proxmox VE is based on Debian and unfortunately, Debian (package management) is not designed to allow downgrades across major releases. And running 7.4 nodes in the same cluster as 8.0 nodes for a long time is not recommended either. So if you really plan to go back, it might be easiest to re-install the nodes with 7.4 one-by-one. I really can't tell if there will be a build of Ceph that supports these old CPUs, would depend on what the actual issue is.

Minimons · Aug 1, 2023

This is REALLY bad! Many of us have million dollars hardware. We cannot just change our whole infrastructure just because someone do not want to compile two different binaries.
This was happening to Microsoft, then VMware and now also Proxmox.
This is REALLY a hit to the trust in Proxmox. I understand that some people are installing this on $500 laptops but serious setups do not update all new hardware just because some people do not want to add one or two compilation parameters.
Maybe this is a sign that we need to move away from Proxmox before it is too late.

alexskysilk · Aug 1, 2023

Minimons said:
This is REALLY bad! Many of us have million dollars hardware.

Harsh lesson for anyone. IT teams that have "millions of dollars" of responsibility typically don't upgrade without testing EXTENSIVELY, and even then kicking and screaming. I usually wait as long as I can just to LAB an upgrade, much less perform on production.

Minimons · Aug 2, 2023

The point is just that I want to decide what hardware my organisation uses. It should not be Proxmox or (in this case) Ceph (or the person who actually compiled the Ceph binaries) who does.
I can understand if the new software actually requires the new hardware. But I do not believe that is. It is just a matter of either compiling the binaries with backward compatibility or produce two different binaries.
EXTENSIVE testing will never change the binary to be able to run on the AMD64 architecture again.
This is a slippery slope and the consequences are huge. Now instead of being productive we need to baby-sit our server hardware too. What is why we abstracted the hardware away in the first place!
Who will buy a Lenovo ThinkSystem SR950 V3 Server for a million dollars if Proxmox 'do not like the color of it' 5 years later?

fiona · Aug 3, 2023

Hi,

Minimons said:
The point is just that I want to decide what hardware my organisation uses. It should not be Proxmox or (in this case) Ceph (or the person who actually compiled the Ceph binaries) who does.
I can understand if the new software actually requires the new hardware. But I do not believe that is. It is just a matter of either compiling the binaries with backward compatibility or produce two different binaries.
EXTENSIVE testing will never change the binary to be able to run on the AMD64 architecture again.
This is a slippery slope and the consequences are huge. Now instead of being productive we need to baby-sit our server hardware too. What is why we abstracted the hardware away in the first place!
Who will buy a Lenovo ThinkSystem SR950 V3 Server for a million dollars if Proxmox 'do not like the color of it' 5 years later?

It's not like we chose to break support for this older hardware and if it can be fixed without downsides we will. For more information, see: https://forum.proxmox.com/threads/p...orking-on-amd-opteron-2427.129613/post-568822

Diego Aguirre · Aug 23, 2023

Same issue with Intel(R) Xeon(R) CPU E5320 @ 1.86GHz

kifeo · Nov 8, 2023

The issue on the HP N54L is resolved with 17.2.7

Search

Search

Ceph issue after 7.4 to 8.0 migration

kifeo

Well-Known Member

Attachments

fiona

Proxmox Staff Member

kifeo

Well-Known Member

fiona

Proxmox Staff Member

Minimons

New Member

alexskysilk

Distinguished Member

Minimons

New Member

fiona

Proxmox Staff Member

Diego Aguirre

Renowned Member

kifeo

Well-Known Member