Ceph issue after 7.4 to 8.0 migration

kifeo

Well-Known Member
Oct 28, 2019
112
13
58
Hi

I've recently upgraded from 7.4 to 8.0.
I have a cluster of 5 nodes, and 2 have the below failed condition. I mention that these two are HP N54L.

All ceph-related process are failing like this :

Code:
0> 2023-06-28T12:19:08.096+0200 7f5efbaf3a00 -1 *** Caught signal (Illegal instruction) **
 in thread 7f5efbaf3a00 thread_name:ceph-mon

 ceph version 17.2.6 (810db68029296377607028a6c6da1ec06f5a2b27) quincy (stable)
 1: /lib/x86_64-linux-gnu/libc.so.6(+0x3bf90) [0x7f5efc193f90]
 2: gf_init_hard()
 3: gf_init_easy()
 4: galois_init_default_field()
 5: jerasure_init()
 6: __erasure_code_init()
 7: (ceph::ErasureCodePluginRegistry::load(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::ErasureCodePl
ugin**, std::ostream*)+0x2b5) [0x55a42d3fb605]
 8: (ceph::ErasureCodePluginRegistry::preload(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::ostream*)+0
x9f) [0x55a42d3fbbaf]
 9: (global_init_preload_erasure_code(ceph::common::CephContext const*)+0x7c2) [0x55a42cea8f92]
 10: main()
 11: /lib/x86_64-linux-gnu/libc.so.6(+0x2718a) [0x7f5efc17f18a]
 12: __libc_start_main()
 13: _start()
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

the same happens for a monitor or a osd, and also after a reboot.

Any hint on how I could resolve this ?


Thanks
 

Attachments

Last edited:
  • Like
Reactions: dwm
Proxmox VE is based on Debian and unfortunately, Debian (package management) is not designed to allow downgrades across major releases. And running 7.4 nodes in the same cluster as 8.0 nodes for a long time is not recommended either. So if you really plan to go back, it might be easiest to re-install the nodes with 7.4 one-by-one. I really can't tell if there will be a build of Ceph that supports these old CPUs, would depend on what the actual issue is.
 
This is REALLY bad! Many of us have million dollars hardware. We cannot just change our whole infrastructure just because someone do not want to compile two different binaries.
This was happening to Microsoft, then VMware and now also Proxmox.
This is REALLY a hit to the trust in Proxmox. I understand that some people are installing this on $500 laptops but serious setups do not update all new hardware just because some people do not want to add one or two compilation parameters.
Maybe this is a sign that we need to move away from Proxmox before it is too late.
 
This is REALLY bad! Many of us have million dollars hardware.
Harsh lesson for anyone. IT teams that have "millions of dollars" of responsibility typically don't upgrade without testing EXTENSIVELY, and even then kicking and screaming. I usually wait as long as I can just to LAB an upgrade, much less perform on production.
 
The point is just that I want to decide what hardware my organisation uses. It should not be Proxmox or (in this case) Ceph (or the person who actually compiled the Ceph binaries) who does.
I can understand if the new software actually requires the new hardware. But I do not believe that is. It is just a matter of either compiling the binaries with backward compatibility or produce two different binaries.
EXTENSIVE testing will never change the binary to be able to run on the AMD64 architecture again.
This is a slippery slope and the consequences are huge. Now instead of being productive we need to baby-sit our server hardware too. What is why we abstracted the hardware away in the first place!
Who will buy a Lenovo ThinkSystem SR950 V3 Server for a million dollars if Proxmox 'do not like the color of it' 5 years later?
 
Hi,
The point is just that I want to decide what hardware my organisation uses. It should not be Proxmox or (in this case) Ceph (or the person who actually compiled the Ceph binaries) who does.
I can understand if the new software actually requires the new hardware. But I do not believe that is. It is just a matter of either compiling the binaries with backward compatibility or produce two different binaries.
EXTENSIVE testing will never change the binary to be able to run on the AMD64 architecture again.
This is a slippery slope and the consequences are huge. Now instead of being productive we need to baby-sit our server hardware too. What is why we abstracted the hardware away in the first place!
Who will buy a Lenovo ThinkSystem SR950 V3 Server for a million dollars if Proxmox 'do not like the color of it' 5 years later?
It's not like we chose to break support for this older hardware and if it can be fixed without downsides we will. For more information, see: https://forum.proxmox.com/threads/p...orking-on-amd-opteron-2427.129613/post-568822