Ceph issue after 7.4 to 8.0 migration

kifeo

Well-Known Member
Oct 28, 2019
112
12
58
Hi

I've recently upgraded from 7.4 to 8.0.
I have a cluster of 5 nodes, and 2 have the below failed condition. I mention that these two are HP N54L.

All ceph-related process are failing like this :

Code:
0> 2023-06-28T12:19:08.096+0200 7f5efbaf3a00 -1 *** Caught signal (Illegal instruction) **
 in thread 7f5efbaf3a00 thread_name:ceph-mon

 ceph version 17.2.6 (810db68029296377607028a6c6da1ec06f5a2b27) quincy (stable)
 1: /lib/x86_64-linux-gnu/libc.so.6(+0x3bf90) [0x7f5efc193f90]
 2: gf_init_hard()
 3: gf_init_easy()
 4: galois_init_default_field()
 5: jerasure_init()
 6: __erasure_code_init()
 7: (ceph::ErasureCodePluginRegistry::load(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::ErasureCodePl
ugin**, std::ostream*)+0x2b5) [0x55a42d3fb605]
 8: (ceph::ErasureCodePluginRegistry::preload(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::ostream*)+0
x9f) [0x55a42d3fbbaf]
 9: (global_init_preload_erasure_code(ceph::common::CephContext const*)+0x7c2) [0x55a42cea8f92]
 10: main()
 11: /lib/x86_64-linux-gnu/libc.so.6(+0x2718a) [0x7f5efc17f18a]
 12: __libc_start_main()
 13: _start()
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

the same happens for a monitor or a osd, and also after a reboot.

Any hint on how I could resolve this ?


Thanks
 

Attachments

  • error-mon.noname-a-osd.txt
    21.4 KB · Views: 2
  • error-mon.noname-a-mon.txt
    14.7 KB · Views: 1
Last edited:
  • Like
Reactions: dwm
Proxmox VE is based on Debian and unfortunately, Debian (package management) is not designed to allow downgrades across major releases. And running 7.4 nodes in the same cluster as 8.0 nodes for a long time is not recommended either. So if you really plan to go back, it might be easiest to re-install the nodes with 7.4 one-by-one. I really can't tell if there will be a build of Ceph that supports these old CPUs, would depend on what the actual issue is.
 
This is REALLY bad! Many of us have million dollars hardware. We cannot just change our whole infrastructure just because someone do not want to compile two different binaries.
This was happening to Microsoft, then VMware and now also Proxmox.
This is REALLY a hit to the trust in Proxmox. I understand that some people are installing this on $500 laptops but serious setups do not update all new hardware just because some people do not want to add one or two compilation parameters.
Maybe this is a sign that we need to move away from Proxmox before it is too late.
 
This is REALLY bad! Many of us have million dollars hardware.
Harsh lesson for anyone. IT teams that have "millions of dollars" of responsibility typically don't upgrade without testing EXTENSIVELY, and even then kicking and screaming. I usually wait as long as I can just to LAB an upgrade, much less perform on production.
 
The point is just that I want to decide what hardware my organisation uses. It should not be Proxmox or (in this case) Ceph (or the person who actually compiled the Ceph binaries) who does.
I can understand if the new software actually requires the new hardware. But I do not believe that is. It is just a matter of either compiling the binaries with backward compatibility or produce two different binaries.
EXTENSIVE testing will never change the binary to be able to run on the AMD64 architecture again.
This is a slippery slope and the consequences are huge. Now instead of being productive we need to baby-sit our server hardware too. What is why we abstracted the hardware away in the first place!
Who will buy a Lenovo ThinkSystem SR950 V3 Server for a million dollars if Proxmox 'do not like the color of it' 5 years later?
 
Hi,
The point is just that I want to decide what hardware my organisation uses. It should not be Proxmox or (in this case) Ceph (or the person who actually compiled the Ceph binaries) who does.
I can understand if the new software actually requires the new hardware. But I do not believe that is. It is just a matter of either compiling the binaries with backward compatibility or produce two different binaries.
EXTENSIVE testing will never change the binary to be able to run on the AMD64 architecture again.
This is a slippery slope and the consequences are huge. Now instead of being productive we need to baby-sit our server hardware too. What is why we abstracted the hardware away in the first place!
Who will buy a Lenovo ThinkSystem SR950 V3 Server for a million dollars if Proxmox 'do not like the color of it' 5 years later?
It's not like we chose to break support for this older hardware and if it can be fixed without downsides we will. For more information, see: https://forum.proxmox.com/threads/p...orking-on-amd-opteron-2427.129613/post-568822
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!