Please, can you help me

chinaxin · Tuesday at 09:04

We use the PVE8.2+CEPH hyper-converged architecture in the production environment, with a total of 8 physical nodes. These 8 physical nodes use exactly the same hardware configuration. THE SERVER MODEL IS DELL R750
However, in the past six months or so, there has been a physical node crash, and the frequency of crashes is irregular, sometimes for more than ten hours, sometimes for a day, sometimes even for days or a month. However, there are no hardware alarms under IDRAC. We contacted DELL engineers to troubleshoot the hardware issue and told us that there were no hardware issues. So we started with software. After the faulty host is removed from the cluster, the PVE host is rebuilt and added to the cluster. But a few days later, it started to freeze again, and this solution did not solve the problem. The cluster is also connected to Huawei's FC SAN storage, and only the faulty node cannot be powered on and automatically connected to the FC SAN. At this time, we replaced the FC HBA card, and the problem of not being able to automatically connect to the storage after the replacement was completed, and the second and third days after the replacement were completed, and there were crashes on the second and third days, respectively. This was followed by no crashes for 1 month. But we recently replaced a new R750 server, but because the CPU of the new machine failed, we used the old CPU to install on the new server, but the next day it crashed again. So how do I troubleshoot my freeze issue?

Description of the crash: It is a network outage, pressing the enter key under the monitor does not respond, and the server must be restarted to restore the system and network
I'm sorry, I'm from China, I'm using Microsoft Translator to communicate with you, please understand. I'll attach some screenshots of the glitch

But since the fault screenshot may not be complete, I will attach a screenshot of the time when the new machine encounters a crash

I uploaded the system log of the most recent crash attached to it

cave · Tuesday at 09:29

To be honest, sounds like a hardware issue. unsure how the Forum could help in your case.

您有生產集群的 Proxmox VE 訂閱嗎？

chinaxin · Tuesday at 09:37

cave said:
To be honest, sounds like a hardware issue. unsure how the Forum could help in your case.

您有生產集群的 Proxmox VE 訂閱嗎？

We do not have a subscription license. I personally think it may be related to the CPU, and I am currently preparing to try replacing it with a new one.

cave · Tuesday at 10:03

chinaxin said:
I personally think it may be related to the CPU, and I am currently preparing to try replacing it with a new one.

Proxmox Software or this forum won't be able to solve your faulty CPU. Kick your Dell Engineers and local Hardware Support.

If you run Proxmox VE in production, i recommend to have a valid Subscription for a proper Support from them.

Disclaimer: I'm just a User. I'm not affiliated with Proxmox in any way!

bbgeek17 · Tuesday at 15:44

Hi @chinaxin , based on error messages you reported, you may be running into a Kernel software bug.
Specifically: Dec 14 19:35:35 PVE05 kernel: BUG: kernel NULL pointer dereference, address: 0000000000000000
The system is telling us something and there is no reason to ignore it.

Here is a report of similar symptoms: https://forum.proxmox.com/threads/kernel-panics-after-6-8-8-2-pve-upgrade.150084/

A forum is not the best medium for troubleshooting Kernel crash issues. As suggested in the above thread you may want to report this to PVE via proper bug reporting method.

Or, given the Ubuntu pedigree of the Kernel, follow this guide: https://wiki.ubuntu.com/Kernel/Bugs

A short term solution/experiment may be to downgrade the Kernel.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

chinaxin · Wednesday at 16:07

cave said:
Proxmox Software or this forum won't be able to solve your faulty CPU. Kick your Dell Engineers and local Hardware Support.

If you run Proxmox VE in production, i recommend to have a valid Subscription for a proper Support from them.

Disclaimer: I'm just a User. I'm not affiliated with Proxmox in any way!

Thank you for your suggestion, we will seriously consider it

chinaxin · Wednesday at 16:29

bbgeek17 said:
您好 @chinaxin ，根据您报告的错误信息，您可能遇到了内核软件错误。
具体来说：12 月 14 日 19：35：35 PVE05 内核：错误：内核 NULL 指针取消引用，地址：000000000000000000
系统正在告诉我们一些事情，没有理由忽视它。

以下是类似症状的报告：https://forum.proxmox.com/threads/kernel-panics-after-6-8-8-2-pve-upgrade.150084/

论坛不是解决内核崩溃问题的最佳媒介。如上述帖子中所建议的那样，您可能希望通过适当的错误报告方法向 PVE 报告此情况。

或者，鉴于内核的 Ubuntu 血统，请遵循本指南：https://wiki.ubuntu.com/Kernel/Bugs

短期解决方案/实验可能是降级 Kernel。

Blockbridge：适用于 Proxmox 的超低延迟全 NVME 共享存储 - https://www.blockbridge.com/proxmox

谢谢你的帮助，我会继续根据你提供的帮助信息进行故障排除，但我有一些信息要和你分享，我遇到的第一个问题是 PVE8.1.4，然后我从集群中删除了故障主机，用 PVE8.1.2 重新安装了故障主机（因为我找不到 8.1.4 的 ISO）。重新安装后，我将其添加到集群中，在安装 ceph 时，我更新了存储库，这导致我的整个 8.1.4 集群升级到 8.2.7。因此，故障主机已在 8.1.4 或 8.2.7 中崩溃

chinaxin · Wednesday at 16:32

The first PVE crash was PVE8.1.4 and then reinstalling PVE8.1.2 after removing the failed host from the cluster (because I couldn't find the ISO for 8.1.4). After reinstalling, I added it to the cluster, and when installing ceph, I updated the repository, which caused my entire 8.1.4 cluster to upgrade to 8.2.7. As a result, the failed host has crashed in 8.1.4 or 8.2.7。 But the other 7 physical nodes in my cluster never crashed

chinaxin · Wednesday at 16:41

And the reason for the crash has nothing to do with the workload, because since the crash was discovered, there is no business on this host, there is only one container on this host, no matter whether the container is in the boot state or shutdown state, this physical host will occasionally crash.

chinaxin · Wednesday at 16:42

bbgeek17 said:
Hi @chinaxin , based on error messages you reported, you may be running into a Kernel software bug.
Specifically: Dec 14 19:35:35 PVE05 kernel: BUG: kernel NULL pointer dereference, address: 0000000000000000
The system is telling us something and there is no reason to ignore it.

Here is a report of similar symptoms: https://forum.proxmox.com/threads/kernel-panics-after-6-8-8-2-pve-upgrade.150084/

A forum is not the best medium for troubleshooting Kernel crash issues. As suggested in the above thread you may want to report this to PVE via proper bug reporting method.

Or, given the Ubuntu pedigree of the Kernel, follow this guide: https://wiki.ubuntu.com/Kernel/Bugs

A short term solution/experiment may be to downgrade the Kernel.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

Thank you for your help, I'll continue to troubleshoot based on the help information you provided, but I have some information to share with you, the first issue I had was PVE8.1.4 and then I removed the failed host from the cluster and reinstalled the failed host with PVE8.1.2 (because I couldn't find the ISO for 8.1.4). After reinstalling, I added it to the cluster, and when installing ceph, I updated the repository, which caused my entire 8.1.4 cluster to upgrade to 8.2.7. As a result, the failed host has crashed in 8.1.4 or 8.2.7

Search

Search

Please, can you help me

chinaxin

New Member

Attachments

cave

Renowned Member

chinaxin

New Member

cave

Renowned Member

bbgeek17

Distinguished Member

chinaxin

New Member

chinaxin

New Member

chinaxin

New Member

chinaxin

New Member

chinaxin

New Member