Please, can you help me

chinaxin

New Member
Dec 17, 2024
7
0
1
We use the PVE8.2+CEPH hyper-converged architecture in the production environment, with a total of 8 physical nodes. These 8 physical nodes use exactly the same hardware configuration. THE SERVER MODEL IS DELL R750
However, in the past six months or so, there has been a physical node crash, and the frequency of crashes is irregular, sometimes for more than ten hours, sometimes for a day, sometimes even for days or a month. However, there are no hardware alarms under IDRAC. We contacted DELL engineers to troubleshoot the hardware issue and told us that there were no hardware issues. So we started with software. After the faulty host is removed from the cluster, the PVE host is rebuilt and added to the cluster. But a few days later, it started to freeze again, and this solution did not solve the problem. The cluster is also connected to Huawei's FC SAN storage, and only the faulty node cannot be powered on and automatically connected to the FC SAN. At this time, we replaced the FC HBA card, and the problem of not being able to automatically connect to the storage after the replacement was completed, and the second and third days after the replacement were completed, and there were crashes on the second and third days, respectively. This was followed by no crashes for 1 month. But we recently replaced a new R750 server, but because the CPU of the new machine failed, we used the old CPU to install on the new server, but the next day it crashed again. So how do I troubleshoot my freeze issue?

Description of the crash: It is a network outage, pressing the enter key under the monitor does not respond, and the server must be restarted to restore the system and network
I'm sorry, I'm from China, I'm using Microsoft Translator to communicate with you, please understand. I'll attach some screenshots of the glitch

But since the fault screenshot may not be complete, I will attach a screenshot of the time when the new machine encounters a crash

I uploaded the system log of the most recent crash attached to it
 

Attachments

  • error_log.txt
    error_log.txt
    355.4 KB · Views: 6
  • 微信图片_20241217154930.png
    微信图片_20241217154930.png
    165.5 KB · Views: 9
  • 微信图片_20241217154916.png
    微信图片_20241217154916.png
    69.4 KB · Views: 9
  • 微信图片_20241217154911.png
    微信图片_20241217154911.png
    83.3 KB · Views: 7
  • 微信图片_20241217154902.png
    微信图片_20241217154902.png
    71.9 KB · Views: 9
To be honest, sounds like a hardware issue. unsure how the Forum could help in your case.

您有生產集群的 Proxmox VE 訂閱嗎?
 
To be honest, sounds like a hardware issue. unsure how the Forum could help in your case.

您有生產集群的 Proxmox VE 訂閱嗎?
We do not have a subscription license. I personally think it may be related to the CPU, and I am currently preparing to try replacing it with a new one.
 
I personally think it may be related to the CPU, and I am currently preparing to try replacing it with a new one.
Proxmox Software or this forum won't be able to solve your faulty CPU. Kick your Dell Engineers and local Hardware Support.

If you run Proxmox VE in production, i recommend to have a valid Subscription for a proper Support from them.

Disclaimer: I'm just a User. I'm not affiliated with Proxmox in any way!
 
Hi @chinaxin , based on error messages you reported, you may be running into a Kernel software bug.
Specifically: Dec 14 19:35:35 PVE05 kernel: BUG: kernel NULL pointer dereference, address: 0000000000000000
The system is telling us something and there is no reason to ignore it.

Here is a report of similar symptoms: https://forum.proxmox.com/threads/kernel-panics-after-6-8-8-2-pve-upgrade.150084/

A forum is not the best medium for troubleshooting Kernel crash issues. As suggested in the above thread you may want to report this to PVE via proper bug reporting method.

Or, given the Ubuntu pedigree of the Kernel, follow this guide: https://wiki.ubuntu.com/Kernel/Bugs

A short term solution/experiment may be to downgrade the Kernel.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
Proxmox Software or this forum won't be able to solve your faulty CPU. Kick your Dell Engineers and local Hardware Support.

If you run Proxmox VE in production, i recommend to have a valid Subscription for a proper Support from them.

Disclaimer: I'm just a User. I'm not affiliated with Proxmox in any way!
Thank you for your suggestion, we will seriously consider it
 
您好 @chinaxin ,根据您报告的错误信息,您可能遇到了内核软件错误。
具体来说:12 月 14 日 19:35:35 PVE05 内核:错误:内核 NULL 指针取消引用,地址:000000000000000000
系统正在告诉我们一些事情,没有理由忽视它。

以下是类似症状的报告:https://forum.proxmox.com/threads/kernel-panics-after-6-8-8-2-pve-upgrade.150084/

论坛不是解决内核崩溃问题的最佳媒介。如上述帖子中所建议的那样,您可能希望通过适当的错误报告方法向 PVE 报告此情况。

或者,鉴于内核的 Ubuntu 血统,请遵循本指南:https://wiki.ubuntu.com/Kernel/Bugs

短期解决方案/实验可能是降级 Kernel。


Blockbridge:适用于 Proxmox 的超低延迟全 NVME 共享存储 - https://www.blockbridge.com/proxmox
谢谢你的帮助,我会继续根据你提供的帮助信息进行故障排除,但我有一些信息要和你分享,我遇到的第一个问题是 PVE8.1.4,然后我从集群中删除了故障主机,用 PVE8.1.2 重新安装了故障主机(因为我找不到 8.1.4 的 ISO)。重新安装后,我将其添加到集群中,在安装 ceph 时,我更新了存储库,这导致我的整个 8.1.4 集群升级到 8.2.7。因此,故障主机已在 8.1.4 或 8.2.7 中崩溃
 
The first PVE crash was PVE8.1.4 and then reinstalling PVE8.1.2 after removing the failed host from the cluster (because I couldn't find the ISO for 8.1.4). After reinstalling, I added it to the cluster, and when installing ceph, I updated the repository, which caused my entire 8.1.4 cluster to upgrade to 8.2.7. As a result, the failed host has crashed in 8.1.4 or 8.2.7。 But the other 7 physical nodes in my cluster never crashed
 
And the reason for the crash has nothing to do with the workload, because since the crash was discovered, there is no business on this host, there is only one container on this host, no matter whether the container is in the boot state or shutdown state, this physical host will occasionally crash.
 
Hi @chinaxin , based on error messages you reported, you may be running into a Kernel software bug.
Specifically: Dec 14 19:35:35 PVE05 kernel: BUG: kernel NULL pointer dereference, address: 0000000000000000
The system is telling us something and there is no reason to ignore it.

Here is a report of similar symptoms: https://forum.proxmox.com/threads/kernel-panics-after-6-8-8-2-pve-upgrade.150084/

A forum is not the best medium for troubleshooting Kernel crash issues. As suggested in the above thread you may want to report this to PVE via proper bug reporting method.

Or, given the Ubuntu pedigree of the Kernel, follow this guide: https://wiki.ubuntu.com/Kernel/Bugs

A short term solution/experiment may be to downgrade the Kernel.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
Thank you for your help, I'll continue to troubleshoot based on the help information you provided, but I have some information to share with you, the first issue I had was PVE8.1.4 and then I removed the failed host from the cluster and reinstalled the failed host with PVE8.1.2 (because I couldn't find the ISO for 8.1.4). After reinstalling, I added it to the cluster, and when installing ceph, I updated the repository, which caused my entire 8.1.4 cluster to upgrade to 8.2.7. As a result, the failed host has crashed in 8.1.4 or 8.2.7
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!