The status on the PVE management page is all displayed as 'unknown,' and this phenomenon occurs frequently

zhousp666

New Member
Jan 6, 2024
10
0
1
我的服务器都尝试过安装6.X、7.X、8.X所有的版本,
I have tried installing all versions, including 6.X, 7.X, and 8.X, on my server.

使用一段时间后(比如在安装虚拟机过程中,切换到浏览器别的页面再切换回来)就都会出现页面unknown的情况,
After using it for a while (such as during the virtual machine installation process, switching to other pages in the browser, and then switching back), the page status will become 'unknown.'

每当出现unknown的状态,使用命令“systemctl status pvestatd”就能恢复,
Whenever the 'unknown' status occurs, using the command "systemctl status pvestatd" can restore it.

但过一会又有问题,这种时候就只能拔掉服务器电源重启服务器。
However, after a while, the issue reoccurs, and in such cases, I have to unplug the server's power to restart it.

难道是我的服务器硬件有问题吗?
Could it be an issue with my server's hardware?

我的主板是A520SD4-ITX 搭配AMD5600G的CPU,
My motherboard is A520SD4-ITX paired with an AMD5600G CPU.

主板的BIOS已经把虚拟化开起来了,还有别的地方要设置?
The motherboard's BIOS has already enabled virtualization. Are there any other settings that need to be configured?

我还需要提供什么信息
What other information do I need to provide?
Snipaste_2024-01-06_17-19-14.png

Snipaste_2024-01-06_17-25-21.pngSnipaste_2024-01-06_17-30-34.pngSnipaste_2024-01-06_17-30-58.png
 
Hi zhousp666,

Welcome to the forums!

I'll answer in English, and let you do the translation yourself: I'm not sure Google translate would make anything legible. On the other hand: the translations in your post make sense.
难道是我的服务器硬件有问题吗?
Could it be an issue with my server's hardware?
Hardware would be my first guess (RAM or power supply).
我的主板是A520SD4-ITX
My motherboard is A520SD4-ITX
This one? http://www.onda.cn/MotherBoard_Specifications.aspx?id=508


我的服务器都尝试过安装6.X、7.X、8.X所有的版本,
I have tried installing all versions, including 6.X, 7.X, and 8.X, on my server.
With new/recent hardware, it should not be necessary to try so many versions. It does show that it is not related to one specific installation of Proxmox on your machine, again indicating a hardware problem.

我还需要提供什么信息
What other information do I need to provide?
How familiar are you with PC troubleshooting in general, and Linux troubleshooting in particular?

Proxmox runs on Debian under the hood. You could log in via SSH on one screen and run tail -f /var/log/messages to show system events while they are happening, and then browse the webinterface in your browser until a problem occurs. Then have a look in the SSH session to see what messages are shown in the log.

Another thing you could provide, is the output of dmesg ( cat dmesg > dmesg.log ) and attach the file to your next post.

Good luck debugging!
 
Hi zhousp666,

Welcome to the forums!

I'll answer in English, and let you do the translation yourself: I'm not sure Google translate would make anything legible. On the other hand: the translations in your post make sense.

Hardware would be my first guess (RAM or power supply).

This one? http://www.onda.cn/MotherBoard_Specifications.aspx?id=508



With new/recent hardware, it should not be necessary to try so many versions. It does show that it is not related to one specific installation of Proxmox on your machine, again indicating a hardware problem.


How familiar are you with PC troubleshooting in general, and Linux troubleshooting in particular?

Proxmox runs on Debian under the hood. You could log in via SSH on one screen and run tail -f /var/log/messages to show system events while they are happening, and then browse the webinterface in your browser until a problem occurs. Then have a look in the SSH session to see what messages are shown in the log.

Another thing you could provide, is the output of dmesg ( cat dmesg > dmesg.log ) and attach the file to your next post.

Good luck debugging!

this one
http://www.onda.cn/MotherBoard_Specifications.aspx?id=604

之前我使用的是CPU N5105的多WAN软路由小主机,也是装的PVE,使用了几年运行正常,
Previously, I used a small host with a CPU N5105 for a multi-WAN software router, and I also installed PVE on it. It ran smoothly for several years.

现在有使用多台虚拟机的需求,所以专门组装了这套配置高的硬件来跑PVE。
Now, I have a need to use multiple virtual machines, so I assembled this high-config hardware specifically for running PVE.

附件是我的硬件信息,至于出现问题时的日志,我刚安装了syslog,等待下一次出现这个现象再提供。
Attached is my hardware information. As for the logs when the issue occurs, I have just installed syslog, and I will provide them the next time this phenomenon occurs.
 

Attachments

  • lshw.log
    35.2 KB · Views: 0
  • dmesg.log
    121 KB · Views: 1
Last edited:
附件是我的硬件信息,至于出现问题时的日志,我刚安装了syslog,等待下一次出现这个现象再提供。Attached is my hardware information. As for the logs when the issue occurs, I have just installed syslog, and I will provide them the next time this phenomenon occurs.

I think your post was flagged as spam. Unfortunately, it seems to happen with some frequency to new users, especially after editing a post.

Don't worry: it is inconvenieunt, but your post should reappear after a while (I don't know how often the queue is checked by moderators in the weekend).

As such, I was not yet able to view the attachments.
 
The display of CPU clock and NVMe temperature etc. is not standard. You may have made a mistake or the adjustment is not compatible with the new PVE version. I therefore recommend that you remove the changes and see if it works.

Since I personally don't know what changes are necessary for this, I can't judge whether it has anything to do with it. Basically, I would recommend not changing anything in PVE and leaving it as it is. If you are interested in the values displayed there, you should resort to monitoring.
 
I think your post was flagged as spam. Unfortunately, it seems to happen with some frequency to new users, especially after editing a post.

Don't worry: it is inconvenieunt, but your post should reappear after a while (I don't know how often the queue is checked by moderators in the weekend).

As such, I was not yet able to view the attachments.
upload again
 

Attachments

  • hardware.zip
    32 KB · Views: 3
我的问题已经复现,而且这次情况更加严重,syslog日志已经记录了错误,但是还希望官方能帮忙查看如何解决。
My issue has been reproduced, and this time the situation is more severe. The syslog has recorded the error, but I still hope that the authorities can assist in finding a solution.

2024-01-07T10:36:14开始操作vncproxy:106安装虚拟机,虚拟机加载的镜像是约2.8G的linuxmint-21.2-xfce-64bit.iso
On 2024-01-07 at 10:36:14, I initiated the operation of installing a virtual machine through vncproxy:106. The image loaded by the virtual machine was approximately 2.8GB of linuxmint-21.2-xfce-64bit.iso.

进入安装过程后,浏览器切换tab去浏览别的网页,再次切换回PVE的页面,系统安装过程中就出现了error。
After entering the installation process, I switched tabs in the browser to browse other web pages. Upon returning to the PVE page, an error occurred during the system installation process.

我再等待一会页面的虚拟机状态就变成问号(unknown),然后PVE的页面也无响应,Ping不通PVE的IP地址。
I patiently waited for some time, but the virtual machine status on the page turned into a question mark (unknown), and the PVE page became unresponsive. Additionally, I was unable to ping the IP address of the PVE server.

只能拔掉电源重启服务器,然而重启完后所有的虚拟机都无法运行起来了,查看日志是activating LV 'pve/data' failed状态。
I had no choice but to power off the server and restart it. However, after rebooting, none of the virtual machines could be started. Checking the logs revealed the status 'activating LV 'pve/data' failed'.

所以希望有人能帮我看一下到底是什么原因(哪个硬件)导致的问题。
Therefore, I am seeking assistance in identifying the root cause (which hardware component) of this problem.

附件是syslog的日志信息
Please find attached the syslog log information
 

Attachments

  • syslog.log
    733.8 KB · Views: 2
Hi zhousp666,

2024-01-07T10:36:14开始操作vncproxy:106安装虚拟机,虚拟机加载的镜像是约2.8G的linuxmint-21.2-xfce-64bit.iso
On 2024-01-07 at 10:36:14, I initiated the operation of installing a virtual machine through vncproxy:106. The image loaded by the virtual machine was approximately 2.8GB of linuxmint-21.2-xfce-64bit.iso.

In the syslog I scrolled to 2024-01-07T10:36:04, just before your installation. I found VM 106, and expected it to be a new VM, but it already was created the day before, correct? VM 106 is then updated, a task in run (installation, I guess) and VM 106 starts succesfully.

Ten minutes later there is a segmentation fault,

2024-01-07T10:45:04.412626+08:00 pve kernel: [51110.472280] vgs[132245]: segfault at 560edf891 ip 0000560eddc8e9e1 sp 00007ffc6b0a2170 error 6 in lvm[560eddac0000+1d3000] likely on CPU 0 (core 0, socket 0) 2024-01-07T10:45:04.412629+08:00 pve kernel: [51110.472292] Code: f0 48 89 50 08 48 8b 45 f8 48 8b 55 f0 48 89 10 90 c9 c3 55 48 89 e5 48 89 7d f8 48 8b 45 f8 48 8b 00 48 8b 55 f8 48 8b 52 08 <48> 89 50 08 48 8b 45 f8 48 8b 40 08 48 8b 55 f8 48 8b 12 48 89 10 2024-01-07T10:45:04.415878+08:00 pve systemd[1]: pvestatd.service: Main process exited, code=killed, status=6/ABRT 2024-01-07T10:45:04.416051+08:00 pve systemd[1]: pvestatd.service: Failed with result 'signal'.

did you notice it in the log?

From there on, there are quite a few faults on reading swap, search for _swap_info_get: Bad swap file entry

My first guess would be faulty RAM. If you have both sockets occupied, you could try running with first one, than the other DIMM for a while. I am not sure how memory is allocated, but imagine that there is a faulty chip at '40%' of memory used, you won't see an error as long as only '35%' of memory is used.

Memtest86+ can check your RAM. You have to start it from boot; you can not run it while Linux is running. Memtester is a less powerful alternative that can run in the background.

In either case, it may detect errors, but sometimes a RAM chip is only 'somewhat' broken and an error for example only occurs when the system is under load and the temperature is high.

Second guess, since the faults are in swap file entries: maybe a problem with the storage device. After repairing LVM and checking filesystems, you could try running without swap for a while.

Good luck!
 
  • Like
Reactions: Nismo
Hi zhousp666,



In the syslog I scrolled to 2024-01-07T10:36:04, just before your installation. I found VM 106, and expected it to be a new VM, but it already was created the day before, correct? VM 106 is then updated, a task in run (installation, I guess) and VM 106 starts succesfully.

Ten minutes later there is a segmentation fault,

2024-01-07T10:45:04.412626+08:00 pve kernel: [51110.472280] vgs[132245]: segfault at 560edf891 ip 0000560eddc8e9e1 sp 00007ffc6b0a2170 error 6 in lvm[560eddac0000+1d3000] likely on CPU 0 (core 0, socket 0) 2024-01-07T10:45:04.412629+08:00 pve kernel: [51110.472292] Code: f0 48 89 50 08 48 8b 45 f8 48 8b 55 f0 48 89 10 90 c9 c3 55 48 89 e5 48 89 7d f8 48 8b 45 f8 48 8b 00 48 8b 55 f8 48 8b 52 08 <48> 89 50 08 48 8b 45 f8 48 8b 40 08 48 8b 55 f8 48 8b 12 48 89 10 2024-01-07T10:45:04.415878+08:00 pve systemd[1]: pvestatd.service: Main process exited, code=killed, status=6/ABRT 2024-01-07T10:45:04.416051+08:00 pve systemd[1]: pvestatd.service: Failed with result 'signal'.

did you notice it in the log?

From there on, there are quite a few faults on reading swap, search for _swap_info_get: Bad swap file entry

My first guess would be faulty RAM. If you have both sockets occupied, you could try running with first one, than the other DIMM for a while. I am not sure how memory is allocated, but imagine that there is a faulty chip at '40%' of memory used, you won't see an error as long as only '35%' of memory is used.

Memtest86+ can check your RAM. You have to start it from boot; you can not run it while Linux is running. Memtester is a less powerful alternative that can run in the background.

In either case, it may detect errors, but sometimes a RAM chip is only 'somewhat' broken and an error for example only occurs when the system is under load and the temperature is high.

Second guess, since the faults are in swap file entries: maybe a problem with the storage device. After repairing LVM and checking filesystems, you could try running without swap for a while.

Good luck!
Damn it! I used Memtest86+ to scan and found a memory fault. I bought these two memory modules in May last year, and now I just discovered the problem. I want to kill the seller who sold me these memory modules.
 

Attachments

  • Screenshot 2024-01-07 17-24-32.png
    Screenshot 2024-01-07 17-24-32.png
    704.9 KB · Views: 6
kill the seller
Your RAM would still be bad and your karma worse.

A less ... drastic ... remedy would be to configure BadRAM in GRUB.

*edit* I just see... 781278 errors, that quite a bit to copy into GRUB. How about warranty on your modules?
 
Last edited:
Your RAM would still be bad and your karma worse.

A less ... drastic ... remedy would be to configure BadRAM in GRUB.

*edit* I just see... 781278 errors, that quite a bit to copy into GRUB. How about warranty on your modules?
The RAM I bought is second-hand, so for now, I can only have 16GB of memory temporarily. I will buy a new one when the RAM becomes cheaper.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!