Host crash or freeze

Dariu

New Member
Nov 9, 2022
7
0
1
Hello.

Proxmox 7.2-11

My host freeze or crash, even if the RAM is 50% free. Sometime after 2 days other time after 26 days, in 90% of cases during nighttime.

First case, just freeze the webUI and SSH but responds to ping.
https://postimg.cc/K4M2CHfd

Second case, complete crash, probably IRQ related.
https://postimg.cc/Dms9xbyh

What other useful info (eg. log files) should I post ?
Please advise.

Thanks.
 
My host freeze or crash, even if the RAM is 50% free. Sometime after 2 days other time after 26 days, in 90% of cases during nighttime.

First case, just freeze the webUI and SSH but responds to ping.
https://postimg.cc/K4M2CHfd

Second case, complete crash, probably IRQ related.
https://postimg.cc/Dms9xbyh

What other useful info (eg. log files) should I post ?
Access to the the second photo does not work.

The problems are probably also visible in syslog /var/log/syslog* files (as textfile which makes it easier to investigate).

AFAICS the first one is a protection error, can be caused by a buggy software (is there anything installed additionally at the host) or a hardware defect, probably memory. Recommended to perform a memtest.
 
Hello Richard

Attached the pictures here, please confirm if you can open it.

Attached also syslog.zip ; what should I search for ?

On the host is installed NFS since on LXC don't work (kernel stuff). The hardware worked fine for months with Debian 10 (no virtualization). Anyway I will start a memtest overnight and come back with the result. How to identify the trouble hardware ?

Thanks.
 

Attachments

  • photo_2022-11-09_09-06-50.jpg
    photo_2022-11-09_09-06-50.jpg
    208.9 KB · Views: 18
  • photo_2022-11-09_09-07-24.jpg
    photo_2022-11-09_09-07-24.jpg
    208.9 KB · Views: 17
  • syslog.zip
    syslog.zip
    291.9 KB · Views: 2
Hello Richard

Attached the pictures here, please confirm if you can open it.

Attached also syslog.zip ; what should I search for ?

On the host is installed NFS since on LXC don't work (kernel stuff). The hardware worked fine for months with Debian 10 (no virtualization). Anyway I will start a memtest overnight and come back with the result. How to identify the trouble hardware ?

Thanks.
Can access at the photos, at a first glance I guess both are identical to each other as well as with the first photo from first post.

syslog shows more or less the same (this quite normal) but as said it's easier to investigate and I found:
Code:
Nov  7 04:06:52 zeus kernel: [1930674.728628] Fixing recursive fault but reboot is needed!
The conclusion I made in my first answer is confirmed by the above: either a hardware (Memory) error or a buggy third party software.
 
So, same hardware run for months with Debian 10 + NFS + a good deal of other software without virtualization without problems, but after I install Proxmox 7.2 it's the hardware or third party software fault.
I'm sorry, but I don't see the logic in this.
 
So, same hardware run for months with Debian 10 + NFS + a good deal of other software without virtualization without problems, but after I install Proxmox 7.2 it's the hardware or third party software fault.
I'm sorry, but I don't see the logic in this.
As mentioned can be also caused by a software bug; since there is no known issue like this in Proxmox I suspect a third party software installed. Moreover in case of a memory error it's a coincidence whether the faulty area is really used or not. Recommended to run a memtest - if you post a pvereport of your system I'll have a look if there is something unusual configured (and may cause problems).
 
I left testing memtest on loop for >24 hours without any reported problem, should I test longer ? If so, for how long ?
Pvereport file attached, replaced privacy info with xxxxxx.
 

Attachments

Remarkable looks for me that an LUKS encrypted device is mounted and uses currently ~ 200GB disk space but cannot see any usage in VMs. possibly only used via mount point in container 109? Of course, should work, but it is a rather rarely used combination and therefore I'd give it a try to not use this mount point any more respectively stop temporarily and have a look if the phenomenon disappears then.
 
LUKS partition is mounted only on the host & shared over NFS.
Same LUKS partition & NFS share was mounted before in Debian 10 for months without problem before I installed Proxmox.
Is there any known incompatibility between LUKS and Proxmox ?
 
LUKS partition is mounted only on the host & shared over NFS.
Same LUKS partition & NFS share was mounted before in Debian 10 for months without problem before I installed Proxmox.
Is there any known incompatibility between LUKS and Proxmox ?
No, it is not - however,a rather rarely used setup, bug cannot be excluded. We'll have a look and going to try it out (and let you know about the results, but it will take some time). An indication whether this is cause of the problem indeed would be when the phenomenon disappears as soon as the container does not use this mountpoint any more.
 
I wish I could provide more useful info, unfortunately this is happening very random one time at 2 days other time after 26 days.
I can't keep the LUKS partition not mounted for test purposes.
Additional info:
Reading topics here on the forum I found that some people use ",aio=native" setting for VM (not container) and since I have on one such VM I have also enabled this setting for the first time also for testing purposes.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!