Proxmox randomly stopping without a trace in logs

kobuki

Renowned Member
Dec 30, 2008
473
27
93
Hi,

I'm recently encountering a strange and very serious behaviour in one of my proxmox servers. The server sporadically stops. I've set the kernel panic restart timer to a nonzero value so the server is restarting after the failure (it seems no hardware problem keeping it from restarting). There is absolutely no trace in the logs of the cause. No kernel panic message or trace or register dump. The logging suddenly stops at the point of halt, and resumes with kernel booting messages after the automatic restart. This way I cannot even start to diagnose what the problem might be. It's a production server in a datacenter so I can't look at the console when the problem arises. I'm running kernel 2.6.24-12-pve - can't upgrade because in the 2.6.32 branch the EX4650 Promise hwraid in the machine is unstable. This has already happened twice in only a few days interval. After the first one I've upgraded to the latest stable Proxmox release, but it had no effect. The problem is unrelated to the load on the server and there's no specific activity I can spot in the logs before the halt.

Please help me diagnose this problem. There's important data on this server for customers, and I don't like the idea of being forced to move this installation to another node. Should I try to downgrade to 2.6.18 or upgrade to an updated 2.6.3x kernel? I fear the instability of the raid card will be a problem again. I'm not sure 2.6.18 supports this card at all. Should I replace the card to a different brand? (I was planning on doing it anyway since I've had problems before with this specific model.)

Proxmox info:

# pveversion -v
pve-manager: 1.7-10 (pve-manager/1.7/5323)
running kernel: 2.6.24-12-pve
proxmox-ve-2.6.24: 1.6-26
pve-kernel-2.6.32-4-pve: 2.6.32-28
pve-kernel-2.6.24-12-pve: 2.6.24-25
qemu-server: 1.1-25
pve-firmware: 1.0-9
libpve-storage-perl: 1.0-16
vncterm: 0.9-2
vzctl: 3.0.24-1pve4
vzdump: 1.2-9
vzprocps: 2.0.11-1dso2
vzquota: 3.0.11-1
pve-qemu-kvm: 0.13.0-2
 
I assume that machine run stable for some time? If so, it is more likely an hardware failure.
 
Yes, it was running stable for nearly a year. There are no indications of any HW failure. After restart I backed up all VMS (several hundred gigs), no failure of any kind was reported anywhere. This usually uses all ram temporarily, and creates heavy IO on all disks. I suspect either a driver problem or a kernel security problem, but without logs it's all just guesswork. The raid card reports no problems either.
 
Hi,

I'm seeing similar issues with my Proxmox VE host at work... system worked fine for 2+ years, then in the last several weeks has halted about once a week, randomly, with zero log info.

Unfortunately the system halted Dec. 24 and I don't have physical access to the machine until Jan. 4. I will probably try changing the kernel unless this thread suggests something else.

Cheers!
 
Unfortunately the situation hasn't get better for me either. In 2-3 weeks intervals I experience sudden restarts (kernel panic watchdog is restarting the box). I'm gonna hook up a serial console to a machine nearby to be able to capture the kernel messages. I'll get back when I'll be able to gather some usable info.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!