Proxmox server unreachable!

emanuelebruno

Renowned Member
May 1, 2012
143
7
83
Catania
emanuelebruno.it
Hi all,

today I was panicked because the server has become unreachable while I was reading e-mail.
I couldn't ping so I restarted it using the remote reboot function (I don't why it takes 40 minutes every time to reboot and the last time was more than 6 months ago...).

I am very scared that there is some hardware problem (such as a disk that it is going to damage) so I ask the community some advice to diagnose the server: what are the logs that might help me figure out if there is a problem with Proxmox (software) or if there is a hardware problem?

pve-manager: 2.1-1 (pve-manager/2.1/f9b0f63a)
running kernel: 2.6.32-11-pve
proxmox-ve-2.6.32: 2.0-66
pve-kernel-2.6.32-11-pve: 2.6.32-66
lvm2: 2.02.95-1pve2
clvm: 2.02.95-1pve2
corosync-pve: 1.4.3-1
openais-pve: 1.1.4-2
libqb: 0.10.1-2
redhat-cluster-pve: 3.1.8-3
resource-agents-pve: 3.9.2-3
fence-agents-pve: 3.1.7-2
pve-cluster: 1.0-26
qemu-server: 2.0-39
pve-firmware: 1.0-15
libpve-common-perl: 1.0-27
libpve-access-control: 1.0-21
libpve-storage-perl: 2.0-18
vncterm: 1.0-2
vzctl: 3.0.30-2pve5
vzprocps: 2.0.11-2
vzquota: 3.0.12-3
pve-qemu-kvm: 1.0-9
ksm-control-daemon: 1.1-1

What follows is the hard drive tests: do you think it is failing?

View attachment smartctl.zip

Thanks for your help.
Sincerely,
E.Bruno
 
Last edited:
Well, my guess is as good as your guess.

How old is the hardware? What kind of hardware? Single disk or Raid?

To be honest I didn't even look at your SMART Test, it was just last week that I had a drive on the way out ... (the Raid was slowing at odd times and doing funny things) ... but SMART showed everything was fine ... bad block scan showed all fine as well. So I watched the drive lights during lunch break while having it go to all sorts of places on the disks ... AND THERE it was ... the sudden stuck light on drive two.

So yes, it is great to have a raid and a hot spare plugged in, but it is not a fail safe.

If you got bad feelings about the machine make sure your backups are up to date and ok.
 
How old is the hardware? What kind of hardware? Single disk or Raid?

Yes a bit more of detail would help.
Disks are the most at risk, if in raid, use a spare unit, memory can also fault, and both can make systems go crazy, and power units too (better if redundant and hot swap), and a UPS will help to stabilize power too.
The best thing would be have a "mirror" node (identical) to be able to find out what exactly is running bad.
However, if you have a cluster of even 2 nodes, with shared storage, and good (tested) backups, you should be fine. When disaster happens, you will have a short downtime for vm on the failed node.

Marco
 
I discovered that my UPS software on pve host sometimes causes kernel panics:
1. if i disconnect USB keyboard or
2. if for some reason USB voltage drops.