I have updated about 50% of my nodes to the latest version running 2.6.32-19-pve.
Since doing so I have had one machine lock up 3 times, and another has locked up twice, the first time was right after booting the new kernel for the first time.
All of the machines having the issues are the ones that have the very latest updates, the others are humming along perfectly like they have been for months.
Nothing in the logs, they just abruptly stop on the 'locked up' machine.
In logs on other servers I see they detect the failure and fence the locked up node.
The hardware in the two servers are very different, one is AMD the other is Intel.
Both do have an Areca 1880 and the same model Mellanox IB pcie card, these are the only common components between the two machines that have had issues.
I am going to get a serial port logger setup and configure the machine to send kernel messages to the serial port in hopes I can capture a some messages indicating what the issue is.
Having repeated downtime is not an option for some of our vms.
Would it be ok to revert to 2.6.32-18-pve temporarily on those machines?
Since doing so I have had one machine lock up 3 times, and another has locked up twice, the first time was right after booting the new kernel for the first time.
All of the machines having the issues are the ones that have the very latest updates, the others are humming along perfectly like they have been for months.
Nothing in the logs, they just abruptly stop on the 'locked up' machine.
In logs on other servers I see they detect the failure and fence the locked up node.
The hardware in the two servers are very different, one is AMD the other is Intel.
Both do have an Areca 1880 and the same model Mellanox IB pcie card, these are the only common components between the two machines that have had issues.
I am going to get a serial port logger setup and configure the machine to send kernel messages to the serial port in hopes I can capture a some messages indicating what the issue is.
Having repeated downtime is not an option for some of our vms.
Would it be ok to revert to 2.6.32-18-pve temporarily on those machines?
Code:
# pveversion -v
pve-manager: 2.3-13 (pve-manager/2.3/7946f1f1)
running kernel: 2.6.32-19-pve
proxmox-ve-2.6.32: 2.3-93
pve-kernel-2.6.32-13-pve: 2.6.32-72
pve-kernel-2.6.32-14-pve: 2.6.32-74
pve-kernel-2.6.32-11-pve: 2.6.32-66
pve-kernel-2.6.32-12-pve: 2.6.32-68
pve-kernel-2.6.32-19-pve: 2.6.32-93
pve-kernel-2.6.32-7-pve: 2.6.32-60
lvm2: 2.02.95-1pve2
clvm: 2.02.95-1pve2
corosync-pve: 1.4.4-4
openais-pve: 1.1.4-2
libqb: 0.10.1-2
redhat-cluster-pve: 3.1.93-2
resource-agents-pve: 3.9.2-3
fence-agents-pve: 3.1.9-1
pve-cluster: 1.0-36
qemu-server: 2.3-18
pve-firmware: 1.0-21
libpve-common-perl: 1.0-49
libpve-access-control: 1.0-26
libpve-storage-perl: 2.3-6
vncterm: 1.0-3
vzctl: 4.0-1pve2
vzprocps: 2.0.11-2
vzquota: 3.1-1
pve-qemu-kvm: 1.4-8
ksm-control-daemon: 1.1-1
Last edited: