[SOLVED] Hostmachine hangup Rejecting I/O to offline device

fireon

Distinguished Member
Oct 25, 2010
4,520
489
153
Austria/Graz
deepdoc.at
Hello,

We have an Cluster with 5 Nodes and about 45VMs. We had some problems with clustercommunication at newer kernel then pve-kernel-2.6.32-34-pve. For us works this kernel fine. But we had also the problem with this kernel that when we take an snapshot form vm's that sporaticly the filesystem on VM crashes at most load (I/O). This was only on linuxguests. At the same time the filesystem crashes at the hostmachine. So we go to update to an 3.X Kernel. From e fews days we installed the pve-kernel-3.10.0-10-pve on all nodes. Everything looks fine, Filesystem not crashes anymore on Updates with snapshots and cluster can communicate (Not so a lot Totemmessages in syslog). But yesterday one of the hostmachines crashes completly. On HP ILO we see the hole monitor full of this message in attached.

Here the info about the machines:
Code:
proxmox-ve-2.6.32: 3.4-157 (running kernel: 3.10.0-10-pve)
pve-manager: 3.4-6 (running version: 3.4-6/102d4547)
pve-kernel-2.6.32-39-pve: 2.6.32-157
pve-kernel-3.10.0-10-pve: 3.10.0-34
pve-kernel-2.6.32-34-pve: 2.6.32-140
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.7-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.10-2
pve-cluster: 3.0-18
qemu-server: 3.4-6
pve-firmware: 1.1-4
libpve-common-perl: 3.0-24
libpve-access-control: 3.0-16
libpve-storage-perl: 3.0-33
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-8
vzctl: 4.0-1pve6
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 2.2-10
ksm-control-daemon: 1.1-1
glusterfs-client: 3.5.2-1
3 of the machines are HP ML350 G6 and two of them are HP DL XXX G6.

I'ts a littel bit strange because before we upgrade to this kernel, yes we had these problems, but never crashes a hole machine, and never wie hat this error messages.

Thanks a lot vor help
Best Regards
 

Attachments

  • error-device.jpg
    error-device.jpg
    135 KB · Views: 34
hmm, yes interessting, can be the same... but we need this kernel for USV Agent, and we can only run the 2.6.32-34 not newer 2.6er, because cluster crashes after some houres... and with 3.10.0.10 this problem is gone, ok, we wait hoply it was only one crash, but it can also be an HW error. Let us see!


 
This was an problem with an HW Raidcontroler. And an similar with an damaged Switchport on clustervlan.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!