Problems with pve-kernel-2.6.32

Tallaril

New Member
Feb 18, 2011
23
0
1
Hi there,

i have a strange problem with the above mentioned kernel. I'm running a cluster with 3 machines. 2 of them work fine, but the 3rd one is giving me continous errors in the syslog:

Feb 12 16:22:29 bazinga kernel: [drm:drm_edid_block_valid] *ERROR* EDID checksum is invalid, remainder is 128
Feb 12 16:22:29 bazinga kernel: [drm:drm_edid_block_valid] *ERROR* Raw EDID:
Feb 12 16:22:29 bazinga kernel: <3>ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
Feb 12 16:22:29 bazinga kernel: <3>ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
Feb 12 16:22:29 bazinga kernel: <3>ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
Feb 12 16:22:29 bazinga kernel: <3>ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
Feb 12 16:22:29 bazinga kernel: <3>ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
Feb 12 16:22:29 bazinga kernel: <3>ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
Feb 12 16:22:29 bazinga kernel: <3>ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
Feb 12 16:22:29 bazinga kernel: <3>ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
Feb 12 16:22:29 bazinga kernel:
Feb 12 16:22:29 bazinga kernel: [drm:drm_edid_block_valid] *ERROR* EDID checksum is invalid, remainder is 128
Feb 12 16:22:29 bazinga kernel: [drm:drm_edid_block_valid] *ERROR* Raw EDID:
Feb 12 16:22:29 bazinga kernel: <3>ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
Feb 12 16:22:29 bazinga kernel: <3>ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
Feb 12 16:22:29 bazinga kernel: <3>ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
Feb 12 16:22:29 bazinga kernel: <3>ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
Feb 12 16:22:29 bazinga kernel: <3>ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
Feb 12 16:22:29 bazinga kernel: <3>ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
Feb 12 16:22:29 bazinga kernel: <3>ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
Feb 12 16:22:29 bazinga kernel: <3>ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................


This error starts at the moment i've installed this kernel. Proxmox installation works, but cluster configuration not.

It seems that the fuse mounted /etc/pve is not working anymore after "pvecm add" .....

Does anybody seen this error before?
 
Last edited:
Hi,
What kernel do you install? Is it a pve-installation or on top on squeeze?

Please post the output of "pveversion -v".

Udo

This is a squeeze system with an pve kernel.

# pveversion -v
pve-manager: 2.0-18 (pve-manager/2.0/16283a5a)
running kernel: 2.6.32-6-pve
proxmox-ve-2.6.32: 2.0-55
pve-kernel-2.6.32-6-pve: 2.6.32-55
lvm2: 2.02.88-2pve1
clvm: 2.02.88-2pve1
corosync-pve: 1.4.1-1
openais-pve: 1.1.4-1
libqb: 0.6.0-1
redhat-cluster-pve: 3.1.8-3
pve-cluster: 1.0-17
qemu-server: 2.0-13
pve-firmware: 1.0-14
libpve-common-perl: 1.0-11
libpve-access-control: 1.0-5
libpve-storage-perl: 2.0-9
vncterm: 1.0-2
vzctl: 3.0.29-3pve8
vzprocps: 2.0.11-2
vzquota: 3.0.12-3
pve-qemu-kvm: 1.0-1
ksm-control-daemon: 1.1-1


The error show a problem with radeon or DVI ... but this is a server within a Datacenter, there no monitor attached and just a graphic onboard.




# lspci -k
00:00.0 Host bridge: Intel Corporation 3200/3210 Chipset DRAM Controller (rev 01)
Subsystem: Super Micro Computer Inc Device d280
Kernel driver in use: i3200_edac
00:01.0 PCI bridge: Intel Corporation 3200/3210 Chipset Host-Primary PCI Express Bridge (rev 01)
00:06.0 PCI bridge: Intel Corporation 3210 Chipset Host-Secondary PCI Express Bridge (rev 01)
00:1a.0 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #4 (rev 02)
Subsystem: Super Micro Computer Inc Device d280
Kernel driver in use: uhci_hcd
00:1a.1 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #5 (rev 02)
Subsystem: Super Micro Computer Inc Device d280
Kernel driver in use: uhci_hcd
00:1a.2 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #6 (rev 02)
Subsystem: Super Micro Computer Inc Device d280
Kernel driver in use: uhci_hcd
00:1a.7 USB Controller: Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #2 (rev 02)
Subsystem: Super Micro Computer Inc Device d280
Kernel driver in use: ehci_hcd
00:1c.0 PCI bridge: Intel Corporation 82801I (ICH9 Family) PCI Express Port 1 (rev 02)
00:1c.4 PCI bridge: Intel Corporation 82801I (ICH9 Family) PCI Express Port 5 (rev 02)
00:1c.5 PCI bridge: Intel Corporation 82801I (ICH9 Family) PCI Express Port 6 (rev 02)
00:1d.0 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #1 (rev 02)
Subsystem: Super Micro Computer Inc Device d280
Kernel driver in use: uhci_hcd
00:1d.1 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #2 (rev 02)
Subsystem: Super Micro Computer Inc Device d280
Kernel driver in use: uhci_hcd
00:1d.2 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #3 (rev 02)
Subsystem: Super Micro Computer Inc Device d280
Kernel driver in use: uhci_hcd
00:1d.7 USB Controller: Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #1 (rev 02)
Subsystem: Super Micro Computer Inc Device d280
Kernel driver in use: ehci_hcd
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 92)
00:1f.0 ISA bridge: Intel Corporation 82801IR (ICH9R) LPC Interface Controller (rev 02)
Subsystem: Super Micro Computer Inc Device d280
00:1f.2 IDE interface: Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 4 port SATA IDE Controller (rev 02)
Subsystem: Super Micro Computer Inc Device d280
Kernel driver in use: ata_piix
00:1f.3 SMBus: Intel Corporation 82801I (ICH9 Family) SMBus Controller (rev 02)
Subsystem: Super Micro Computer Inc Device d280
Kernel driver in use: i801_smbus
00:1f.5 IDE interface: Intel Corporation 82801I (ICH9 Family) 2 port SATA IDE Controller (rev 02)
Subsystem: Super Micro Computer Inc Device d280
Kernel driver in use: ata_piix
00:1f.6 Signal processing controller: Intel Corporation 82801I (ICH9 Family) Thermal Subsystem (rev 02)
Subsystem: Super Micro Computer Inc Device 0000
01:00.0 PCI bridge: Intel Corporation 6700PXH PCI Express-to-PCI Bridge A (rev 09)
01:00.1 PIC: Intel Corporation 6700/6702PXH I/OxAPIC Interrupt Controller A (rev 09)
Subsystem: Super Micro Computer Inc Device d280
01:00.2 PCI bridge: Intel Corporation 6700PXH PCI Express-to-PCI Bridge B (rev 09)
01:00.3 PIC: Intel Corporation 6700PXH I/OxAPIC Interrupt Controller B (rev 09)
Subsystem: Super Micro Computer Inc Device d280
03:01.0 RAID bus controller: 3ware Inc 9550SX SATA-II RAID PCI-X
Subsystem: 3ware Inc 9550SX SATA-II RAID PCI-X
Kernel driver in use: 3w-9xxx
04:00.0 RAID bus controller: 3ware Inc 9650SE SATA-II RAID PCIe (rev 01)
Subsystem: 3ware Inc 9650SE SATA-II RAID PCIe
Kernel driver in use: 3w-9xxx
0d:00.0 Ethernet controller: Intel Corporation 82573E Gigabit Ethernet Controller (Copper) (rev 03)
Subsystem: Super Micro Computer Inc Device 108c
Kernel driver in use: e1000e
0f:00.0 Ethernet controller: Intel Corporation 82573L Gigabit Ethernet Controller
Subsystem: Super Micro Computer Inc Device 109a
Kernel driver in use: e1000e
11:04.0 VGA compatible controller: ATI Technologies Inc ES1000 (rev 02)
Subsystem: Super Micro Computer Inc Device d280
Kernel driver in use: radeon


Anyway, the proxmox installation is no problem and as a single node it should work fine (even with this error). But as soon as i try to connect this server to the cluster, all data in /etc/pve dissapear and the cluster configuration fails.

The only thing i've found in the log is:


vestatd[6259]: WARNING: ipcc_send_rec failed: Verbindungsaufbau abgelehnt


This server is connected with an internal address to the other nodes. (IP multicast workd fine with the other nodes)


This is the only machine with this trouble....
 
Hmm,
do you can ssh to the other clusternodes without password?
Show the output of "pvecm status" any hints (compare with the output of the running cluster).

Udo

this is the output on the master:

# pvecm status
Version: 6.2.0
Config Version: 13
Cluster Name: t4hosting
Cluster Id: 50071
Cluster Member: Yes
Cluster Generation: 60
Membership state: Cluster-Member
Nodes: 2
Expected votes: 2
Total votes: 2
Node votes: 1
Quorum: 2
Active subsystems: 5
Flags:
Ports Bound: 0
Node name: bazinga-2
Node ID: 1
Multicast addresses: 239.192.195.91
Node addresses: 109.234.106.16


and this on the not working node:

pvecm status
Version: 6.2.0
Config Version: 12
Cluster Name: t4hosting
Cluster Id: 50071
Cluster Member: Yes
Cluster Generation: 12
Membership state: Cluster-Member
Nodes: 1
Expected votes: 3
Total votes: 1
Node votes: 1
Quorum: 2 Activity blocked
Active subsystems: 2
Flags:
Ports Bound: 0
Node name: bazinga
Node ID: 3
Multicast addresses: 239.192.195.91
Node addresses: 109.234.106.10

Yes, i can do a ssh connect without password.
 
Hi,
hmm - without quorum no go...
What happens, if you change expected votes to 1 on the non-working node?
( pvecm expected 1)

Udo

The main problem is, that the folder /etc/pve on the not working node looks like this:

/etc/pve# ll
insgesamt 1
-r--r----- 1 root www-data 465 12. Feb 17:43 cluster.conf
lr-xr-x--- 1 root www-data 0 1. Jan 1970 local -> nodes/bazinga
lr-xr-x--- 1 root www-data 0 1. Jan 1970 openvz -> nodes/bazinga/openvz
lr-xr-x--- 1 root www-data 0 1. Jan 1970 qemu-server -> nodes/bazinga/qemu-server


and on th eworking cluster nodes like this:

/etc/pve# ll
insgesamt 4
-rw-r----- 1 root www-data 451 6. Feb 20:11 authkey.pub
-rw-r----- 1 root www-data 415 12. Feb 18:11 cluster.conf
-rw-r----- 1 root www-data 465 12. Feb 18:11 cluster.conf.old
lrwxr-x--- 1 root www-data 0 1. Jan 1970 local -> nodes/bazinga-2
drwxr-x--- 2 root www-data 0 6. Feb 20:11 nodes
lrwxr-x--- 1 root www-data 0 1. Jan 1970 openvz -> nodes/bazinga-2/openvz
drwx------ 2 root www-data 0 6. Feb 20:11 priv
-rw-r----- 1 root www-data 1533 6. Feb 20:11 pve-root-ca.pem
-rw-r----- 1 root www-data 1679 6. Feb 20:11 pve-www.key
lrwxr-x--- 1 root www-data 0 1. Jan 1970 qemu-server -> nodes/bazinga-2/qemu-server
-rw-r----- 1 root www-data 526 9. Feb 12:17 storage.cfg
-rw-r----- 1 root www-data 89 8. Feb 21:49 user.cfg
-rw-r----- 1 root www-data 1473 11. Feb 14:55 vzdump.cron


The not working nodes folder looks ok until i use pvecm add .... after this command it looks like above showed
 
The main problem is, that the folder /etc/pve on the not working node looks like this:

/etc/pve# ll
insgesamt 1
-r--r----- 1 root www-data 465 12. Feb 17:43 cluster.conf
lr-xr-x--- 1 root www-data 0 1. Jan 1970 local -> nodes/bazinga
lr-xr-x--- 1 root www-data 0 1. Jan 1970 openvz -> nodes/bazinga/openvz
lr-xr-x--- 1 root www-data 0 1. Jan 1970 qemu-server -> nodes/bazinga/qemu-server


...

Hi,
that's normal without quorum! An clustermember without quorum have the /etc/pve-filesystem readonly. If your new node get the quorum (and joined sucessfull) the pve-fs will sync with the right data.

Udo
 
All right, thanks, that's clear now.

I've changed it to expected 1 ... but no changes on the error. I try to restart cman:

# /etc/init.d/cman restart
Stopping cluster:
Stopping dlm_controld... [ OK ]
Stopping fenced... [ OK ]
Stopping cman... [ OK ]
Waiting for corosync to shutdown:[ OK ]
Unloading kernel modules... [ OK ]
Unmounting configfs... [ OK ]
Starting cluster:
Checking if cluster has been disabled at boot... [ OK ]
Checking Network Manager... [ OK ]
Global setup... [ OK ]
Loading kernel modules... [ OK ]
Mounting configfs... [ OK ]
Starting cman... /usr/share/cluster/cluster.rng:992: element ref: Relax-NG parser error : Reference PVEVM has no matching definition
/usr/share/cluster/cluster.rng:992: element ref: Relax-NG parser error : Internal found no define for ref PVEVM
Relax-NG schema /usr/share/cluster/cluster.rng failed to compile
[ OK ]
Waiting for quorum... Timed-out waiting for cluster
[FAILED]

That's of course the non working node ......
 
I guess i know what the problem is.

The master and the first node are within the same rack and connected to the same switch, but this node is within a different rack and connected to a different switch.

All three nodes are connected with eth1 to an own internal switch.

Is it possible, that quorum try's to connect just through eth0 instead of the internal address?

my /etc/hosts looks like this:

127.0.0.1 localhost109.234.106.16 bazinga-2.t4hosting.de bazinga-2


192.168.10.4 leonidas
192.168.10.2 bazinga-2
192.168.10.3 bazinga

but pvecm status shows:
# pvecm status
Version: 6.2.0
Config Version: 17
Cluster Name: t4hosting
Cluster Id: 50071
Cluster Member: Yes
Cluster Generation: 68
Membership state: Cluster-Member
Nodes: 2
Expected votes: 4
Total votes: 2
Node votes: 1
Quorum: 3 Activity blocked
Active subsystems: 1
Flags:
Ports Bound: 0
Node name: bazinga-2
Node ID: 1
Multicast addresses: 239.192.195.91
Node addresses: 109.234.106.16
 
Ok, i did it .... it was indeed the IP multicast ....

the problem was, taht the external IP had the same shortcut (bazinga) as the internal ... i've removed the shortcut from the external ... and since then ... it works ...

Thanks for your help udo :)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!