Cluster service unexplained shutdown

bert64 · Apr 2, 2012

I have a cluster of 2 nodes running PVE 2.0, which were upgraded from 1.9...

The old primary node works fine, no problems...
The second node however will run fine for a while, and then the cluster services seem to shut down for no apparent reason...

There is nothing obvious in the logs on the secondary node:
Apr 2 09:39:18 jay corosync[299830]: [TOTEM ] Retransmit List: 2cb49 2cb4a 2cb4b
Apr 2 09:39:25 jay corosync[299830]: [TOTEM ] Retransmit List: 2cb49 2cb4a 2cb4b
<then nothing>

The primary node shows logs from corosync indicating the second node has gone away...

On the second box, it looks like the cluster service has simply shut down without explanation:
jay:~# pvecm nodes
cman_tool: Cannot open connection to cman, is it running ?

Is there anything i can do to increase logging verbosity and try to find out whats going on?

dietmar · Apr 2, 2012

bert64 said:
Is there anything i can do to increase logging verbosity and try to find out whats going on?

see 'man cluster.conf' (try to set <logging debug="on"/>

bert64 · Apr 3, 2012

With debug logging on, i see on node 1:
Apr 2 22:23:22 silentbob corosync[1896]: [TOTEM ] Retransmit List: 235e8 235d8 235d9 235da 235db 235dc 235dd 235de 235df 235e0 235e1 235e2 235e3 235e4 235e5 235e6 235e7 235e9
Apr 2 22:23:22 silentbob corosync[1896]: [TOTEM ] Retransmit List: 235e9 235d8 235d9 235da 235db 235dc 235dd 235de 235df 235e0 235e1 235e2 235e3 235e4 235e5 235e6 235e7 235e8
Apr 2 22:23:22 silentbob corosync[1896]: [CLM ] CLM CONFIGURATION CHANGE
Apr 2 22:23:22 silentbob corosync[1896]: [CLM ] New Configuration:
Apr 2 22:23:22 silentbob corosync[1896]: [CLM ] #011r(0) ip(192.168.75.1)
Apr 2 22:23:22 silentbob corosync[1896]: [CLM ] Members Left:
Apr 2 22:23:22 silentbob corosync[1896]: [CLM ] #011r(0) ip(192.168.75.2)
Apr 2 22:23:22 silentbob corosync[1896]: [CLM ] Members Joined:
Apr 2 22:23:22 silentbob corosync[1896]: [CMAN ] quorum lost, blocking activity
...

At the same time on node 2:

endless floods of:
Apr 1 22:23:10 jay pmxcfs[169924]: [status] crit: cpg_send_message failed: 9
Apr 1 22:23:10 jay pmxcfs[169924]: [status] crit: cpg_send_message failed: 9
Apr 1 22:23:10 jay pmxcfs[169924]: [status] crit: cpg_send_message failed: 9
Apr 1 22:23:10 jay pmxcfs[169924]: [status] crit: cpg_send_message failed: 9

and some very odd messages:
Apr 1 22:17:50 jay kernel: kvm: 5633: cpu0 unhandled rdmsr: 0x345
Apr 1 22:17:50 jay kernel: kvm: 5633: cpu0 unhandled rdmsr: 0x38d
Apr 1 22:18:22 jay qm[445972]: shutdown VM 122: UPID:jay:0006CE14:00F4D8B1:4F78C61E:qmshutdown:122:root@pam:
Apr 1 22:18:22 jay qm[445953]: <root@pam> starting task UPID:jay:0006CE14:00F4D8B1:4F78C61E:qmshutdown:122:root@pam:
Apr 1 22:18:33 jay kernel: vmbr1: port 5(tap122i0) entering disabled state
Apr 1 22:18:33 jay kernel: vmbr1: port 5(tap122i0) entering disabled state
Apr 1 22:18:34 jay qm[445953]: <root@pam> end task UPID:jay:0006CE14:00F4D8B1:4F78C61E:qmshutdown:122:root@pam: OK
Apr 1 22:18:35 jay pvedaemon[169795]: worker 441008 finished
Apr 1 22:18:35 jay pvedaemon[169795]: starting 1 worker(s)
Apr 1 22:18:35 jay pvedaemon[169795]: worker 446026 started
Apr 1 22:18:38 jay vnstatd[144901]: Interface "tap122i0" disabled.
Apr 1 22:18:46 jay qm[446066]: <root@pam> starting task UPID:jay:0006CE74:00F4E1F8:4F78C636:qmstart:122:root@pam:
Apr 1 22:18:46 jay qm[446068]: start VM 122: UPID:jay:0006CE74:00F4E1F8:4F78C636:qmstart:122:root@pam:
Apr 1 22:18:47 jay kernel: device tap122i0 entered promiscuous mode
Apr 1 22:18:47 jay kernel: vmbr1: port 5(tap122i0) entering forwarding state
Apr 1 22:18:47 jay qm[446066]: <root@pam> end task UPID:jay:0006CE74:00F4E1F8:4F78C636:qmstart:122:root@pam: OK
Apr 1 22:18:48 jay vnstatd[144901]: Interface "tap122i0" enabled.
Apr 1 22:18:57 jay kernel: tap122i0: no IPv6 routers present
Apr 1 22:19:09 jay kernel: kvm: 446103: cpu0 unhandled rdmsr: 0x345
Apr 1 22:19:09 jay kernel: kvm: 446103: cpu0 unhandled rdmsr: 0x38d
Apr 1 22:19:22 jay pvedaemon[169795]: worker 441393 finished
Apr 1 22:19:22 jay pvedaemon[169795]: starting 1 worker(s)
Apr 1 22:19:22 jay pvedaemon[169795]: worker 446269 started

it seems to imply that vmid 122 was shutdown and restarted, which certainly didn't happen...

dietmar · Apr 3, 2012

Whatis the output of

# pveversion -v

bert64 · Apr 3, 2012

pve-manager: 2.0-54 (pve-manager/2.0/4b59ea39)
running kernel: 2.6.32-6-pve
proxmox-ve-2.6.32: 2.0-63
pve-kernel-2.6.32-10-pve: 2.6.32-63
lvm2: 2.02.88-2pve2
clvm: 2.02.88-2pve2
corosync-pve: 1.4.1-1
openais-pve: 1.1.4-2
libqb: 0.10.1-2
redhat-cluster-pve: 3.1.8-3
resource-agents-pve: 3.9.2-3
fence-agents-pve: 3.1.7-2
pve-cluster: 1.0-26
qemu-server: 2.0-33
pve-firmware: 1.0-15
libpve-common-perl: 1.0-23
libpve-access-control: 1.0-17
libpve-storage-perl: 2.0-16
vncterm: 1.0-2
vzctl: 3.0.30-2pve2
vzprocps: 2.0.11-2
vzquota: 3.0.12-3
pve-qemu-kvm: 1.0-8
ksm-control-daemon: 1.1-1

both nodes return identical output from this command...

dietmar · Apr 3, 2012

Please can you boot with the correct kernel (2.6.32-10-pve) and test again?

bert64 · Apr 3, 2012

If i boot that kernel on these boxes, then none of my images are able to boot (these exact same images run fine on this hardware under 2.6.32-6-pve)

see http://forum.proxmox.com/threads/8940-Kernel-crash-on-boot-with-2-6-32-10-pve-host-kernel

gnumdk · Apr 11, 2012

Same here, all is working but cman seems to crash with no reason after some time...

pve-manager: 2.0-57 (pve-manager/2.0/ff6cd700)
running kernel: 2.6.32-11-pve
proxmox-ve-2.6.32: 2.0-65
pve-kernel-2.6.32-11-pve: 2.6.32-65
lvm2: 2.02.88-2pve2
clvm: 2.02.88-2pve2
corosync-pve: 1.4.1-1
openais-pve: 1.1.4-2
libqb: 0.10.1-2
redhat-cluster-pve: 3.1.8-3
resource-agents-pve: 3.9.2-3
fence-agents-pve: 3.1.7-2
pve-cluster: 1.0-26
qemu-server: 2.0-37
pve-firmware: 1.0-15
libpve-common-perl: 1.0-25
libpve-access-control: 1.0-17
libpve-storage-perl: 2.0-17
vncterm: 1.0-2
vzctl: 3.0.30-2pve2
vzprocps: 2.0.11-2
vzquota: 3.0.12-3
pve-qemu-kvm: 1.0-9
ksm-control-daemon: 1.1-1

dietmar · Apr 11, 2012

gnumdk said:
Same here, all is working but cman seems to crash with no reason after some time...

What is 'some time' - minutes, hours or days?

gnumdk · Apr 11, 2012

dietmar said:
What is 'some time' - minutes, hours or days?

My fault, iptables was not accepting multicast connections...

Search

Search

Cluster service unexplained shutdown

bert64

Member

dietmar

Proxmox Staff Member

bert64

Member

dietmar

Proxmox Staff Member

bert64

Member

dietmar

Proxmox Staff Member

bert64

Member

gnumdk

New Member

dietmar

Proxmox Staff Member

gnumdk

New Member

We value your privacy