Cluster service unexplained shutdown

bert64

Member
Aug 14, 2008
71
1
6
I have a cluster of 2 nodes running PVE 2.0, which were upgraded from 1.9...

The old primary node works fine, no problems...
The second node however will run fine for a while, and then the cluster services seem to shut down for no apparent reason...

There is nothing obvious in the logs on the secondary node:
Apr 2 09:39:18 jay corosync[299830]: [TOTEM ] Retransmit List: 2cb49 2cb4a 2cb4b
Apr 2 09:39:25 jay corosync[299830]: [TOTEM ] Retransmit List: 2cb49 2cb4a 2cb4b
<then nothing>

The primary node shows logs from corosync indicating the second node has gone away...

On the second box, it looks like the cluster service has simply shut down without explanation:
jay:~# pvecm nodes
cman_tool: Cannot open connection to cman, is it running ?

Is there anything i can do to increase logging verbosity and try to find out whats going on?
 
With debug logging on, i see on node 1:
Apr 2 22:23:22 silentbob corosync[1896]: [TOTEM ] Retransmit List: 235e8 235d8 235d9 235da 235db 235dc 235dd 235de 235df 235e0 235e1 235e2 235e3 235e4 235e5 235e6 235e7 235e9
Apr 2 22:23:22 silentbob corosync[1896]: [TOTEM ] Retransmit List: 235e9 235d8 235d9 235da 235db 235dc 235dd 235de 235df 235e0 235e1 235e2 235e3 235e4 235e5 235e6 235e7 235e8
Apr 2 22:23:22 silentbob corosync[1896]: [CLM ] CLM CONFIGURATION CHANGE
Apr 2 22:23:22 silentbob corosync[1896]: [CLM ] New Configuration:
Apr 2 22:23:22 silentbob corosync[1896]: [CLM ] #011r(0) ip(192.168.75.1)
Apr 2 22:23:22 silentbob corosync[1896]: [CLM ] Members Left:
Apr 2 22:23:22 silentbob corosync[1896]: [CLM ] #011r(0) ip(192.168.75.2)
Apr 2 22:23:22 silentbob corosync[1896]: [CLM ] Members Joined:
Apr 2 22:23:22 silentbob corosync[1896]: [CMAN ] quorum lost, blocking activity
...

At the same time on node 2:

endless floods of:
Apr 1 22:23:10 jay pmxcfs[169924]: [status] crit: cpg_send_message failed: 9
Apr 1 22:23:10 jay pmxcfs[169924]: [status] crit: cpg_send_message failed: 9
Apr 1 22:23:10 jay pmxcfs[169924]: [status] crit: cpg_send_message failed: 9
Apr 1 22:23:10 jay pmxcfs[169924]: [status] crit: cpg_send_message failed: 9

and some very odd messages:
Apr 1 22:17:50 jay kernel: kvm: 5633: cpu0 unhandled rdmsr: 0x345
Apr 1 22:17:50 jay kernel: kvm: 5633: cpu0 unhandled rdmsr: 0x38d
Apr 1 22:18:22 jay qm[445972]: shutdown VM 122: UPID:jay:0006CE14:00F4D8B1:4F78C61E:qmshutdown:122:root@pam:
Apr 1 22:18:22 jay qm[445953]: <root@pam> starting task UPID:jay:0006CE14:00F4D8B1:4F78C61E:qmshutdown:122:root@pam:
Apr 1 22:18:33 jay kernel: vmbr1: port 5(tap122i0) entering disabled state
Apr 1 22:18:33 jay kernel: vmbr1: port 5(tap122i0) entering disabled state
Apr 1 22:18:34 jay qm[445953]: <root@pam> end task UPID:jay:0006CE14:00F4D8B1:4F78C61E:qmshutdown:122:root@pam: OK
Apr 1 22:18:35 jay pvedaemon[169795]: worker 441008 finished
Apr 1 22:18:35 jay pvedaemon[169795]: starting 1 worker(s)
Apr 1 22:18:35 jay pvedaemon[169795]: worker 446026 started
Apr 1 22:18:38 jay vnstatd[144901]: Interface "tap122i0" disabled.
Apr 1 22:18:46 jay qm[446066]: <root@pam> starting task UPID:jay:0006CE74:00F4E1F8:4F78C636:qmstart:122:root@pam:
Apr 1 22:18:46 jay qm[446068]: start VM 122: UPID:jay:0006CE74:00F4E1F8:4F78C636:qmstart:122:root@pam:
Apr 1 22:18:47 jay kernel: device tap122i0 entered promiscuous mode
Apr 1 22:18:47 jay kernel: vmbr1: port 5(tap122i0) entering forwarding state
Apr 1 22:18:47 jay qm[446066]: <root@pam> end task UPID:jay:0006CE74:00F4E1F8:4F78C636:qmstart:122:root@pam: OK
Apr 1 22:18:48 jay vnstatd[144901]: Interface "tap122i0" enabled.
Apr 1 22:18:57 jay kernel: tap122i0: no IPv6 routers present
Apr 1 22:19:09 jay kernel: kvm: 446103: cpu0 unhandled rdmsr: 0x345
Apr 1 22:19:09 jay kernel: kvm: 446103: cpu0 unhandled rdmsr: 0x38d
Apr 1 22:19:22 jay pvedaemon[169795]: worker 441393 finished
Apr 1 22:19:22 jay pvedaemon[169795]: starting 1 worker(s)
Apr 1 22:19:22 jay pvedaemon[169795]: worker 446269 started

it seems to imply that vmid 122 was shutdown and restarted, which certainly didn't happen...
 
pve-manager: 2.0-54 (pve-manager/2.0/4b59ea39)
running kernel: 2.6.32-6-pve
proxmox-ve-2.6.32: 2.0-63
pve-kernel-2.6.32-10-pve: 2.6.32-63
lvm2: 2.02.88-2pve2
clvm: 2.02.88-2pve2
corosync-pve: 1.4.1-1
openais-pve: 1.1.4-2
libqb: 0.10.1-2
redhat-cluster-pve: 3.1.8-3
resource-agents-pve: 3.9.2-3
fence-agents-pve: 3.1.7-2
pve-cluster: 1.0-26
qemu-server: 2.0-33
pve-firmware: 1.0-15
libpve-common-perl: 1.0-23
libpve-access-control: 1.0-17
libpve-storage-perl: 2.0-16
vncterm: 1.0-2
vzctl: 3.0.30-2pve2
vzprocps: 2.0.11-2
vzquota: 3.0.12-3
pve-qemu-kvm: 1.0-8
ksm-control-daemon: 1.1-1

both nodes return identical output from this command...
 
Same here, all is working but cman seems to crash with no reason after some time...

pve-manager: 2.0-57 (pve-manager/2.0/ff6cd700)
running kernel: 2.6.32-11-pve
proxmox-ve-2.6.32: 2.0-65
pve-kernel-2.6.32-11-pve: 2.6.32-65
lvm2: 2.02.88-2pve2
clvm: 2.02.88-2pve2
corosync-pve: 1.4.1-1
openais-pve: 1.1.4-2
libqb: 0.10.1-2
redhat-cluster-pve: 3.1.8-3
resource-agents-pve: 3.9.2-3
fence-agents-pve: 3.1.7-2
pve-cluster: 1.0-26
qemu-server: 2.0-37
pve-firmware: 1.0-15
libpve-common-perl: 1.0-25
libpve-access-control: 1.0-17
libpve-storage-perl: 2.0-17
vncterm: 1.0-2
vzctl: 3.0.30-2pve2
vzprocps: 2.0.11-2
vzquota: 3.0.12-3
pve-qemu-kvm: 1.0-9
ksm-control-daemon: 1.1-1