Cluster service unexplained shutdown

bert64

Member
Aug 14, 2008
71
0
6
I have a cluster of 2 nodes running PVE 2.0, which were upgraded from 1.9...

The old primary node works fine, no problems...
The second node however will run fine for a while, and then the cluster services seem to shut down for no apparent reason...

There is nothing obvious in the logs on the secondary node:
Apr 2 09:39:18 jay corosync[299830]: [TOTEM ] Retransmit List: 2cb49 2cb4a 2cb4b
Apr 2 09:39:25 jay corosync[299830]: [TOTEM ] Retransmit List: 2cb49 2cb4a 2cb4b
<then nothing>

The primary node shows logs from corosync indicating the second node has gone away...

On the second box, it looks like the cluster service has simply shut down without explanation:
jay:~# pvecm nodes
cman_tool: Cannot open connection to cman, is it running ?

Is there anything i can do to increase logging verbosity and try to find out whats going on?
 
With debug logging on, i see on node 1:
Apr 2 22:23:22 silentbob corosync[1896]: [TOTEM ] Retransmit List: 235e8 235d8 235d9 235da 235db 235dc 235dd 235de 235df 235e0 235e1 235e2 235e3 235e4 235e5 235e6 235e7 235e9
Apr 2 22:23:22 silentbob corosync[1896]: [TOTEM ] Retransmit List: 235e9 235d8 235d9 235da 235db 235dc 235dd 235de 235df 235e0 235e1 235e2 235e3 235e4 235e5 235e6 235e7 235e8
Apr 2 22:23:22 silentbob corosync[1896]: [CLM ] CLM CONFIGURATION CHANGE
Apr 2 22:23:22 silentbob corosync[1896]: [CLM ] New Configuration:
Apr 2 22:23:22 silentbob corosync[1896]: [CLM ] #011r(0) ip(192.168.75.1)
Apr 2 22:23:22 silentbob corosync[1896]: [CLM ] Members Left:
Apr 2 22:23:22 silentbob corosync[1896]: [CLM ] #011r(0) ip(192.168.75.2)
Apr 2 22:23:22 silentbob corosync[1896]: [CLM ] Members Joined:
Apr 2 22:23:22 silentbob corosync[1896]: [CMAN ] quorum lost, blocking activity
...

At the same time on node 2:

endless floods of:
Apr 1 22:23:10 jay pmxcfs[169924]: [status] crit: cpg_send_message failed: 9
Apr 1 22:23:10 jay pmxcfs[169924]: [status] crit: cpg_send_message failed: 9
Apr 1 22:23:10 jay pmxcfs[169924]: [status] crit: cpg_send_message failed: 9
Apr 1 22:23:10 jay pmxcfs[169924]: [status] crit: cpg_send_message failed: 9

and some very odd messages:
Apr 1 22:17:50 jay kernel: kvm: 5633: cpu0 unhandled rdmsr: 0x345
Apr 1 22:17:50 jay kernel: kvm: 5633: cpu0 unhandled rdmsr: 0x38d
Apr 1 22:18:22 jay qm[445972]: shutdown VM 122: UPID:jay:0006CE14:00F4D8B1:4F78C61E:qmshutdown:122:root@pam:
Apr 1 22:18:22 jay qm[445953]: <root@pam> starting task UPID:jay:0006CE14:00F4D8B1:4F78C61E:qmshutdown:122:root@pam:
Apr 1 22:18:33 jay kernel: vmbr1: port 5(tap122i0) entering disabled state
Apr 1 22:18:33 jay kernel: vmbr1: port 5(tap122i0) entering disabled state
Apr 1 22:18:34 jay qm[445953]: <root@pam> end task UPID:jay:0006CE14:00F4D8B1:4F78C61E:qmshutdown:122:root@pam: OK
Apr 1 22:18:35 jay pvedaemon[169795]: worker 441008 finished
Apr 1 22:18:35 jay pvedaemon[169795]: starting 1 worker(s)
Apr 1 22:18:35 jay pvedaemon[169795]: worker 446026 started
Apr 1 22:18:38 jay vnstatd[144901]: Interface "tap122i0" disabled.
Apr 1 22:18:46 jay qm[446066]: <root@pam> starting task UPID:jay:0006CE74:00F4E1F8:4F78C636:qmstart:122:root@pam:
Apr 1 22:18:46 jay qm[446068]: start VM 122: UPID:jay:0006CE74:00F4E1F8:4F78C636:qmstart:122:root@pam:
Apr 1 22:18:47 jay kernel: device tap122i0 entered promiscuous mode
Apr 1 22:18:47 jay kernel: vmbr1: port 5(tap122i0) entering forwarding state
Apr 1 22:18:47 jay qm[446066]: <root@pam> end task UPID:jay:0006CE74:00F4E1F8:4F78C636:qmstart:122:root@pam: OK
Apr 1 22:18:48 jay vnstatd[144901]: Interface "tap122i0" enabled.
Apr 1 22:18:57 jay kernel: tap122i0: no IPv6 routers present
Apr 1 22:19:09 jay kernel: kvm: 446103: cpu0 unhandled rdmsr: 0x345
Apr 1 22:19:09 jay kernel: kvm: 446103: cpu0 unhandled rdmsr: 0x38d
Apr 1 22:19:22 jay pvedaemon[169795]: worker 441393 finished
Apr 1 22:19:22 jay pvedaemon[169795]: starting 1 worker(s)
Apr 1 22:19:22 jay pvedaemon[169795]: worker 446269 started

it seems to imply that vmid 122 was shutdown and restarted, which certainly didn't happen...
 
pve-manager: 2.0-54 (pve-manager/2.0/4b59ea39)
running kernel: 2.6.32-6-pve
proxmox-ve-2.6.32: 2.0-63
pve-kernel-2.6.32-10-pve: 2.6.32-63
lvm2: 2.02.88-2pve2
clvm: 2.02.88-2pve2
corosync-pve: 1.4.1-1
openais-pve: 1.1.4-2
libqb: 0.10.1-2
redhat-cluster-pve: 3.1.8-3
resource-agents-pve: 3.9.2-3
fence-agents-pve: 3.1.7-2
pve-cluster: 1.0-26
qemu-server: 2.0-33
pve-firmware: 1.0-15
libpve-common-perl: 1.0-23
libpve-access-control: 1.0-17
libpve-storage-perl: 2.0-16
vncterm: 1.0-2
vzctl: 3.0.30-2pve2
vzprocps: 2.0.11-2
vzquota: 3.0.12-3
pve-qemu-kvm: 1.0-8
ksm-control-daemon: 1.1-1

both nodes return identical output from this command...
 
Same here, all is working but cman seems to crash with no reason after some time...

pve-manager: 2.0-57 (pve-manager/2.0/ff6cd700)
running kernel: 2.6.32-11-pve
proxmox-ve-2.6.32: 2.0-65
pve-kernel-2.6.32-11-pve: 2.6.32-65
lvm2: 2.02.88-2pve2
clvm: 2.02.88-2pve2
corosync-pve: 1.4.1-1
openais-pve: 1.1.4-2
libqb: 0.10.1-2
redhat-cluster-pve: 3.1.8-3
resource-agents-pve: 3.9.2-3
fence-agents-pve: 3.1.7-2
pve-cluster: 1.0-26
qemu-server: 2.0-37
pve-firmware: 1.0-15
libpve-common-perl: 1.0-25
libpve-access-control: 1.0-17
libpve-storage-perl: 2.0-17
vncterm: 1.0-2
vzctl: 3.0.30-2pve2
vzprocps: 2.0.11-2
vzquota: 3.0.12-3
pve-qemu-kvm: 1.0-9
ksm-control-daemon: 1.1-1
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!