Unable to create VM in the second node

Paulo Maligaya

New Member
Jul 23, 2016
18
0
1
40
Hi,

I have a fresh setup of Proxmox VE 4.2-15 cluster with two nodes running, Proxmox01 and Proxmox02.

Using Proxmox01's UI as my management UI, I'm able to deploy and setup a VM on Proxmox01 with no sweat. However, I'm having an issue deploying a VM in Proxmox02 (using Proxmox01's UI). I was able to create a VM under Proxmox02 node. But, as soon as I click on the Start Proxmox02 suddenly gets offline (there's a red x icon in the name of the node from the left pane).

Looking at the corosync and pve-cluster log on Proxmox01, I noticed these bunch of messages/errors:

---------
Jul 28 03:07:50 proxmox01.ewr1.getcadre.com pmxcfs[109069]: [status] notice: received log

Jul 28 03:07:56 proxmox01.ewr1.getcadre.com corosync[5254]: [TOTEM ] Invalid packet data
Jul 28 03:07:57 proxmox01.ewr1.getcadre.com corosync[5254]: [TOTEM ] FAILED TO RECEIVE
Jul 28 03:07:57 proxmox01.ewr1.getcadre.com corosync[5254]: [TOTEM ] A new membership (10.2.4.1:25220) was formed. Members left: 2
Jul 28 03:07:57 proxmox01.ewr1.getcadre.com corosync[5254]: [TOTEM ] Failed to receive the leave message. failed: 2

Jul 28 03:07:57 proxmox01.ewr1.getcadre.com pmxcfs[109069]: [dcdb] notice: members: 1/109069
Jul 28 03:07:57 proxmox01.ewr1.getcadre.com pmxcfs[109069]: [status] notice: members: 1/109069
Jul 28 03:07:57 proxmox01.ewr1.getcadre.com corosync[5254]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
Jul 28 03:07:57 proxmox01.ewr1.getcadre.com corosync[5254]: [QUORUM] Members[1]: 1
Jul 28 03:07:57 proxmox01.ewr1.getcadre.com pmxcfs[109069]: [status] notice: node lost quorum
Jul 28 03:07:57 proxmox01.ewr1.getcadre.com corosync[5254]: [MAIN ] Completed service synchronization, ready to provide service.
Jul 28 03:07:57 proxmox01.ewr1.getcadre.com corosync[5254]: [TOTEM ] A new membership (10.2.4.1:25224) was formed. Members joined: 2
Jul 28 03:07:57 proxmox01.ewr1.getcadre.com corosync[5254]: [TOTEM ] Invalid packet data

Jul 28 03:07:57 proxmox01.ewr1.getcadre.com corosync[5254]: [TOTEM ] Digest does not match
Jul 28 03:07:57 proxmox01.ewr1.getcadre.com corosync[5254]: [TOTEM ] Received message has invalid digest... ignoring.
Jul 28 03:07:57 proxmox01.ewr1.getcadre.com corosync[5254]: [TOTEM ] Invalid packet data
Jul 28 03:07:58 proxmox01.ewr1.getcadre.com corosync[5254]: [TOTEM ] FAILED TO RECEIVE
Jul 28 03:07:58 proxmox01.ewr1.getcadre.com corosync[5254]: [TOTEM ] A new membership (10.2.4.1:25228) was formed. Members left: 2

Jul 28 03:07:58 proxmox01.ewr1.getcadre.com corosync[5254]: [TOTEM ] Failed to receive the leave message. failed: 2
Jul 28 03:07:58 proxmox01.ewr1.getcadre.com corosync[5254]: [QUORUM] Members[1]: 1
Jul 28 03:07:58 proxmox01.ewr1.getcadre.com corosync[5254]: [MAIN ] Completed service synchronization, ready to provide service.
Jul 28 03:07:58 proxmox01.ewr1.getcadre.com corosync[5254]: [TOTEM ] A new membership (10.2.4.1:25232) was formed. Members joined: 2

Jul 28 03:07:58 proxmox01.ewr1.getcadre.com corosync[5254]: [TOTEM ] FAILED TO RECEIVE
Jul 28 03:07:58 proxmox01.ewr1.getcadre.com corosync[5254]: [TOTEM ] A new membership (10.2.4.1:25236) was formed. Members left: 2
Jul 28 03:07:58 proxmox01.ewr1.getcadre.com corosync[5254]: [TOTEM ] Failed to receive the leave message. failed: 2
Jul 28 03:07:58 proxmox01.ewr1.getcadre.com corosync[5254]: [QUORUM] Members[1]: 1
Jul 28 03:07:58 proxmox01.ewr1.getcadre.com corosync[5254]: [MAIN ] Completed service synchronization, ready to provide service.
Jul 28 03:07:58 proxmox01.ewr1.getcadre.com corosync[5254]: [TOTEM ] A new membership (10.2.4.1:25240) was formed. Members joined: 2
---------


I've alslo lost the second node looking from Proxmox01

root@proxmox01:~# pvecm status
Quorum information
------------------
Date: Thu Jul 28 02:54:29 2016
Quorum provider: corosync_votequorum
Nodes: 1
Node ID: 0x00000001
Ring ID: 25036
Quorate: No

Votequorum information
----------------------
Expected votes: 2
Highest expected: 2
Total votes: 1
Quorum: 2 Activity blocked
Flags:

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.2.4.1 (local)



It continues to fill-up the logs, until initiated a Stop on the VM via Proxmox02's management UI. By then, from Proxmox01 management UI second node (proxmox02) became on-line again. and the cluster appears to be healthy:

root@proxmox01:~# pvecm status
Quorum information
------------------
Date: Thu Jul 28 03:11:34 2016
Quorum provider: corosync_votequorum
Nodes: 2
Node ID: 0x00000001
Ring ID: 26136
Quorate: Yes

Votequorum information
----------------------
Expected votes: 2
Highest expected: 2
Total votes: 2
Quorum: 2
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.2.4.1 (local)
0x00000002 1 10.2.4.2



Right now, I'm quite puzzled what have caused this? And how can I avoid this from happening as I prefer to control/create the VMs in a single management UI (Proxmox01) without breaking the other nodes.

TIA!

Cheers,
Paulo
 
Hi,

normally this behavior you get if the network is not reliable or the latency are to high.

please send the output of this command.

omping -c 10000 -i 0.001 -F -q 10.2.4.1 10.2.4.2
 
Hey @wolfgang, thanks for the prompt response!

Here's the multicast test between my two nodes:

root@proxmox01:~# omping -c 10000 -i 0.001 -F -q 10.2.4.1 10.2.4.2
10.2.4.2 : waiting for response msg
10.2.4.2 : waiting for response msg
10.2.4.2 : joined (S,G) = (*, 232.43.211.234), pinging
10.2.4.2 : waiting for response msg
10.2.4.2 : server told us to stop

10.2.4.2 : unicast, xmt/rcv/%loss = 9289/9289/0%, min/avg/max/std-dev = 0.049/0.114/0.900/0.037
10.2.4.2 : multicast, xmt/rcv/%loss = 9289/9289/0%, min/avg/max/std-dev = 0.056/0.144/1.101/0.042

root@proxmox02:~# omping -c 10000 -i 0.001 -F -q 10.2.4.1 10.2.4.2
10.2.4.1 : waiting for response msg
10.2.4.1 : joined (S,G) = (*, 232.43.211.234), pinging
10.2.4.1 : given amount of query messages was sent

10.2.4.1 : unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.060/0.137/1.006/0.046
10.2.4.1 : multicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.067/0.152/1.008/0.048


I don't see any packet lost or latency based on the result. Or do I missed something here?
 
does it happen with every vm? also if you start it from node 02 directly?
does it also happen when you start a vm to node 01 from 02?

it would be very helpful if you could post the config of the vm
 
@dcsapak Thanks for your response! see my answers in-line.

does it happen with every vm?
Yes. it happens in every VM I created (not starting it yet)

also if you start it from node 02 directly?

Starting it from node02 directly does not have issue. But again, Proxmox01 sees the node02 offline. When you go to node02 (Proxmox02) Management UI. both nodes are online.

does it also happen when you start a vm to node 01 from 02?

No. all is ok from Node02 (Proxmox02). Creating/sarting a VM from node02 using node01 as the hypervisor.

it would be very helpful if you could post the config of the vm
I no longer have the config as I've removed the VM config I used when I experienced the issue last week. I'll try to deploy again a VM from node01 (Proxmox01) management UI using the template I created. I'll keep this thread updated.

Thanks!
 
mhmm, ... does the vm send anything multicast related on the network (the same as the cluster?)
if not, maybe the network settings (mtu, etc.) are wrong or the hardware is bad?

the errors in the syslog mean that corosync does not get valid packets anymore from node2, so it should be
something network related
 
@Dominik, I don't think the VM is sending any multicast on the network as it just a fresh install.

One thing I noticed earlier, I tried to create VM. Now directly to node 02 to avoid node 01 from disconnecting. But as soon as the VM is installed and started, I noticed from node 01 UI. Same thing happened, the cluster broke.

(see node1.png)


at Node01:
root@proxmox01:~# pvecm status
Quorum information
------------------
Date: Tue Aug 2 05:54:30 2016
Quorum provider: corosync_votequorum
Nodes: 1
Node ID: 0x00000001
Ring ID: 51692
Quorate: No

Votequorum information
----------------------
Expected votes: 2
Highest expected: 2
Total votes: 1
Quorum: 2 Activity blocked
Flags:

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.2.4.1 (local)

Then quite interesting, after a couple of minutes of checking around. Now, same thing happened to node 02 (breaking off from the cluster):

(see node2.png)

root@proxmox02:~# pvecm status
Quorum information
------------------
Date: Tue Aug 2 05:55:29 2016
Quorum provider: corosync_votequorum
Nodes: 1
Node ID: 0x00000002
Ring ID: 52380
Quorate: No

Votequorum information
----------------------
Expected votes: 2
Highest expected: 2
Total votes: 1
Quorum: 2 Activity blocked
Flags:

Membership information
----------------------
Nodeid Votes Name
0x00000002 1 10.2.4.2 (local)


In order for me to restore the connection between two nodes. I just need to stop the VM from node 02. :(

(see node2_vmstop.png)

You'll see the VM (100) is stop, and two nodes are healthy again. Same goes in node 01, both nodes are back in sync.

(see node1_back.png)


By the way, when you mentioned earlier about "network settings (mtu, etc.) are wrong or the hardware is bad?" are you referring to the switch fabric where the two hosts are connected? or network settings on the two hosts itself?

Thanks in advance!

Best,
Paulo
 

Attachments

  • node1.png
    node1.png
    15.7 KB · Views: 2
  • node2.png
    node2.png
    13.6 KB · Views: 2
  • node2_vmstop.png
    node2_vmstop.png
    16.2 KB · Views: 3
  • node1_back.png
    node1_back.png
    15.9 KB · Views: 2
By the way, when you mentioned earlier about "network settings (mtu, etc.) are wrong or the hardware is bad?" are you referring to the switch fabric where the two hosts are connected? or network settings on the two hosts itself?

Anything really,

the error messages simply mean that the multicast data coming into the nodes is not valid, so when anything in the
network between the two nodes does anything funny with the packets, this could be the problem

also what os do you run in the guest?
is the network of the guest the same as the cluster network? if yes, does it help to have it on a different network?
 
@dcsapak thanks for the input! I think we have a clue on what's going on. I consulted Adam and it seems that this is something to do with the MTU settings on weave interface. I'll find a work around and I'll keep you posted.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!