Quorum problems with PVE 2.3 and 3.3

cesarpk

Well-Known Member
Mar 31, 2012
770
3
58
Hi to all.

Please, anybody that can help me.

I have a cluster of PVE nodes with two versions of PVE.
Some PVE nodes have (modern):
proxmox-ve-2.6.32: 3.3-139 (running kernel: 3.10.0-5-pve)
pve-manager: 3.3-5 (running version: 3.3-5/bfebec03)

Other PVE nodes have (old):
pve-manager: 2.3-13 (pve-manager/2.3/7946f1f1)
running kernel: 2.6.32-19-pve
proxmox-ve-2.6.32: 2.3-96

And i have several problems of quorum with proxmox-ve-2.6.32: 3.3-139 (running kernel: 3.10.0-5-pve), the problem appear sometimes at start up and sometimes after of start up.

This is a log of a node that have problems at start up:
Code:
Dec  9 18:08:50 pve1 rrdcached[2532]: starting up
Dec  9 18:08:50 pve1 rrdcached[2532]: checking for journal files
Dec  9 18:08:50 pve1 rrdcached[2532]: started new journal /var/lib/rrdcached/journal/rrd.journal.1418159330.030667
Dec  9 18:08:50 pve1 rrdcached[2532]: journal processing complete
Dec  9 18:08:50 pve1 rrdcached[2532]: listening for connections
Dec  9 18:08:50 pve1 pmxcfs[2546]: [quorum] crit: quorum_initialize failed: 6
Dec  9 18:08:50 pve1 pmxcfs[2546]: [quorum] crit: can't initialize service
Dec  9 18:08:50 pve1 pmxcfs[2546]: [confdb] crit: confdb_initialize failed: 6
Dec  9 18:08:50 pve1 pmxcfs[2546]: [quorum] crit: can't initialize service
Dec  9 18:08:50 pve1 pmxcfs[2546]: [dcdb] crit: cpg_initialize failed: 6
Dec  9 18:08:50 pve1 pmxcfs[2546]: [quorum] crit: can't initialize service
Dec  9 18:08:50 pve1 pmxcfs[2546]: [dcdb] crit: cpg_initialize failed: 6
Dec  9 18:08:50 pve1 pmxcfs[2546]: [quorum] crit: can't initialize service
Dec  9 18:08:50 pve1 postfix/master[2579]: daemon started -- version 2.9.6, configuration /etc/postfix
Dec  9 18:08:50 pve1 iscsid: iSCSI daemon with pid=2300 started!
Dec  9 18:08:51 pve1 /usr/sbin/cron[2616]: (CRON) INFO (pidfile fd = 3)
Dec  9 18:08:51 pve1 /usr/sbin/cron[2617]: (CRON) STARTUP (fork ok)
Dec  9 18:08:51 pve1 /usr/sbin/cron[2617]: (CRON) INFO (Running @reboot jobs)
Dec  9 18:08:51 pve1 kernel: [   10.639499] sctp: Hash tables configured (established 65536 bind 65536)
Dec  9 18:08:51 pve1 kernel: [   10.653149] DLM installed
Dec  9 18:08:51 pve1 kernel: [   10.755368] bnx2 0000:01:00.1 eth1: NIC Copper Link is Up, 1000 Mbps full duplex
Dec  9 18:08:51 pve1 kernel: [   10.755421] , receive & transmit flow control ON
Dec  9 18:08:51 pve1 kernel: [   10.825572] bonding: bond10: link status definitely up for interface eth1, 1000 Mbps full duplex.
Dec  9 18:08:51 pve1 kernel: [   10.825585] bonding: bond10: first active interface up!
Dec  9 18:08:51 pve1 kernel: [   10.825636] IPv6: ADDRCONF(NETDEV_CHANGE): bond10: link becomes ready
Dec  9 18:08:51 pve1 kernel: [   11.205214] bnx2 0000:02:00.1 eth3: NIC Copper Link is Up, 1000 Mbps full duplex
Dec  9 18:08:51 pve1 kernel: [   11.205267] , receive & transmit flow control ON
Dec  9 18:08:51 pve1 kernel: [   11.225639] bonding: bond10: link status definitely up for interface eth3, 1000 Mbps full duplex.
Dec  9 18:08:52 pve1 corosync[2697]:   [MAIN  ] Corosync Cluster Engine ('1.4.7'): started and ready to provide service.
Dec  9 18:08:52 pve1 corosync[2697]:   [MAIN  ] Corosync built-in features: nss
Dec  9 18:08:52 pve1 corosync[2697]:   [MAIN  ] Successfully read config from /etc/cluster/cluster.conf
Dec  9 18:08:52 pve1 corosync[2697]:   [MAIN  ] Successfully parsed cman config
Dec  9 18:08:52 pve1 corosync[2697]:   [MAIN  ] Successfully configured openais services to load
Dec  9 18:08:52 pve1 corosync[2697]:   [TOTEM ] Initializing transport (UDP/IP Multicast).
Dec  9 18:08:52 pve1 corosync[2697]:   [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
Dec  9 18:08:52 pve1 corosync[2697]:   [TOTEM ] The network interface [10.25.25.20] is now up.
Dec  9 18:08:52 pve1 corosync[2697]:   [QUORUM] Using quorum provider quorum_cman
Dec  9 18:08:52 pve1 corosync[2697]:   [SERV  ] Service engine loaded: corosync cluster quorum service v0.1
Dec  9 18:08:52 pve1 corosync[2697]:   [CMAN  ] CMAN 1364188437 (built Mar 25 2013 06:14:01) started
Dec  9 18:08:52 pve1 corosync[2697]:   [SERV  ] Service engine loaded: corosync CMAN membership service 2.90
Dec  9 18:08:52 pve1 corosync[2697]:   [SERV  ] Service engine loaded: openais cluster membership service B.01.01
Dec  9 18:08:52 pve1 corosync[2697]:   [SERV  ] Service engine loaded: openais event service B.01.01
Dec  9 18:08:52 pve1 corosync[2697]:   [SERV  ] Service engine loaded: openais checkpoint service B.01.01
Dec  9 18:08:52 pve1 corosync[2697]:   [SERV  ] Service engine loaded: openais message service B.03.01
Dec  9 18:08:52 pve1 corosync[2697]:   [SERV  ] Service engine loaded: openais distributed locking service B.03.01
Dec  9 18:08:52 pve1 corosync[2697]:   [SERV  ] Service engine loaded: openais timer service A.01.01
Dec  9 18:08:52 pve1 corosync[2697]:   [SERV  ] Service engine loaded: corosync extended virtual synchrony service
Dec  9 18:08:52 pve1 corosync[2697]:   [SERV  ] Service engine loaded: corosync configuration service
Dec  9 18:08:52 pve1 corosync[2697]:   [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01
Dec  9 18:08:52 pve1 corosync[2697]:   [SERV  ] Service engine loaded: corosync cluster config database access v1.01
Dec  9 18:08:52 pve1 corosync[2697]:   [SERV  ] Service engine loaded: corosync profile loading service
Dec  9 18:08:52 pve1 corosync[2697]:   [QUORUM] Using quorum provider quorum_cman
Dec  9 18:08:52 pve1 corosync[2697]:   [SERV  ] Service engine loaded: corosync cluster quorum service v0.1
Dec  9 18:08:52 pve1 corosync[2697]:   [MAIN  ] Compatibility mode set to whitetank.  Using V1 and V2 of the synchronization engine.
Dec  9 18:08:53 pve1 ntpd[2476]: Listen normally on 6 bond10 fe80::a6ba:dbff:fe35:1457 UDP 123
Dec  9 18:08:53 pve1 ntpd[2476]: Listen normally on 7 vmbr0 fe80::a6ba:dbff:fe35:1455 UDP 123
Dec  9 18:08:53 pve1 ntpd[2476]: peers refreshed
Dec  9 18:08:56 pve1 pmxcfs[2546]: [quorum] crit: quorum_initialize failed: 6
Dec  9 18:08:56 pve1 pmxcfs[2546]: [confdb] crit: confdb_initialize failed: 6
Dec  9 18:08:56 pve1 pmxcfs[2546]: [dcdb] crit: cpg_initialize failed: 6
Dec  9 18:08:56 pve1 pmxcfs[2546]: [dcdb] crit: cpg_initialize failed: 6
Dec  9 18:09:02 pve1 pmxcfs[2546]: [quorum] crit: quorum_initialize failed: 6
Dec  9 18:09:02 pve1 pmxcfs[2546]: [confdb] crit: confdb_initialize failed: 6
Dec  9 18:09:02 pve1 pmxcfs[2546]: [dcdb] crit: cpg_initialize failed: 6
Dec  9 18:09:02 pve1 pmxcfs[2546]: [dcdb] crit: cpg_initialize failed: 6
Dec  9 18:09:08 pve1 pmxcfs[2546]: [quorum] crit: quorum_initialize failed: 6
Dec  9 18:09:08 pve1 pmxcfs[2546]: [confdb] crit: confdb_initialize failed: 6
Dec  9 18:09:08 pve1 pmxcfs[2546]: [dcdb] crit: cpg_initialize failed: 6
Dec  9 18:09:08 pve1 pmxcfs[2546]: [dcdb] crit: cpg_initialize failed: 6
Dec  9 18:09:14 pve1 pmxcfs[2546]: [quorum] crit: quorum_initialize failed: 6
Dec  9 18:09:14 pve1 pmxcfs[2546]: [confdb] crit: confdb_initialize failed: 6
Dec  9 18:09:14 pve1 pmxcfs[2546]: [dcdb] crit: cpg_initialize failed: 6
Dec  9 18:09:14 pve1 pmxcfs[2546]: [dcdb] crit: cpg_initialize failed: 6
Dec  9 18:09:20 pve1 pmxcfs[2546]: [quorum] crit: quorum_initialize failed: 6
Dec  9 18:09:20 pve1 pmxcfs[2546]: [confdb] crit: confdb_initialize failed: 6
Dec  9 18:09:20 pve1 pmxcfs[2546]: [dcdb] crit: cpg_initialize failed: 6
Dec  9 18:09:20 pve1 pmxcfs[2546]: [dcdb] crit: cpg_initialize failed: 6
Dec  9 18:09:26 pve1 pmxcfs[2546]: [quorum] crit: quorum_initialize failed: 6
Dec  9 18:09:26 pve1 pmxcfs[2546]: [confdb] crit: confdb_initialize failed: 6
Dec  9 18:09:26 pve1 pmxcfs[2546]: [dcdb] crit: cpg_initialize failed: 6
Dec  9 18:09:26 pve1 pmxcfs[2546]: [dcdb] crit: cpg_initialize failed: 6
Dec  9 18:09:32 pve1 pmxcfs[2546]: [quorum] crit: quorum_initialize failed: 6
Dec  9 18:09:32 pve1 pmxcfs[2546]: [confdb] crit: confdb_initialize failed: 6
Dec  9 18:09:32 pve1 pmxcfs[2546]: [dcdb] crit: cpg_initialize failed: 6
Dec  9 18:09:32 pve1 pmxcfs[2546]: [dcdb] crit: cpg_initialize failed: 6


In other node with PVE 3.3 and kernel 3.10.0-5-pve, i have this messages:
Code:
Dec  9 18:52:44 pve5 rrdcached[4276]: listening for connections
Dec  9 18:52:44 pve5 iscsid: iSCSI daemon with pid=4032 started!
Dec  9 18:52:44 pve5 pmxcfs[4307]: [quorum] crit: quorum_initialize failed: 6
Dec  9 18:52:44 pve5 pmxcfs[4307]: [quorum] crit: can't initialize service
Dec  9 18:52:44 pve5 pmxcfs[4307]: [confdb] crit: confdb_initialize failed: 6
Dec  9 18:52:44 pve5 pmxcfs[4307]: [quorum] crit: can't initialize service
Dec  9 18:52:44 pve5 pmxcfs[4307]: [dcdb] crit: cpg_initialize failed: 6
Dec  9 18:52:44 pve5 pmxcfs[4307]: [quorum] crit: can't initialize service
Dec  9 18:52:44 pve5 pmxcfs[4307]: [dcdb] crit: cpg_initialize failed: 6
Dec  9 18:52:44 pve5 pmxcfs[4307]: [quorum] crit: can't initialize service
...
Dec  9 18:52:47 pve5 kernel: [   18.957024] DLM installed
Dec  9 18:52:47 pve5 corosync[4556]:   [MAIN  ] Corosync Cluster Engine ('1.4.7'): started and ready to provide service.
Dec  9 18:52:47 pve5 corosync[4556]:   [MAIN  ] Corosync built-in features: nss
Dec  9 18:52:47 pve5 corosync[4556]:   [MAIN  ] Successfully read config from /etc/cluster/cluster.conf
Dec  9 18:52:47 pve5 corosync[4556]:   [MAIN  ] Successfully parsed cman config
Dec  9 18:52:47 pve5 corosync[4556]:   [MAIN  ] Successfully configured openais services to load
Dec  9 18:52:47 pve5 corosync[4556]:   [TOTEM ] Initializing transport (UDP/IP Multicast).
Dec  9 18:52:47 pve5 corosync[4556]:   [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
Dec  9 18:52:47 pve5 corosync[4556]:   [TOTEM ] The network interface [10.25.25.50] is now up.
Dec  9 18:52:47 pve5 corosync[4556]:   [QUORUM] Using quorum provider quorum_cman
Dec  9 18:52:47 pve5 corosync[4556]:   [SERV  ] Service engine loaded: corosync cluster quorum service v0.1
Dec  9 18:52:47 pve5 corosync[4556]:   [CMAN  ] CMAN 1364188437 (built Mar 25 2013 06:14:01) started
Dec  9 18:52:47 pve5 corosync[4556]:   [SERV  ] Service engine loaded: corosync CMAN membership service 2.90
Dec  9 18:52:47 pve5 corosync[4556]:   [SERV  ] Service engine loaded: openais cluster membership service B.01.01
Dec  9 18:52:47 pve5 corosync[4556]:   [SERV  ] Service engine loaded: openais event service B.01.01
Dec  9 18:52:47 pve5 corosync[4556]:   [SERV  ] Service engine loaded: openais checkpoint service B.01.01
Dec  9 18:52:48 pve5 corosync[4556]:   [SERV  ] Service engine loaded: openais message service B.03.01
Dec  9 18:52:48 pve5 corosync[4556]:   [SERV  ] Service engine loaded: openais distributed locking service B.03.01
Dec  9 18:52:48 pve5 corosync[4556]:   [SERV  ] Service engine loaded: openais timer service A.01.01
Dec  9 18:52:48 pve5 corosync[4556]:   [SERV  ] Service engine loaded: corosync extended virtual synchrony service
Dec  9 18:52:48 pve5 corosync[4556]:   [SERV  ] Service engine loaded: corosync configuration service
Dec  9 18:52:48 pve5 corosync[4556]:   [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01
Dec  9 18:52:48 pve5 corosync[4556]:   [SERV  ] Service engine loaded: corosync cluster config database access v1.01
Dec  9 18:52:48 pve5 corosync[4556]:   [SERV  ] Service engine loaded: corosync profile loading service
Dec  9 18:52:48 pve5 corosync[4556]:   [QUORUM] Using quorum provider quorum_cman
Dec  9 18:52:48 pve5 corosync[4556]:   [SERV  ] Service engine loaded: corosync cluster quorum service v0.1
Dec  9 18:52:48 pve5 corosync[4556]:   [MAIN  ] Compatibility mode set to whitetank.  Using V1 and V2 of the synchronization engine.
Dec  9 18:52:49 pve5 ntpd[4223]: Listen normally on 13 bond10 fe80::eef4:bbff:fec8:52dc UDP 123
Dec  9 18:52:49 pve5 ntpd[4223]: peers refreshed
Dec  9 18:52:50 pve5 pmxcfs[4307]: [quorum] crit: quorum_initialize failed: 6
Dec  9 18:52:50 pve5 pmxcfs[4307]: [confdb] crit: confdb_initialize failed: 6
Dec  9 18:52:50 pve5 pmxcfs[4307]: [dcdb] crit: cpg_initialize failed: 6
Dec  9 18:52:50 pve5 pmxcfs[4307]: [dcdb] crit: cpg_initialize failed: 6
Dec  9 18:52:56 pve5 pmxcfs[4307]: [quorum] crit: quorum_initialize failed: 6
Dec  9 18:52:56 pve5 pmxcfs[4307]: [confdb] crit: confdb_initialize failed: 6
Dec  9 18:52:56 pve5 pmxcfs[4307]: [dcdb] crit: cpg_initialize failed: 6
Dec  9 18:52:56 pve5 pmxcfs[4307]: [dcdb] crit: cpg_initialize failed: 6
Dec  9 18:53:02 pve5 pmxcfs[4307]: [quorum] crit: quorum_initialize failed: 6
Dec  9 18:53:02 pve5 pmxcfs[4307]: [confdb] crit: confdb_initialize failed: 6
Dec  9 18:53:02 pve5 pmxcfs[4307]: [dcdb] crit: cpg_initialize failed: 6
Dec  9 18:53:02 pve5 pmxcfs[4307]: [dcdb] crit: cpg_initialize failed: 6
Dec  9 18:53:08 pve5 pmxcfs[4307]: [quorum] crit: quorum_initialize failed: 6
Dec  9 18:53:08 pve5 pmxcfs[4307]: [confdb] crit: confdb_initialize failed: 6
Dec  9 18:53:08 pve5 pmxcfs[4307]: [dcdb] crit: cpg_initialize failed: 6
Dec  9 18:53:08 pve5 pmxcfs[4307]: [dcdb] crit: cpg_initialize failed: 6
Dec  9 18:53:14 pve5 pmxcfs[4307]: [quorum] crit: quorum_initialize failed: 6
Dec  9 18:53:14 pve5 pmxcfs[4307]: [confdb] crit: confdb_initialize failed: 6
Dec  9 18:53:14 pve5 pmxcfs[4307]: [dcdb] crit: cpg_initialize failed: 6
Dec  9 18:53:14 pve5 pmxcfs[4307]: [dcdb] crit: cpg_initialize failed: 6
Dec  9 18:53:20 pve5 pmxcfs[4307]: [quorum] crit: quorum_initialize failed: 6
Dec  9 18:53:20 pve5 pmxcfs[4307]: [confdb] crit: confdb_initialize failed: 6
Dec  9 18:53:20 pve5 pmxcfs[4307]: [dcdb] crit: cpg_initialize failed: 6
Dec  9 18:53:20 pve5 pmxcfs[4307]: [dcdb] crit: cpg_initialize failed: 6
... etc.

My question is:

How fix it?

Best regards
Cesar
 
Hi again

I answer to myself

After of edit of the cluster.conf file in the PVE nodes without quorum and restart again (without a "<totem window_size" configuration), i got that these nodes have quorum again.

Some notes:
- All my servers are DELL and the majority has NICs of 1 Gb/s (soon in LACP layer2+3 in two NICs, that will be used in conjunction for the VMs and the PVE Cluster)
- Two nodes will have two NICs of 10 Gb/s for a VM and his LAN communication in conjunction with the PVE cluster (with LACP layer2+3)
- My PVE nodes do not have firewall enabled.
- Multicast is working (in Switches managed and PVE nodes)

So now my question are four:
1) What values can i give it to the cluster.conf file for that PVE cluster have less chances of losing quorum?
2) When a live migration VM is in progress, is used the network of the PVE cluster, or the network of the VM? (i guess that the network of the PVE cluster)
(due to that i have to choose how to use better my NICs (between 10 Gb/s and 1 Gbp/s (between the cluster communication, and network communication of the VM that will have 256GB RAM - both can be in LACP layer2+3)
3) Is the PVE kernel 3.10.0-5-pve version stable for use with LACP layer2+3?, or with what LACP options is stable?
4) Is the PVE kernel 3.10.0-5-pve version stable for use with "I/OAT DMA Engine" enabled in the Hardware Bios? (this is much better in terms of performance, especially in networks)
... See these links about of "I/OAT DMA Engine":...
http://www.linuxfoundation.org/collaborate/workgroups/networking/i/oat
http://www.intel.com/content/dam/do...technology-software-guide-for-linux-paper.pdf
http://www.intel.com/content/www/us/en/wireless-network/accel-technology.html
http://www.linux-kvm.org/page/NetworkingTodo


Best regards
Cesar
 
Last edited:
Hi Cesar, Great that you solve your problem

1) What values can i give it to the cluster.conf file for that PVE cluster have less chances of losing quorum?
I have 14 nodes cluster without any tuning in cluster.conf and 2xgigabit links by node

2) When a live migration VM is in progress, is used the network of the PVE cluster, or the network of the VM? (i guess that the network of the PVE cluster)

The cluster of PVE. (Nodes use their hostname to connect together)


3) Is the PVE kernel 3.10.0-5-pve version stable for use with LACP layer2+3?, or with what LACP options is stable?

Yes, I'm using lacp in production without problem (layer 3+4)

4) Is the PVE kernel 3.10.0-5-pve version stable for use with "I/OAT DMA Engine" enabled in the Hardware Bios? (this is much better in terms of performance, especially in networks)
... See these links about of "I/OAT DMA Engine":...

don't known about this one.
 
2) Host network is used.
3) and 4) are pretty old technologies so i dont see any reasons for them to be unstable.
BTW, ioat highy depends on application socket buffer sizes. You can try use it for lesser buffers by tuning /proc/sys/net/ipv4/tcp_dma_copybreak (sysctl net.ipv4.tcp_dma_copybreak), and tune system default buffer sizes (net.ipv4.tcp_*mem) but dont expect much improvements.
 
@Mitya:
@spirit:

Many thanks for your answers, this is a excellent community!!!.

Moreover, i am testing a pair of new Dell servers purchased, and soon i will be testing the options of the Bios Hardware for know if PVE can work in a secure manner with the best performance of his hardware configured (i am working in conjunction with a DBA of Microsoft-SQL, and we will do the tests with each option of the Hardware Bios for confirm if PVE works well), so if you want, i can tell you about of the results.

@spirit:
Today i was talking with the DBA MS-SQL, and he said me that for the server will be better that the "numa nodes" be managed by the same VM and not by QEMU, due to that MS-SQL will know better about of his own processes that are running in the "numa nodes", and obviously how to manage it better in comparison to QEMU.

And maybe others applications for linux or M$-Windows also can do this work better that QEMU.

Then, i would like to order a feature for PVE (if is possible):
- That the PVE GUI have a option for enable or disable the "numa nodes" function.
- And if is possible and you can, also give me again the new patches for do the tests. Of this manner, me and the DBA can do the tests and compare the efficiency, that which to me is a unique opportunity.

Moreover, talking about of "I/OAT DMA Engine", as it is a technology of Intel that exist in all prestigious servers, and configurable on all the Bios of hardware, i think that you should read these web links:
http://www.linuxfoundation.org/collaborate/workgroups/networking/i/oat
http://www.intel.com/content/dam/do...technology-software-guide-for-linux-paper.pdf
http://www.intel.com/content/www/us/en/wireless-network/accel-technology.html
http://www.linux-kvm.org/page/NetworkingTodo

Best regards
Cesar
 
so if you want, i can tell you about of the results.
Sure, please share ! :)

@spirit:
Today i was talking with the DBA MS-SQL, and he said me that for the server will be better that the "numa nodes" be managed by the same VM and not by QEMU, due to that MS-SQL will know better about of his own processes that are running in the "numa nodes", and obviously how to manage it better in comparison to QEMU.

And maybe others applications for linux or M$-Windows also can do this work better that QEMU.

Then, i would like to order a feature for PVE (if is possible):
- That the PVE GUI have a option for enable or disable the "numa nodes" function.
- And if is possible and you can, also give me again the new patches for do the tests. Of this manner, me and the DBA can do the tests and compare the efficiency, that which to me is a unique opportunity.

yes, I'll send patch for gui soon.

I have sent 2 patches for numa.
1 basic numa:0|1, which expose numa to guest and with host kernel 3.10, It'll also do auto numa balancing on the host.
This one is already applied on proxmox git repository.

1 advanced patch, where you can tell "I want to use this host processor for this numa node", something like cpu pinning, but better ;)
Dietmar don't have applied yet this patch.

I'll try to build package for you today.

Moreover, talking about of "I/OAT DMA Engine", as it is a technology of Intel that exist in all prestigious servers, and configurable on all the Bios of hardware, i think that you should read these web links:
http://www.linuxfoundation.org/collaborate/workgroups/networking/i/oat
http://www.intel.com/content/dam/do...technology-software-guide-for-linux-paper.pdf
http://www.intel.com/content/www/us/en/wireless-network/accel-technology.html
http://www.linux-kvm.org/page/NetworkingTodo

I have begin to read, but I don't think It'll help qemu, but maybe the host.
The main bottleneck are in windows virtio drivers currently. (I can reach easily 20gbit/s from a linux guest, and now 6gbit/s with windows 2012r2)
 
Sure, please share ! :)

With pleasure I will do when I have the results

yes, I'll send patch for gui soon.

Many thanks, you are the best!!!
Note: Please, also explain how i can to do these tasks:
1) Expose and not expose the "numa nodes" to the VM.
2) Enable to QEMU manage the processes on the "numa nodes"
3) Disable to QEMU manage the processes on the "numa nodes"

I have sent 2 patches for numa.
1 basic numa:0|1, which expose numa to guest and with host kernel 3.10, It'll also do auto numa balancing on the host.
This one is already applied on proxmox git repository.

1 advanced patch, where you can tell "I want to use this host processor for this numa node", something like cpu pinning, but better ;)
Dietmar don't have applied yet this patch.

I'll try to build package for you today.
Many thanks, excellent !!!

I have begin to read, but I don't think It'll help qemu, but maybe the host.
The main bottleneck are in windows virtio drivers currently. (I can reach easily 20gbit/s from a linux guest, and now 6gbit/s with windows 2012r2)

Many thanks for the info, but do you have 8 queues of network enabled for the VM Win 2012r2?


Best regards
Cesar
 
Last edited:
Many thanks for the info, but do you have 8 queues of network enabled for the VM Win 2012r2?

2 queues, more don't improve performance. (queues are only for inbound traffic).

For outbound traffic, as far I remember, the difference is huge between 2008r2 and 2012r2. (something like 1,5 vs 6gbits).
 
Hi to all.

Please, anyone that can help me.

Again, I have a problem of quorum, but in this case only with two hosts that has NIC of 10 Gb/s that is used for the cluster communication and the network of the VM.
Worst of all is that when the cluster communication is lost in these two nodes (i get the light red in the node into PVE GUI), the VM turns off.

Notes:
- The target is to use the NICs of 10 Gb/s for several reasons and not of 1 Gb/s.
- Also the VM that be in HA turns off, and i think this virtual machine not should be turned off by this problem.
- The NICs of 10 Gb/s are Intel x520-da2
- With these NICs of 10 Gb/s, DRBD 8.4.5 version is solid as a rock (I use independents NICs for DRBD)
- The resource group manager is running in all nodes that has HA enabled.
- The loss of cluster communication occurs when the VM is working with his network, if i have the VM powered on, but nobody is connected to this VM, i do not lose the quorum, and the VM is in a development environment with few network connections (the workstations that are connected to this VM has 1 Gb/s of network communication and not 10 Gb/s)

The error messages are these (that appear after of some hours when the nodes lose the cluster communication):

1)
Dec 17 08:41:35 pvehost5 pmxcfs[4211]: [status] crit: cpg_send_message failed: 9

2)
Executing HA start for VM 109
Member kvm5 trying to enable pvevm:109...Could not connect to resource group manager
TASK ERROR: command 'clusvcadm -e pvevm:109 -m pvehost5' failed: exit code 1

About of the second message error, these are my actions:

1) I run reboot in both nodes (pvehost 5 and 6), because these two nodes lose the cluster communication and the VM was brutally powered off.
2) I get the message error that i did write above (message number two), and i don't get that the 109 VM start up again, nor manually.
3) Again, I execute "reboot" on the other node (pvehost6) that also has "HA" enabled for the 109 VM, then the 109 VM start up automatically in the other node (pvehost5), that is expected behavior due to cluster configuration.

About of my software:

- I have five nodes with PVE 3.3 and three nodes with PVE 2.3 version
- These are my packages installed:
Code:
proxmox-ve-2.6.32: 3.3-139 (running kernel: 3.10.0-5-pve)
pve-manager: 3.3-5 (running version: 3.3-5/bfebec03)
pve-kernel-3.10.0-5-pve: 3.10.0-19
pve-kernel-2.6.32-34-pve: 2.6.32-139
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.7-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.10-1
pve-cluster: 3.0-15
qemu-server: 3.3-5    <--- Patch of test provided by spirit, installed in only a node.
pve-firmware: 1.1-3
libpve-common-perl: 3.0-19
libpve-access-control: 3.0-15
libpve-storage-perl: 3.0-25
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-8
vzctl: 4.0-1pve6
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 2.2-2    <--- Patch of test provided by spirit, installed in only a node.
ksm-control-daemon: 1.1-1
glusterfs-client: 3.5.2-1

- I am using the classic linux stack for the network communication, due to that with OVS, DRBD don't work, and DRBD has his own dedicated network links,
This is my network configuration:
Code:
auto bond0
iface bond0 inet manual
    slaves eth0 eth2
    bond_miimon 100
    bond_mode 802.3ad
    bond_xmit_hash_policy layer2
    #post-up echo 0 > /sys/devices/virtual/net/vmbr0/bridge/multicast_snooping [COLOR=#ff0000][B](i guess that this line isn't necessary)[/B][/COLOR]

My questions:
A) How can i fix my problem?
B) Is correct that the line of multicast_snooping isn't necessary if i have the kernel 3.10 version? (with this line enabled, also i had quorum problems, and maybe neither I need it).
C) Is a expected behaviour that the VM turns off because the PVE node lose the quorum?.

Best regards
Cesar
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!