Cluster offline / cman works / clustat shows all online / all nodes red

Chris Rivera · Apr 3, 2013

Cluster is online, accessible via ssh.

Clustat shows all nodes online

root@proxmox9a:~# clustat
Cluster Status for FL-Cluster @ Wed Apr 3 09:11:26 2013
Member Status: Quorate

Member Name ID Status
------ ---- ---- ------
proxmox11 1 Online
proxmox2 2 Online
proxmox3a 3 Online
proxmox4 4 Online
poxmox5 5 Online
proxmox6 6 Online
proxmox7 7 Online
proxmox8 8 Online
proxmox9a 9 Online, Local
Proxmox10 10 Online

######

pvecm status

root@proxmox9a:~# pvecm status
Version: 6.2.0
Config Version: 53
Cluster Name: FL-Cluster
Cluster Id: 6836
Cluster Member: Yes
Cluster Generation: 5163132
Membership state: Cluster-Member
Nodes: 10
Expected votes: 10
Total votes: 10
Node votes: 1
Quorum: 6
Active subsystems: 5
Flags:
Ports Bound: 0
Node name: proxmox9a
Node ID: 9

######

pveversion -v

root@proxmox9a:~# pveversion -v
pve-manager: 2.3-13 (pve-manager/2.3/7946f1f1)
running kernel: 2.6.32-19-pve
proxmox-ve-2.6.32: 2.3-93
pve-kernel-2.6.32-19-pve: 2.6.32-93
pve-kernel-2.6.32-7-pve: 2.6.32-60
lvm2: 2.02.95-1pve2
clvm: 2.02.95-1pve2
corosync-pve: 1.4.4-4
openais-pve: 1.1.4-2
libqb: 0.10.1-2
redhat-cluster-pve: 3.1.93-2
resource-agents-pve: 3.9.2-3
fence-agents-pve: 3.1.9-1
pve-cluster: 1.0-36
qemu-server: 2.3-18
pve-firmware: 1.0-21
libpve-common-perl: 1.0-49
libpve-access-control: 1.0-26
libpve-storage-perl: 2.3-6
vncterm: 1.0-3
vzctl: 4.0-1pve2
vzprocps: 2.0.11-2
vzquota: 3.1-1
pve-qemu-kvm: 1.4-8
ksm-control-daemon: 1.1-1

######

We haven't been able to find any network disturbances so were not sure what happened.

What is the difference between clustat and the service that shows the node as red / green?

Chris Rivera · Apr 3, 2013

unicast / multicast tested using oomping and it worked.

How does the web interface know to show red/green?
Is this value saved to a database?
What service updates this status?

spirit · Apr 3, 2013

Chris Rivera said:
unicast / multicast tested using oomping and it worked.

How does the web interface know to show red/green?
Is this value saved to a database?
What service updates this status?

pvestatd daemon

/etc/init.d/pvestatd restart

(check also that pve-cluster filesystem is ok, /etc/init.d/pve-cluster restart)

Chris Rivera · Apr 4, 2013

restarting all the services below have proved useless....

cman
pve-cluster
pvestatd
pvedaemon

Seems everything is running right, but the web interface is showing them offline

spirit · Apr 4, 2013

- if pvecm status is ok, then cman service is ok
- if you can edit/write a vm config file in /etc/pve/qemu-server, then pve-cluster is ok.

if both are ok, then it must be a problem with pvestatd hanging. (maybe on storage ?).
Can you check your logs : /Var/log/daemon.log, /var/log/messages and serch pvestatd logs

Chris Rivera · Apr 6, 2013

Not sure why or how but was able to get them all online as of now. Took 3 days. got it to a point where i was able to get 2-3 nodes online then after a day we have them all online now.

My issue:

Node 4 is not online, cman will not start. VMs are online and running but when i try to restart cman i get the following error.

root@proxmox4:~# service cman restart
Stopping cluster:
Stopping dlm_controld... [ OK ]
Stopping fenced... [ OK ]
Stopping cman... [ OK ]
Unloading kernel modules... [ OK ]
Unmounting configfs... [ OK ]
Starting cluster:
Checking if cluster has been disabled at boot... [ OK ]
Checking Network Manager... [ OK ]
Global setup... [ OK ]
Loading kernel modules... FATAL: Could not load /lib/modules/2.6.32-16-pve/modules.dep: No such file or directory
[FAILED]

Normally reboot will fix things but i don't want to reboot the box if the file is missing and will not boot or cause more issues than cman not running. I haven't found anything with this error, would like some feedback.

spirit · Apr 6, 2013

Chris Rivera said:
Not sure why or how but was able to get them all online as of now. Took 3 days. got it to a point where i was able to get 2-3 nodes online then after a day we have them all online now.

My issue:

Node 4 is not online, cman will not start. VMs are online and running but when i try to restart cman i get the following error.

root@proxmox4:~# service cman restart
Stopping cluster:
Stopping dlm_controld... [ OK ]
Stopping fenced... [ OK ]
Stopping cman... [ OK ]
Unloading kernel modules... [ OK ]
Unmounting configfs... [ OK ]
Starting cluster:
Checking if cluster has been disabled at boot... [ OK ]
Checking Network Manager... [ OK ]
Global setup... [ OK ]
Loading kernel modules... FATAL: Could not load /lib/modules/2.6.32-16-pve/modules.dep: No such file or directory
[FAILED]

Normally reboot will fix things but i don't want to reboot the box if the file is missing and will not boot or cause more issues than cman not running. I haven't found anything with this error, would like some feedback.

you need to reboot, because you have updated the kernel and I think that your current kernel (-16-pve ?), don't have the modules.dep anymore.
You can try to reinstall the older kernel as workaround If you don't want to reboot.

But the missing modules.dep shouldn't be a problem, as when you will reboot, you will have a newer kernel.

Chris Rivera · Apr 8, 2013

Node is complely offline. Will not boot up it say there is no kernel file found. Also says fallback does not work...

im burning an ubuntu live cd right now to get access to look around.

Is it possible to copy the boot dir from one server to another?
What are your suggestions to get this back online?

** this is extremely high importance

Chris Rivera · Apr 8, 2013

I was able to use ubuntu live cd to navigate thru the boot partition. I see that the box was upgraded, but was never rebooted. The new .18 kernel exists but was never booted.

My problem:

- Grub does not show the .18-pve kernel option
- All previous kernels will not boot

How i Got it to boot:

- in grub i edited the latest .16-pve kernel and edited the .16 => .18 and it booted up.

1. How can i fix this so i dont need to manually do this every time the box is rebooted
2. Why did this happen?
3. Why do all previous kernels fail?

spirit · Apr 8, 2013

Chris Rivera said:
I was able to use ubuntu live cd to navigate thru the boot partition. I see that the box was upgraded, but was never rebooted. The new .18 kernel exists but was never booted.

My problem:

- Grub does not show the .18-pve kernel option
- All previous kernels will not boot

How i Got it to boot:

- in grub i edited the latest .16-pve kernel and edited the .16 => .18 and it booted up.

1. How can i fix this so i dont need to manually do this every time the box is rebooted
2. Why did this happen?
3. Why do all previous kernels fail?

1. #update-grub ?

It will update the /boot/grub/grub.cfg

2. maybe kernel has been updated, but grub.cfg was not ?

3. because they have been removed ? (dpkg -l|grep pve-kernel)

Chris Rivera · Apr 9, 2013

spirit said:
1. #update-grub ?

It will update the /boot/grub/grub.cfg

2. maybe kernel has been updated, but grub.cfg was not ?

3. because they have been removed ? (dpkg -l|grep pve-kernel)

1. Ill update grub manually.
2. How was this not completed when update was done using apt-get upgrade-distro... this not completing seems like an issue that should be addressed, I doubt i will make any more updates unless a node is failing, can afford any downtime or issues.
3. Upgrading distros removes previous kernels? We do not remove kernels manually, i thought the update installs new kernel with option to revert back to older ones incase of any issues. If the kernels were deleted by the upgrade process why was grub never updated to remove the enties as selectable kernels to load when booting?

FYI: We don't delete or remove nothing from the cluster / node unless its a vm. The fact the previous kernels are missing leads me to believe something horribly went wrong here.

Chris Rivera · Apr 10, 2013

Original issue where all nodes are red... but clustat shows all nodes online

root@proxmox8:~# clustat
Cluster Status for FL-Cluster @ Wed Apr 10 09:05:31 2013
Member Status: Quorate

Member Name ID Status
------ ---- ---- ------
proxmox11 1 Online
proxmox2 2 Online
proxmox3a 3 Online
proxmox4 4 Online
poxmox5 5 Online
proxmox6 6 Online
proxmox7 7 Online
proxmox8 8 Online, Local
proxmox9a 9 Online
Proxmox10 10 Online
proxmox12 11 Online

root@proxmox8:~# pvecm status
Version: 6.2.0
Config Version: 58
Cluster Name: FL-Cluster
Cluster Id: 6836
Cluster Member: Yes
Cluster Generation: 5217712
Membership state: Cluster-Member
Nodes: 11
Expected votes: 11
Total votes: 11
Node votes: 1
Quorum: 6
Active subsystems: 5
Flags:
Ports Bound: 0
Node name: proxmox8
Node ID: 8

Not sure whats wrong. Multi cast testing using omping (http://pve.proxmox.com/wiki/Multicast_notes) shows multicast / unicast to be fine.

Checking multicast / unicast on the switch to ensure the switch is not filtering packets, or to ensure another port is not doing any muticast / unicast / broadcast storm

#show storm-control multicast
Interface Filter State Upper Lower Current
--------- ------------- ----------- ----------- ----------
Gi1/0/1 Forwarding 2.00% 2.00% 0.00%
Gi1/0/2 Forwarding 2.00% 2.00% 0.00%
Gi1/0/3 Forwarding 2.00% 2.00% 0.00%
Gi1/0/4 Forwarding 2.00% 2.00% 0.00%
Gi1/0/5 Link Down 2.00% 2.00% 0.00%
Gi1/0/6 Forwarding 2.00% 2.00% 0.00%
Gi1/0/7 Forwarding 2.00% 2.00% 0.00%
Gi1/0/8 Link Down 2.00% 2.00% 0.00%
Gi1/0/9 Forwarding 2.00% 2.00% 0.00%
Gi1/0/10 Forwarding 2.00% 2.00% 0.00%
Gi1/0/11 Forwarding 2.00% 2.00% 0.00%
Gi1/0/12 Forwarding 2.00% 2.00% 0.00%
Gi1/0/13 Forwarding 2.00% 2.00% 0.00%
Gi1/0/14 Forwarding 2.00% 2.00% 0.00%
Gi1/0/15 Forwarding 2.00% 2.00% 0.00%
Gi1/0/16 Forwarding 2.00% 2.00% 0.00%
Gi1/0/17 Forwarding 2.00% 2.00% 0.00%
Gi1/0/18 Forwarding 2.00% 2.00% 0.00%
Gi1/0/19 Forwarding 2.00% 2.00% 0.00%
Gi1/0/20 Forwarding 2.00% 2.00% 0.00%
Gi1/0/21 Link Down 2.00% 2.00% 0.00%
Gi1/0/22 Forwarding 2.00% 2.00% 0.00%
Gi1/0/23 Link Down 2.00% 2.00% 0.00%
Gi1/0/24 Forwarding 2.00% 2.00% 0.00%
Gi1/0/25 Link Down 2.00% 2.00% 0.00%
Gi1/0/26 Forwarding 2.00% 2.00% 0.00%
Gi1/0/27 Forwarding 2.00% 2.00% 0.00%
Gi1/0/28 Forwarding 2.00% 2.00% 0.00%
Gi1/0/29 Forwarding 2.00% 2.00% 0.00%
Gi1/0/30 Forwarding 2.00% 2.00% 0.00%
Gi1/0/31 Forwarding 2.00% 2.00% 0.00%
Gi1/0/32 Forwarding 2.00% 2.00% 0.00%
Gi1/0/33 Forwarding 2.00% 2.00% 0.00%
Gi1/0/34 Link Down 2.00% 2.00% 0.00%
Gi1/0/35 Forwarding 2.00% 2.00% 0.00%
Gi1/0/36 Link Down 2.00% 2.00% 0.00%
Gi1/0/37 Forwarding 2.00% 2.00% 0.00%
Gi1/0/38 Link Down 2.00% 2.00% 0.00%
Gi1/0/39 Link Down 2.00% 2.00% 0.00%
Gi1/0/40 Link Down 2.00% 2.00% 0.00%
Gi1/0/48 Forwarding 2.00% 2.00% 0.00%
Gi2/0/7 Link Down 5.00% 5.00% 0.00%

#show storm-control unicast
Interface Filter State Upper Lower Current
--------- ------------- ----------- ----------- ----------
Gi1/0/1 Forwarding 87.00% 65.00% 13.68%
Gi1/0/2 Forwarding 87.00% 65.00% 1.97%
Gi1/0/3 Forwarding 87.00% 65.00% 0.05%
Gi1/0/4 Forwarding 87.00% 65.00% 0.00%
Gi1/0/5 Link Down 87.00% 65.00% 0.00%
Gi1/0/6 Forwarding 87.00% 65.00% 0.01%
Gi1/0/7 Forwarding 87.00% 65.00% 0.46%
Gi1/0/8 Link Down 87.00% 65.00% 0.00%
Gi1/0/9 Forwarding 87.00% 65.00% 0.00%
Gi1/0/10 Forwarding 87.00% 65.00% 4.85%
Gi1/0/11 Forwarding 87.00% 65.00% 0.26%
Gi1/0/12 Forwarding 87.00% 65.00% 0.21%
Gi1/0/13 Forwarding 87.00% 65.00% 1.26%
Gi1/0/14 Forwarding 87.00% 65.00% 0.00%
Gi1/0/15 Forwarding 87.00% 65.00% 0.99%
Gi1/0/16 Forwarding 87.00% 65.00% 9.56%
Gi1/0/17 Forwarding 87.00% 65.00% 0.47%
Gi1/0/18 Forwarding 87.00% 65.00% 0.00%
Gi1/0/19 Forwarding 87.00% 65.00% 0.00%
Gi1/0/20 Forwarding 87.00% 65.00% 0.00%
Gi1/0/21 Link Down 87.00% 65.00% 0.00%
Gi1/0/22 Forwarding 87.00% 65.00% 10.08%
Gi1/0/23 Link Down 87.00% 65.00% 0.00%
Gi1/0/24 Forwarding 87.00% 65.00% 0.00%
Gi1/0/25 Link Down 87.00% 65.00% 0.00%
Gi1/0/26 Forwarding 87.00% 65.00% 0.00%
Gi1/0/27 Forwarding 87.00% 65.00% 0.00%
Gi1/0/28 Forwarding 87.00% 65.00% 0.00%
Gi1/0/29 Forwarding 87.00% 65.00% 0.00%
Gi1/0/30 Forwarding 87.00% 65.00% 0.00%
Gi1/0/31 Forwarding 87.00% 65.00% 0.00%
Gi1/0/32 Forwarding 87.00% 65.00% 0.00%
Gi1/0/33 Forwarding 87.00% 65.00% 0.28%
Gi1/0/34 Link Down 87.00% 65.00% 0.00%
Gi1/0/35 Forwarding 87.00% 65.00% 0.00%
Gi1/0/48 Forwarding 87.00% 65.00% 0.02%
Gi2/0/7 Link Down 87.00% 65.00% 0.00%

#show storm-control broadcast
Interface Filter State Upper Lower Current
--------- ------------- ----------- ----------- ----------
Gi1/0/1 Forwarding 2.00% 2.00% 0.00%
Gi1/0/2 Forwarding 2.00% 2.00% 0.00%
Gi1/0/3 Forwarding 2.00% 2.00% 0.00%
Gi1/0/4 Forwarding 2.00% 2.00% 0.00%
Gi1/0/5 Link Down 2.00% 2.00% 0.00%
Gi1/0/6 Forwarding 2.00% 2.00% 0.00%
Gi1/0/7 Forwarding 2.00% 2.00% 0.00%
Gi1/0/8 Link Down 2.00% 2.00% 0.00%
Gi1/0/9 Forwarding 2.00% 2.00% 0.00%
Gi1/0/10 Forwarding 2.00% 2.00% 0.00%
Gi1/0/11 Forwarding 2.00% 2.00% 0.00%
Gi1/0/12 Forwarding 2.00% 2.00% 0.00%
Gi1/0/13 Forwarding 2.00% 2.00% 0.00%
Gi1/0/14 Forwarding 2.00% 2.00% 0.00%
Gi1/0/15 Forwarding 2.00% 2.00% 0.00%
Gi1/0/16 Forwarding 2.00% 2.00% 0.00%
Gi1/0/17 Forwarding 2.00% 2.00% 0.00%
Gi1/0/18 Forwarding 2.00% 2.00% 0.00%
Gi1/0/19 Forwarding 2.00% 2.00% 0.00%
Gi1/0/20 Forwarding 2.00% 2.00% 0.00%
Gi1/0/21 Link Down 2.00% 2.00% 0.00%
Gi1/0/22 Forwarding 2.00% 2.00% 0.00%
Gi1/0/23 Link Down 2.00% 2.00% 0.00%
Gi1/0/24 Forwarding 2.00% 2.00% 0.00%
Gi1/0/25 Link Down 2.00% 2.00% 0.00%
Gi1/0/26 Forwarding 2.00% 2.00% 0.00%
Gi1/0/27 Forwarding 2.00% 2.00% 0.00%
Gi1/0/28 Forwarding 2.00% 2.00% 0.00%
Gi1/0/29 Forwarding 2.00% 2.00% 0.00%
Gi1/0/30 Forwarding 2.00% 2.00% 0.00%
Gi1/0/31 Forwarding 2.00% 2.00% 0.00%
Gi1/0/32 Forwarding 2.00% 2.00% 0.00%
Gi1/0/33 Forwarding 2.00% 2.00% 0.00%
Gi1/0/34 Link Down 2.00% 2.00% 0.00%
Gi1/0/35 Forwarding 2.00% 2.00% 0.00%
Gi1/0/36 Link Down 2.00% 2.00% 0.00%
Gi1/0/37 Forwarding 2.00% 2.00% 0.00%
Gi1/0/38 Link Down 2.00% 2.00% 0.00%
Gi1/0/39 Link Down 2.00% 2.00% 0.00%
Gi1/0/40 Link Down 2.00% 2.00% 0.00%
Gi1/0/48 Forwarding 2.00% 2.00% 0.00%
Gi2/0/7 Link Down 5.00% 5.00% 0.00%

Any one can provide me with other steps or tests to run to see what could be wrong?

spirit · Apr 10, 2013

Are you using cisco switches ?

If yes, I think I can help you, as some cisco switchs can have multicast problem with current linux bridge igmp queries.

spirit · Apr 10, 2013

Oh, I think I speak too fast, your multicast seem indeed ok and corosync is working.

So from my previous post:

>>- if you can edit/write a vm config file in /etc/pve/qemu-server, then pve-cluster is ok. ?

>>if both are ok, then it must be a problem with pvestatd hanging. (maybe on storage ?).
>>Can you check your logs : /Var/log/daemon.log, /var/log/messages and serch pvestatd logs ?

Chris Rivera · Apr 10, 2013

Yes we are using Cicso 3750G Series Switches.

This same problem i have comes takes everything down for few days.... then magically one by one nodes come online.... stay online for 3-4 days then the cycle begins again.

Never been able to track down or figure out what is the cause of this. Just spend numerous amount of hours restarting services, tracking the network, checking the switches, cacti, vms....

Any help to resolve this issue and help someone in the future with these same issues would be greatly appreciated.

spirit · Apr 10, 2013

Oh ok, so multicast works, but sometime seem to break, then nodes come back again.
I think that nodes remains in red, because pve-cluster and pvestatd don't come back auto online after multicast break.
As workaround, restart
/etc/init.d/pve-cluster restart
/etc/init.d/pvestatd restart

nodes should come back in green.

Now for your cisco switchs, I have differents solutions (with last proxmox kernel):

A) keep the multicast filtering (snooping) enabled, this avoid to forward multicast on alls yours switchs ports/linux bridge

1) if you can, don't put proxmox admin ip on bridge (vmbr0), but directly only ethX or bondX interface.

2) if you really need the proxmox admin ip on vmbr0, you can try this:

Code:

auto vmbr0
iface vmbr0 inet static
        address X.X.X.X
        netmask X.X.X.X
        broadcast X.X.X.X
        bridge_ports eth0
        bridge_stp off
        bridge_fd 0
        post-up echo 0 > /sys/devices/virtual/net/vmbr0/bridge/multicast_querier

then for 1) or 2),
on you cisco switchs, you need to enable igmp querier
http://www.cisco.com/en/US/docs/swi...se/12.2_52_se/configuration/guide/swigmp.html

Code:

#conf t [/B]# ip igmp snooping querier [B]# ip igmp snooping querier query-interval [/B]10

You can check the mutlicast groups in cisco switchs with:

Code:

[I]# show ip igmp snooping groups[/I]

B) disabling all multicast filtering

on proxmox host

Code:

auto vmbr0 iface vmbr0 inet static address X.X.X.X netmask X.X.X.X broadcast X.X.X.X bridge_ports eth0 bridge_stp off bridge_fd 0 post-up echo 0 > /sys/devices/virtual/net/vmbr0/bridge/multicast_snooping

on cisco switchs

Code:

#conf t #no [I]ip igmp snooping[/I]

Maybe Try B) first, so it'll forward all traffic , then A) to recreate clean multicast snooping groups

But basicly, the problem is linux bridge sending igmp queries to network, and cisco switchs don't like them.
I have add same problems than you with cisco6500 and cisco2960g switchs

Chris Rivera · Apr 10, 2013

spirit said:
Oh, I think I speak too fast, your multicast seem indeed ok and corosync is working.

So from my previous post:
>>- if you can edit/write a vm config file in /etc/pve/qemu-server, then pve-cluster is ok. ?

>>if both are ok, then it must be a problem with pvestatd hanging. (maybe on storage ?).
>>Can you check your logs : /Var/log/daemon.log, /var/log/messages and serch pvestatd logs ?

/etc/pve/qemu-server - cannot create / update a file permission problem

/Var/log/daemon.log
Apr 10 10:46:29 proxmox8 pmxcfs[454901]: [status] crit: cpg_send_message failed: 9
Apr 10 10:46:29 proxmox8 pmxcfs[454901]: [status] crit: cpg_send_message failed: 9
Apr 10 10:46:29 proxmox8 pmxcfs[454901]: [status] crit: cpg_send_message failed: 9
Apr 10 10:46:29 proxmox8 pmxcfs[454901]: [status] crit: cpg_send_message failed: 9
Apr 10 10:46:29 proxmox8 pmxcfs[454901]: [status] crit: cpg_send_message failed: 9
Apr 10 10:46:29 proxmox8 pmxcfs[454901]: [status] crit: cpg_send_message failed: 9
Apr 10 10:46:29 proxmox8 pmxcfs[454901]: [status] crit: cpg_send_message failed: 9
Apr 10 10:46:29 proxmox8 pmxcfs[454901]: [status] crit: cpg_send_message failed: 9
Apr 10 10:46:29 proxmox8 pmxcfs[454901]: [status] crit: cpg_send_message failed: 9
Apr 10 10:46:29 proxmox8 pmxcfs[454901]: [status] crit: cpg_send_message failed: 9

/var/log/messages
Apr 10 06:03:38 proxmox8 corosync[875588]: [TOTEM ] Retransmit List: 1 2 3 4 5 6 7 8 b c d e f 10 11
Apr 10 06:03:38 proxmox8 corosync[875588]: [TOTEM ] Retransmit List: 1e 1f 20 21 22 23 25 26 24 27 28 29 2a 2b 2c 2d 2e 2f 30 31 32
Apr 10 06:03:38 proxmox8 corosync[875588]: [TOTEM ] Retransmit List: 1a 1b 1c 1d 1e 20 21 22 37 38 39 3a 3b 3c 3d 3e 3f 40 41 42 43 44 45 46 47 48 49 4a 4b 4c
Apr 10 06:03:41 proxmox8 corosync[875588]: [TOTEM ] Retransmit List: 15 17 18 19 1a 1b 1c 1d 4d 4e 4f 50 51 52 53 54 55 56 57 58 59 5a 5b 5c 5d
Apr 10 06:03:48 proxmox8 corosync[875588]: [TOTEM ] Retransmit List: 18 19 1a 1c 1d 1e 21 22 62 63 64 65 66 67 68 69 6a 6b 6c 6d 6e 71 72 74 75 76 78 79 7a 7b
Apr 10 06:03:50 proxmox8 corosync[875588]: [TOTEM ] Retransmit List: 17 18 19 1a 1c 1d 1e 21 7c 7d 7e 7f 80 81 82 83 84 85 86 87
Apr 10 06:03:58 proxmox8 corosync[875588]: [TOTEM ] Retransmit List: 21 25 26 27 28 29 2a 2b 8d 8e 8f 90 91 92 93 94 95 97 98 99 9e a1 a2 a3 a4 a5 a6 a7 a8 a9
Apr 10 06:04:00 proxmox8 corosync[875588]: [TOTEM ] Retransmit List: 17 18 19 1a 1c 1d 1e 25 aa ab ac ad ae af b2 b3
Apr 10 06:04:05 proxmox8 corosync[875588]: [TOTEM ] Retransmit List: 2b 2c 2d 2e 2f 30 31 32
Apr 10 06:04:05 proxmox8 corosync[875588]: [TOTEM ] Retransmit List: 2b 2c 2e 2f 30 31 32
Apr 10 06:04:07 proxmox8 corosync[875588]: [TOTEM ] Retransmit List: 2e 2f 30 31 32 34 35 36
Apr 10 06:04:07 proxmox8 corosync[875588]: [TOTEM ] Retransmit List: 66 67 68 69 6b 6c 6d 6e
Apr 10 06:04:07 proxmox8 corosync[875588]: [TOTEM ] Retransmit List: 92 93 94 95 96 97 98 99
Apr 10 06:04:10 proxmox8 corosync[875588]: [TOTEM ] Retransmit List: cb d0 d1 d2 d3 d4 d5 d6
Apr 10 06:04:25 proxmox8 corosync[875588]: [TOTEM ] Retransmit List: 1a 1b 1c 1d 1e 1f 20 21
Apr 10 06:04:25 proxmox8 corosync[875588]: [TOTEM ] Retransmit List: 1d 1e 1f 20 21 22 23 24
Apr 10 06:25:04 proxmox8 rsyslogd: [origin software="rsyslogd" swVersion="4.6.4" x-pid="1154" x-info="http://www.rsyslog.com"] rsyslogd was HUPed, type 'lightweight'.

/var/log/messages | grep pvestatd

no results

Chris Rivera · Apr 10, 2013

spirit said:
Oh ok, so multicast works, but sometime seem to break, then nodes come back again.
I think that nodes remains in red, because pve-cluster and pvestatd don't come back auto online after multicast break.
As workaround, restart
/etc/init.d/pve-cluster restart
/etc/init.d/pvestatd restart

nodes should come back in green.

This did not do anything. All nodes still show red. Even the local node which normally is green shows red.

Once we can get the nodes green we can try to work on getting the networking aspect of the cluster tuned to work optimally.

Chris Rivera · Apr 10, 2013

You can check the mutlicast groups in cisco switchs with:

Code:

[I]# show ip igmp snooping groups[/I]

in our switch its "sho" instead of "show"

Vlan 801:
--------
IGMP snooping : Enabled
IGMPv2 immediate leave : Disabled
Multicast router learning mode : pim-dvmrp
CGMP interoperability mode : IGMP_ONLY
Robustness variable : 2
Last member query count : 2
Last member query interval : 1000

GigabitEthernet1/0/29 is up, line protocol is up (connected)
Hardware is Gigabit Ethernet, address is ****.****.****(bia ****.****.****)
Description: ProxmoxNode_9
MTU 1500 bytes, BW 1000000 Kbit, DLY 10 usec,
reliability 255/255, txload 1/255, rxload 1/255
Encapsulation ARPA, loopback not set
Keepalive set (10 sec)
Full-duplex, 1000Mb/s, media type is 10/100/1000BaseTX
Media-type configured as connector
input flow-control is off, output flow-control is unsupported
ARP type: ARPA, ARP Timeout 04:00:00
Last input never, output 00:00:05, output hang never
Last clearing of "show interface" counters never
Input queue: 0/75/0/0 (size/max/drops/flushes); Total output drops: 860268144
Queueing strategy: fifo
Output queue: 0/40 (size/max)
5 minute input rate 91000 bits/sec, 42 packets/sec
5 minute output rate 59000 bits/sec, 55 packets/sec
22242896065 packets input, 4362177900019 bytes, 0 no buffer
Received 1806220939 broadcasts (1806147129 multicasts)
0 runts, 0 giants, 0 throttles
0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored
0 watchdog, 1806147129 multicast, 0 pause input
0 input packets with dribble condition detected
30487156638 packets output, 11077953726901 bytes, 0 underruns
0 output errors, 0 collisions, 10 interface resets
0 babbles, 0 late collision, 0 deferred
0 lost carrier, 0 no carrier, 0 PAUSE output
0 output buffer failures, 0 output buffers swapped out

this is the interface for node 9. I do not see any reference to igmp.

Chris Rivera · Apr 10, 2013

OMPing Results:

Doesn't seem like there is a communication issue. All nodes are communicating.

proxmox4 : multicast, seq=60, size=69 bytes, dist=0, time=0.205ms
proxmox7 : unicast, seq=60, size=69 bytes, dist=0, time=0.194ms
proxmox7 : multicast, seq=60, size=69 bytes, dist=0, time=0.202ms
proxmox8 : unicast, seq=60, size=69 bytes, dist=0, time=0.186ms
proxmox11 : multicast, seq=60, size=69 bytes, dist=0, time=0.230ms
proxmox11 : unicast, seq=60, size=69 bytes, dist=0, time=0.218ms
proxmox8 : multicast, seq=60, size=69 bytes, dist=0, time=0.258ms
proxmox12 : unicast, seq=60, size=69 bytes, dist=0, time=0.251ms
proxmox12 : multicast, seq=60, size=69 bytes, dist=0, time=0.279ms
proxmox4 : unicast, seq=61, size=69 bytes, dist=0, time=0.181ms
proxmox4 : multicast, seq=61, size=69 bytes, dist=0, time=0.210ms
proxmox9a : unicast, seq=61, size=69 bytes, dist=0, time=0.192ms
proxmox9a : multicast, seq=61, size=69 bytes, dist=0, time=0.196ms
Proxmox10 : unicast, seq=36, size=69 bytes, dist=0, time=0.194ms
Proxmox10 : multicast, seq=36, size=69 bytes, dist=0, time=0.202ms
proxmox3a : unicast, seq=61, size=69 bytes, dist=1, time=0.259ms
proxmox3a : multicast, seq=61, size=69 bytes, dist=0, time=0.273ms
proxmox7 : unicast, seq=61, size=69 bytes, dist=0, time=0.237ms
proxmox7 : multicast, seq=61, size=69 bytes, dist=0, time=0.250ms
proxmox8 : unicast, seq=61, size=69 bytes, dist=0, time=0.235ms
proxmox8 : multicast, seq=61, size=69 bytes, dist=0, time=0.256ms
poxmox5 : unicast, seq=61, size=69 bytes, dist=0, time=0.261ms
poxmox5 : multicast, seq=61, size=69 bytes, dist=0, time=0.276ms
proxmox11 : unicast, seq=61, size=69 bytes, dist=0, time=0.241ms
proxmox11 : multicast, seq=61, size=69 bytes, dist=0, time=0.246ms
proxmox12 : unicast, seq=61, size=69 bytes, dist=0, time=0.251ms
proxmox12 : multicast, seq=61, size=69 bytes, dist=0, time=0.257ms
proxmox3a : unicast, seq=62, size=69 bytes, dist=1, time=0.137ms
proxmox3a : multicast, seq=62, size=69 bytes, dist=0, time=0.151ms
Proxmox10 : unicast, seq=37, size=69 bytes, dist=0, time=0.099ms
proxmox9a : multicast, seq=62, size=69 bytes, dist=0, time=0.115ms
proxmox9a : unicast, seq=62, size=69 bytes, dist=0, time=0.111ms
Proxmox10 : multicast, seq=37, size=69 bytes, dist=0, time=0.112ms
proxmox4 : unicast, seq=62, size=69 bytes, dist=0, time=0.162ms
proxmox4 : multicast, seq=62, size=69 bytes, dist=0, time=0.172ms
poxmox5 : unicast, seq=62, size=69 bytes, dist=0, time=0.258ms
proxmox7 : multicast, seq=62, size=69 bytes, dist=0, time=0.261ms
proxmox7 : unicast, seq=62, size=69 bytes, dist=0, time=0.258ms
poxmox5 : multicast, seq=62, size=69 bytes, dist=0, time=0.272ms
proxmox11 : unicast, seq=62, size=69 bytes, dist=0, time=0.236ms
proxmox11 : multicast, seq=62, size=69 bytes, dist=0, time=0.240ms
proxmox12 : unicast, seq=62, size=69 bytes, dist=0, time=0.232ms
proxmox12 : multicast, seq=62, size=69 bytes, dist=0, time=0.236ms
proxmox8 : unicast, seq=62, size=69 bytes, dist=0, time=0.346ms
proxmox8 : multicast, seq=62, size=69 bytes, dist=0, time=0.358ms
proxmox3a : unicast, seq=63, size=69 bytes, dist=1, time=0.152ms
proxmox3a : multicast, seq=63, size=69 bytes, dist=0, time=0.165ms
proxmox4 : unicast, seq=63, size=69 bytes, dist=0, time=0.201ms
proxmox4 : multicast, seq=63, size=69 bytes, dist=0, time=0.217ms
proxmox9a : unicast, seq=63, size=69 bytes, dist=0, time=0.187ms
proxmox9a : multicast, seq=63, size=69 bytes, dist=0, time=0.191ms
poxmox5 : unicast, seq=63, size=69 bytes, dist=0, time=0.224ms
poxmox5 : multicast, seq=63, size=69 bytes, dist=0, time=0.240ms
proxmox11 : unicast, seq=63, size=69 bytes, dist=0, time=0.165ms
proxmox11 : multicast, seq=63, size=69 bytes, dist=0, time=0.198ms
Proxmox10 : unicast, seq=38, size=69 bytes, dist=0, time=0.221ms
Proxmox10 : multicast, seq=38, size=69 bytes, dist=0, time=0.230ms
proxmox8 : unicast, seq=63, size=69 bytes, dist=0, time=0.247ms
proxmox8 : multicast, seq=63, size=69 bytes, dist=0, time=0.252ms
proxmox7 : unicast, seq=63, size=69 bytes, dist=0, time=0.272ms
proxmox7 : multicast, seq=63, size=69 bytes, dist=0, time=0.277ms
proxmox12 : unicast, seq=63, size=69 bytes, dist=0, time=0.221ms
proxmox12 : multicast, seq=63, size=69 bytes, dist=0, time=0.225ms

Cluster offline / cman works / clustat shows all online / all nodes red

Chris Rivera

Guest

Chris Rivera

Guest

spirit

Distinguished Member

Chris Rivera

Guest

spirit

Distinguished Member

Chris Rivera

Guest

spirit

Distinguished Member

Chris Rivera

Guest

Chris Rivera

Guest

spirit

Distinguished Member

Chris Rivera

Guest

Chris Rivera

Guest

spirit

Distinguished Member

spirit

Distinguished Member

Chris Rivera

Guest

spirit

Distinguished Member

Chris Rivera

Guest

Chris Rivera

Guest

Chris Rivera

Guest

Chris Rivera

Guest