Corosync is not working and I can't log in to the web console.

efg · Apr 29, 2022

Please help. Cluster stopped working. After rebooting the server vm does not start. From the console, too, it is not possible to launch them.

Code:

pvecm status
Cluster information
-------------------
Name:             asodesk
Config Version:   52
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Fri Apr 29 05:03:51 2022
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000001
Ring ID:          1.19
Quorate:          No

Votequorum information
----------------------
Expected votes:   14
Highest expected: 14
Total votes:      5
Quorum:           8 Activity blocked
Flags:

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          5 IP1 (local)

Code:

systemctl status corosync
● corosync.service - Corosync Cluster Engine
     Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
     Active: active (running) since Fri 2022-04-29 04:16:39 CEST; 46min ago
       Docs: man:corosync
             man:corosync.conf
             man:corosync_overview
   Main PID: 24977 (corosync)
      Tasks: 9 (limit: 308854)
     Memory: 197.8M
        CPU: 2h 30min 348ms
     CGroup: /system.slice/corosync.service
             └─24977 /usr/sbin/corosync -f

Apr 29 05:02:28 x8.asodesk.com corosync[24977]:   [TOTEM ] entering GATHER state from 11(merge during join).
Apr 29 05:02:28 x8.asodesk.com corosync[24977]:   [TOTEM ] entering GATHER state from 11(merge during join).
Apr 29 05:02:28 x8.asodesk.com corosync[24977]:   [TOTEM ] entering GATHER state from 11(merge during join).
Apr 29 05:02:28 x8.asodesk.com corosync[24977]:   [TOTEM ] entering GATHER state from 11(merge during join).
Apr 29 05:02:28 x8.asodesk.com corosync[24977]:   [TOTEM ] entering GATHER state from 11(merge during join).
Apr 29 05:02:28 x8.asodesk.com corosync[24977]:   [TOTEM ] entering GATHER state from 11(merge during join).
Apr 29 05:02:28 x8.asodesk.com corosync[24977]:   [TOTEM ] entering GATHER state from 11(merge during join).
Apr 29 05:02:28 x8.asodesk.com corosync[24977]:   [TOTEM ] entering GATHER state from 11(merge during join).
Apr 29 05:02:28 x8.asodesk.com corosync[24977]:   [TOTEM ] entering GATHER state from 11(merge during join).
Apr 29 05:02:28 x8.asodesk.com corosync[24977]:   [TOTEM ] entering GATHER state from 11(merge during join).

Code:

pvecm e 1
Unable to set expected votes: CS_ERR_INVALID_PARAM

Code:

qm start 100
cluster not ready - no quorum?

Spirog · Apr 29, 2022

efg said:

Please help. Cluster stopped working. After rebooting the server vm does not start. From the console, too, it is not possible to launch them.

Code:

pvecm status
Cluster information
-------------------
Name:             asodesk
Config Version:   52
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Fri Apr 29 05:03:51 2022
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000001
Ring ID:          1.19
Quorate:          No

Votequorum information
----------------------
Expected votes:   14
Highest expected: 14
Total votes:      5
Quorum:           8 Activity blocked
Flags:

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          5 168.119.78.190 (local)

Code:

systemctl status corosync
● corosync.service - Corosync Cluster Engine
     Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
     Active: active (running) since Fri 2022-04-29 04:16:39 CEST; 46min ago
       Docs: man:corosync
             man:corosync.conf
             man:corosync_overview
   Main PID: 24977 (corosync)
      Tasks: 9 (limit: 308854)
     Memory: 197.8M
        CPU: 2h 30min 348ms
     CGroup: /system.slice/corosync.service
             └─24977 /usr/sbin/corosync -f

Apr 29 05:02:28 x8.asodesk.com corosync[24977]:   [TOTEM ] entering GATHER state from 11(merge during join).
Apr 29 05:02:28 x8.asodesk.com corosync[24977]:   [TOTEM ] entering GATHER state from 11(merge during join).
Apr 29 05:02:28 x8.asodesk.com corosync[24977]:   [TOTEM ] entering GATHER state from 11(merge during join).
Apr 29 05:02:28 x8.asodesk.com corosync[24977]:   [TOTEM ] entering GATHER state from 11(merge during join).
Apr 29 05:02:28 x8.asodesk.com corosync[24977]:   [TOTEM ] entering GATHER state from 11(merge during join).
Apr 29 05:02:28 x8.asodesk.com corosync[24977]:   [TOTEM ] entering GATHER state from 11(merge during join).
Apr 29 05:02:28 x8.asodesk.com corosync[24977]:   [TOTEM ] entering GATHER state from 11(merge during join).
Apr 29 05:02:28 x8.asodesk.com corosync[24977]:   [TOTEM ] entering GATHER state from 11(merge during join).
Apr 29 05:02:28 x8.asodesk.com corosync[24977]:   [TOTEM ] entering GATHER state from 11(merge during join).
Apr 29 05:02:28 x8.asodesk.com corosync[24977]:   [TOTEM ] entering GATHER state from 11(merge during join).

Code:

pvecm e 1
Unable to set expected votes: CS_ERR_INVALID_PARAM

Code:

qm start 100
cluster not ready - no quorum?

pvecm expected 1

efg · Apr 29, 2022

It did not help. I am getting error

Code:

Unable to set expected votes: CS_ERR_INVALID_PARAM

Spirog · Apr 29, 2022

efg said:
It did not help. I am getting error

Code:

Unable to set expected votes: CS_ERR_INVALID_PARAM

I think because still 4 nodes are running shows 5 yes. So can you shut down the 4 and leave the main one running then try it.

efg · Apr 29, 2022

I only have 10 nodes

Code:

cat /etc/corosync/corosync.conf
logging {
  debug: on
  to_syslog: yes
}

nodelist {
  node {
    name: staging
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 136.243.39.81
  }
  node {
    name: x3
    nodeid: 10
    quorum_votes: 1
    ring0_addr: 95.216.17.92
  }
  node {
    name: x4
    nodeid: 8
    quorum_votes: 1
    ring0_addr: 65.108.103.217
  }
  node {
    name: x5
    nodeid: 9
    quorum_votes: 1
    ring0_addr: 142.132.128.83
  }
  node {
    name: x8
    nodeid: 1
    quorum_votes: 5
    ring0_addr: 168.119.78.190
  }
  node {
    name: x9
    nodeid: 5
    quorum_votes: 1
    ring0_addr: 162.55.90.37
  }
  node {
    name: z1
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 116.202.222.171
  }
  node {
    name: z2
    nodeid: 6
    quorum_votes: 1
    ring0_addr: 136.243.133.76
  }
  node {
    name: z3
    nodeid: 7
    quorum_votes: 1
    ring0_addr: 136.243.132.216
  }
  node {
    name: z4
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 176.9.11.85
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: asodesk
  config_version: 52
  interface {
    linknumber: 0
  }
  ip_version: ipv4
  link_mode: passive
  secauth: on
  version: 2
}

Spirog · Apr 29, 2022

efg said:

I only have 10 nodes

Code:

cat /etc/corosync/corosync.conf
logging {
  debug: on
  to_syslog: yes
}

nodelist {
  node {
    name: staging
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 136.243.39.81
  }
  node {
    name: x3
    nodeid: 10
    quorum_votes: 1
    ring0_addr: 95.216.17.92
  }
  node {
    name: x4
    nodeid: 8
    quorum_votes: 1
    ring0_addr: 65.108.103.217
  }
  node {
    name: x5
    nodeid: 9
    quorum_votes: 1
    ring0_addr: 142.132.128.83
  }
  node {
    name: x8
    nodeid: 1
    quorum_votes: 5
    ring0_addr: 168.119.78.190
  }
  node {
    name: x9
    nodeid: 5
    quorum_votes: 1
    ring0_addr: 162.55.90.37
  }
  node {
    name: z1
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 116.202.222.171
  }
  node {
    name: z2
    nodeid: 6
    quorum_votes: 1
    ring0_addr: 136.243.133.76
  }
  node {
    name: z3
    nodeid: 7
    quorum_votes: 1
    ring0_addr: 136.243.132.216
  }
  node {
    name: z4
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 176.9.11.85
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: asodesk
  config_version: 52
  interface {
    linknumber: 0
  }
  ip_version: ipv4
  link_mode: passive
  secauth: on
  version: 2
}

so I read that in order to be able to login to main node, and set to 1 you need to shut down all other nodes first and then set to 1 the main one

Spirog · Apr 29, 2022

https://forum.proxmox.com/threads/not-working-pvecm-expected.48771/post-228287

Spirog · Apr 29, 2022

or try to start one host wait 3 sec then start the next one.
And so on. and see if you get Quorum again.. ?

Spirog · Apr 29, 2022

https://forum.proxmox.com/threads/shutdown-start-cluster.26815/

efg · Apr 29, 2022

I'm afraid to turn off all nodes, because now VMs are running on the remaining nodes, and if the quorum is not met after restarting the nodes, the VMs will not start.
Maybe there is a safer way?

Spirog · Apr 29, 2022

efg said:
I'm afraid to turn off all nodes, because now VMs are running on the remaining nodes, and if the quorum is not met after restarting the nodes, the VMs will not start.
Maybe there is a safer way?

lets wait for a Proxmox Support team member then...
to see what they suggest for you.
@efg maybe we will get @Fabian_E to give us a solution. Is very Excellent support member and has helped me before.
PS the E in his name is for Excellence

efg · Apr 29, 2022

Spirog said:
lets wait for a Proxmox Support team member then...
to see what they suggest for you.
@efg maybe we will get @Fabian_E to give us a solution. Is very Excellent support member and has helped me before.
PS the E in his name is for Excellence

Spirog, thank you very much. It's great to have someone to help

.

Spirog · Apr 29, 2022

@efg can you post # pveversion -v

efg · Apr 29, 2022

Spirog said:
@efg can you post # pveversion -v

pveversion -v
proxmox-ve: 7.1-1 (running kernel: 5.13.19-3-pve)
pve-manager: 7.1-10 (running version: 7.1-10/6ddebafe)
pve-kernel-helper: 7.1-8
pve-kernel-5.13: 7.1-6
pve-kernel-5.4: 6.4-12
pve-kernel-5.13.19-3-pve: 5.13.19-7
pve-kernel-5.4.162-1-pve: 5.4.162-2
ceph-fuse: 14.2.21-1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: residual config
ifupdown2: 3.1.0-1+pmx3
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.1
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.1-6
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.1-2
libpve-guest-common-perl: 4.0-3
libpve-http-server-perl: 4.1-1
libpve-storage-perl: 7.0-15
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.11-1
lxcfs: 4.0.11-pve1
novnc-pve: 1.3.0-1
proxmox-backup-client: 2.1.4-1
proxmox-backup-file-restore: 2.1.4-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.4-5
pve-cluster: 7.1-3
pve-container: 4.1-3
pve-docs: 7.1-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.3-4
pve-ha-manager: 3.3-3
pve-i18n: 2.6-2
pve-qemu-kvm: 6.1.0-3
pve-xtermjs: 4.16.0-1
qemu-server: 7.1-4
smartmontools: 7.2-pve2
spiceterm: 3.2-2
swtpm: 0.7.0~rc1+2
vncterm: 1.7-1

Spirog · May 2, 2022

@efg are you still stuck did you ever figure this out? I hope a Proxmox team member gets to you to help you

fiona · May 2, 2022

Hi,
please don't use commands like pvecm expected 1 if you still have multiple working nodes, because that would allow the single node to change cluster status.

Please make sure that the nodes can ping each other. Cluster communication needs a low network latency, so using a dedicated network for it is highly recommended. You should try to get enough nodes talking to each other to reach quorum again. Please see also see here for more information.

One of your nodes is configured to have 5 votes, and you might want to change the configuration once the cluster is healthy again, if there is not a good reason for that.

efg · May 4, 2022

@Spirog After hetzner removed the traffic limit, my problem was solved by turning off one of the servers completely and turning it back on

!

I would like to report on the probable causes of the destruction of the cluster.
When the cluster collapsed, all cluster nodes were up and available (ping). But I got the following message from my ISP:

Code:

Unfortunately, Falkenstein servers are currently experiencing very large inbound attacks. Our technicians are already working on a solution.
We apologize for the inconvenience caused.
Thank you for your understanding.

Due to always different directions (IP addresses, ports, packet size) we unfortunately had to restrict this traffic. This affects UDP traffic on port 9000-65535.
We apologize for the inconvenience caused.

@Fabian_E, could the 9000-65535(udp) limit be the reason for the destruction of the cluster? As far as I know corosync works on ports 5404-5405

Spirog · May 4, 2022

efg said:
@Spirog After hetzner removed the traffic limit, my problem was solved by turning off one of the servers completely and turning it back on !

awesome... I am glad its working.. that was huge issue with hetzner then... hopefully they got it under control

fiona · May 4, 2022

efg said:
@Fabian_E, could the 9000-65535(udp) limit be the reason for the destruction of the cluster? As far as I know corosync works on ports 5404-5405

I don't think so, but likely the network wasn't stable enough for cluster communication. The most important thing is having low latency.

Corosync is not working and I can't log in to the web console.

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Proxmox Staff Member

Member

Member

Proxmox Staff Member

We value your privacy