Corosync won't start

CelticWebs

Member
Mar 14, 2023
75
3
8
I'm having issues getting corosync to start up, causing my node to be unable to connect to the other.

Diagnostics so far

I've tested the basics like pinging one node from the other and it works fine.\

Results from journalctl -xeu pve-cluster.service
Jan 15 18:15:49 pve847 pmxcfs[122430]: [quorum] crit: quorum_initialize failed: 2 Jan 15 18:15:49 pve847 pmxcfs[122430]: [confdb] crit: cmap_initialize failed: 2 Jan 15 18:15:49 pve847 pmxcfs[122430]: [dcdb] crit: cpg_initialize failed: 2 Jan 15 18:15:49 pve847 pmxcfs[122430]: [status] crit: cpg_initialize failed: 2 Jan 15 18:15:55 pve847 pmxcfs[122430]: [quorum] crit: quorum_initialize failed: 2 Jan 15 18:15:55 pve847 pmxcfs[122430]: [confdb] crit: cmap_initialize failed: 2 Jan 15 18:15:55 pve847 pmxcfs[122430]: [dcdb] crit: cpg_initialize failed: 2 Jan 15 18:15:55 pve847 pmxcfs[122430]: [status] crit: cpg_initialize failed: 2 Jan 15 18:16:01 pve847 pmxcfs[122430]: [quorum] crit: quorum_initialize failed: 2 Jan 15 18:16:01 pve847 pmxcfs[122430]: [confdb] crit: cmap_initialize failed: 2 Jan 15 18:16:01 pve847 pmxcfs[122430]: [dcdb] crit: cpg_initialize failed: 2 Jan 15 18:16:01 pve847 pmxcfs[122430]: [status] crit: cpg_initialize failed: 2 Jan 15 18:16:07 pve847 pmxcfs[122430]: [quorum] crit: quorum_initialize failed: 2 Jan 15 18:16:07 pve847 pmxcfs[122430]: [confdb] crit: cmap_initialize failed: 2 Jan 15 18:16:07 pve847 pmxcfs[122430]: [dcdb] crit: cpg_initialize failed: 2 Jan 15 18:16:07 pve847 pmxcfs[122430]: [status] crit: cpg_initialize failed: 2 Jan 15 18:16:13 pve847 pmxcfs[122430]: [quorum] crit: quorum_initialize failed: 2 Jan 15 18:16:13 pve847 pmxcfs[122430]: [confdb] crit: cmap_initialize failed: 2 Jan 15 18:16:13 pve847 pmxcfs[122430]: [dcdb] crit: cpg_initialize failed: 2 Jan 15 18:16:13 pve847 pmxcfs[122430]: [status] crit: cpg_initialize failed: 2 Jan 15 18:16:19 pve847 pmxcfs[122430]: [quorum] crit: quorum_initialize failed: 2 Jan 15 18:16:19 pve847 pmxcfs[122430]: [confdb] crit: cmap_initialize failed: 2 Jan 15 18:16:19 pve847 pmxcfs[122430]: [dcdb] crit: cpg_initialize failed: 2 Jan 15 18:16:19 pve847 pmxcfs[122430]: [status] crit: cpg_initialize failed: 2 Jan 15 18:16:25 pve847 pmxcfs[122430]: [quorum] crit: quorum_initialize failed: 2 Jan 15 18:16:25 pve847 pmxcfs[122430]: [confdb] crit: cmap_initialize failed: 2 Jan 15 18:16:25 pve847 pmxcfs[122430]: [dcdb] crit: cpg_initialize failed: 2 Jan 15 18:16:25 pve847 pmxcfs[122430]: [status] crit: cpg_initialize failed: 2
results of cat /etc/pve/corosync.conf

logging { debug: off to_syslog: yes } nodelist { node { name: prox380 nodeid: 1 quorum_votes: 1 ring0_addr: xxx.xxx.xxx.xxx } node { name: pve847 nodeid: 2 quorum_votes: 1 ring0_addr: xxx.xxx.xxx.xxx } } quorum { provider: corosync_votequorum } totem { cluster_name: CelticWebs config_version: 2 interface { linknumber: 0 } ip_version: ipv4-6 link_mode: passive secauth: on version: 2 }

Output from pveversion -v

proxmox-ve: 8.1.0 (running kernel: 6.5.11-7-pve) pve-manager: 8.1.3 (running version: 8.1.3/b46aac3b42da5d15) proxmox-kernel-helper: 8.1.0 proxmox-kernel-6.5: 6.5.11-7 proxmox-kernel-6.5.11-7-pve-signed: 6.5.11-7 proxmox-kernel-6.5.11-4-pve-signed: 6.5.11-4 ceph-fuse: 17.2.7-pve1 corosync: 3.1.7-pve3 criu: 3.17.1-2 glusterfs-client: 10.3-5 ifupdown2: 3.2.0-1+pmx8 ksm-control-daemon: 1.4-1 libjs-extjs: 7.0.0-4 libknet1: 1.28-pve1 libproxmox-acme-perl: 1.5.0 libproxmox-backup-qemu0: 1.4.1 libproxmox-rs-perl: 0.3.3 libpve-access-control: 8.0.7 libpve-apiclient-perl: 3.3.1 libpve-common-perl: 8.1.0 libpve-guest-common-perl: 5.0.6 libpve-http-server-perl: 5.0.5 libpve-network-perl: 0.9.5 libpve-rs-perl: 0.8.7 libpve-storage-perl: 8.0.5 libspice-server1: 0.15.1-1 lvm2: 2.03.16-2 lxc-pve: 5.0.2-4 lxcfs: 5.0.3-pve4 novnc-pve: 1.4.0-3 proxmox-backup-client: 3.1.2-1 proxmox-backup-file-restore: 3.1.2-1 proxmox-kernel-helper: 8.1.0 proxmox-mail-forward: 0.2.2 proxmox-mini-journalreader: 1.4.0 proxmox-offline-mirror-helper: 0.6.3 proxmox-widget-toolkit: 4.1.3 pve-cluster: 8.0.5 pve-container: 5.0.8 pve-docs: 8.1.3 pve-edk2-firmware: 4.2023.08-2 pve-firewall: 5.0.3 pve-firmware: 3.9-1 pve-ha-manager: 4.0.3 pve-i18n: 3.1.5 pve-qemu-kvm: 8.1.2-6 pve-xtermjs: 5.3.0-3 qemu-server: 8.0.10 smartmontools: 7.3-pve1 spiceterm: 3.3.0 swtpm: 0.8.0+pve1 vncterm: 1.8.0 zfsutils-linux: 2.2.2-pve1

output from pvecm status

Cluster information ------------------- Name: CelticWebs Config Version: 2 Transport: knet Secure auth: on Cannot initialize CMAP service

I don't see anything that is obvious for why it's not able to start, I'm at a total loss?
 
Last edited:
Can you post:
Code:
systemctl status corosync
[/QUOTE]
[QUOTE="tempacc346235, post: 624980, member: 209888"]

journalctl -u corosync
Sure, results below


Code:
Jan 13 22:32:53 pve847 systemd[1]: Starting corosync.service - Corosync Cluster Engine...
Jan 13 22:32:53 pve847 systemd[1]: corosync.service: Main process exited, code=exited, status=8/>
Jan 13 22:32:53 pve847 systemd[1]: corosync.service: Failed with result 'exit-code'.
Jan 13 22:32:53 pve847 systemd[1]: Failed to start corosync.service - Corosync Cluster Engine.
Jan 13 22:33:03 pve847 systemd[1]: Starting corosync.service - Corosync Cluster Engine...
Jan 13 22:33:03 pve847 systemd[1]: corosync.service: Main process exited, code=exited, status=8/>
Jan 13 22:33:03 pve847 systemd[1]: corosync.service: Failed with result 'exit-code'.
Jan 13 22:33:03 pve847 systemd[1]: Failed to start corosync.service - Corosync Cluster Engine.


Code:
corosync.service - Corosync Cluster Engine
     Loaded: loaded (/lib/systemd/system/corosync.service; enabled; preset: enabled)
     Active: failed (Result: exit-code) since Tue 2024-01-16 02:40:22 GMT; 6s ago
       Docs: man:corosync
             man:corosync.conf
             man:corosync_overview
    Process: 320601 ExecStart=/usr/sbin/corosync -f $COROSYNC_OPTIONS (code=exited, status=8)
   Main PID: 320601 (code=exited, status=8)
        CPU: 6ms

Jan 16 02:40:22 pve847 systemd[1]: Starting corosync.service - Corosync Cluster Engine...
Jan 16 02:40:22 pve847 systemd[1]: corosync.service: Main process exited, code=exited, status=8/>
Jan 16 02:40:22 pve847 systemd[1]: corosync.service: Failed with result 'exit-code'.
Jan 16 02:40:22 pve847 systemd[1]: Failed to start corosync.service - Corosync Cluster Engine.
 
could you try running "corosync -t", and if that looks okay, "corosync -f"? and post the output of both commands?
 
Here’s the results for both

Code:
root@pve847:~# corosync -t
Jan 16 09:26:44.841 notice  [MAIN  ] Corosync Cluster Engine exiting normally
root@pve847:~# corosync -f
Jan 16 09:26:54.517 notice  [MAIN  ] Corosync Cluster Engine  starting up
Jan 16 09:26:54.517 info    [MAIN  ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf vqsim nozzle snmp pie relro bindnow
Jan 16 09:26:54.557 notice  [TOTEM ] Initializing transport (Kronosnet).
Jan 16 09:26:54.865 info    [TOTEM ] totemknet initialized
Jan 16 09:26:54.865 info    [KNET  ] pmtud: MTU manually set to: 0
Jan 16 09:26:54.865 info    [KNET  ] common: crypto_nss.so has been loaded from /usr/lib/x86_64-linux-gnu/kronosnet/crypto_nss.so
Jan 16 09:26:54.965 notice  [SERV  ] Service engine loaded: corosync configuration map access [0]
Jan 16 09:26:54.965 info    [QB    ] server name: cmap
Jan 16 09:26:54.965 notice  [SERV  ] Service engine loaded: corosync configuration service [1]
Jan 16 09:26:54.965 info    [QB    ] server name: cfg
Jan 16 09:26:54.965 notice  [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Jan 16 09:26:54.965 info    [QB    ] server name: cpg
Jan 16 09:26:54.965 notice  [SERV  ] Service engine loaded: corosync profile loading service [4]
Jan 16 09:26:54.965 notice  [SERV  ] Service engine loaded: corosync resource monitoring service [6]
Jan 16 09:26:54.965 warning [WD    ] Watchdog not enabled by configuration
Jan 16 09:26:54.965 warning [WD    ] resource load_15min missing a recovery key.
Jan 16 09:26:54.965 warning [WD    ] resource memory_used missing a recovery key.
Jan 16 09:26:54.965 info    [WD    ] no resources configured.
Jan 16 09:26:54.965 notice  [SERV  ] Service engine loaded: corosync watchdog service [7]
Jan 16 09:26:54.965 notice  [QUORUM] Using quorum provider corosync_votequorum
Jan 16 09:26:54.965 notice  [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
Jan 16 09:26:54.965 info    [QB    ] server name: votequorum
Jan 16 09:26:54.965 notice  [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Jan 16 09:26:54.965 info    [QB    ] server name: quorum
Jan 16 09:26:54.965 info    [TOTEM ] Configuring link 0
Jan 16 09:26:54.965 info    [TOTEM ] Configured link number 0: local addr: xxx.xxx.xxx.xxx, port=5405
Jan 16 09:26:54.969 info    [KNET  ] host: host: 1 (passive) best link: 0 (pri: 0)
Jan 16 09:26:54.969 warning [KNET  ] host: host: 1 has no active links
Jan 16 09:26:54.969 info    [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jan 16 09:26:54.969 warning [KNET  ] host: host: 1 has no active links
Jan 16 09:26:54.969 info    [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jan 16 09:26:54.969 warning [KNET  ] host: host: 1 has no active links
Jan 16 09:26:54.969 info    [KNET  ] link: Resetting MTU for link 0 because host 2 joined
Jan 16 09:26:54.969 notice  [QUORUM] Sync members[1]: 2
Jan 16 09:26:54.969 notice  [QUORUM] Sync joined[1]: 2
Jan 16 09:26:54.969 notice  [TOTEM ] A new membership (2.3c44d) was formed. Members joined: 2
Jan 16 09:26:54.969 notice  [QUORUM] Members[1]: 2
Jan 16 09:26:54.969 notice  [MAIN  ] Completed service synchronization, ready to provide service.
Jan 16 09:26:56.637 info    [KNET  ] rx: host: 1 link: 0 is up
Jan 16 09:26:56.637 info    [KNET  ] link: Resetting MTU for link 0 because host 1 joined
Jan 16 09:26:56.637 info    [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jan 16 09:26:56.833 info    [KNET  ] pmtud: PMTUD link change for host: 1 link: 0 from 469 to 1397
Jan 16 09:26:56.833 info    [KNET  ] pmtud: Global data MTU changed to: 1397
Jan 16 09:26:56.853 notice  [QUORUM] Sync members[2]: 1 2
Jan 16 09:26:56.853 notice  [QUORUM] Sync joined[1]: 1
Jan 16 09:26:56.853 notice  [TOTEM ] A new membership (1.3c451) was formed. Members joined: 1
Jan 16 09:26:56.857 error   [CMAP  ] Received config version (3) is different than my config version (2)! Exiting
Jan 16 09:26:56.857 notice  [SERV  ] Unloading all Corosync service engines.
Jan 16 09:26:56.857 info    [QB    ] withdrawing server sockets
Jan 16 09:26:56.857 notice  [SERV  ] Service engine unloaded: corosync vote quorum service v1.0
Jan 16 09:26:56.857 info    [QB    ] withdrawing server sockets
Jan 16 09:26:56.857 notice  [SERV  ] Service engine unloaded: corosync configuration map access
Jan 16 09:26:56.857 info    [QB    ] withdrawing server sockets
Jan 16 09:26:56.857 notice  [SERV  ] Service engine unloaded: corosync configuration service
Jan 16 09:26:56.857 info    [QB    ] withdrawing server sockets
Jan 16 09:26:56.857 notice  [SERV  ] Service engine unloaded: corosync cluster closed process group service v1.01
Jan 16 09:26:56.857 info    [QB    ] withdrawing server sockets
Jan 16 09:26:56.857 notice  [SERV  ] Service engine unloaded: corosync cluster quorum service v0.1
Jan 16 09:26:56.857 notice  [SERV  ] Service engine unloaded: corosync profile loading service
Jan 16 09:26:56.857 notice  [SERV  ] Service engine unloaded: corosync resource monitoring service
Jan 16 09:26:56.857 notice  [SERV  ] Service engine unloaded: corosync watchdog service
Jan 16 09:26:57.661 info    [KNET  ] link: Resetting MTU for link 0 because host 2 joined
Jan 16 09:26:57.661 info    [KNET  ] link: Resetting MTU for link 0 because host 1 joined
Jan 16 09:26:57.661 notice  [MAIN  ] Corosync Cluster Engine exiting normally
 
Noticed it saying corosync.conf different, this was because I changed the votes on the other to get it to run. I’ve updated so they’re the same now.

Code:
root@pve847:/etc/corosync# corosync -f
Jan 16 09:54:25.009 notice  [MAIN  ] Corosync Cluster Engine  starting up
Jan 16 09:54:25.009 info    [MAIN  ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf vqsim nozzle snmp pie relro bindnow
Jan 16 09:54:25.041 notice  [TOTEM ] Initializing transport (Kronosnet).
Jan 16 09:54:25.341 info    [TOTEM ] totemknet initialized
Jan 16 09:54:25.341 info    [KNET  ] pmtud: MTU manually set to: 0
Jan 16 09:54:25.341 info    [KNET  ] common: crypto_nss.so has been loaded from /usr/lib/x86_64-linux-gnu/kronosnet/crypto_nss.so
Jan 16 09:54:25.441 notice  [SERV  ] Service engine loaded: corosync configuration map access [0]
Jan 16 09:54:25.441 info    [QB    ] server name: cmap
Jan 16 09:54:25.441 notice  [SERV  ] Service engine loaded: corosync configuration service [1]
Jan 16 09:54:25.441 info    [QB    ] server name: cfg
Jan 16 09:54:25.441 notice  [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Jan 16 09:54:25.441 info    [QB    ] server name: cpg
Jan 16 09:54:25.441 notice  [SERV  ] Service engine loaded: corosync profile loading service [4]
Jan 16 09:54:25.441 notice  [SERV  ] Service engine loaded: corosync resource monitoring service [6]
Jan 16 09:54:25.441 warning [WD    ] Watchdog not enabled by configuration
Jan 16 09:54:25.441 warning [WD    ] resource load_15min missing a recovery key.
Jan 16 09:54:25.441 warning [WD    ] resource memory_used missing a recovery key.
Jan 16 09:54:25.441 info    [WD    ] no resources configured.
Jan 16 09:54:25.441 notice  [SERV  ] Service engine loaded: corosync watchdog service [7]
Jan 16 09:54:25.441 notice  [QUORUM] Using quorum provider corosync_votequorum
Jan 16 09:54:25.441 notice  [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
Jan 16 09:54:25.441 info    [QB    ] server name: votequorum
Jan 16 09:54:25.441 notice  [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Jan 16 09:54:25.441 info    [QB    ] server name: quorum
Jan 16 09:54:25.441 info    [TOTEM ] Configuring link 0
Jan 16 09:54:25.441 info    [TOTEM ] Configured link number 0: local addr: 185.70.132.126, port=5405
Jan 16 09:54:25.445 info    [KNET  ] host: host: 1 (passive) best link: 0 (pri: 0)
Jan 16 09:54:25.445 warning [KNET  ] host: host: 1 has no active links
Jan 16 09:54:25.445 info    [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jan 16 09:54:25.445 warning [KNET  ] host: host: 1 has no active links
Jan 16 09:54:25.445 info    [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jan 16 09:54:25.445 warning [KNET  ] host: host: 1 has no active links
Jan 16 09:54:25.445 info    [KNET  ] link: Resetting MTU for link 0 because host 2 joined
Jan 16 09:54:25.445 notice  [QUORUM] Sync members[1]: 2
Jan 16 09:54:25.445 notice  [QUORUM] Sync joined[1]: 2
Jan 16 09:54:25.445 notice  [TOTEM ] A new membership (2.3c456) was formed. Members joined: 2
Jan 16 09:54:25.445 notice  [QUORUM] Members[1]: 2
Jan 16 09:54:25.445 notice  [MAIN  ] Completed service synchronization, ready to provide service.
Jan 16 09:54:27.113 info    [KNET  ] rx: host: 1 link: 0 is up
Jan 16 09:54:27.113 info    [KNET  ] link: Resetting MTU for link 0 because host 1 joined
Jan 16 09:54:27.113 info    [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jan 16 09:54:27.313 info    [KNET  ] pmtud: PMTUD link change for host: 1 link: 0 from 469 to 1397
Jan 16 09:54:27.313 info    [KNET  ] pmtud: Global data MTU changed to: 1397
Jan 16 09:54:27.333 notice  [QUORUM] Sync members[2]: 1 2
Jan 16 09:54:27.333 notice  [QUORUM] Sync joined[1]: 1
Jan 16 09:54:27.333 notice  [TOTEM ] A new membership (1.3c45a) was formed. Members joined: 1
Jan 16 09:54:27.337 notice  [QUORUM] This node is within the primary component and will provide service.
Jan 16 09:54:27.337 notice  [QUORUM] Members[2]: 1 2
Jan 16 09:54:27.337 notice  [MAIN  ] Completed service synchronization, ready to provide service.


Now it says invalid pve ticket when Iog in to the faulty node and select the other node.
 
what is the status of corosync on the other node? what about pmxcfs? the log output above for pve847 looks okay..
 
Tried updating certs but it can’t
Code:
root@pve847:~# pvecm updatecerts
waiting for pmxcfs mount to appear and get quorate...
waiting for pmxcfs mount to appear and get quorate...
waiting for pmxcfs mount to appear and get quorate...
waiting for pmxcfs mount to appear and get quorate...
waiting for pmxcfs mount to appear and get quorate...
waiting for pmxcfs mount to appear and get quorate...
got timeout when trying to ensure cluster certificates and base file hierarchy is set up - no quorum (yet) or hung pmxcfs?
 
what is the status of corosync on the other node? what about pmxcfs? the log output above for pve847 looks okay..
Checked status and this is what it spat out

Code:
root@prox380:~# systemctl status corosync
● corosync.service - Corosync Cluster Engine
     Loaded: loaded (/lib/systemd/system/corosync.service; enabled; preset: enabled)
     Active: active (running) since Fri 2024-01-05 20:06:19 GMT; 1 week 3 days ago
       Docs: man:corosync
             man:corosync.conf
             man:corosync_overview
   Main PID: 2174 (corosync)
      Tasks: 9 (limit: 154442)
     Memory: 3.5G
        CPU: 1h 48min 1.031s
     CGroup: /system.slice/corosync.service
             └─2174 /usr/sbin/corosync -f

Jan 16 09:54:27 prox380 corosync[2174]:   [QUORUM] Members[2]: 1 2
Jan 16 09:54:27 prox380 corosync[2174]:   [MAIN  ] Completed service synchronization, ready to p>
lines 1-15...skipping...
● corosync.service - Corosync Cluster Engine
     Loaded: loaded (/lib/systemd/system/corosync.service; enabled; preset: enabled)
     Active: active (running) since Fri 2024-01-05 20:06:19 GMT; 1 week 3 days ago
       Docs: man:corosync
             man:corosync.conf
             man:corosync_overview
   Main PID: 2174 (corosync)
      Tasks: 9 (limit: 154442)
     Memory: 3.5G
        CPU: 1h 48min 1.031s
     CGroup: /system.slice/corosync.service
             └─2174 /usr/sbin/corosync -f

Jan 16 09:54:27 prox380 corosync[2174]:   [QUORUM] Members[2]: 1 2
Jan 16 09:54:27 prox380 corosync[2174]:   [MAIN  ] Completed service synchronization, ready to p>
Jan 16 09:58:38 prox380 corosync[2174]:   [QUORUM] Sync members[1]: 1
Jan 16 09:58:38 prox380 corosync[2174]:   [QUORUM] Sync left[1]: 2
Jan 16 09:58:38 prox380 corosync[2174]:   [TOTEM ] A new membership (1.3c45e) was formed. Member>
Jan 16 09:58:38 prox380 corosync[2174]:   [QUORUM] Members[1]: 1
Jan 16 09:58:38 prox380 corosync[2174]:   [MAIN  ] Completed service synchronization, ready to p>
Jan 16 09:58:39 prox380 corosync[2174]:   [KNET  ] link: host: 2 link: 0 is down
Jan 16 09:58:39 prox380 corosync[2174]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Jan 16 09:58:39 prox380 corosync[2174]:   [KNET  ] host: host: 2 has no active links
~
~
~
~
~
~
~
~
~
~
~
~
~
lines 1-23/23 (END)
● corosync.service - Corosync Cluster Engine
     Loaded: loaded (/lib/systemd/system/corosync.service; enabled; preset: enabled)
     Active: active (running) since Fri 2024-01-05 20:06:19 GMT; 1 week 3 days ago
       Docs: man:corosync
             man:corosync.conf
             man:corosync_overview
   Main PID: 2174 (corosync)
      Tasks: 9 (limit: 154442)
     Memory: 3.5G
        CPU: 1h 48min 1.031s
     CGroup: /system.slice/corosync.service
             └─2174 /usr/sbin/corosync -f

Jan 16 09:54:27 prox380 corosync[2174]:   [QUORUM] Members[2]: 1 2
Jan 16 09:54:27 prox380 corosync[2174]:   [MAIN  ] Completed service synchronization, ready to p>
lines 1-15/23 52%
● corosync.service - Corosync Cluster Engine
     Loaded: loaded (/lib/systemd/system/corosync.service; enabled; preset: enabled)
     Active: active (running) since Fri 2024-01-05 20:06:19 GMT; 1 week 3 days ago
       Docs: man:corosync
             man:corosync.conf
             man:corosync_overview
   Main PID: 2174 (corosync)
      Tasks: 9 (limit: 154442)
     Memory: 3.5G
        CPU: 1h 48min 1.031s
     CGroup: /system.slice/corosync.service
             └─2174 /usr/sbin/corosync -f

Jan 16 09:54:27 prox380 corosync[2174]:   [QUORUM] Members[2]: 1 2
Jan 16 09:54:27 prox380 corosync[2174]:   [MAIN  ] Completed service synchronization, ready to p>
Jan 16 09:58:38 prox380 corosync[2174]:   [QUORUM] Sync members[1]: 1
Jan 16 09:58:38 prox380 corosync[2174]:   [QUORUM] Sync left[1]: 2
Jan 16 09:58:38 prox380 corosync[2174]:   [TOTEM ] A new membership (1.3c45e) was formed. Member>
Jan 16 09:58:38 prox380 corosync[2174]:   [QUORUM] Members[1]: 1
Jan 16 09:58:38 prox380 corosync[2174]:   [MAIN  ] Completed service synchronization, ready to p>
Jan 16 09:58:39 prox380 corosync[2174]:   [KNET  ] link: host: 2 link: 0 is down
Jan 16 09:58:39 prox380 corosync[2174]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Jan 16 09:58:39 prox380 corosync[2174]:   [KNET  ] host: host: 2 has no active links
~
~
~
~
~
~
~
~
~
~
~
~
 
did you kill the corosync process on the first node again? could you try running "systemctl start corosync" again on pve847, and then post the output of

- systemctl status corosync pve-cluster
- journalctl --since "-10min" -u corosync -u pve-cluster

from both nodes?
 
Results from 380 are

root@prox380:~# systemctl status corosync pve-cluster ● corosync.service - Corosync Cluster Engine Loaded: loaded (/lib/systemd/system/corosync.service; enabled; preset: enabled) Active: active (running) since Fri 2024-01-05 20:06:19 GMT; 1 week 3 days ago Docs: man:corosync man:corosync.conf man:corosync_overview Main PID: 2174 (corosync) Tasks: 9 (limit: 154442) Memory: 3.5G CPU: 1h 48min 12.175s CGroup: /system.slice/corosync.service └─2174 /usr/sbin/corosync -f Jan 16 10:36:05 prox380 corosync[2174]: [MAIN ] Completed service synchronization, ready to provide service. Jan 16 10:36:05 prox380 corosync[2174]: [KNET ] pmtud: Global data MTU changed to: 1397 Jan 16 10:36:20 prox380 corosync[2174]: [QUORUM] Sync members[1]: 1 Jan 16 10:36:20 prox380 corosync[2174]: [QUORUM] Sync left[1]: 2 Jan 16 10:36:20 prox380 corosync[2174]: [TOTEM ] A new membership (1.3c467) was formed. Members left: 2 Jan 16 10:36:20 prox380 corosync[2174]: [QUORUM] Members[1]: 1 Jan 16 10:36:20 prox380 corosync[2174]: [MAIN ] Completed service synchronization, ready to provide service. Jan 16 10:36:21 prox380 corosync[2174]: [KNET ] link: host: 2 link: 0 is down Jan 16 10:36:21 prox380 corosync[2174]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) Jan 16 10:36:21 prox380 corosync[2174]: [KNET ] host: host: 2 has no active links ● pve-cluster.service - The Proxmox VE cluster filesystem Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; preset: enabled) Active: active (running) since Fri 2024-01-05 20:06:18 GMT; 1 week 3 days ago Process: 2073 ExecStart=/usr/bin/pmxcfs (code=exited, status=0/SUCCESS) Main PID: 2098 (pmxcfs) Tasks: 7 (limit: 154442) Memory: 67.2M CPU: 14min 57.545s CGroup: /system.slice/pve-cluster.service └─2098 /usr/bin/pmxcfs Jan 16 10:36:09 prox380 pmxcfs[2098]: [dcdb] notice: leader is 1/2098 Jan 16 10:36:09 prox380 pmxcfs[2098]: [dcdb] notice: synced members: 1/2098 Jan 16 10:36:09 prox380 pmxcfs[2098]: [dcdb] notice: start sending inode updates Jan 16 10:36:09 prox380 pmxcfs[2098]: [dcdb] notice: sent all (17) updates Jan 16 10:36:09 prox380 pmxcfs[2098]: [dcdb] notice: all data is up to date Jan 16 10:36:09 prox380 pmxcfs[2098]: [status] notice: received all states Jan 16 10:36:09 prox380 pmxcfs[2098]: [status] notice: all data is up to date Jan 16 10:36:10 prox380 pmxcfs[2098]: [status] notice: received log Jan 16 10:36:20 prox380 pmxcfs[2098]: [dcdb] notice: members: 1/2098 Jan 16 10:36:20 prox380 pmxcfs[2098]: [status] notice: members: 1/2098

root@prox380:~# journalctl --since "-10min" -u corosync -u pve-cluster Jan 16 10:36:05 prox380 corosync[2174]: [KNET ] link: Resetting MTU for link 0 because host 2 joined Jan 16 10:36:05 prox380 corosync[2174]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) Jan 16 10:36:05 prox380 corosync[2174]: [QUORUM] Sync members[2]: 1 2 Jan 16 10:36:05 prox380 corosync[2174]: [QUORUM] Sync joined[1]: 2 Jan 16 10:36:05 prox380 corosync[2174]: [TOTEM ] A new membership (1.3c463) was formed. Members joined: 2 Jan 16 10:36:05 prox380 corosync[2174]: [QUORUM] Members[2]: 1 2 Jan 16 10:36:05 prox380 corosync[2174]: [MAIN ] Completed service synchronization, ready to provide service. Jan 16 10:36:05 prox380 corosync[2174]: [KNET ] pmtud: Global data MTU changed to: 1397 Jan 16 10:36:09 prox380 pmxcfs[2098]: [dcdb] notice: members: 1/2098, 2/5958 Jan 16 10:36:09 prox380 pmxcfs[2098]: [dcdb] notice: starting data syncronisation Jan 16 10:36:09 prox380 pmxcfs[2098]: [status] notice: members: 1/2098, 2/5958 Jan 16 10:36:09 prox380 pmxcfs[2098]: [status] notice: starting data syncronisation Jan 16 10:36:09 prox380 pmxcfs[2098]: [dcdb] notice: received sync request (epoch 1/2098/0000000A) Jan 16 10:36:09 prox380 pmxcfs[2098]: [status] notice: received sync request (epoch 1/2098/0000000A) Jan 16 10:36:09 prox380 pmxcfs[2098]: [dcdb] notice: received all states Jan 16 10:36:09 prox380 pmxcfs[2098]: [dcdb] notice: leader is 1/2098 Jan 16 10:36:09 prox380 pmxcfs[2098]: [dcdb] notice: synced members: 1/2098 Jan 16 10:36:09 prox380 pmxcfs[2098]: [dcdb] notice: start sending inode updates Jan 16 10:36:09 prox380 pmxcfs[2098]: [dcdb] notice: sent all (17) updates Jan 16 10:36:09 prox380 pmxcfs[2098]: [dcdb] notice: all data is up to date Jan 16 10:36:09 prox380 pmxcfs[2098]: [status] notice: received all states Jan 16 10:36:09 prox380 pmxcfs[2098]: [status] notice: all data is up to date Jan 16 10:36:10 prox380 pmxcfs[2098]: [status] notice: received log Jan 16 10:36:20 prox380 pmxcfs[2098]: [dcdb] notice: members: 1/2098 Jan 16 10:36:20 prox380 pmxcfs[2098]: [status] notice: members: 1/2098 Jan 16 10:36:20 prox380 corosync[2174]: [QUORUM] Sync members[1]: 1 Jan 16 10:36:20 prox380 corosync[2174]: [QUORUM] Sync left[1]: 2 Jan 16 10:36:20 prox380 corosync[2174]: [TOTEM ] A new membership (1.3c467) was formed. Members left: 2 Jan 16 10:36:20 prox380 corosync[2174]: [QUORUM] Members[1]: 1 Jan 16 10:36:20 prox380 corosync[2174]: [MAIN ] Completed service synchronization, ready to provide service. Jan 16 10:36:21 prox380 corosync[2174]: [KNET ] link: host: 2 link: 0 is down Jan 16 10:36:21 prox380 corosync[2174]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) Jan 16 10:36:21 prox380 corosync[2174]: [KNET ] host: host: 2 has no active links


847 won't start, corosync -f says it's working fine but if I break out of that and run systemctl start corosync it gives following

root@pve847:~# systemctl start corosync Job for corosync.service failed because the control process exited with error code. See "systemctl status corosync.service" and "journalctl -xeu corosync.service" for details.

systemctl status corosync pve-cluster gives
× corosync.service - Corosync Cluster Engine Loaded: loaded (/lib/systemd/system/corosync.service; enabled; preset: enabled) Active: failed (Result: exit-code) since Tue 2024-01-16 10:40:40 GMT; 58s ago Docs: man:corosync man:corosync.conf man:corosync_overview Process: 13345 ExecStart=/usr/sbin/corosync -f $COROSYNC_OPTIONS (code=exited, status=8) Main PID: 13345 (code=exited, status=8) CPU: 5ms Jan 16 10:40:40 pve847 systemd[1]: Starting corosync.service - Corosync Cluster Engine... Jan 16 10:40:40 pve847 systemd[1]: corosync.service: Main process exited, code=exited, status=8/n/a Jan 16 10:40:40 pve847 systemd[1]: corosync.service: Failed with result 'exit-code'. Jan 16 10:40:40 pve847 systemd[1]: Failed to start corosync.service - Corosync Cluster Engine. ● pve-cluster.service - The Proxmox VE cluster filesystem Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; preset: enabled) Active: active (running) since Tue 2024-01-16 10:02:16 GMT; 39min ago Process: 5956 ExecStart=/usr/bin/pmxcfs (code=exited, status=0/SUCCESS) Main PID: 5958 (pmxcfs) Tasks: 6 (limit: 154435) Memory: 48.3M CPU: 1.353s CGroup: /system.slice/pve-cluster.service └─5958 /usr/bin/pmxcfs Jan 16 10:41:21 pve847 pmxcfs[5958]: [dcdb] crit: cpg_initialize failed: 2 Jan 16 10:41:21 pve847 pmxcfs[5958]: [status] crit: cpg_initialize failed: 2 Jan 16 10:41:27 pve847 pmxcfs[5958]: [quorum] crit: quorum_initialize failed: 2 Jan 16 10:41:27 pve847 pmxcfs[5958]: [confdb] crit: cmap_initialize failed: 2 Jan 16 10:41:27 pve847 pmxcfs[5958]: [dcdb] crit: cpg_initialize failed: 2 Jan 16 10:41:27 pve847 pmxcfs[5958]: [status] crit: cpg_initialize failed: 2 Jan 16 10:41:33 pve847 pmxcfs[5958]: [quorum] crit: quorum_initialize failed: 2 Jan 16 10:41:33 pve847 pmxcfs[5958]: [confdb] crit: cmap_initialize failed: 2 Jan 16 10:41:33 pve847 pmxcfs[5958]: [dcdb] crit: cpg_initialize failed: 2 Jan 16 10:41:33 pve847 pmxcfs[5958]: [status] crit: cpg_initialize failed: 2

with journalctl --since "-10min" -u corosync -u pve-cluster showing repeating version of

Jan 16 10:33:27 pve847 pmxcfs[5958]: [dcdb] crit: cpg_initialize failed: 2 Jan 16 10:33:27 pve847 pmxcfs[5958]: [status] crit: cpg_initialize failed: 2 Jan 16 10:33:33 pve847 pmxcfs[5958]: [quorum] crit: quorum_initialize failed: 2 Jan 16 10:33:33 pve847 pmxcfs[5958]: [confdb] crit: cmap_initialize failed: 2 Jan 16 10:33:33 pve847 pmxcfs[5958]: [dcdb] crit: cpg_initialize failed: 2 Jan 16 10:33:33 pve847 pmxcfs[5958]: [status] crit: cpg_initialize failed: 2 Jan 16 10:33:39 pve847 pmxcfs[5958]: [quorum] crit: quorum_initialize failed: 2 Jan 16 10:33:39 pve847 pmxcfs[5958]: [confdb] crit: cmap_initialize failed: 2 Jan 16 10:33:39 pve847 pmxcfs[5958]: [dcdb] crit: cpg_initialize failed: 2 Jan 16 10:33:39 pve847 pmxcfs[5958]: [status] crit: cpg_initialize failed: 2 Jan 16 10:33:45 pve847 pmxcfs[5958]: [quorum] crit: quorum_initialize failed: 2 Jan 16 10:33:45 pve847 pmxcfs[5958]: [confdb] crit: cmap_initialize failed: 2 Jan 16 10:33:45 pve847 pmxcfs[5958]: [dcdb] crit: cpg_initialize failed: 2 Jan 16 10:33:45 pve847 pmxcfs[5958]: [status] crit: cpg_initialize failed: 2 Jan 16 10:33:51 pve847 pmxcfs[5958]: [quorum] crit: quorum_initialize failed: 2 Jan 16 10:33:51 pve847 pmxcfs[5958]: [confdb] crit: cmap_initialize failed: 2 Jan 16 10:33:51 pve847 pmxcfs[5958]: [dcdb] crit: cpg_initialize failed: 2 Jan 16 10:33:51 pve847 pmxcfs[5958]: [status] crit: cpg_initialize failed: 2 Jan 16 10:33:57 pve847 pmxcfs[5958]: [quorum] crit: quorum_initialize failed: 2 Jan 16 10:33:57 pve847 pmxcfs[5958]: [confdb] crit: cmap_initialize failed: 2
 
is the corosync.conf identical now on both nodes in both locations?
 
Even more confusing is that now I can now see whats 380 from 847, though it shows as ? on the icon for the node. However, I can se ethe drives and even connect to the VM console!

it's also reverted the config of the VM on 847 to what it was when the two nodes were connected. Very strange!

Image 16-01-2024 at 10.53.jpeg
 
Last edited:
Image 16-01-2024 at 10.56.jpeg
This is now really weird, as you can see, it has question marks on it but it can actually pull data from the node that it thinks it can't see!
 
Image 16-01-2024 at 10.59.jpeg
Loggin in to 380 and trying to select 847 like I can the other way around gives me this. After accepting the error, I can see the details from the other server!
 
Last edited:
please run "corosync -t" on the problematic node, the systemctl output indicates corosync still chokes on the config file..
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!