cluster not working

cybernard · Jun 17, 2024

Even though I can independently login to each node, when I try to do any thing it can't find it.

1. time in sync
2. pvecm updatecerts -force

(re)generate node files
generate new node certificate
got timeout when trying to ensure cluster certificates and base file hierarchy is set up - no quorum (yet) or hung pmxcfs?

I'm trying to migrate a VM to a different proxmox server.

fiona · Jun 17, 2024

Hi,
what is the output of journalctl -b -u pve-cluster.service -u corosync.service and pvecm status as well as pveversion -v?

cybernard · Jun 17, 2024

Jun 17 06:07:15 pm pmxcfs[23540]: [dcdb] crit: cpg_send_message failed: 9
Jun 17 06:07:15 pm pmxcfs[23540]: [dcdb] crit: cpg_send_message failed: 9
Jun 17 06:07:15 pm pmxcfs[23540]: [dcdb] crit: cpg_send_message failed: 9
Jun 17 06:07:15 pm pmxcfs[23540]: [dcdb] crit: cpg_send_message failed: 9
Jun 17 06:07:15 pm pmxcfs[23540]: [dcdb] crit: cpg_send_message failed: 9
Jun 17 06:07:15 pm pmxcfs[23540]: [dcdb] crit: cpg_send_message failed: 9
Jun 17 06:07:15 pm pmxcfs[23540]: [dcdb] crit: cpg_send_message failed: 9
Jun 17 06:25:07 pm corosync[23548]: [TOTEM ] FAILED TO RECEIVE
Jun 17 06:25:07 pm pmxcfs[23540]: [dcdb] notice: members: 1/23540
Jun 17 06:25:07 pm pmxcfs[23540]: [dcdb] notice: all data is up to date
Jun 17 06:25:07 pm corosync[23548]: [QUORUM] Sync members[1]: 1
Jun 17 06:25:07 pm corosync[23548]: [QUORUM] Sync left[1]: 2
Jun 17 06:25:07 pm corosync[23548]: [TOTEM ] A new membership (1.7f5) was formed. Members left: 2
Jun 17 06:25:07 pm corosync[23548]: [TOTEM ] Failed to receive the leave message. failed: 2
Jun 17 06:25:07 pm corosync[23548]: [QUORUM] This node is within the non-primary component and will NOT provide any>
Jun 17 06:25:07 pm corosync[23548]: [QUORUM] Members[1]: 1
Jun 17 06:25:07 pm corosync[23548]: [MAIN ] Completed service synchronization, ready to provide service.
Jun 17 06:25:07 pm corosync[23548]: [QUORUM] Sync members[2]: 1 2
Jun 17 06:25:07 pm corosync[23548]: [QUORUM] Sync joined[1]: 2
Jun 17 06:25:07 pm corosync[23548]: [TOTEM ] A new membership (1.7f9) was formed. Members joined: 2
Jun 17 06:25:07 pm corosync[23548]: [QUORUM] This node is within the primary component and will provide service.
Jun 17 06:25:07 pm corosync[23548]: [QUORUM] Members[2]: 1 2
Jun 17 06:25:07 pm corosync[23548]: [MAIN ] Completed service synchronization, ready to provide service.
Jun 17 06:25:07 pm pmxcfs[23540]: [dcdb] notice: cpg_send_message retried 1 times
Jun 17 06:25:07 pm pmxcfs[23540]: [status] notice: node lost quorum
Jun 17 06:25:07 pm pmxcfs[23540]: [status] notice: node has quorum
Jun 17 06:25:07 pm pmxcfs[23540]: [status] notice: members: 1/23540
Jun 17 06:25:07 pm pmxcfs[23540]: [status] notice: all data is up to date
Jun 17 06:25:07 pm pmxcfs[23540]: [status] notice: dfsm_deliver_queue: queue length 864
Jun 17 06:25:07 pm pmxcfs[23540]: [status] notice: members: 1/23540, 2/1756
Jun 17 06:25:07 pm pmxcfs[23540]: [status] notice: starting data syncronisation
Jun 17 06:25:07 pm pmxcfs[23540]: [status] notice: received sync request (epoch 1/23540/00000069)
Jun 17 06:32:47 pm corosync[23548]: [KNET ] link: host: 2 link: 0 is down
Jun 17 06:32:47 pm corosync[23548]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Jun 17 06:32:47 pm corosync[23548]: [KNET ] host: host: 2 has no active links
Jun 17 06:32:47 pm corosync[23548]: [KNET ] link: Resetting MTU for link 0 because host 2 joined
Jun 17 06:32:47 pm corosync[23548]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Jun 17 06:32:48 pm corosync[23548]: [KNET ] pmtud: Global data MTU changed to: 1397
Jun 17 06:33:24 pm corosync[23548]: [KNET ] link: host: 2 link: 0 is down
Jun 17 06:33:24 pm corosync[23548]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Jun 17 06:33:24 pm corosync[23548]: [KNET ] host: host: 2 has no active links
Jun 17 06:33:24 pm corosync[23548]: [KNET ] link: Resetting MTU for link 0 because host 2 joined
Jun 17 06:33:24 pm corosync[23548]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Jun 17 06:33:24 pm corosync[23548]: [KNET ] pmtud: Global data MTU changed to: 1397

------------------------------------------------------

Cluster information
-------------------
Name: MysteryInc
Config Version: 2
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Mon Jun 17 06:39:38 2024
Quorum provider: corosync_votequorum
Nodes: 2
Node ID: 0x00000001
Ring ID: 1.7f9
Quorate: Yes

Votequorum information
----------------------
Expected votes: 2
Highest expected: 2
Total votes: 2
Quorum: 2
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 192.168.3.102 (local)
0x00000002 1 192.168.3.227
root@pm:~#
--------------------------------------------------------------

proxmox-ve: 8.2.0 (running kernel: 6.8.4-2-pve)
pve-manager: 8.2.2 (running version: 8.2.2/9355359cd7afbae4)
proxmox-kernel-helper: 8.1.0
proxmox-kernel-6.8: 6.8.4-2
proxmox-kernel-6.8.4-2-pve-signed: 6.8.4-2
ceph-fuse: 17.2.7-pve3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx8
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.0
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.1.4
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.6
libpve-cluster-perl: 8.0.6
libpve-common-perl: 8.2.1
libpve-guest-common-perl: 5.1.1
libpve-http-server-perl: 5.1.0
libpve-network-perl: 0.9.8
libpve-rs-perl: 0.8.8
libpve-storage-perl: 8.2.1
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.2.0-1
proxmox-backup-file-restore: 3.2.0-1
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.6
proxmox-widget-toolkit: 4.2.1
pve-cluster: 8.0.6
pve-container: 5.0.10
pve-docs: 8.2.1
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.0
pve-firewall: 5.0.5
pve-firmware: 3.11-1
pve-ha-manager: 4.0.4
pve-i18n: 3.2.2
pve-qemu-kvm: 8.1.5-5
pve-xtermjs: 5.3.0-3
qemu-server: 8.2.1
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.3-pve2

fiona · Jun 17, 2024

Okay, so the quorum seems to be reached, but the link seems a bit flaky:

Code:

Jun 17 06:32:47 pm corosync[23548]:   [KNET  ] link: host: 2 link: 0 is down
Jun 17 06:32:47 pm corosync[23548]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Jun 17 06:32:47 pm corosync[23548]:   [KNET  ] host: host: 2 has no active links
Jun 17 06:32:47 pm corosync[23548]:   [KNET  ] link: Resetting MTU for link 0 because host 2 joined
Jun 17 06:32:47 pm corosync[23548]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Jun 17 06:32:48 pm corosync[23548]:   [KNET  ] pmtud: Global data MTU changed to: 1397
Jun 17 06:33:24 pm corosync[23548]:   [KNET  ] link: host: 2 link: 0 is down
Jun 17 06:33:24 pm corosync[23548]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Jun 17 06:33:24 pm corosync[23548]:   [KNET  ] host: host: 2 has no active links
Jun 17 06:33:24 pm corosync[23548]:   [KNET  ] link: Resetting MTU for link 0 because host 2 joined
Jun 17 06:33:24 pm corosync[23548]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Jun 17 06:33:24 pm corosync[23548]:   [KNET  ] pmtud: Global data MTU changed to: 1397

Please ensure that your network connection between the hosts is stable.

What do you get with journalctl -b -u pvestatd.service? You can also check in your browser's developer tools to see which API request hangs (usually Ctrl+Shift+C and then open the Network tab).

cybernard · Jun 17, 2024

root@pm:~# journalctl -b -u pvestatd.service
Jun 16 15:31:32 pm systemd[1]: Starting pvestatd.service - PVE Status Daemon...
Jun 16 15:31:33 pm pvestatd[1053]: starting server
Jun 16 15:31:33 pm systemd[1]: Started pvestatd.service - PVE Status Daemon.
Jun 17 06:46:37 pm systemd[1]: Stopping pvestatd.service - PVE Status Daemon...
Jun 17 06:46:38 pm pvestatd[1053]: received signal TERM
Jun 17 06:46:38 pm pvestatd[1053]: server closing
Jun 17 06:46:38 pm pvestatd[1053]: server stopped
Jun 17 06:46:39 pm systemd[1]: pvestatd.service: Deactivated successfully.
Jun 17 06:46:39 pm systemd[1]: Stopped pvestatd.service - PVE Status Daemon.
Jun 17 06:46:39 pm systemd[1]: pvestatd.service: Consumed 3min 23.547s CPU time.

I have no known issues with my network, I have 2 switches each one directly connected to source of internet.

cybernard · Jun 17, 2024

Before I added them together in a cluster everything was fine, now it is not.
This AM I couldn't even access the web GUI on the original/1st node of the cluster.

The new node has 4 nic's in 1 PCIe card plus an integrated NIC and only 1 port in use.
The 1st node also has at least 1 unused NIC port.

fiona · Jun 18, 2024

cybernard said:
root@pm:~# journalctl -b -u pvestatd.service
Jun 16 15:31:32 pm systemd[1]: Starting pvestatd.service - PVE Status Daemon...
Jun 16 15:31:33 pm pvestatd[1053]: starting server
Jun 16 15:31:33 pm systemd[1]: Started pvestatd.service - PVE Status Daemon.
Jun 17 06:46:37 pm systemd[1]: Stopping pvestatd.service - PVE Status Daemon...
Jun 17 06:46:38 pm pvestatd[1053]: received signal TERM
Jun 17 06:46:38 pm pvestatd[1053]: server closing
Jun 17 06:46:38 pm pvestatd[1053]: server stopped
Jun 17 06:46:39 pm systemd[1]: pvestatd.service: Deactivated successfully.
Jun 17 06:46:39 pm systemd[1]: Stopped pvestatd.service - PVE Status Daemon.
Jun 17 06:46:39 pm systemd[1]: pvestatd.service: Consumed 3min 23.547s CPU time.

I have no known issues with my network, I have 2 switches each one directly connected to source of internet.

So the pvestatd got stopped for some reason? Please try starting it with systemctl reload-or-restart pvestatd.service and then check with systemctl status pvestatd.service if it is Active: active (running) on both nodes.

cybernard · Jun 18, 2024

Also try to dump the virtual machine and it claims its locked due to migration even though the migration failed to start in the first place.
Everything is taking much longer.
qm unlock 110
Takes at least 5 min to process, and then doesn't seem to have done anything.

systemctl restart pvestatd.service
Runs for 5 mins and quits and gui doesn't work

How do I return both nodes to single entities so I can start over.

fiona · Jun 19, 2024

cybernard said:
Also try to dump the virtual machine and it claims its locked due to migration even though the migration failed to start in the first place.
Everything is taking much longer.
qm unlock 110
Takes at least 5 min to process, and then doesn't seem to have done anything.

systemctl restart pvestatd.service
Runs for 5 mins and quits and gui doesn't work

Please share the system journal from the current boot for both nodes. Did you verify that the nodes can ping each other via the addressees defined for corosync?

cybernard said:
How do I return both nodes to single entities so I can start over.

You would need to remove one of the nodes from the cluster and then reinstall: https://pve.proxmox.com/pve-docs/chapter-pvecm.html#_remove_a_cluster_node However, please note that you cannot join a node that already has guests to a cluster, so you would need to first create backups or migrate.

cybernard · Jun 19, 2024

I formatted and reloaded the empty proxmox server
I used CLI to move the VM to the fresh proxmox server
Reformated and reloaded the 1st node

Cluster seem way to painful to use.
If I can use 3 commands to migrate via CLI with no cluster why can't I do it via the GUI with no cluster?

Part of me would like a cluster, and the other half says if clusters can totally wreck your proxmox server why bother?

fiona · Jun 25, 2024

Well, there are a lot of people successfully running clusters out there. The remote migration commands are actually preparatory work for exactly such a GUI: https://forum.proxmox.com/threads/multi-datacenter-management.66830/post-515033

cybernard · Jun 26, 2024

How are they using cluster? What prep work are they doing so that the 2nd and more unit join the cluster successfully?
I have tried 10 times and did reformats, and I have yet to have a successful cluster.

Sure I can create a cluster and its fine so long as I don't add any nodes.

I did
ssh node 1 to node 2 save cert
node 2 to node 1 save cert
updatecerts

Nothing so far has helped.

So far I am brand new to proxmox, I have not enabled any other features that might conflict with each other.

When every I tried the 2nd node always gets "connection failed" or "Loading...." or other similar error indicating there is no connection even though I could communicate with both of them just fine before the cluster join attempt.

Search

Search

cluster not working

cybernard

New Member

fiona

Proxmox Staff Member

cybernard

New Member

fiona

Proxmox Staff Member

cybernard

New Member

cybernard

New Member

fiona

Proxmox Staff Member

cybernard

New Member

fiona

Proxmox Staff Member

cybernard

New Member

fiona

Proxmox Staff Member

cybernard

New Member