Mixed Environment Help

mabdallah · Dec 12, 2023

Greetings to all community members.
I have an environment where 3 proxmox hosts share an EQL SAN storage, and another 3 hosts are standalone each with its own local storage.
To be able to manage all hosts and move VMs around (live migration) we've put them all in a single cluster.
For the 3 VMs with the SAN storage, I wish to use HA, problem is that HA tries to involve the hosts that are standalone, even after creating an HA group for the 3 hosts with SAN.
Aside from that, some hosts are crashing impacting the whole system.
Is it healthy to create a cluster having both standalone and SAN-connected hosts?
The solution I was thinking was to create 2 separate clusters, 1 for the standalone hosts and another for the SAN-connected hosts for HA, but how would I be able to move VMs around while they're still running?

bbgeek17 · Dec 12, 2023

mabdallah said:
For the 3 VMs with the SAN storage, I wish to use HA, problem is that HA tries to involve the hosts that are standalone, even after creating an HA group for the 3 hosts with SAN.

This sounds like an incorrect behavior, however you have not provided any supporting data to your conclusion (configuration, log outputs, etc). So the first and simplest explanation is that something is not configured properly.

mabdallah said:
Is it healthy to create a cluster having both standalone and SAN-connected hosts?

It should be ok, if everything is properly configured and isolated. Its certainly not optimal.

mabdallah said:
The solution I was thinking was to create 2 separate clusters, 1 for the standalone hosts and another for the SAN-connected hosts for HA, but how would I be able to move VMs around while they're still running?

You cant.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

mabdallah · Dec 13, 2023

Please find attached the data extracted from the journalctl command from before and after the host crashed due to HA misbehaving.

bbgeek17 said:
This sounds like an incorrect behavior, however you have not provided any supporting data to your conclusion (configuration, log outputs, etc). So the first and simplest explanation is that something is not configured properly.

It should be ok, if everything is properly configured and isolated. Its certainly not optimal.

You cant.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

sb-jw · Dec 13, 2023

Please also include the output of pvecm status and ha-manager status.

bbgeek17 · Dec 13, 2023

mabdallah said:
I have an environment where 3 proxmox hosts share an EQL SAN storage, and another 3 hosts are standalone each with its own local storage.
To be able to manage all hosts and move VMs around (live migration) we've put them all in a single cluster

So you have a cluster with even number of hosts
https://pve.proxmox.com/wiki/Cluster_Manager#:~:text=requirements of corosync.-,Supported Setups,-We support QDevices

mabdallah said:
For the 3 VMs with the SAN storage, I wish to use HA, problem is that HA tries to involve the hosts that are standalone, even after creating an HA group for the 3 hosts with SAN.

You will need to show your HA group and cluster configuration, including commands mentioned by @sb-jw
You need to show as well your VM configuration and its HA status. In short, provide comprehensive set of information.

mabdallah said:
Aside from that, some hosts are crashing impacting the whole system.

This could be related to non-optimal cluster configuration, or could be something completely different.

mabdallah said:
Please find attached the data extracted from the journalctl command from before and after the host crashed due to HA misbehaving.

If you would prefer to fully offload troubleshooting of your business environment, I'd recommend buying subscription from Proxmox Gmbh, or reaching out to a Proxmox partner.

Having said that, taking a quick glance at the log shows that the reset is clearly cluster related:

Code:

Dec 11 16:08:13 pxclstran1 corosync[1712]:   [QUORUM] Members[6]: 1 2 3 4 5 6
Dec 11 16:08:13 pxclstran1 corosync[1712]:   [MAIN  ] Completed service synchronization, ready to provide service.
Dec 11 16:08:13 pxclstran1 pmxcfs[1604]: [dcdb] notice: members: 1/1030, 2/1604, 3/1934, 4/8106, 5/1695, 6/7986
Dec 11 16:08:13 pxclstran1 pmxcfs[1604]: [dcdb] notice: queue not emtpy - resening 10 messages
Dec 11 16:08:13 pxclstran1 corosync[1712]:   [KNET  ] link: host: 6 link: 0 is down
Dec 11 16:08:13 pxclstran1 corosync[1712]:   [KNET  ] link: host: 4 link: 0 is down
Dec 11 16:08:13 pxclstran1 corosync[1712]:   [KNET  ] host: host: 6 (passive) best link: 0 (pri: 1)
Dec 11 16:08:13 pxclstran1 corosync[1712]:   [KNET  ] host: host: 6 has no active links
Dec 11 16:08:13 pxclstran1 corosync[1712]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
Dec 11 16:08:13 pxclstran1 corosync[1712]:   [KNET  ] host: host: 4 has no active links
Dec 11 16:08:13 pxclstran1 pmxcfs[1604]: [dcdb] notice: cpg_send_message retried 85 times
Dec 11 16:08:17 pxclstran1 pvestatd[1759]: status update time (24.169 seconds)
Dec 11 16:08:18 pxclstran1 corosync[1712]:   [KNET  ] rx: host: 6 link: 0 is up
Dec 11 16:08:18 pxclstran1 corosync[1712]:   [KNET  ] link: Resetting MTU for link 0 because host 6 joined
Dec 11 16:08:18 pxclstran1 corosync[1712]:   [KNET  ] host: host: 6 (passive) best link: 0 (pri: 1)
Dec 11 16:08:18 pxclstran1 corosync[1712]:   [KNET  ] pmtud: Global data MTU changed to: 1397
Dec 11 16:08:19 pxclstran1 corosync[1712]:   [TOTEM ] Token has not been received in 4200 ms
Dec 11 16:08:19 pxclstran1 corosync[1712]:   [KNET  ] rx: host: 4 link: 0 is up
Dec 11 16:08:19 pxclstran1 corosync[1712]:   [KNET  ] link: Resetting MTU for link 0 because host 4 joined
Dec 11 16:08:19 pxclstran1 corosync[1712]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
Dec 11 16:08:19 pxclstran1 corosync[1712]:   [KNET  ] pmtud: Global data MTU changed to: 1397
Dec 11 16:08:23 pxclstran1 watchdog-mux[1052]: client watchdog expired - disable watchdog updates
-- Boot d42384a7b5324c7bb118dbb80aea7228 --

Plug the message into google: "watchdog-mux client watchdog expired - disable watchdog updates"
The first results returned are extremely relevant.

Good luck

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

mabdallah · Dec 14, 2023

sb-jw said:
Please also include the output of pvecm status and ha-manager status.

Hello,

Currently the HA Group has been removed to avoid crashes. I can re-create and enable if needed?

Appreciate the help.

Search

Search

Mixed Environment Help

mabdallah

New Member

bbgeek17

Distinguished Member

mabdallah

New Member

Attachments

sb-jw

Famous Member

bbgeek17

Distinguished Member

mabdallah

New Member