Mixed Environment Help

mabdallah

New Member
Dec 12, 2023
3
0
1
Greetings to all community members.
I have an environment where 3 proxmox hosts share an EQL SAN storage, and another 3 hosts are standalone each with its own local storage.
To be able to manage all hosts and move VMs around (live migration) we've put them all in a single cluster.
For the 3 VMs with the SAN storage, I wish to use HA, problem is that HA tries to involve the hosts that are standalone, even after creating an HA group for the 3 hosts with SAN.
Aside from that, some hosts are crashing impacting the whole system.
Is it healthy to create a cluster having both standalone and SAN-connected hosts?
The solution I was thinking was to create 2 separate clusters, 1 for the standalone hosts and another for the SAN-connected hosts for HA, but how would I be able to move VMs around while they're still running?
 
Last edited:
For the 3 VMs with the SAN storage, I wish to use HA, problem is that HA tries to involve the hosts that are standalone, even after creating an HA group for the 3 hosts with SAN.
This sounds like an incorrect behavior, however you have not provided any supporting data to your conclusion (configuration, log outputs, etc). So the first and simplest explanation is that something is not configured properly.
Is it healthy to create a cluster having both standalone and SAN-connected hosts?
It should be ok, if everything is properly configured and isolated. Its certainly not optimal.
The solution I was thinking was to create 2 separate clusters, 1 for the standalone hosts and another for the SAN-connected hosts for HA, but how would I be able to move VMs around while they're still running?
You cant.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
Please find attached the data extracted from the journalctl command from before and after the host crashed due to HA misbehaving.
This sounds like an incorrect behavior, however you have not provided any supporting data to your conclusion (configuration, log outputs, etc). So the first and simplest explanation is that something is not configured properly.

It should be ok, if everything is properly configured and isolated. Its certainly not optimal.

You cant.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 

Attachments

  • logfile.txt
    221.7 KB · Views: 2
Please also include the output of pvecm status and ha-manager status.
 
I have an environment where 3 proxmox hosts share an EQL SAN storage, and another 3 hosts are standalone each with its own local storage.
To be able to manage all hosts and move VMs around (live migration) we've put them all in a single cluster
So you have a cluster with even number of hosts
https://pve.proxmox.com/wiki/Cluster_Manager#:~:text=requirements of corosync.-,Supported Setups,-We support QDevices

For the 3 VMs with the SAN storage, I wish to use HA, problem is that HA tries to involve the hosts that are standalone, even after creating an HA group for the 3 hosts with SAN.
You will need to show your HA group and cluster configuration, including commands mentioned by @sb-jw
You need to show as well your VM configuration and its HA status. In short, provide comprehensive set of information.

Aside from that, some hosts are crashing impacting the whole system.
This could be related to non-optimal cluster configuration, or could be something completely different.

Please find attached the data extracted from the journalctl command from before and after the host crashed due to HA misbehaving.
If you would prefer to fully offload troubleshooting of your business environment, I'd recommend buying subscription from Proxmox Gmbh, or reaching out to a Proxmox partner.

Having said that, taking a quick glance at the log shows that the reset is clearly cluster related:
Code:
Dec 11 16:08:13 pxclstran1 corosync[1712]:   [QUORUM] Members[6]: 1 2 3 4 5 6
Dec 11 16:08:13 pxclstran1 corosync[1712]:   [MAIN  ] Completed service synchronization, ready to provide service.
Dec 11 16:08:13 pxclstran1 pmxcfs[1604]: [dcdb] notice: members: 1/1030, 2/1604, 3/1934, 4/8106, 5/1695, 6/7986
Dec 11 16:08:13 pxclstran1 pmxcfs[1604]: [dcdb] notice: queue not emtpy - resening 10 messages
Dec 11 16:08:13 pxclstran1 corosync[1712]:   [KNET  ] link: host: 6 link: 0 is down
Dec 11 16:08:13 pxclstran1 corosync[1712]:   [KNET  ] link: host: 4 link: 0 is down
Dec 11 16:08:13 pxclstran1 corosync[1712]:   [KNET  ] host: host: 6 (passive) best link: 0 (pri: 1)
Dec 11 16:08:13 pxclstran1 corosync[1712]:   [KNET  ] host: host: 6 has no active links
Dec 11 16:08:13 pxclstran1 corosync[1712]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
Dec 11 16:08:13 pxclstran1 corosync[1712]:   [KNET  ] host: host: 4 has no active links
Dec 11 16:08:13 pxclstran1 pmxcfs[1604]: [dcdb] notice: cpg_send_message retried 85 times
Dec 11 16:08:17 pxclstran1 pvestatd[1759]: status update time (24.169 seconds)
Dec 11 16:08:18 pxclstran1 corosync[1712]:   [KNET  ] rx: host: 6 link: 0 is up
Dec 11 16:08:18 pxclstran1 corosync[1712]:   [KNET  ] link: Resetting MTU for link 0 because host 6 joined
Dec 11 16:08:18 pxclstran1 corosync[1712]:   [KNET  ] host: host: 6 (passive) best link: 0 (pri: 1)
Dec 11 16:08:18 pxclstran1 corosync[1712]:   [KNET  ] pmtud: Global data MTU changed to: 1397
Dec 11 16:08:19 pxclstran1 corosync[1712]:   [TOTEM ] Token has not been received in 4200 ms
Dec 11 16:08:19 pxclstran1 corosync[1712]:   [KNET  ] rx: host: 4 link: 0 is up
Dec 11 16:08:19 pxclstran1 corosync[1712]:   [KNET  ] link: Resetting MTU for link 0 because host 4 joined
Dec 11 16:08:19 pxclstran1 corosync[1712]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
Dec 11 16:08:19 pxclstran1 corosync[1712]:   [KNET  ] pmtud: Global data MTU changed to: 1397
Dec 11 16:08:23 pxclstran1 watchdog-mux[1052]: client watchdog expired - disable watchdog updates
-- Boot d42384a7b5324c7bb118dbb80aea7228 --

Plug the message into google: "watchdog-mux client watchdog expired - disable watchdog updates"
The first results returned are extremely relevant.

Good luck


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!