That is planned but can be done only after we add network cards to the nodes.
I checked the logs, the output is in the first post.
Now i restarted the service again and this is the output since starting the service (the "[TOTEM ] FAILED TO RECEIVE" sections just repeat after this) :
eb 04...
Hi,
We have a cluster of 4 PVE servers, all were working fine.
Corosync works on the main management interface (vmbr0).
We stopped one node and after booting up again, corosync failed to work on it (no updates were performed before shutdown).
The other nodes work. Tcpdump shows udp packets...
Stop the pve-ha-lrm on all nodes.
Stop pve-ha-lrm services on all nodes.
Stop corosync on all nodes
Then
Start corosync on all nodes
Start pve-ha-crm on all nodes
Start pve-ha-lrm on all nodes
Thanks.
This worked.
Edit - some details:
Basically i identified the stuck device with
dmsetup table |grep 135
Then removed it with
dmsetup remove san_lvm_volgroup-vm--135--disk--0
I managed to move the VM to another node, move the disk to the SAN. Then i could move the VM to a third node.
But i still get this error when i try to move it back to the original node i encountered the error on.
So that server is the only one with these issues.
Hi,
Still get the same error. I moved the drive to another storage for now. The issue seems to be related to iscsi having something stuck somewhere.
Pveversion (we use the enterprise repos):
# pveversion -v
proxmox-ve: 6.1-2 (running kernel: 5.3.13-3-pve)
pve-manager: 6.1-7 (running version...
We have a 3 node cluster and a SAN that is connected via iscsi (using lvm over iscsi).
I had to do some maintenance and move machines around and i noticed that one specific machine could not be migrated between nodes because of errors related to it's storage.
Initially it gave me some errors...
Well in this case it most likely will make no difference.
Just shut down the VM and start it up again when possible.
Edit:
It see the issue is lock related, you could try "qm unlock VMID"
Well i just checked and we are on qemu 4.0.0-5 yet... So i cannot tell you anthing for sure.
In these cases we just shut down the machines and started up again.
But the issue had pretty much gone away after we moved the cluster traffic to a separate dedicated network without any package update.
1. You would have to reinstall them ideally. Strictly speaking it is possible to remove and readd them but you would have to make really sure stuff is
- cleaned thoroughly from the distributed cluster config
- not stuck in the cluster config
- not cached on the node.
I don't know for sure but...
It seems to me that the new corosync using unicast is more finicky than the old one that used multicast. We havea 3 node cluster and we had corosync related issues after upgrading to PVE 6. What we did is we split the management network (4x 1gbit links) into 2 2x1gbit, one for management and...
You can use a LE certificate internally too. The browser doesn't care how the DNS name was resolved or what IP points to (so you can use the hosts file, too for this, you don't even need an internal DNS to handle stuff).
But for generating the certificate you need internet connection and DNS...
Ok, good. I marked this as solved.
Interesting is the fact that after we separated the cluster network from the management we did not have any issues like this anymore.
This happened the day i submitted this thread or a day after that, since then i saw no problematic VMs and we did not update...
I know what was the issue, I started this thread...
I tried to ask the Proxmox staff for details about what caused this issue since they said they fixed it.