Just sharing a possible issue some people may run into in the future experiencing a broken pipe when migrating a VM, during a replicate operation, etc.
BACKGROUND
I recently upgraded the CPU and motherboard for my alternate node. There were some issues during the upgrade and considering I had already elected to move all my VM's to the primary, I decided to remove the node from the cluster and re-install from scratch. I gave it the same name as the original alternate node (pve-alt specifically), and re-added it to the cluster. Everything was working fine up until I tried migrating nodes back to pve-alt. For very small VM's it would SOMETIMES work. I was able to get a small vm moved over, then set up replication for it and HA and was even able to transition the node back and forth by changing the HA group. For any VM with a disk larger than 16GB's though, I'd get a failure on migration/replication/etc with a broken pipe error and a failed task.
ISSUE
What the root of the issue was that upon re-adding the VM added ssh keys to /etc/ssh/ssh_known_hosts but the original node key was not removed. I resolved the issue by deleting the keys from /etc/ssh/ssh_known_hosts that were from the old installation and rebooting both nodes. I then had an issue where the installation SSD for the alternate node was failing (I kept getting broken pipes and other errors and tried another fresh install under a different node name, but the installation was failing). Thus a new drive, different node name, and a re-add to the cluster and everything is working like clockwork again.
TLDR
When removing a node from the cluster, you need to clean up it's keys from /etc/ssh/ssh_known_hosts (comment if you know other places such keys need to be cleaned up that I didn't mention) and/or when adding a new node to a cluster, consider giving it a name that is unique and doesn't overlap with a previously removed node. Also...don't use SSD's that are 8 years old.
Hope this helps someone in the future!
BACKGROUND
I recently upgraded the CPU and motherboard for my alternate node. There were some issues during the upgrade and considering I had already elected to move all my VM's to the primary, I decided to remove the node from the cluster and re-install from scratch. I gave it the same name as the original alternate node (pve-alt specifically), and re-added it to the cluster. Everything was working fine up until I tried migrating nodes back to pve-alt. For very small VM's it would SOMETIMES work. I was able to get a small vm moved over, then set up replication for it and HA and was even able to transition the node back and forth by changing the HA group. For any VM with a disk larger than 16GB's though, I'd get a failure on migration/replication/etc with a broken pipe error and a failed task.
ISSUE
What the root of the issue was that upon re-adding the VM added ssh keys to /etc/ssh/ssh_known_hosts but the original node key was not removed. I resolved the issue by deleting the keys from /etc/ssh/ssh_known_hosts that were from the old installation and rebooting both nodes. I then had an issue where the installation SSD for the alternate node was failing (I kept getting broken pipes and other errors and tried another fresh install under a different node name, but the installation was failing). Thus a new drive, different node name, and a re-add to the cluster and everything is working like clockwork again.
TLDR
When removing a node from the cluster, you need to clean up it's keys from /etc/ssh/ssh_known_hosts (comment if you know other places such keys need to be cleaned up that I didn't mention) and/or when adding a new node to a cluster, consider giving it a name that is unique and doesn't overlap with a previously removed node. Also...don't use SSD's that are 8 years old.
Hope this helps someone in the future!