I was upgrading the whole cluster, which consisted of two nodes that operate 24/7, two nodes that are turned on for labbing purposes, then turned off when not needed (due to their power inefficiencies and loudness). All nodes were previously 4.4 with the latest updates. I upgraded the lab nodes first to 5.0 (release).
When I upgraded the lab node 1, I was not able to migrate VMs on or off it between it and lab node 2, or other nodes. I can't remember the error but it was extremely vague and I would say useless.
I then upgraded lab node 2, then "prod" node 2 (which had the VMs moved off of it). I was able to move VMs between lab node 1, 2 and prod node 2 at this point. Keep in mind prod node 1 was still on 4.4. lab node 1 and 2 has no LACP going on, but both prod nodes have 2x1gige LACP going on, and have been doing that for a bunch of weeks now.
prod node 2 threw no issues with the LACP after upgrade, I had to change nothing about its network config.
And then I upgraded prod node 1.... I saw no errors in the upgrade process, it appeared to come up just fine. I was not aware that the logs were throwing errors at this point. I then migrated one VM onto it, and it seemed good, so I migrated a whole bunch of VMs onto it, and that's when things got... concerning.
In the second set of migrations I batch migrated a whole bunch of VMs with 3 in parallel at once. It went so fast, I figured I would try in the reverse direction, but with 6 in parallel at once... and that's when I got the error about the public key... on EVERY VM trying to migrate.
I then kicked to the CLI and was able to actually SSH from prod 1 to prod 2 with zero problems, so the keys were still trusted, the error was bunk. I tried a whole bunch of things, such as checking if root was allowed, verifying if the key files were accessible, etc. It all seemed to be correct, but the error was being thrown. About this time I started looking at the logs, and oh shit, I saw the bond0 error now too.
That's when the steps I took kind of got hazy in my memory, but my BOFH gene started kicking in too. Anger grew, because none of this made sense.
I tried removing the node and re-adding it to the cluster without reinstalling it a few times. Removing was successful, re-adding was... not so successful. In the end wiped it and fed it the same configs. Except even after that, again, bond0 throwing errors.
Another thing that's weird and I'm not sure why... the interface names changed. prod 2 the interface names stayed the same, but when I fresh reinstalled prod 1, the interface names changed from eth1 / eth0, to something indicative of a driver or something. I have not yet found the area where I can change this, as I would prefer eth1 / eth0. But I'm tolerating this for now because my care factor is dropping.
All in all, I spent way more time on this than I should have. The upgrade for 3 of the 4 nodes went really smoothly, and I seriously cannot fathom why prod 1 has given me so much trouble. Nothing adds up!
So to ACTUALLY answer your question. So far as I can tell all the nodes have the same package versions.
Looks like you covered everything I could think of as well. Very strange. Does the working node have different package versions than the one that doesn't work? Like were they both upgraded to 5.0 or just the problematic one?