I think that may have solved it. The machines had their times sync'd after boot (e.g. after running a minute), but there is a very small window where it was out of sync in the first few seconds presumably during services starting. I guess the...
I think I found the likely culprit ;)
May 28 07:35:16 nh5 nh-ntpdate[3084]: CLOCK: time stepped by -934.341954
could you try to get that sorted and see if the problem goes away then?
Ok.
I rebooted nh5, then created the nh5-journal-boot-fresh.txt file. Note: the PVE services start on boot, so they try to run.
Then I stopped the services, this it the nh5-journal-boot-stopped.txt file.
Then I started the services again, in...
@fabian
So everything was fine for a couple days, then I thought I would reboot nh5 to make sure things came back up ok (and checking before doing upgrades).
But when it rebooted, it had the same issue. I am able to recover it by doing this...
Awesome, that order did the trick. All looks good. I did the same for nh5 too, so they are all up. Thanks 100x for all your help!
Do you know why this may have happened? I've rebooted Proxmox nodes many times, including entire clusters that were...
that looks good so far, the question mark is probably because pvestatd is not running yet on that node. could you try starting it and see if it goes "green" then? if it does, you can start the other services as well on that node.
please then try...
The network configs were all written last in January, 2025. So none of them have been touched in over a year. They are all the exact same size. So if it is a network issue, it could be maybe a flakey switch or something like that (?). Maybe MTU...
could you double check and post your network configuration/setup, including the switch config? in particular of the two "problematic" nodes? this looks like a network misconfiguration problem, though the logs don't give a clear indication *what*...
please provide the journal of all 5 nodes covering the bootup, and the full journal for the corosync and pve-cluster units on all 5 nodes for the same boot.
See attached.
The node 2 and node 5 boot logs were "too large for the server to process" on upload. I truncated those as they just repeat the same things over and over, so they are small enough to upload.
Thanks for your consideration. :)
I have a 5 node Proxmox cluster co-located in a data center with ~100 KVMs that has been running happily the last year+.
The ISP needed to move the servers to another building (sigh).
Everything came back online, but two of the nodes, node2 and...