Hi there!
I ran into some strange trouble with both of my proxmox clusters, and let me show u how it got solved.
Problems:
-Gui showed all nodes offline and all vms offline - but everything was running
-gui didnt show any storage
Diagnostic steps:
- ssh into every proxmox node and run ping tests across their networks - everything shows fine
- ssh and pvecm status - everything shows fine
- check the storage: ssh and run df -Th to show mounted storages - THIS freezes?!?
- ok, after long wait it loads syslogs, and there I found some lines about nfs server storage unavailable but still trying
- look at nfs storages - they run just dandy, can mount them from other servers... what is the problem?!?
- cannot add or remove storages in the gui
Turns out there was a network outage during the night, and the nfs server part of the kernel kinda went bonkers...
Resolution:
- ssh and run this on every node, but wait until it finishes on each node before running on another:
systemctl restart corosync.service && systemctl restart pve-cluster.service && systemctl restart pvedaemon.service && systemctl restart pveproxy.service && systemctl restart pvestatd.service
After I ran this code on every node with some wait time between them, I could disconnect the nfs storages in the gui. Still, after some time, the error message that the nfs server is unavailable, appears again, and I have to run the above code on that node again.
df -Th still freezes. To completely get rid of it, I had to reboot every node. So live migrate every vm, and restart the nodes one by one.
Now came another problem: the nodes wont restart! I can ping them but no ssh and no other communication. Turns out, the watchdog timeout reared it's head. So I logged into ikvm, and restarted from there. The bug didn't appear before and after either, so I think the nfs server freezing caused this too...
If u have ceph, remember to wait for it to heal before doing another node reboot!
After all done, I remounted the nfs storages, ran some diagnostics, and everything is fine again.
The moral of my story:
-kernelspace storage sucks. Imagine if I had any more containers, that use the node's kernel. Who knows what bugs I would have encountered. Luckily I use nfs only for backups.
- use ceph as a main storage engine. All during this stuff happening, all of my vms ran and didn't catch on any of this.
- if u can, phase out nfs from your storage tech. From proxmox version 5, u can even have SMB storage!
Hope if someone encounters something similar, this helps
I ran into some strange trouble with both of my proxmox clusters, and let me show u how it got solved.
Problems:
-Gui showed all nodes offline and all vms offline - but everything was running
-gui didnt show any storage
Diagnostic steps:
- ssh into every proxmox node and run ping tests across their networks - everything shows fine
- ssh and pvecm status - everything shows fine
- check the storage: ssh and run df -Th to show mounted storages - THIS freezes?!?
- ok, after long wait it loads syslogs, and there I found some lines about nfs server storage unavailable but still trying
- look at nfs storages - they run just dandy, can mount them from other servers... what is the problem?!?
- cannot add or remove storages in the gui
Turns out there was a network outage during the night, and the nfs server part of the kernel kinda went bonkers...
Resolution:
- ssh and run this on every node, but wait until it finishes on each node before running on another:
systemctl restart corosync.service && systemctl restart pve-cluster.service && systemctl restart pvedaemon.service && systemctl restart pveproxy.service && systemctl restart pvestatd.service
After I ran this code on every node with some wait time between them, I could disconnect the nfs storages in the gui. Still, after some time, the error message that the nfs server is unavailable, appears again, and I have to run the above code on that node again.
df -Th still freezes. To completely get rid of it, I had to reboot every node. So live migrate every vm, and restart the nodes one by one.
Now came another problem: the nodes wont restart! I can ping them but no ssh and no other communication. Turns out, the watchdog timeout reared it's head. So I logged into ikvm, and restarted from there. The bug didn't appear before and after either, so I think the nfs server freezing caused this too...
If u have ceph, remember to wait for it to heal before doing another node reboot!
After all done, I remounted the nfs storages, ran some diagnostics, and everything is fine again.
The moral of my story:
-kernelspace storage sucks. Imagine if I had any more containers, that use the node's kernel. Who knows what bugs I would have encountered. Luckily I use nfs only for backups.
- use ceph as a main storage engine. All during this stuff happening, all of my vms ran and didn't catch on any of this.
- if u can, phase out nfs from your storage tech. From proxmox version 5, u can even have SMB storage!
Hope if someone encounters something similar, this helps