Solved - Proxmox 4.4 gui showed nodes and vms offline

Discussion in 'Proxmox VE: Installation and configuration' started by CadilLACi, Feb 14, 2019.

  1. CadilLACi

    CadilLACi New Member

    Hi there!

    I ran into some strange trouble with both of my proxmox clusters, and let me show u how it got solved.

    -Gui showed all nodes offline and all vms offline - but everything was running
    -gui didnt show any storage

    Diagnostic steps:
    - ssh into every proxmox node and run ping tests across their networks - everything shows fine
    - ssh and pvecm status - everything shows fine
    - check the storage: ssh and run df -Th to show mounted storages - THIS freezes?!?
    - ok, after long wait it loads syslogs, and there I found some lines about nfs server storage unavailable but still trying
    - look at nfs storages - they run just dandy, can mount them from other servers... what is the problem?!?
    - cannot add or remove storages in the gui

    Turns out there was a network outage during the night, and the nfs server part of the kernel kinda went bonkers...

    - ssh and run this on every node, but wait until it finishes on each node before running on another:

    systemctl restart corosync.service && systemctl restart pve-cluster.service && systemctl restart pvedaemon.service && systemctl restart pveproxy.service && systemctl restart pvestatd.service

    After I ran this code on every node with some wait time between them, I could disconnect the nfs storages in the gui. Still, after some time, the error message that the nfs server is unavailable, appears again, and I have to run the above code on that node again.

    df -Th still freezes. To completely get rid of it, I had to reboot every node. So live migrate every vm, and restart the nodes one by one.

    Now came another problem: the nodes wont restart! I can ping them but no ssh and no other communication. Turns out, the watchdog timeout reared it's head. So I logged into ikvm, and restarted from there. The bug didn't appear before and after either, so I think the nfs server freezing caused this too...

    If u have ceph, remember to wait for it to heal before doing another node reboot!

    After all done, I remounted the nfs storages, ran some diagnostics, and everything is fine again.

    The moral of my story:
    -kernelspace storage sucks. Imagine if I had any more containers, that use the node's kernel. Who knows what bugs I would have encountered. Luckily I use nfs only for backups.
    - use ceph as a main storage engine. All during this stuff happening, all of my vms ran and didn't catch on any of this.
    - if u can, phase out nfs from your storage tech. From proxmox version 5, u can even have SMB storage!

    Hope if someone encounters something similar, this helps
  2. t.lamprecht

    t.lamprecht Proxmox Staff Member
    just a few comments, as this was already solved by you.

    FYI, you can pass multiple units to systemctl restart in a row.

    This should be not such of a hard issue with PVE 5.X anymore, as pvestatd checks storages in a forked process, so not the whole daemons hangs there. Also remember that PVE 4.X is EOL and it's highly recommended to upgrade.

    not every kernel storage acts the same as NFS. E.g., ceph can be also provided by the kernel (krbd), and there you won't have those issues, as you noticed too. SMB/CIFS is also provided by the kernel, as example.
  3. CadilLACi

    CadilLACi New Member

    Ok, I stand corrected!

    It is reassuring that this kind of thing would only happen with NFS storage.

    Phasing it out and upgrading to 5.3 are this year's new year's resolutions.

    Thx for ur reply!
