Proxmox server randomly freezing and wont accept any input

Discussion in 'Proxmox VE: Installation and configuration' started by Dannyzee, Mar 7, 2018.

  1. Dannyzee

    Dannyzee New Member

    Joined:
    Mar 7, 2018
    Messages:
    11
    Likes Received:
    0
    Hi,

    We are new users of Proxmox with CEPH which works fine as of now but we have a strange problem.
    Once in a few days the whole proxmox management server we have freezes completely and does nothing anymore.

    The servers are SuperMicro servers and have IPMI. When we check IPMI it wont accept input anymore as well, i have to completely reboot it in order to make it work again until it freezes again, i havent been able to find a pattern yet so it does it completely random.

    We also have 3 monitor servers with exact the same hardware and configuration and these don't freeze at all.

    Also atop logs until the time it freezes, but we don't see any significant proces running other then what was already running.

    Any idea's/ tricks we can do to fix this? The ceph/proxmox configuration is still in a test stadium so it is not urgent, but we like it to work so we can use the full function of the proxmox/ceph environment.
     
  2. wolfgang

    wolfgang Proxmox Staff Member
    Staff Member

    Joined:
    Oct 1, 2014
    Messages:
    4,710
    Likes Received:
    313
    Hi,
    If the IPMI is stopping to response you have a HW problem.
    No OS can brake the IPMI.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  3. Dannyzee

    Dannyzee New Member

    Joined:
    Mar 7, 2018
    Messages:
    11
    Likes Received:
    0
    Hello Wolfgang,

    Thank you for your reply, that is what we also are thinking but do you have an idea how we can check this out?
    I have done multiple Memtests which went totally fine without any faults

    is there a way how we can check what is exactly broken?
     
  4. wolfgang

    wolfgang Proxmox Staff Member
    Staff Member

    Joined:
    Oct 1, 2014
    Messages:
    4,710
    Likes Received:
    313
    A working memory do not prove that the IPMI is not corrupt.

    Contact your HW vendor. Some has special tools for this task.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  5. Dannyzee

    Dannyzee New Member

    Joined:
    Mar 7, 2018
    Messages:
    11
    Likes Received:
    0
    Hi,

    Thank you. I will contact our HW supplier.
     
  6. Vasu Sreekumar

    Vasu Sreekumar Active Member

    Joined:
    Mar 3, 2018
    Messages:
    123
    Likes Received:
    34
    "If the IPMI is stopping to response you have a HW problem.
    No OS can brake the IPMI."

    In my experience yes it is possible.

    In BIOS there is an option enable/disable IPMI for OS. Keep it disabled.
     
  7. Dannyzee

    Dannyzee New Member

    Joined:
    Mar 7, 2018
    Messages:
    11
    Likes Received:
    0
    The IPMI was still responding, just the OS took no respond, so pressing any key did literally nothing
     
  8. Dannyzee

    Dannyzee New Member

    Joined:
    Mar 7, 2018
    Messages:
    11
    Likes Received:
    0
    Re-reading this made me thing, IPMI works just the console takes no more response, is there another way we can check this because i don't think it is HW related as the IPMI itself works fine. When we go to DC and attach keyboard OS takes no response either until we reboot
     
  9. Dannyzee

    Dannyzee New Member

    Joined:
    Mar 7, 2018
    Messages:
    11
    Likes Received:
    0
    Any more idea's to try?
     
  10. Vasu Sreekumar

    Vasu Sreekumar Active Member

    Joined:
    Mar 3, 2018
    Messages:
    123
    Likes Received:
    34
    ssh is still working?

    Or you have freezing issue only in IPKVM console?
     
  11. wolfgang

    wolfgang Proxmox Staff Member
    Staff Member

    Joined:
    Oct 1, 2014
    Messages:
    4,710
    Likes Received:
    313
    Do you see any errors in the syslog?
    Turn on core dump and make journald persistent.

    How to make journald persistent
    Code:
    mkdir /var/log/journal
    systemctl restart systemd-journald
    
    see
    https://pve.proxmox.com/wiki/Enable_Core_Dump_systemd


    May be you get a hint this way.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  12. Dannyzee

    Dannyzee New Member

    Joined:
    Mar 7, 2018
    Messages:
    11
    Likes Received:
    0
    Hello,

    SSH is not working anymore. We cant login.
    When we login into IPMI and open iKVM HTML console it doesn't accept input anymore as well, until we reboot it, then it works fine.

    The whole system freezes but IPMI works fine, also we went to Datacentre once when it happened, then it was the same problem. The machine was on and running but SSH was froze and wouldn't accept any input. While our other ceph machines work fine, it is only this manager machine. Our monitor's and storage servers run fine
     
  13. Dannyzee

    Dannyzee New Member

    Joined:
    Mar 7, 2018
    Messages:
    11
    Likes Received:
    0

    Hi,

    Thanks for the hint, i will activate the core dumps and let you know if i find anything useful after it crashes again.
     
  14. Dannyzee

    Dannyzee New Member

    Joined:
    Mar 7, 2018
    Messages:
    11
    Likes Received:
    0
    Hi,

    The server is freezing again. How can i "read" these core dumps? I have installed gdb but that is as far as my intelligence goes on core dumps
     
  15. Dannyzee

    Dannyzee New Member

    Joined:
    Mar 7, 2018
    Messages:
    11
    Likes Received:
    0
    Hi,

    I have checked but there are no core dumps when i check

    /var/lib/coredumps# ls -la

    total 8
    drwxr-xr-x 2 root root 4096 Mar 9 09:39 .
    drwxr-xr-x 44 root root 4096 Mar 9 09:39 ..


    Whenever i shut down PVE it doesn't make the core dumps either and i followed every exact step from the guide
     
  16. sda9s8d2sd

    sda9s8d2sd New Member

    Joined:
    Mar 18, 2018
    Messages:
    1
    Likes Received:
    0
    I am having random freezes as well with the xact same symptoms as you.
    I am also running a Supermicro board with 2 xeon e5-v3 processors.

    I cleaned up all error messages first, like ntp, CPU microcode update, GPU power states etc.

    Still freezes randomly. It lasted for 24hours which is a new record for me.

    I followed the coredumps write-up as well and it wasn't working for me.
    Perhaps that write-up needs an update..
     
  17. Bigkuhuna24

    Bigkuhuna24 New Member

    Joined:
    Mar 18, 2018
    Messages:
    1
    Likes Received:
    0
    My system freezes and web ui stops working for a min or two.. exon e5 too
     
  18. Dannyzee

    Dannyzee New Member

    Joined:
    Mar 7, 2018
    Messages:
    11
    Likes Received:
    0
    Looks like multiple people have the same issues, any idea?
     
  19. Stewge

    Stewge Member

    Joined:
    Feb 11, 2010
    Messages:
    38
    Likes Received:
    2
    I'm seeing this as well on one of our Supermicro boards which is running 2x Xeon E5-2640v4 on an X10DRi motherboard. I wrote it off as an anomaly but it now happens about once every 1-3 months. I've got recorded outages as 7/10/2017, 9/1/2018, 1/2/2018 and 28/3/2018.

    PVE and VMs lock up. IPMI is still functional and reads 30fps through the java console (so capture is still functioning) but it's just a black screen unlike a regular kernel panic condition where I'd expect to see the panic output. The only thing I can do is force reboot via IPMI.

    Fortunately it's running on very fast SSDs so I can often have the whole thing rebooted and VMs running in less than 5 minutes.

    Unfortunately, nothing of substance makes it into any of the logs.
     
  20. OH24

    OH24 New Member

    Joined:
    Jun 1, 2017
    Messages:
    15
    Likes Received:
    6
    We are experiencing the same problems with a Supermicro Board X9DRi. Are there any new findings on this topic?
     
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice