Proxmox Node freezes

Discussion in 'Proxmox VE: Installation and configuration' started by nitaish, Jun 16, 2018.

  1. nitaish

    nitaish Member

    Joined:
    Feb 1, 2014
    Messages:
    44
    Likes Received:
    2
    One of my Proxmox node freezes every few hours. We have to do a hard reboot to bring it back online. I am unable to find anything in the syslog. Can anyone help what should I check to find the cause and the solution?
     
  2. fireon

    fireon Well-Known Member
    Proxmox VE Subscriber

    Joined:
    Oct 25, 2010
    Messages:
    2,690
    Likes Received:
    148
    What type and model of server do you have?
    Have an ILO/IDRAC or IPMI to see what problem the hardware have?
    What is with the loadgraph, is there heavy load (CPU, I/O...) on the server?
    Tell us about more of the hardware, Raid, ZFS, how many disk, how fast?
    What PVE Version? System up do date and clean? "apt install -f"
    If you connect an monitor and keyboard directly to the node, did you see an error at servercrash?
    Attach also you syslog for the depending time.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  3. ITNiels

    ITNiels New Member

    Joined:
    Jun 17, 2018
    Messages:
    1
    Likes Received:
    0
    Hi :)

    We have the same problem currently!
    for the last few days the server just disappeard several times! no more logs, no nothing!
    Can't ping, SSH or WebUI! And only a remote hard reboot will bring it back..

    We had hetzner to replace the server, but keep the disks as our initial thought was a network card issue!
    We got an error: "e1000e 000:00:1F.6 enp0s31f6: Detected Hardware Unit Hang"

    But replacing all the hardware did not fix the issue!
    It happens at total random!

    Hope this can help in investigating the issue.

    Kind regards
    Niels

    Stats:
    Load: < 1
    CPU: <5%
    IO: <1%
    Last full updates and reboot: June 9th

    Server:
    Model: Hetzner EX41
    Software:
    OS: Debian 9
    Proxmox: pve-manager/5.2-1/0fcd7879
    Kernel: Linux 4.15.17-2-pve #1 SMP PVE 4.15.17-10 (Tue, 22 May 2018 11:15:44 +0200)

    Hardware
    IntelĀ® Core i7-6700 Quad-Core processor
    2 x 500 GB SATA 6 Gb/s SSD (Micron 1100) (2 separate disks without LVM)
    32 GB DDR4
    1 GBit/s-Port

    Syslog just before/after it freezes:
    Code:
    Jun 17 01:50:00 vm5 systemd[1]: Starting Proxmox VE replication runner...
    Jun 17 01:50:01 vm5 systemd[1]: Started Proxmox VE replication runner.
    Jun 17 01:50:05 vm5 pvedaemon[1877]: <*********@pam> successful auth for user '*********'
    Jun 17 01:50:27 vm5 pvedaemon[1597]: <*********@pam> successful auth for user '*********'
    Jun 17 01:51:00 vm5 systemd[1]: Starting Proxmox VE replication runner...
    Jun 17 01:51:01 vm5 systemd[1]: Started Proxmox VE replication runner.
    Jun 17 01:52:00 vm5 systemd[1]: Starting Proxmox VE replication runner...
    Jun 17 01:52:01 vm5 systemd[1]: Started Proxmox VE replication runner.
    Jun 17 01:52:29 vm5 postfix/smtpd[10415]: connect from unknown[*********]
    Jun 17 01:52:29 vm5 postfix/smtpd[10415]: lost connection after AUTH from unknown[*********]
    Jun 17 01:52:29 vm5 postfix/smtpd[10415]: disconnect from unknown[*********] ehlo=1 auth=0/1 commands=1/2
    Jun 17 01:53:00 vm5 systemd[1]: Starting Proxmox VE replication runner...
    Jun 17 01:53:01 vm5 systemd[1]: Started Proxmox VE replication runner.
    Jun 17 01:53:15 vm5 pveproxy[8181]: worker exit
    Jun 17 01:53:15 vm5 pveproxy[1819]: worker 8181 finished
    Jun 17 01:53:15 vm5 pveproxy[1819]: starting 1 worker(s)
    Jun 17 01:53:15 vm5 pveproxy[1819]: worker 10491 started
    Jun 17 01:53:43 vm5 pveproxy[1819]: worker 8051 finished
    Jun 17 01:53:43 vm5 pveproxy[1819]: starting 1 worker(s)
    Jun 17 01:53:43 vm5 pveproxy[1819]: worker 10521 started
    Jun 17 01:53:44 vm5 pveproxy[10520]: worker exit
    Jun 17 01:54:00 vm5 systemd[1]: Starting Proxmox VE replication runner...
    Jun 17 01:54:01 vm5 systemd[1]: Started Proxmox VE replication runner.
    Jun 17 01:55:00 vm5 systemd[1]: Starting Proxmox VE replication runner...
    Jun 17 01:55:00 vm5 pvedaemon[3977]: <*********@pam> successful auth for user '*********'
    Jun 17 01:55:01 vm5 systemd[1]: Started Proxmox VE replication runner.
    Jun 17 01:55:50 vm5 postfix/anvil[10417]: statistics: max connection rate 1/60s for (smtp:185.234.217.38) at Jun 17 01:52:29
    Jun 17 01:55:50 vm5 postfix/anvil[10417]: statistics: max connection count 1 for (smtp:185.234.217.38) at Jun 17 01:52:29
    Jun 17 01:55:50 vm5 postfix/anvil[10417]: statistics: max cache size 1 at Jun 17 01:52:29
    Jun 17 01:56:00 vm5 systemd[1]: Starting Proxmox VE replication runner...
    Jun 17 01:56:01 vm5 systemd[1]: Started Proxmox VE replication runner.
    Jun 17 01:57:00 vm5 systemd[1]: Starting Proxmox VE replication runner...
    Jun 17 01:57:01 vm5 systemd[1]: Started Proxmox VE replication runner.
    Jun 17 01:58:00 vm5 systemd[1]: Starting Proxmox VE replication runner...
    Jun 17 01:58:01 vm5 systemd[1]: Started Proxmox VE replication runner.
    Jun 17 01:59:00 vm5 systemd[1]: Starting Proxmox VE replication runner...
    Jun 17 01:59:01 vm5 systemd[1]: Started Proxmox VE replication runner.
    Jun 17 02:00:00 vm5 systemd[1]: Starting Proxmox VE replication runner...
    Jun 17 02:00:01 vm5 systemd[1]: Started Proxmox VE replication runner.
    Jun 17 02:01:00 vm5 systemd[1]: Starting Proxmox VE replication runner...
    Jun 17 02:01:01 vm5 systemd[1]: Started Proxmox VE replication runner.
    
    <========= DEAD HERE AND THEN WE HARD RESAT SERVER =========>
    
    Jun 17 02:08:25 vm5 systemd-modules-load[330]: Inserted module 'iscsi_tcp'
    Jun 17 02:08:25 vm5 kernel: [    0.000000] Linux version 4.15.17-2-pve (tlamprecht@evita) (gcc version 6.3.0 20170516 (Debian 6.3.0-18+deb9u1)) #1 SMP PVE 4.15.17-10 (Tue, 22 May 2018 11:15:44 +0200) ()
    Jun 17 02:08:25 vm5 kernel: [    0.000000] Command line: BOOT_IMAGE=/vmlinuz-4.15.17-2-pve root=UUID=dc2c6eeb-e09e-4e1b-a1b2-658c64d9dd62 ro nomodeset consoleblank=0
    Jun 17 02:08:25 vm5 kernel: [    0.000000] KERNEL supported cpus:
    Jun 17 02:08:25 vm5 kernel: [    0.000000]   Intel GenuineIntel
     
  4. Amonal

    Amonal New Member

    Joined:
    Oct 9, 2013
    Messages:
    5
    Likes Received:
    0
  5. michaelvv

    michaelvv Member

    Joined:
    Oct 9, 2008
    Messages:
    94
    Likes Received:
    1
    Same Issue on my private HomeServer. Worst bug I ever have seen on my 8 year Proxmox Journey.

    Start seeing this about 1-1 1/2 month ago, after an update. Tried to add these lines to my network config
    but I'll still have the issue.

    offload-tx off
    offload-sg off
    offload-tso off

    proxmox-ve: 5.2-2 (running kernel: 4.15.18-1-pve)
    pve-manager: 5.2-5 (running version: 5.2-5/eb24855a)

    ethtool -i eth0
    driver: e1000e
    version: 3.4.1.1-NAPI
    firmware-version: 0.13-4
    expansion-rom-version:
    bus-info: 0000:00:19.0
    supports-statistics: yes
    supports-test: yes
    supports-eeprom-access: yes
    supports-register-dump: yes
    supports-priv-flags: no
     
    #5 michaelvv, Jul 14, 2018
    Last edited: Jul 14, 2018
  6. gallew

    gallew New Member

    Joined:
    Oct 9, 2015
    Messages:
    24
    Likes Received:
    6
    I had same problem with Hetzner machines (two clusters).
    I don't know if it helps, been running 2 weeks now without problems (knocking on wood), so i think it is worth sharing:
    Code:
    # e1000e module hang problem
    /sbin/ethtool -K eth0 tx off rx off
    
    When executing, expect second or two outtage.
    I use traditional NIC names, your mileage may vary.

    As for lagging in KVM machines, i'd check on MTU's on all systems (physical interfaces, bridges, NIC inside KVM, etc)
    I had one case where one NIC had different MTU than others, and result was same, lagging because of packet fragmentation.
     
  7. Jarek Hartman

    Jarek Hartman New Member

    Joined:
    Aug 3, 2018
    Messages:
    2
    Likes Received:
    1
    I only can confirm same issue here.

    As I was suspecting HW issue, I've ordered mainboard replacement but as I can see now - no improvement at all.

    I will try the trick with the ethtool but I think somebody should start thinking about a regular fix. What do you think, which element of the stack (kernel, NIC drivers, ...) might be responsible? I'd like to raise a formal ticket as this issue is really annoying.



    Best regards,
    Jarek

    ------

    Notes to self (to remember what I've done)

    Output when running from the CLI:

    Code:
    root@wieloryb-pve:/etc/rc.d/init.d# /sbin/ethtool -K enp0s31f6 tx off rx off
    Cannot get device udp-fragmentation-offload settings: Operation not supported
    Cannot get device udp-fragmentation-offload settings: Operation not supported
    Actual changes:
    rx-checksumming: off
    tx-checksumming: off
        tx-checksum-ip-generic: off
    tcp-segmentation-offload: off
        tx-tcp-segmentation: off [requested on]
        tx-tcp6-segmentation: off [requested on]
    Preserving the changes across reboots:

    Code:
    root@wieloryb-pve:~# cat /etc/network/if-up.d/ethtool2
    #!/bin/sh
    
    /sbin/ethtool -K enp0s31f6 tx off rx off
    
    root@wieloryb-pve:~# chmod 755 /etc/network/if-up.d/ethtool2
    
    Reboot and verify:

    Code:
    root@wieloryb-pve:/etc#  shutdown -r now
    
    root@wieloryb-pve:/etc/rc.d/init.d# /sbin/ethtool -k enp0s31f6
    Features for enp0s31f6:
    Cannot get device udp-fragmentation-offload settings: Operation not supported
    rx-checksumming: off                   <--------- SHOULD BE OFF, HERE AND A FEW OTHER PLACES
    tx-checksumming: off
    
    
     
  8. tobimuc

    tobimuc New Member

    Joined:
    Jan 18, 2014
    Messages:
    1
    Likes Received:
    0
    Hi!

    I still have the same problem. The Server ist an Hetzner EX51 ...

    Now I will try the solution from Jarek and wait :)

    Tobi
     
  9. Jarek Hartman

    Jarek Hartman New Member

    Joined:
    Aug 3, 2018
    Messages:
    2
    Likes Received:
    1
    Since applying configuration posted above (it's been 10 days already) no more issues. I hope it will work for others as well.
     
    tobimuc likes this.
  10. DZenker

    DZenker New Member
    Proxmox VE Subscriber

    Joined:
    Apr 10, 2018
    Messages:
    1
    Likes Received:
    0
    Hi!

    I also had the same problem with an Hetzner PX61 server. I've applied now Gallew's and Jarek's solution and it seems to be fixed now.
    But, as this is just a work-around, is there any chance that this will be fixed persistently by a ProxMox kernel update?

    Greetings,
    Dietmar
     
  11. gallew

    gallew New Member

    Joined:
    Oct 9, 2015
    Messages:
    24
    Likes Received:
    6
    Big dido here about fixing kernel module.

    Unfortunately after running little bit over two months, one of machines suffered again from e1000e hang.
    Investigation showed that still, e1000e module hangs repeatedly, regardless of checksum offloading.
    There was many hangs in log, but latest took machine offline.
    That let me thinkink, rc.local may not be best place to put ethtool command.
    This is because although after reboot, checksum offloading is set by rc.local, but after first module hang, ifupd will bring interface back up without checksum offloading.
    Therefore it would be better to put this into /etc/network/interfaces file, under main NIC:
    Code:
      offload-tx  off
      offload-sg  off
      offload-tso off
      post-up /sbin/ethtool -K eth0 tx off rx off
    
    Or as Jarek did, as executable script that will be executed every time NIC becomes up.
     
  12. celtar

    celtar New Member

    Joined:
    Feb 10, 2016
    Messages:
    3
    Likes Received:
    0
    We got same Problems here with Broadcom and ASUS Cards. I am not sure but you may lost manual changes in /etc/network/interfaces if you change something in the webgui network section?
     
  13. gallew

    gallew New Member

    Joined:
    Oct 9, 2015
    Messages:
    24
    Likes Received:
    6
    I'm not sure, maybe developers can confirm it but i my case, i have changed network config via web only once, and that was long time ago, but i did not notice that something were missing.
    I guess that best way to do it would be to test if parameters are still there on not after network reconfiguration via web UI.
    Also, my current parameters for eth0 are:
    Remember! ethtool still needs to be installed!

    Code:
      offload-rx  off
      offload-tx  off
      offload-sg  off
      offload-tso off
    
     
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice