Proxmox Node freezes

nitaish

Member
Feb 1, 2014
50
2
8
Mumbai
www.techknowlogy.in
One of my Proxmox node freezes every few hours. We have to do a hard reboot to bring it back online. I am unable to find anything in the syslog. Can anyone help what should I check to find the cause and the solution?
 

fireon

Well-Known Member
Oct 25, 2010
3,067
194
63
Austria/Graz
iteas.at
What type and model of server do you have?
Have an ILO/IDRAC or IPMI to see what problem the hardware have?
What is with the loadgraph, is there heavy load (CPU, I/O...) on the server?
Tell us about more of the hardware, Raid, ZFS, how many disk, how fast?
What PVE Version? System up do date and clean? "apt install -f"
If you connect an monitor and keyboard directly to the node, did you see an error at servercrash?
Attach also you syslog for the depending time.
 

ITNiels

New Member
Jun 17, 2018
1
0
1
36
Hi :)

We have the same problem currently!
for the last few days the server just disappeard several times! no more logs, no nothing!
Can't ping, SSH or WebUI! And only a remote hard reboot will bring it back..

We had hetzner to replace the server, but keep the disks as our initial thought was a network card issue!
We got an error: "e1000e 000:00:1F.6 enp0s31f6: Detected Hardware Unit Hang"

But replacing all the hardware did not fix the issue!
It happens at total random!

Hope this can help in investigating the issue.

Kind regards
Niels

Stats:
Load: < 1
CPU: <5%
IO: <1%
Last full updates and reboot: June 9th

Server:
Model: Hetzner EX41
Software:
OS: Debian 9
Proxmox: pve-manager/5.2-1/0fcd7879
Kernel: Linux 4.15.17-2-pve #1 SMP PVE 4.15.17-10 (Tue, 22 May 2018 11:15:44 +0200)

Hardware
Intel® Core i7-6700 Quad-Core processor
2 x 500 GB SATA 6 Gb/s SSD (Micron 1100) (2 separate disks without LVM)
32 GB DDR4
1 GBit/s-Port

Syslog just before/after it freezes:
Code:
Jun 17 01:50:00 vm5 systemd[1]: Starting Proxmox VE replication runner...
Jun 17 01:50:01 vm5 systemd[1]: Started Proxmox VE replication runner.
Jun 17 01:50:05 vm5 pvedaemon[1877]: <*********@pam> successful auth for user '*********'
Jun 17 01:50:27 vm5 pvedaemon[1597]: <*********@pam> successful auth for user '*********'
Jun 17 01:51:00 vm5 systemd[1]: Starting Proxmox VE replication runner...
Jun 17 01:51:01 vm5 systemd[1]: Started Proxmox VE replication runner.
Jun 17 01:52:00 vm5 systemd[1]: Starting Proxmox VE replication runner...
Jun 17 01:52:01 vm5 systemd[1]: Started Proxmox VE replication runner.
Jun 17 01:52:29 vm5 postfix/smtpd[10415]: connect from unknown[*********]
Jun 17 01:52:29 vm5 postfix/smtpd[10415]: lost connection after AUTH from unknown[*********]
Jun 17 01:52:29 vm5 postfix/smtpd[10415]: disconnect from unknown[*********] ehlo=1 auth=0/1 commands=1/2
Jun 17 01:53:00 vm5 systemd[1]: Starting Proxmox VE replication runner...
Jun 17 01:53:01 vm5 systemd[1]: Started Proxmox VE replication runner.
Jun 17 01:53:15 vm5 pveproxy[8181]: worker exit
Jun 17 01:53:15 vm5 pveproxy[1819]: worker 8181 finished
Jun 17 01:53:15 vm5 pveproxy[1819]: starting 1 worker(s)
Jun 17 01:53:15 vm5 pveproxy[1819]: worker 10491 started
Jun 17 01:53:43 vm5 pveproxy[1819]: worker 8051 finished
Jun 17 01:53:43 vm5 pveproxy[1819]: starting 1 worker(s)
Jun 17 01:53:43 vm5 pveproxy[1819]: worker 10521 started
Jun 17 01:53:44 vm5 pveproxy[10520]: worker exit
Jun 17 01:54:00 vm5 systemd[1]: Starting Proxmox VE replication runner...
Jun 17 01:54:01 vm5 systemd[1]: Started Proxmox VE replication runner.
Jun 17 01:55:00 vm5 systemd[1]: Starting Proxmox VE replication runner...
Jun 17 01:55:00 vm5 pvedaemon[3977]: <*********@pam> successful auth for user '*********'
Jun 17 01:55:01 vm5 systemd[1]: Started Proxmox VE replication runner.
Jun 17 01:55:50 vm5 postfix/anvil[10417]: statistics: max connection rate 1/60s for (smtp:185.234.217.38) at Jun 17 01:52:29
Jun 17 01:55:50 vm5 postfix/anvil[10417]: statistics: max connection count 1 for (smtp:185.234.217.38) at Jun 17 01:52:29
Jun 17 01:55:50 vm5 postfix/anvil[10417]: statistics: max cache size 1 at Jun 17 01:52:29
Jun 17 01:56:00 vm5 systemd[1]: Starting Proxmox VE replication runner...
Jun 17 01:56:01 vm5 systemd[1]: Started Proxmox VE replication runner.
Jun 17 01:57:00 vm5 systemd[1]: Starting Proxmox VE replication runner...
Jun 17 01:57:01 vm5 systemd[1]: Started Proxmox VE replication runner.
Jun 17 01:58:00 vm5 systemd[1]: Starting Proxmox VE replication runner...
Jun 17 01:58:01 vm5 systemd[1]: Started Proxmox VE replication runner.
Jun 17 01:59:00 vm5 systemd[1]: Starting Proxmox VE replication runner...
Jun 17 01:59:01 vm5 systemd[1]: Started Proxmox VE replication runner.
Jun 17 02:00:00 vm5 systemd[1]: Starting Proxmox VE replication runner...
Jun 17 02:00:01 vm5 systemd[1]: Started Proxmox VE replication runner.
Jun 17 02:01:00 vm5 systemd[1]: Starting Proxmox VE replication runner...
Jun 17 02:01:01 vm5 systemd[1]: Started Proxmox VE replication runner.

<========= DEAD HERE AND THEN WE HARD RESAT SERVER =========>

Jun 17 02:08:25 vm5 systemd-modules-load[330]: Inserted module 'iscsi_tcp'
Jun 17 02:08:25 vm5 kernel: [    0.000000] Linux version 4.15.17-2-pve (tlamprecht@evita) (gcc version 6.3.0 20170516 (Debian 6.3.0-18+deb9u1)) #1 SMP PVE 4.15.17-10 (Tue, 22 May 2018 11:15:44 +0200) ()
Jun 17 02:08:25 vm5 kernel: [    0.000000] Command line: BOOT_IMAGE=/vmlinuz-4.15.17-2-pve root=UUID=dc2c6eeb-e09e-4e1b-a1b2-658c64d9dd62 ro nomodeset consoleblank=0
Jun 17 02:08:25 vm5 kernel: [    0.000000] KERNEL supported cpus:
Jun 17 02:08:25 vm5 kernel: [    0.000000]   Intel GenuineIntel
 

michaelvv

Member
Oct 9, 2008
94
1
6
Same Issue on my private HomeServer. Worst bug I ever have seen on my 8 year Proxmox Journey.

Start seeing this about 1-1 1/2 month ago, after an update. Tried to add these lines to my network config
but I'll still have the issue.

offload-tx off
offload-sg off
offload-tso off

proxmox-ve: 5.2-2 (running kernel: 4.15.18-1-pve)
pve-manager: 5.2-5 (running version: 5.2-5/eb24855a)

ethtool -i eth0
driver: e1000e
version: 3.4.1.1-NAPI
firmware-version: 0.13-4
expansion-rom-version:
bus-info: 0000:00:19.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no
 
Last edited:

gallew

New Member
Oct 9, 2015
26
6
3
I had same problem with Hetzner machines (two clusters).
I don't know if it helps, been running 2 weeks now without problems (knocking on wood), so i think it is worth sharing:
Code:
# e1000e module hang problem
/sbin/ethtool -K eth0 tx off rx off
When executing, expect second or two outtage.
I use traditional NIC names, your mileage may vary.

As for lagging in KVM machines, i'd check on MTU's on all systems (physical interfaces, bridges, NIC inside KVM, etc)
I had one case where one NIC had different MTU than others, and result was same, lagging because of packet fragmentation.
 

Jarek Hartman

New Member
Aug 3, 2018
2
1
3
42
I only can confirm same issue here.

As I was suspecting HW issue, I've ordered mainboard replacement but as I can see now - no improvement at all.

I will try the trick with the ethtool but I think somebody should start thinking about a regular fix. What do you think, which element of the stack (kernel, NIC drivers, ...) might be responsible? I'd like to raise a formal ticket as this issue is really annoying.



Best regards,
Jarek

------

Notes to self (to remember what I've done)

Output when running from the CLI:

Code:
root@wieloryb-pve:/etc/rc.d/init.d# /sbin/ethtool -K enp0s31f6 tx off rx off
Cannot get device udp-fragmentation-offload settings: Operation not supported
Cannot get device udp-fragmentation-offload settings: Operation not supported
Actual changes:
rx-checksumming: off
tx-checksumming: off
    tx-checksum-ip-generic: off
tcp-segmentation-offload: off
    tx-tcp-segmentation: off [requested on]
    tx-tcp6-segmentation: off [requested on]
Preserving the changes across reboots:

Code:
root@wieloryb-pve:~# cat /etc/network/if-up.d/ethtool2
#!/bin/sh

/sbin/ethtool -K enp0s31f6 tx off rx off

root@wieloryb-pve:~# chmod 755 /etc/network/if-up.d/ethtool2
Reboot and verify:

Code:
root@wieloryb-pve:/etc#  shutdown -r now

root@wieloryb-pve:/etc/rc.d/init.d# /sbin/ethtool -k enp0s31f6
Features for enp0s31f6:
Cannot get device udp-fragmentation-offload settings: Operation not supported
rx-checksumming: off                   <--------- SHOULD BE OFF, HERE AND A FEW OTHER PLACES
tx-checksumming: off
 

tobimuc

New Member
Jan 18, 2014
1
0
1
Hi!

I still have the same problem. The Server ist an Hetzner EX51 ...

Now I will try the solution from Jarek and wait :)

Tobi
 
Apr 10, 2018
1
0
1
52
Hi!

I also had the same problem with an Hetzner PX61 server. I've applied now Gallew's and Jarek's solution and it seems to be fixed now.
But, as this is just a work-around, is there any chance that this will be fixed persistently by a ProxMox kernel update?

Greetings,
Dietmar
 

gallew

New Member
Oct 9, 2015
26
6
3
...
But, as this is just a work-around, is there any chance that this will be fixed persistently by a ProxMox kernel update?
...
Big dido here about fixing kernel module.

Unfortunately after running little bit over two months, one of machines suffered again from e1000e hang.
Investigation showed that still, e1000e module hangs repeatedly, regardless of checksum offloading.
There was many hangs in log, but latest took machine offline.
That let me thinkink, rc.local may not be best place to put ethtool command.
This is because although after reboot, checksum offloading is set by rc.local, but after first module hang, ifupd will bring interface back up without checksum offloading.
Therefore it would be better to put this into /etc/network/interfaces file, under main NIC:
Code:
  offload-tx  off
  offload-sg  off
  offload-tso off
  post-up /sbin/ethtool -K eth0 tx off rx off
Or as Jarek did, as executable script that will be executed every time NIC becomes up.
 

celtar

New Member
Feb 10, 2016
3
0
1
51
Big dido here about fixing kernel module.

Unfortunately after running little bit over two months, one of machines suffered again from e1000e hang.
Investigation showed that still, e1000e module hangs repeatedly, regardless of checksum offloading.
There was many hangs in log, but latest took machine offline.
That let me thinkink, rc.local may not be best place to put ethtool command.
This is because although after reboot, checksum offloading is set by rc.local, but after first module hang, ifupd will bring interface back up without checksum offloading.
Therefore it would be better to put this into /etc/network/interfaces file, under main NIC:
Code:
  offload-tx  off
  offload-sg  off
  offload-tso off
  post-up /sbin/ethtool -K eth0 tx off rx off
Or as Jarek did, as executable script that will be executed every time NIC becomes up.
We got same Problems here with Broadcom and ASUS Cards. I am not sure but you may lost manual changes in /etc/network/interfaces if you change something in the webgui network section?
 

gallew

New Member
Oct 9, 2015
26
6
3
We got same Problems here with Broadcom and ASUS Cards. I am not sure but you may lost manual changes in /etc/network/interfaces if you change something in the webgui network section?
I'm not sure, maybe developers can confirm it but i my case, i have changed network config via web only once, and that was long time ago, but i did not notice that something were missing.
I guess that best way to do it would be to test if parameters are still there on not after network reconfiguration via web UI.
Also, my current parameters for eth0 are:
Remember! ethtool still needs to be installed!

Code:
  offload-rx  off
  offload-tx  off
  offload-sg  off
  offload-tso off
 
Jan 21, 2017
280
26
28
30
Berlin
@proxmox

Is there something you can do about this? Since you provide the Kernel image you could patch it, right!?

Do we know which network hardware causes this issue so we can configure new hardware accordingly?

The first post was in June and no update since?
 

Stoiko Ivanov

Proxmox Staff Member
Staff member
May 2, 2018
2,033
205
63
As far as I see the problem is that it's not a particular kernel-module which always causes the problem, but various modules show problems sometimes, with certain specific NICs - e.g. most e1000(e) NICs work fine, but some do exhibit those problems.

Sometimes patching the BIOS and all firmware resolves the issue for the affected users, sometimes disabling offloading fixes the problem.

But it is nothing deterministic, and I wouldn't see what we could do to improve the situation for everyone?
 
Jan 21, 2017
280
26
28
30
Berlin
Thanks for replying so quickly.

I personally didn't know that this is not related to kernel bugs.
If this is mostly about firmware upgrades, then this can be solved through checking those.
But driver support should be solved by the OS itself, especially for standard NICs.

Can you recommend drivers where those bugs are not present?

Is there a list of defective hardware NICs for reference?
 

Stoiko Ivanov

Proxmox Staff Member
Staff member
May 2, 2018
2,033
205
63
Can you recommend drivers where those bugs are not present?
I do have a hard time with giving definite recommendations (7 months ago I would have told everyone - use Intel - they always work without any problem - yet as this thread indicates - this depends on some other factors as well).
I personally never had any problems with Intel-NICs (e1000, e1000e, igb, ixgbe, i40e) and newer cards using `tg3` - but this is statistically not significant.
 
  • Like
Reactions: DerDanilo

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE and Proxmox Mail Gateway. We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!