Proxmox Node freezes

nitaish · Jun 16, 2018

One of my Proxmox node freezes every few hours. We have to do a hard reboot to bring it back online. I am unable to find anything in the syslog. Can anyone help what should I check to find the cause and the solution?

fireon · Jun 16, 2018

What type and model of server do you have?
Have an ILO/IDRAC or IPMI to see what problem the hardware have?
What is with the loadgraph, is there heavy load (CPU, I/O...) on the server?
Tell us about more of the hardware, Raid, ZFS, how many disk, how fast?
What PVE Version? System up do date and clean? "apt install -f"
If you connect an monitor and keyboard directly to the node, did you see an error at servercrash?
Attach also you syslog for the depending time.

ITNiels · Jun 17, 2018

Hi

We have the same problem currently!
for the last few days the server just disappeard several times! no more logs, no nothing!
Can't ping, SSH or WebUI! And only a remote hard reboot will bring it back..

We had hetzner to replace the server, but keep the disks as our initial thought was a network card issue!
We got an error: "e1000e 000:00:1F.6 enp0s31f6: Detected Hardware Unit Hang"

But replacing all the hardware did not fix the issue!
It happens at total random!

Hope this can help in investigating the issue.

Kind regards
Niels

Stats:
Load: < 1
CPU: <5%
IO: <1%
Last full updates and reboot: June 9th

Server:
Model: Hetzner EX41
Software:
OS: Debian 9
Proxmox: pve-manager/5.2-1/0fcd7879
Kernel: Linux 4.15.17-2-pve #1 SMP PVE 4.15.17-10 (Tue, 22 May 2018 11:15:44 +0200)

Hardware
Intel® Core i7-6700 Quad-Core processor
2 x 500 GB SATA 6 Gb/s SSD (Micron 1100) (2 separate disks without LVM)
32 GB DDR4
1 GBit/s-Port

Syslog just before/after it freezes:

Code:

Jun 17 01:50:00 vm5 systemd[1]: Starting Proxmox VE replication runner...
Jun 17 01:50:01 vm5 systemd[1]: Started Proxmox VE replication runner.
Jun 17 01:50:05 vm5 pvedaemon[1877]: <*********@pam> successful auth for user '*********'
Jun 17 01:50:27 vm5 pvedaemon[1597]: <*********@pam> successful auth for user '*********'
Jun 17 01:51:00 vm5 systemd[1]: Starting Proxmox VE replication runner...
Jun 17 01:51:01 vm5 systemd[1]: Started Proxmox VE replication runner.
Jun 17 01:52:00 vm5 systemd[1]: Starting Proxmox VE replication runner...
Jun 17 01:52:01 vm5 systemd[1]: Started Proxmox VE replication runner.
Jun 17 01:52:29 vm5 postfix/smtpd[10415]: connect from unknown[*********]
Jun 17 01:52:29 vm5 postfix/smtpd[10415]: lost connection after AUTH from unknown[*********]
Jun 17 01:52:29 vm5 postfix/smtpd[10415]: disconnect from unknown[*********] ehlo=1 auth=0/1 commands=1/2
Jun 17 01:53:00 vm5 systemd[1]: Starting Proxmox VE replication runner...
Jun 17 01:53:01 vm5 systemd[1]: Started Proxmox VE replication runner.
Jun 17 01:53:15 vm5 pveproxy[8181]: worker exit
Jun 17 01:53:15 vm5 pveproxy[1819]: worker 8181 finished
Jun 17 01:53:15 vm5 pveproxy[1819]: starting 1 worker(s)
Jun 17 01:53:15 vm5 pveproxy[1819]: worker 10491 started
Jun 17 01:53:43 vm5 pveproxy[1819]: worker 8051 finished
Jun 17 01:53:43 vm5 pveproxy[1819]: starting 1 worker(s)
Jun 17 01:53:43 vm5 pveproxy[1819]: worker 10521 started
Jun 17 01:53:44 vm5 pveproxy[10520]: worker exit
Jun 17 01:54:00 vm5 systemd[1]: Starting Proxmox VE replication runner...
Jun 17 01:54:01 vm5 systemd[1]: Started Proxmox VE replication runner.
Jun 17 01:55:00 vm5 systemd[1]: Starting Proxmox VE replication runner...
Jun 17 01:55:00 vm5 pvedaemon[3977]: <*********@pam> successful auth for user '*********'
Jun 17 01:55:01 vm5 systemd[1]: Started Proxmox VE replication runner.
Jun 17 01:55:50 vm5 postfix/anvil[10417]: statistics: max connection rate 1/60s for (smtp:185.234.217.38) at Jun 17 01:52:29
Jun 17 01:55:50 vm5 postfix/anvil[10417]: statistics: max connection count 1 for (smtp:185.234.217.38) at Jun 17 01:52:29
Jun 17 01:55:50 vm5 postfix/anvil[10417]: statistics: max cache size 1 at Jun 17 01:52:29
Jun 17 01:56:00 vm5 systemd[1]: Starting Proxmox VE replication runner...
Jun 17 01:56:01 vm5 systemd[1]: Started Proxmox VE replication runner.
Jun 17 01:57:00 vm5 systemd[1]: Starting Proxmox VE replication runner...
Jun 17 01:57:01 vm5 systemd[1]: Started Proxmox VE replication runner.
Jun 17 01:58:00 vm5 systemd[1]: Starting Proxmox VE replication runner...
Jun 17 01:58:01 vm5 systemd[1]: Started Proxmox VE replication runner.
Jun 17 01:59:00 vm5 systemd[1]: Starting Proxmox VE replication runner...
Jun 17 01:59:01 vm5 systemd[1]: Started Proxmox VE replication runner.
Jun 17 02:00:00 vm5 systemd[1]: Starting Proxmox VE replication runner...
Jun 17 02:00:01 vm5 systemd[1]: Started Proxmox VE replication runner.
Jun 17 02:01:00 vm5 systemd[1]: Starting Proxmox VE replication runner...
Jun 17 02:01:01 vm5 systemd[1]: Started Proxmox VE replication runner.

<========= DEAD HERE AND THEN WE HARD RESAT SERVER =========>

Jun 17 02:08:25 vm5 systemd-modules-load[330]: Inserted module 'iscsi_tcp'
Jun 17 02:08:25 vm5 kernel: [    0.000000] Linux version 4.15.17-2-pve (tlamprecht@evita) (gcc version 6.3.0 20170516 (Debian 6.3.0-18+deb9u1)) #1 SMP PVE 4.15.17-10 (Tue, 22 May 2018 11:15:44 +0200) ()
Jun 17 02:08:25 vm5 kernel: [    0.000000] Command line: BOOT_IMAGE=/vmlinuz-4.15.17-2-pve root=UUID=dc2c6eeb-e09e-4e1b-a1b2-658c64d9dd62 ro nomodeset consoleblank=0
Jun 17 02:08:25 vm5 kernel: [    0.000000] KERNEL supported cpus:
Jun 17 02:08:25 vm5 kernel: [    0.000000]   Intel GenuineIntel

Amonal · Jul 4, 2018

Hello,

Here the same problem. I am testing with some servers in Hetzner and the problem persist.

Here I detail a help link:
https://serverfault.com/questions/6...pter-unexpectedly-detected-hardware-unit-hang

Another problem that I detect, with you connect using SSH (putty) in a KVM Linux server, SSH works with some lag.

michaelvv · Jul 14, 2018

Same Issue on my private HomeServer. Worst bug I ever have seen on my 8 year Proxmox Journey.

Start seeing this about 1-1 1/2 month ago, after an update. Tried to add these lines to my network config
but I'll still have the issue.

offload-tx off
offload-sg off
offload-tso off

proxmox-ve: 5.2-2 (running kernel: 4.15.18-1-pve)
pve-manager: 5.2-5 (running version: 5.2-5/eb24855a)

ethtool -i eth0
driver: e1000e
version: 3.4.1.1-NAPI
firmware-version: 0.13-4
expansion-rom-version:
bus-info: 0000:00:19.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no

gallew · Jul 23, 2018

I had same problem with Hetzner machines (two clusters).
I don't know if it helps, been running 2 weeks now without problems (knocking on wood), so i think it is worth sharing:

Code:

# e1000e module hang problem
/sbin/ethtool -K eth0 tx off rx off

When executing, expect second or two outtage.
I use traditional NIC names, your mileage may vary.

As for lagging in KVM machines, i'd check on MTU's on all systems (physical interfaces, bridges, NIC inside KVM, etc)
I had one case where one NIC had different MTU than others, and result was same, lagging because of packet fragmentation.

Jarek Hartman · Aug 3, 2018

I only can confirm same issue here.

As I was suspecting HW issue, I've ordered mainboard replacement but as I can see now - no improvement at all.

I will try the trick with the ethtool but I think somebody should start thinking about a regular fix. What do you think, which element of the stack (kernel, NIC drivers, ...) might be responsible? I'd like to raise a formal ticket as this issue is really annoying.

Best regards,
Jarek

------

Notes to self (to remember what I've done)

Output when running from the CLI:

Code:

root@wieloryb-pve:/etc/rc.d/init.d# /sbin/ethtool -K enp0s31f6 tx off rx off
Cannot get device udp-fragmentation-offload settings: Operation not supported
Cannot get device udp-fragmentation-offload settings: Operation not supported
Actual changes:
rx-checksumming: off
tx-checksumming: off
    tx-checksum-ip-generic: off
tcp-segmentation-offload: off
    tx-tcp-segmentation: off [requested on]
    tx-tcp6-segmentation: off [requested on]

Preserving the changes across reboots:

Code:

root@wieloryb-pve:~# cat /etc/network/if-up.d/ethtool2
#!/bin/sh

/sbin/ethtool -K enp0s31f6 tx off rx off

root@wieloryb-pve:~# chmod 755 /etc/network/if-up.d/ethtool2

Reboot and verify:

Code:

root@wieloryb-pve:/etc#  shutdown -r now

root@wieloryb-pve:/etc/rc.d/init.d# /sbin/ethtool -k enp0s31f6
Features for enp0s31f6:
Cannot get device udp-fragmentation-offload settings: Operation not supported
rx-checksumming: off                   <--------- SHOULD BE OFF, HERE AND A FEW OTHER PLACES
tx-checksumming: off

tobimuc · Aug 13, 2018

Hi!

I still have the same problem. The Server ist an Hetzner EX51 ...

Now I will try the solution from Jarek and wait

Tobi

Jarek Hartman · Aug 13, 2018

Jarek Hartman said:
I only can confirm same issue here.

Since applying configuration posted above (it's been 10 days already) no more issues. I hope it will work for others as well.

DZenker · Sep 26, 2018

Hi!

I also had the same problem with an Hetzner PX61 server. I've applied now Gallew's and Jarek's solution and it seems to be fixed now.
But, as this is just a work-around, is there any chance that this will be fixed persistently by a ProxMox kernel update?

Greetings,
Dietmar

gallew · Oct 5, 2018

DZenker said:
...
But, as this is just a work-around, is there any chance that this will be fixed persistently by a ProxMox kernel update?
...

Big dido here about fixing kernel module.

Unfortunately after running little bit over two months, one of machines suffered again from e1000e hang.
Investigation showed that still, e1000e module hangs repeatedly, regardless of checksum offloading.
There was many hangs in log, but latest took machine offline.
That let me thinkink, rc.local may not be best place to put ethtool command.
This is because although after reboot, checksum offloading is set by rc.local, but after first module hang, ifupd will bring interface back up without checksum offloading.
Therefore it would be better to put this into /etc/network/interfaces file, under main NIC:

Code:

  offload-tx  off
  offload-sg  off
  offload-tso off
  post-up /sbin/ethtool -K eth0 tx off rx off

Or as Jarek did, as executable script that will be executed every time NIC becomes up.

celtar · Nov 26, 2018

gallew said:
Big dido here about fixing kernel module.

Unfortunately after running little bit over two months, one of machines suffered again from e1000e hang.
Investigation showed that still, e1000e module hangs repeatedly, regardless of checksum offloading.
There was many hangs in log, but latest took machine offline.
That let me thinkink, rc.local may not be best place to put ethtool command.
This is because although after reboot, checksum offloading is set by rc.local, but after first module hang, ifupd will bring interface back up without checksum offloading.
Therefore it would be better to put this into /etc/network/interfaces file, under main NIC:

Code:

offload-tx off offload-sg off offload-tso off post-up /sbin/ethtool -K eth0 tx off rx off

Or as Jarek did, as executable script that will be executed every time NIC becomes up.

We got same Problems here with Broadcom and ASUS Cards. I am not sure but you may lost manual changes in /etc/network/interfaces if you change something in the webgui network section?

gallew · Dec 10, 2018

celtar said:
We got same Problems here with Broadcom and ASUS Cards. I am not sure but you may lost manual changes in /etc/network/interfaces if you change something in the webgui network section?

I'm not sure, maybe developers can confirm it but i my case, i have changed network config via web only once, and that was long time ago, but i did not notice that something were missing.
I guess that best way to do it would be to test if parameters are still there on not after network reconfiguration via web UI.
Also, my current parameters for eth0 are:
Remember! ethtool still needs to be installed!

Code:

  offload-rx  off
  offload-tx  off
  offload-sg  off
  offload-tso off

DerDanilo · Dec 27, 2018

@proxmox

Is there something you can do about this? Since you provide the Kernel image you could patch it, right!?

Do we know which network hardware causes this issue so we can configure new hardware accordingly?

The first post was in June and no update since?

Stoiko Ivanov · Dec 28, 2018

As far as I see the problem is that it's not a particular kernel-module which always causes the problem, but various modules show problems sometimes, with certain specific NICs - e.g. most e1000(e) NICs work fine, but some do exhibit those problems.

Sometimes patching the BIOS and all firmware resolves the issue for the affected users, sometimes disabling offloading fixes the problem.

But it is nothing deterministic, and I wouldn't see what we could do to improve the situation for everyone?

DerDanilo · Dec 28, 2018

Thanks for replying so quickly.

I personally didn't know that this is not related to kernel bugs.
If this is mostly about firmware upgrades, then this can be solved through checking those.
But driver support should be solved by the OS itself, especially for standard NICs.

Can you recommend drivers where those bugs are not present?

Is there a list of defective hardware NICs for reference?

Stoiko Ivanov · Dec 28, 2018

DerDanilo said:
Can you recommend drivers where those bugs are not present?

I do have a hard time with giving definite recommendations (7 months ago I would have told everyone - use Intel - they always work without any problem - yet as this thread indicates - this depends on some other factors as well).
I personally never had any problems with Intel-NICs (e1000, e1000e, igb, ixgbe, i40e) and newer cards using `tg3` - but this is statistically not significant.

davidand · Nov 26, 2022

@DerDanilo @gallew @celtar How did you guys end up with this issue? I have a hardware that has been fine for weeks and it started suffer from this issue, it's a critical node, so I'm very much interested to know if/how did you solve it?

celtar · Nov 27, 2022

davidand said:
@DerDanilo @gallew @celtar How did you guys end up with this issue? I have a hardware that has been fine for weeks and it started suffer from this issue, it's a critical node, so I'm very much interested to know if/how did you solve it?

Hi,
so we have different problems we have different solutions. Maybe one or more might be the right for you

1. we changed all Network Cards to Intel TX-550 (10Gbit) - The Asus 10GBit were rubbish (driver, unknown stucks not reproducible)
2. Bios Updates for all Intel Cards (if it works) - some maybe not original Intel so that might be an Issue (OEM maybe China OEM?)
(unfortanetely it is not easy to get Intel 10GBit Cards in these days now)
3. our interfaces config is

post-up ethtool -s "yourcardname" speed 10000 duplex full autoneg off  # only if you have problems with autoneg

iface "yourcardname" inet manual   
    mtu 9000                  # carefully all switches and cards must be set like this
    offload-tx off
    offload-rx off
    offload-gso off

4. examine your network, good open source software netdisco.org (snmp) (Firmware, graphical overview, vlan mismatch ....)
5. no probs with our hetzner Server (no Cluster there)
6. read proxmox documentation (dif. networks for management and all other stuff)

Just my2cent
John

davidand · Nov 28, 2022

@celtar Ok, I replaced the network adapter and the problem hasn't occurred yet.

However, it's VERY strange that the old network adapter was working fine up until I added a new VM to my setup with a VLAN tag assigned. That was the only change that happened the day when the network adapter started crashing. Even after removing the VM completely the problem hasn't gone away

Proxmox Node freezes

Well-Known Member

Distinguished Member

Active Member

New Member

Renowned Member

Active Member

Active Member

Renowned Member

Active Member

Active Member

Active Member

Renowned Member

Active Member

Famous Member

Proxmox Staff Member

Famous Member

Proxmox Staff Member

Active Member

Renowned Member

Active Member

We value your privacy