One of my Proxmox node freezes every few hours. We have to do a hard reboot to bring it back online. I am unable to find anything in the syslog. Can anyone help what should I check to find the cause and the solution?
Jun 17 01:50:00 vm5 systemd[1]: Starting Proxmox VE replication runner...
Jun 17 01:50:01 vm5 systemd[1]: Started Proxmox VE replication runner.
Jun 17 01:50:05 vm5 pvedaemon[1877]: <*********@pam> successful auth for user '*********'
Jun 17 01:50:27 vm5 pvedaemon[1597]: <*********@pam> successful auth for user '*********'
Jun 17 01:51:00 vm5 systemd[1]: Starting Proxmox VE replication runner...
Jun 17 01:51:01 vm5 systemd[1]: Started Proxmox VE replication runner.
Jun 17 01:52:00 vm5 systemd[1]: Starting Proxmox VE replication runner...
Jun 17 01:52:01 vm5 systemd[1]: Started Proxmox VE replication runner.
Jun 17 01:52:29 vm5 postfix/smtpd[10415]: connect from unknown[*********]
Jun 17 01:52:29 vm5 postfix/smtpd[10415]: lost connection after AUTH from unknown[*********]
Jun 17 01:52:29 vm5 postfix/smtpd[10415]: disconnect from unknown[*********] ehlo=1 auth=0/1 commands=1/2
Jun 17 01:53:00 vm5 systemd[1]: Starting Proxmox VE replication runner...
Jun 17 01:53:01 vm5 systemd[1]: Started Proxmox VE replication runner.
Jun 17 01:53:15 vm5 pveproxy[8181]: worker exit
Jun 17 01:53:15 vm5 pveproxy[1819]: worker 8181 finished
Jun 17 01:53:15 vm5 pveproxy[1819]: starting 1 worker(s)
Jun 17 01:53:15 vm5 pveproxy[1819]: worker 10491 started
Jun 17 01:53:43 vm5 pveproxy[1819]: worker 8051 finished
Jun 17 01:53:43 vm5 pveproxy[1819]: starting 1 worker(s)
Jun 17 01:53:43 vm5 pveproxy[1819]: worker 10521 started
Jun 17 01:53:44 vm5 pveproxy[10520]: worker exit
Jun 17 01:54:00 vm5 systemd[1]: Starting Proxmox VE replication runner...
Jun 17 01:54:01 vm5 systemd[1]: Started Proxmox VE replication runner.
Jun 17 01:55:00 vm5 systemd[1]: Starting Proxmox VE replication runner...
Jun 17 01:55:00 vm5 pvedaemon[3977]: <*********@pam> successful auth for user '*********'
Jun 17 01:55:01 vm5 systemd[1]: Started Proxmox VE replication runner.
Jun 17 01:55:50 vm5 postfix/anvil[10417]: statistics: max connection rate 1/60s for (smtp:185.234.217.38) at Jun 17 01:52:29
Jun 17 01:55:50 vm5 postfix/anvil[10417]: statistics: max connection count 1 for (smtp:185.234.217.38) at Jun 17 01:52:29
Jun 17 01:55:50 vm5 postfix/anvil[10417]: statistics: max cache size 1 at Jun 17 01:52:29
Jun 17 01:56:00 vm5 systemd[1]: Starting Proxmox VE replication runner...
Jun 17 01:56:01 vm5 systemd[1]: Started Proxmox VE replication runner.
Jun 17 01:57:00 vm5 systemd[1]: Starting Proxmox VE replication runner...
Jun 17 01:57:01 vm5 systemd[1]: Started Proxmox VE replication runner.
Jun 17 01:58:00 vm5 systemd[1]: Starting Proxmox VE replication runner...
Jun 17 01:58:01 vm5 systemd[1]: Started Proxmox VE replication runner.
Jun 17 01:59:00 vm5 systemd[1]: Starting Proxmox VE replication runner...
Jun 17 01:59:01 vm5 systemd[1]: Started Proxmox VE replication runner.
Jun 17 02:00:00 vm5 systemd[1]: Starting Proxmox VE replication runner...
Jun 17 02:00:01 vm5 systemd[1]: Started Proxmox VE replication runner.
Jun 17 02:01:00 vm5 systemd[1]: Starting Proxmox VE replication runner...
Jun 17 02:01:01 vm5 systemd[1]: Started Proxmox VE replication runner.
<========= DEAD HERE AND THEN WE HARD RESAT SERVER =========>
Jun 17 02:08:25 vm5 systemd-modules-load[330]: Inserted module 'iscsi_tcp'
Jun 17 02:08:25 vm5 kernel: [ 0.000000] Linux version 4.15.17-2-pve (tlamprecht@evita) (gcc version 6.3.0 20170516 (Debian 6.3.0-18+deb9u1)) #1 SMP PVE 4.15.17-10 (Tue, 22 May 2018 11:15:44 +0200) ()
Jun 17 02:08:25 vm5 kernel: [ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-4.15.17-2-pve root=UUID=dc2c6eeb-e09e-4e1b-a1b2-658c64d9dd62 ro nomodeset consoleblank=0
Jun 17 02:08:25 vm5 kernel: [ 0.000000] KERNEL supported cpus:
Jun 17 02:08:25 vm5 kernel: [ 0.000000] Intel GenuineIntel
# e1000e module hang problem
/sbin/ethtool -K eth0 tx off rx off
root@wieloryb-pve:/etc/rc.d/init.d# /sbin/ethtool -K enp0s31f6 tx off rx off
Cannot get device udp-fragmentation-offload settings: Operation not supported
Cannot get device udp-fragmentation-offload settings: Operation not supported
Actual changes:
rx-checksumming: off
tx-checksumming: off
tx-checksum-ip-generic: off
tcp-segmentation-offload: off
tx-tcp-segmentation: off [requested on]
tx-tcp6-segmentation: off [requested on]
root@wieloryb-pve:~# cat /etc/network/if-up.d/ethtool2
#!/bin/sh
/sbin/ethtool -K enp0s31f6 tx off rx off
root@wieloryb-pve:~# chmod 755 /etc/network/if-up.d/ethtool2
root@wieloryb-pve:/etc# shutdown -r now
root@wieloryb-pve:/etc/rc.d/init.d# /sbin/ethtool -k enp0s31f6
Features for enp0s31f6:
Cannot get device udp-fragmentation-offload settings: Operation not supported
rx-checksumming: off <--------- SHOULD BE OFF, HERE AND A FEW OTHER PLACES
tx-checksumming: off
I only can confirm same issue here.
Big dido here about fixing kernel module....
But, as this is just a work-around, is there any chance that this will be fixed persistently by a ProxMox kernel update?
...
offload-tx off
offload-sg off
offload-tso off
post-up /sbin/ethtool -K eth0 tx off rx off
Big dido here about fixing kernel module.
Unfortunately after running little bit over two months, one of machines suffered again from e1000e hang.
Investigation showed that still, e1000e module hangs repeatedly, regardless of checksum offloading.
There was many hangs in log, but latest took machine offline.
That let me thinkink, rc.local may not be best place to put ethtool command.
This is because although after reboot, checksum offloading is set by rc.local, but after first module hang, ifupd will bring interface back up without checksum offloading.
Therefore it would be better to put this into /etc/network/interfaces file, under main NIC:
Or as Jarek did, as executable script that will be executed every time NIC becomes up.Code:offload-tx off offload-sg off offload-tso off post-up /sbin/ethtool -K eth0 tx off rx off
I'm not sure, maybe developers can confirm it but i my case, i have changed network config via web only once, and that was long time ago, but i did not notice that something were missing.We got same Problems here with Broadcom and ASUS Cards. I am not sure but you may lost manual changes in /etc/network/interfaces if you change something in the webgui network section?
offload-rx off
offload-tx off
offload-sg off
offload-tso off
I do have a hard time with giving definite recommendations (7 months ago I would have told everyone - use Intel - they always work without any problem - yet as this thread indicates - this depends on some other factors as well).Can you recommend drivers where those bugs are not present?
Hi,@DerDanilo @gallew @celtar How did you guys end up with this issue? I have a hardware that has been fine for weeks and it started suffer from this issue, it's a critical node, so I'm very much interested to know if/how did you solve it?
post-up ethtool -s "yourcardname" speed 10000 duplex full autoneg off # only if you have problems with autoneg
iface "yourcardname" inet manual
mtu 9000 # carefully all switches and cards must be set like this
offload-tx off
offload-rx off
offload-gso off