Proxmox Crashing on Hetzner Server with Intel I219-LM NIC

techk1d · Aug 25, 2019

Hi Guys,

I am having intermittent crashing issues on a new server I leased from Hetzner. It is stable for only 24-36 hours then the entire host goes offline requiring a power cycle.

I have searched the forums and have assembled the following information for you, to save time:

Possible solution I am testing, and will report back if it stabilizes my system
https://serverfault.com/questions/6...pter-unexpectedly-detected-hardware-unit-hang
and/or https://jhartman.pl/2018/08/06/proxmox-enp0s31f6-detected-hardware-unit-hang/

If that doesn’t work I’ll try this solution: Disabling Enhanced C1 (C1E) in the BIOS

I want to provide the logs of what I am seeing so you perhaps can update Proxmox to fix this issue with this model of NIC since I've seen other people are having similar issues with it with Proxmox. Please let me know if you know of any other solutions.

Thank you for your time.

root@prox01 ~ # lspci | egrep -i --color 'network|ethernet'
00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (2) I219-LM

I really need help because I just migrated to this node and it's not financially possible for me to migrate to another server.

Logs:

I put the long form logs here: https://pastebin.com/fR4tJ9Ji

short version:

root@prox01 ~ # qm config 100
agent: 1
bootdisk: scsi0
cores: 4
ide2: local:iso/CentOS-7-x86_64-Minimal-1810.iso,media=cdrom
memory: 16384
name: Proxmox-VM01
net0: virtio=FA:3E:C8

5:83:2E,bridge=vmbr0,firewall=1
numa: 0
onboot: 1
ostype: l26
scsi0: prox01vmstorage:vm-100-disk-0,size=480G
scsihw: virtio-scsi-pci
smbios1: uuid=bee0785b-9f65-49bf-a6fb-08187ccb33c8
sockets: 1
vmgenid: 1c4565b5-0961-486b-adfe-b3d769206d90

root@prox01 ~ # pveversion -v
proxmox-ve: 5.4-2 (running kernel: 4.15.18-20-pve)
pve-manager: 5.4-13 (running version: 5.4-13/aee6f0ec)
pve-kernel-4.15: 5.4-8
pve-kernel-4.15.18-20-pve: 4.15.18-46
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: not correctly installed
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-12
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-54
libpve-guest-common-perl: 2.0-20
libpve-http-server-perl: 2.0-14
libpve-storage-perl: 5.0-44
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.1.0-6
lxcfs: 3.0.3-pve1
novnc-pve: 1.0.0-3
proxmox-widget-toolkit: 1.0-28
pve-cluster: 5.0-38
pve-container: 2.0-40
pve-docs: 5.4-2
pve-edk2-firmware: 1.20190312-1
pve-firewall: 3.0-22
pve-firmware: 2.0-7
pve-ha-manager: 2.0-9
pve-i18n: 1.1-4
pve-libspice-server1: 0.14.1-2
pve-qemu-kvm: 3.0.1-4
pve-xtermjs: 3.12.0-1
qemu-server: 5.0-54
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3

Aug 25 10:19:14 prox01 kernel: e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang:
TDH <9e>
TDT <a9>
next_to_use <a9>
next_to_clean <9d>
buffer_info[next_to_clean]:
time_stamp <10001883b>
next_to_watch <9e>
jiffies <100018958>
next_to_watch.status <0>
MAC Status <40080083>
PHY Status <796d>
PHY 1000BASE-T Status <3800>
PHY Extended Status <3000>
PCI Status <10>

# dmidecode 3.0
Getting SMBIOS data from sysfs.
SMBIOS 2.8 present.
99 structures occupying 4763 bytes.
Table at 0x8AD2B000.

Handle 0x0000, DMI type 0, 26 bytes
BIOS Information
Vendor: American Megatrends Inc.
Version: 1.EC
Release Date: 05/21/2019 - Up to date
Address: 0xF0000
Runtime Size: 64 kB
ROM Size: 16384 kB
Characteristics:
PCI is supported
BIOS is upgradeable
BIOS shadowing is allowed
Boot from CD is supported
Selectable boot is supported
BIOS ROM is socketed
EDD is supported
5.25"/1.2 MB floppy services are supported (int 13h)
3.5"/720 kB floppy services are supported (int 13h)
3.5"/2.88 MB floppy services are supported (int 13h)
Print screen service is supported (int 5h)
8042 keyboard services are supported (int 9h)
Serial services are supported (int 14h)
Printer services are supported (int 17h)
ACPI is supported
USB legacy is supported
BIOS boot specification is supported
Targeted content distribution is supported
UEFI is supported
BIOS Revision: 5.12

paradox55 · Aug 25, 2019

You need to install the proxmox iso. IIRC hetzners instructions are to install the apt sources and do an in-place upgrade on their preinstalled distro for proxmox.

What they don't tell you is during the initial OS installation process on their rescue tool they modify a lot of core files - such as blacklisting drivers, etc.

So if you're having issues with proxmox it's likely due to that.

Edit: Next time the host goes offline request a kvm to see if it's network only or the entire node going offline.

techk1d · Aug 25, 2019

paradox55 said:
You need to install the proxmox iso. IIRC hetzners instructions are to install the apt sources and do an in-place upgrade on their preinstalled distro for proxmox.

What they don't tell you is during the initial OS installation process on their rescue tool they modify a lot of core files - such as blacklisting drivers, etc.

So if you're having issues with proxmox it's likely due to that.

Edit: Next time the host goes offline request a kvm to see if it's network only or the entire node going offline.

Hey thanks for your reply but I did install from the ISO via their rescue system. I believe it was the actual Proxmox release ISO and not one modified by Hetzner.

paradox55 · Aug 25, 2019

techk1d said:
Hey thanks for your reply but I did install from the ISO via their rescue system. I believe it was the actual Proxmox release ISO and not one modified by Hetzner.

It's been around a year since I used hetzner but iirc every included ISO in the rescue system had those modifications. I found out about it the hard way when I ran into some compatibility issues and then I ran a file check on a blank OS installed at home vs what they installed.

The only way to bypass it is to manually install using chroot or dd or something similar. Or you can pester their support to mount the ISO and give you a rescue KVM to install. They really aren't happy about that though and insist that you use their automatic installer.

Edit: It might also be possible to bypass by using a custom ISO during install but IIRC the installer differs from what they publicly post on github and may just modify those files automatically anyway.

DerDanilo · Aug 25, 2019

Which Server model and which motherboard?
Each server usually has two or more different motherboard for the same model.

If its NIC related it may also be related I could solve some issues by disabling certain functions which are not really required.

Is the server just offline or does it really crash? Just offline would indicate issues with the nic driver.

I can confirm that a dd setup via the rescue system works fine. Done it a few month ago with a SX62 if I remember correctly, since Debian didn't have drivers for its Nics back then.

techk1d · Aug 25, 2019

DerDanilo said:
Which Server model and which motherboard?
Each server usually has two or more different motherboard for the same model.

If its NIC related it may also be related I could solve some issues by disabling certain functions which are not really required.

Is the server just offline or does it really crash? Just offline would indicate issues with the nic driver.

I can confirm that a dd setup via the rescue system works fine. Done it a few month ago with a SX62 if I remember correctly, since Debian didn't have drivers for its Nics back then.

root@prox01 ~ # dmidecode -t 2
# dmidecode 3.0
Getting SMBIOS data from sysfs.
SMBIOS 2.8 present.

Handle 0x0002, DMI type 2, 15 bytes
Base Board Information
Manufacturer: Micro-Star International Co., Ltd.
Product Name: Z370 GAMING PLUS (MS-7B61)
Version: 1.0
Serial Number: I516910569
Asset Tag: Default string
Features:
Board is a hosting board
Board is replaceable
Location In Chassis: Default string
Chassis Handle: 0x0003
Type: Motherboard
Contained Object Handles: 0

I just came to the realization this doesn't have ECC RAM. Oh Lord. Good thing it's just a few web server VM's with good backups.

No wonder it was so cheap but it would be nice to solve it. I need to stay where I am at for now.

I start running MTR reports when the server drops and it's always both the host and VM that is offline. I cannot access the host by web UI or SSH after the crash incidents.

Thank you.

sigo · Aug 25, 2019

DerDanilo said:
Which Server model and which motherboard?
Each server usually has two or more different motherboard for the same model.

If its NIC related it may also be related I could solve some issues by disabling certain functions which are not really required.

Is the server just offline or does it really crash? Just offline would indicate issues with the nic driver.

I can confirm that a dd setup via the rescue system works fine. Done it a few month ago with a SX62 if I remember correctly, since Debian didn't have drivers for its Nics back then.

I had the same problem. This happened in the EX- series, but when I switched to the PX -series, the error went away.

Don Daniello · Aug 25, 2019

You can easily ask for a 3 hour KVM access and send them a link to ISO to be burned on flash drive and attached. They'll happily do that every time for free, that's how I always install my Proxmox servers.

techk1d · Aug 26, 2019

sigo said:
I had the same problem. This happened in the EX- series, but when I switched to the PX -series, the error went away.

Thanks for the information. I wished Hetzner posted better technical specifications on what you are actually buying. I would have preferred ECC RAM but this is what I am stuck with now. Next time I'll load test with VM's for a few days before going live. I didn't expect this since Proxmox has been solid for me before and I assumed I was purchasing server level hardware. I don't think this is Proxmox's fault I am just reporting it so perhaps they could implement some sort of patch in future versions.

Right now I am unable to migrate away from this server so I am testing this fix which I have also archived HERE and will report back.

If anybody can think of any other patches I should try, please feel free to post them below.

Thank you everybody!

paradox55 · Aug 26, 2019

techk1d said:
Thanks for the information. I wished Hetzner posted better technical specifications on what you are actually buying. I would have preferred ECC RAM but this is what I am stuck with now. Next time I'll load test with VM's for a few days before going live. I didn't expect this since Proxmox has been solid for me before and I assumed I was purchasing server level hardware. I don't think this is Proxmox's fault I am just reporting it so perhaps they could implement some sort of patch in future versions.

Right now I am unable to migrate away from this server so I am testing this fix which I have also archived HERE and will report back.

The guide you linked to ain't for Proxmox 6.0. 6.0 has a different kernel, different drivers and iirc Debian was upgraded as well.

My supermicro 1U's use the Intel I210-LM with no issue. I don't think there is much of a difference between the two but could be wrong.

IIRC hetzner offers a dedicated intel nic (PCI) for around $5 extra a month.

techk1d · Aug 27, 2019

paradox55 said:
The guide you linked to ain't for Proxmox 6.0. 6.0 has a different kernel, different drivers and iirc Debian was upgraded as well.

My supermicro 1U's use the Intel I210-LM with no issue. I don't think there is much of a difference between the two but could be wrong.

IIRC hetzner offers a dedicated intel nic (PCI) for around $5 extra a month.

Thanks for the information about the dedicated Intel NIC at Hetzner, I didn't know that. I will try that if the setting change I am testing doesn't pan out. It's too early to tell for sure but the Proxmox node now has 2+ days of uptime on it without crashing. With that being said, I am not holding my breath but I do hope that this fixes it: 11:23:07 up 2 days, 1:15, 1 user, load average: 0.05, 0.07, 0.08

I am currently on Proxmox version 5.4-13 and have the node fully updated. I think somebody asked what Hetzner server I was on, it's the Hetzner EX52-NVME in their German datacenter.

techk1d · Sep 1, 2019

Hello everybody,

I wanted to provide another update so you don't think this is an abandoned thread. In short, my Proxmox node has not crashed since implementing the fix I referenced above. With that being said, I had a shower thought: every time Proxmox crashed I was logged into a remote desktop session on Microsoft Windows Server 2019. I haven't booted that VM since the last crash to leave it off for testing.

My uptime now stands at: 11:04:52 up 7 days, 57 min, 1 user, load average: 0.04, 0.08, 0.09

So that's good. I am starting to think perhaps the automatically selected NIC for the Windows Server 2019 VM, the Proxmox Intel E1000 virtual NIC may have been causing the crash. In a few days I'll boot that VM back up and post if the node crashes again. If so, then there is the problem.

Thanks for everything, and I'll keep you posted.

Search

Search

Proxmox Crashing on Hetzner Server with Intel I219-LM NIC

techk1d

New Member

paradox55

Member

techk1d

New Member

paradox55

Member

DerDanilo

Renowned Member

techk1d

New Member

sigo

Active Member

Don Daniello

Active Member

techk1d

New Member

paradox55

Member

techk1d

New Member

techk1d

New Member