Proxmox crash

sylsyl

New Member
Jun 14, 2024
9
0
1
I have a couple of PCs running Proxmox - but I am far from a Linux guru.

I recently installed immich in an Ubuntu VM. It has a high workload importing a few hundred thousand photos and I have seen my Proxmox machine “crash”.

By crash, I mean become unresponsive across all VMs and the PVE web interface. It doesn’t respond to ping.

I would have blamed it on overheating, but the PC does have intel vPro (a cheap man’s ipmi) and I can remotely control it. Using vPro, I can access the CLI and log in. This is the surprising bit! I can run commands, but ping from PVE doesn’t reach anywhere. After rebooting, netstat shows that the CPU went to near zero when it “crashed”, but obviously it was still collecting some data.

Rebooting the machine from the CLI makes things work again, usually for a few hours until it happens again.

How can I find out what the problem is?
 
dmesg contains lots of copies of:

Code:
[14781.638259] e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang:
                 TDH                  <fd>
                 TDT                  <24>
                 next_to_use          <24>
                 next_to_clean        <fc>
               buffer_info[next_to_clean]:
                 time_stamp           <100bf0538>
                 next_to_watch        <fd>
                 jiffies              <100dcf940>
                 next_to_watch.status <0>
               MAC Status             <80083>
               PHY Status             <796d>
               PHY 1000BASE-T Status  <3800>
               PHY Extended Status    <3000>
               PCI Status             <10>

journalctl -xe containts lots of similar stuff, e.g.:

Code:
May 13 13:06:04 pve2 pvestatd[1073]: storage 'isos' is not online
May 13 13:06:04 pve2 corosync-qdevice[1062]: Can't connect to qnetd host. (-5986): Network address not available (in use?)
May 13 13:06:04 pve2 apcupsd[845]: Communications with UPS lost.
May 13 13:06:05 pve2 kernel: e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang:
                               TDH                  <fd>
                               TDT                  <24>
                               next_to_use          <24>
                               next_to_clean        <fc>
                             buffer_info[next_to_clean]:
                               time_stamp           <100bf0538>
                               next_to_watch        <fd>
                               jiffies              <100dcb300>
                               next_to_watch.status <0>
                             MAC Status             <80083>
                             PHY Status             <796d>
                             PHY 1000BASE-T Status  <3800>
                             PHY Extended Status    <3000>
                             PCI Status             <10>
May 13 13:06:07 pve2 pvestatd[1073]: pbs: error fetching datastores - 500 Can't connect to 192.168.61.2:8007 (No route to host)
May 13 13:06:07 pve2 pvestatd[1073]: status update time (9.519 seconds)

It doesn't go back as far as the event starting

I take it this means the network has given up the ghost for some reason? The machine is a Dell Optiplex 7070 with built in ethernet port.
 
Thanks for the pointer. I'll try the fix(es) and see if things improve. I'll add a post to the thread to say yet another poor schmuck got pwned.