Proxmox is hanging after running fine for more than a year

johnromero

New Member
Nov 12, 2023
8
0
1
I havent really changed the configuration of my proxmox in the last year, but in the last month, I had to physically reset the box I use a couple of times. It seems that is hanging, but I don't see a message anywhere to indicate what is wrong. I see a few rename eth0 in the log all the time but cannot correlate that to the times is down since it happens a lot on the log. I do not have experience troubleshooting what could be wrong. Can anyone help me with where should I start?
 
What was on the monitor when it "hanged"? What logfiles did you inspect?
I didn't have a monitor plugged into the system, but when I did plug it in, it didn't produce an image until I rebooted it. I don't have a spare monitor, but I will try to borrow one for the next time. I am also doing a health check every hour to get a better idea of when it hangs or if the network goes down.

The log files I inspected were
/var/log/pveam.log
/var/log/kern.log
/var/log/auth.log
/var/log/daemon.log
/var/log/syslog

Since so many messages from the network being renamed every minute, I am trying to find out what is happening, as at least I will have cleaner logs. I thought that it could also be "hanging" just the network because of those messages. But I am not sure why they are happening and if it is just the network that is stopping working.

Code:
2023-11-12 07:06:23.013   
Nov 12 07:06:22 pve kernel: [79248.542579] device veth15ff3e5 left promiscuous mode
2023-11-12 07:06:23.013   
Nov 12 07:06:22 pve kernel: [79248.478891] veth0f17c7b: renamed from eth0
2023-11-12 07:06:23.013   
Nov 12 07:06:22 pve kernel: [79248.478792] br-f367acecaec0: port 76(veth15ff3e5) entered disabled state
2023-11-12 07:06:22.763   
Nov 12 07:06:22 pve kernel: [79248.333449] br-f367acecaec0: port 76(veth15ff3e5) entered forwarding state
2023-11-12 07:06:22.763   
Nov 12 07:06:22 pve kernel: [79248.333446] br-f367acecaec0: port 76(veth15ff3e5) entered blocking state
2023-11-12 07:06:22.763   
Nov 12 07:06:22 pve kernel: [79248.333366] IPv6: ADDRCONF(NETDEV_CHANGE): veth15ff3e5: link becomes ready
 
It happened again today. There is just a gap in time in the logs when it happens, it seems that it just stops and returns after a reboot. any advice on trying to pinpoint what could be wrong?
 
It happened again today. There is just a gap in time in the logs when it happens, it seems that it just stops and returns after a reboot. any advice on trying to pinpoint what could be wrong?

What hardware is it? Do you run a cluster? Anything changed in the networking?
 
What hardware is it? Do you run a cluster? Anything changed in the networking?

It runs on a MINISFORUM Elitemini HX90 Mini PC AMD Ryzen 9 5900HX Desktop Computer with 64 GB RAM.
Not a cluster, standalone.
No changes in networking at all.
 
It runs on a MINISFORUM Elitemini HX90 Mini PC AMD Ryzen 9 5900HX Desktop Computer with 64 GB RAM.
Not a cluster, standalone.
No changes in networking at all.
Alright, if it was not happening before, could it be after kernel update?

Anything interesting in journalctl -k or rather journalctl -k -b -1? All sorts of issues like C states, enabled TPM in firmware or even not having plugged in monitor (where a power saving feature gets triggered) possible ... of course hardware fault, overheating, etc ...
 
Alright, if it was not happening before, could it be after kernel update?

Anything interesting in journalctl -k or rather journalctl -k -b -1? All sorts of issues like C states, enabled TPM in firmware or even not having plugged in monitor (where a power saving feature gets triggered) possible ... of course hardware fault, overheating, etc ...
I don't see anything particularly interesting, and the end of the output is just a bunch of network messages. Would it help if I attached the full journal output? Any idea on how to check for hardware fault? I was expecting it to be some message in the logs. It could potentially be any update yes.

Code:
Nov 13 03:50:29 pve kernel: br-f367acecaec0: port 51(vethd1d37fb) entered blocking state
Nov 13 03:50:29 pve kernel: br-f367acecaec0: port 51(vethd1d37fb) entered disabled state
Nov 13 03:50:29 pve kernel: device vethd1d37fb entered promiscuous mode
Nov 13 03:50:29 pve kernel: eth0: renamed from veth8f5734c
Nov 13 03:50:29 pve kernel: IPv6: ADDRCONF(NETDEV_CHANGE): vethd1d37fb: link becomes ready
Nov 13 03:50:29 pve kernel: br-f367acecaec0: port 51(vethd1d37fb) entered blocking state
Nov 13 03:50:29 pve kernel: br-f367acecaec0: port 51(vethd1d37fb) entered forwarding state
Nov 13 03:50:29 pve kernel: br-f367acecaec0: port 51(vethd1d37fb) entered disabled state
Nov 13 03:50:29 pve kernel: veth8f5734c: renamed from eth0
Nov 13 03:50:29 pve kernel: br-f367acecaec0: port 51(vethd1d37fb) entered disabled state
Nov 13 03:50:29 pve kernel: device vethd1d37fb left promiscuous mode
Nov 13 03:50:29 pve kernel: br-f367acecaec0: port 51(vethd1d37fb) entered disabled state
Nov 13 03:51:29 pve kernel: br-f367acecaec0: port 51(vethccdc104) entered blocking state
Nov 13 03:51:29 pve kernel: br-f367acecaec0: port 51(vethccdc104) entered disabled state
Nov 13 03:51:29 pve kernel: device vethccdc104 entered promiscuous mode
Nov 13 03:51:29 pve kernel: eth0: renamed from veth41ed0d4
Nov 13 03:51:29 pve kernel: IPv6: ADDRCONF(NETDEV_CHANGE): vethccdc104: link becomes ready
Nov 13 03:51:29 pve kernel: br-f367acecaec0: port 51(vethccdc104) entered blocking state
Nov 13 03:51:29 pve kernel: br-f367acecaec0: port 51(vethccdc104) entered forwarding state
Nov 13 03:51:29 pve kernel: br-f367acecaec0: port 51(vethccdc104) entered disabled state
Nov 13 03:51:29 pve kernel: veth41ed0d4: renamed from eth0
Nov 13 03:51:29 pve kernel: br-f367acecaec0: port 51(vethccdc104) entered disabled state
Nov 13 03:51:29 pve kernel: device vethccdc104 left promiscuous mode
Nov 13 03:51:29 pve kernel: br-f367acecaec0: port 51(vethccdc104) entered disabled state
 
I don't see anything particularly interesting, and the end of the output is just a bunch of network messages. Would it help if I attached the full journal output? Any idea on how to check for hardware fault? I was expecting it to be some message in the logs. It could potentially be any update yes.

I would check if I did not have any kernel updated lately. But anyhow the full logs of the last boot (-b -1) would help. It may anything even from the bootup that was then waiting for its moment to wreak havoc.

You won't see e.g. bad RAM module in the logs as "bad RAM error" :D People have to run memtest86, even then sometimes only experimenting helps. Things like C-state etc would be possible to see in the log. Things like bad drive might not because, well, the log won't even get flushed onto the drive upon it encountered the crash. That's why the next thing people do is normally look at the screen. If screen is blank that's harder ... one keeps guessing.

Code:
Nov 13 03:50:29 pve kernel: br-f367acecaec0: port 51(vethd1d37fb) entered blocking state
Nov 13 03:50:29 pve kernel: br-f367acecaec0: port 51(vethd1d37fb) entered disabled state
Nov 13 03:50:29 pve kernel: device vethd1d37fb entered promiscuous mode
Nov 13 03:50:29 pve kernel: eth0: renamed from veth8f5734c
Nov 13 03:50:29 pve kernel: IPv6: ADDRCONF(NETDEV_CHANGE): vethd1d37fb: link becomes ready
Nov 13 03:50:29 pve kernel: br-f367acecaec0: port 51(vethd1d37fb) entered blocking state
Nov 13 03:50:29 pve kernel: br-f367acecaec0: port 51(vethd1d37fb) entered forwarding state
Nov 13 03:50:29 pve kernel: br-f367acecaec0: port 51(vethd1d37fb) entered disabled state
Nov 13 03:50:29 pve kernel: veth8f5734c: renamed from eth0
Nov 13 03:50:29 pve kernel: br-f367acecaec0: port 51(vethd1d37fb) entered disabled state
Nov 13 03:50:29 pve kernel: device vethd1d37fb left promiscuous mode
Nov 13 03:50:29 pve kernel: br-f367acecaec0: port 51(vethd1d37fb) entered disabled state
Nov 13 03:51:29 pve kernel: br-f367acecaec0: port 51(vethccdc104) entered blocking state
Nov 13 03:51:29 pve kernel: br-f367acecaec0: port 51(vethccdc104) entered disabled state
Nov 13 03:51:29 pve kernel: device vethccdc104 entered promiscuous mode
Nov 13 03:51:29 pve kernel: eth0: renamed from veth41ed0d4
Nov 13 03:51:29 pve kernel: IPv6: ADDRCONF(NETDEV_CHANGE): vethccdc104: link becomes ready
Nov 13 03:51:29 pve kernel: br-f367acecaec0: port 51(vethccdc104) entered blocking state
Nov 13 03:51:29 pve kernel: br-f367acecaec0: port 51(vethccdc104) entered forwarding state
Nov 13 03:51:29 pve kernel: br-f367acecaec0: port 51(vethccdc104) entered disabled state
Nov 13 03:51:29 pve kernel: veth41ed0d4: renamed from eth0
Nov 13 03:51:29 pve kernel: br-f367acecaec0: port 51(vethccdc104) entered disabled state
Nov 13 03:51:29 pve kernel: device vethccdc104 left promiscuous mode
Nov 13 03:51:29 pve kernel: br-f367acecaec0: port 51(vethccdc104) entered disabled state

This is really perfectly normal networking going on with VMs.
 
Happened today. Looks like proxmox is running but the network goes down. Any idea on how to troubleshoot the ethernet adapter? The monitor doesn't show any useful information.
 

Attachments

  • 20231127_175737.jpg
    20231127_175737.jpg
    395.2 KB · Views: 12