I am having an issue where a new machine with a basically fresh installation of PVE 8.4 will almost always hang within a few seconds to minutes of boot, and I am wondering if anyone has any ideas of a likely cause, or at least where to look next. This is the first time that I have dealt with Proxmox, and my Linux use is sporadic, so I am very sorry for my ignorance on anything.
This occurs with no VMs or containers configured or running; just the host left to sit idle. Once it hangs, it will still display on screen any keypresses for terminals that are already open, and switching terminals (i.e., Alt-F2) works, but unless there was already a terminal session running, switching just gives a blank screen with a flashing cursor and does not respond to keyboard input (besides allowing you to switch between terminals). In no cases does it process anything besides basic typing and certain control sequences. It does not respond to Ctrl-Alt-Del or to a short press of the power button. The web interface also loses connection at that point.
When I first set up the machine, I added a NIC (Chelsio T520-CR) after first installing Proxmox with only the motherboard's Ethernet controllers, and then started having hangs. I removed it and they stopped, so I assumed that the card had gone bad, although it had been working in another (FreeBSD) machine without any problems. I replaced it with a spare of the same model, reinstalled PVE just in case, and everything worked fine, with the system running for a few days without issue. I then let it sit powered off for a bit while waiting on other hardware to arrive. Once it did, I moved the NIC to another PCIe slot to make room for a SAS HBA to be within reach of some shorter cables. After booting up, it hung within a few minutes. I removed the HBA, moved the NIC back into its original slot, and it again hung shortly after boot. I removed the NIC and had no problem for 20+ minutes with just the HBA installed. At this point, I just assumed that there was something that Proxmox disliked about that model of NIC. Leaving it running with just the HBA installed, it eventually hung again. I unattached every removable PCIe device besides the two mirrored (ZFS) U.2 drives on which PVE is installed. Still hanging. I switched the drives to a different MCIO connector on the motherboard. Still hanging. At this point, it is always within five minutes of boot.
I have also tried/checked:
The logs leading up to a hang are basically just postfix errors (expected; most egress is currently blocked) and, about 1/3 of the time, a few PCIe hot plug errors:
I am not sure if these point to something, or are a red herring. There is definitely no corresponding button being pressed. Usually they are all essentially the same log as this, either with or without the PCIe messages, but one log had a minor variation for those:
The one time that I had a hang while actively watching the log, it put out those lines in the web interface, then went immediately unresponsive. The slot numbers did not change when moving the drives to another connector, but I do not know if those are supposed to show physical slot information or logical from enumerating whatever is present. I do not know if four slots is a coincidence, the four lanes of an NVMe drive, or something else. These drives had recently tested good in other machines.
I am a bit lost on where to look next. It seems like a hardware problem, but the system also stays up enough to handle at least very basic terminal functionality. Is it possible for postfix to mess things up that badly, being the last entries in the log about 2/3 of the time? I am going to run a memory test again, although it is 20+ hours per pass. If that comes up clean, I guess I will also try to dig up some other drives and reinstall on those, but mirrored drives going bad at the same time and not having any log messages (while also still being good enough to fully boot the system) seems unlikely, right? Is there a simple way to have more resilient logging if that is somehow the case?
If this is 100% an obvious hardware issue, I apologize for taking your time.
Base Hardware:
This occurs with no VMs or containers configured or running; just the host left to sit idle. Once it hangs, it will still display on screen any keypresses for terminals that are already open, and switching terminals (i.e., Alt-F2) works, but unless there was already a terminal session running, switching just gives a blank screen with a flashing cursor and does not respond to keyboard input (besides allowing you to switch between terminals). In no cases does it process anything besides basic typing and certain control sequences. It does not respond to Ctrl-Alt-Del or to a short press of the power button. The web interface also loses connection at that point.
When I first set up the machine, I added a NIC (Chelsio T520-CR) after first installing Proxmox with only the motherboard's Ethernet controllers, and then started having hangs. I removed it and they stopped, so I assumed that the card had gone bad, although it had been working in another (FreeBSD) machine without any problems. I replaced it with a spare of the same model, reinstalled PVE just in case, and everything worked fine, with the system running for a few days without issue. I then let it sit powered off for a bit while waiting on other hardware to arrive. Once it did, I moved the NIC to another PCIe slot to make room for a SAS HBA to be within reach of some shorter cables. After booting up, it hung within a few minutes. I removed the HBA, moved the NIC back into its original slot, and it again hung shortly after boot. I removed the NIC and had no problem for 20+ minutes with just the HBA installed. At this point, I just assumed that there was something that Proxmox disliked about that model of NIC. Leaving it running with just the HBA installed, it eventually hung again. I unattached every removable PCIe device besides the two mirrored (ZFS) U.2 drives on which PVE is installed. Still hanging. I switched the drives to a different MCIO connector on the motherboard. Still hanging. At this point, it is always within five minutes of boot.
I have also tried/checked:
- Memory test when the machine was initially assembled.
- Disabling C states.
- Changing NUMA configuration (NPS).
- Two separate PSUs; one known good, one new.
- Temperatures are fine on everything that reports them.
The logs leading up to a hang are basically just postfix errors (expected; most egress is currently blocked) and, about 1/3 of the time, a few PCIe hot plug errors:
Code:
...
Aug 28 23:51:23 pve systemd[1]: Startup finished in 7.865s (kernel) + 4.912s (userspace) = 12.777s.
Aug 28 23:51:23 pve kernel: fbcon: Taking over console
Aug 28 23:51:23 pve kernel: Console: switching to colour frame buffer device 240x67
Aug 28 23:51:23 pve chronyd[1507]: Selected source <redacted>
Aug 28 23:51:23 pve chronyd[1507]: System clock TAI offset set to 37 seconds
Aug 28 23:51:50 pve postfix/smtp[1679]: connect to gmail-smtp-in.l.google.com[172.253.132.26]:25: Connection timed out
Aug 28 23:51:50 pve postfix/smtp[1679]: connect to alt1.gmail-smtp-in.l.google.com[2607:f8b0:400d:c0e::1a]:25: Network is unreachable
Aug 28 23:51:50 pve postfix/smtp[1677]: connect to gmail-smtp-in.l.google.com[172.253.132.26]:25: Connection timed out
Aug 28 23:51:50 pve postfix/smtp[1680]: connect to gmail-smtp-in.l.google.com[172.253.132.26]:25: Connection timed out
Aug 28 23:51:50 pve postfix/smtp[1681]: connect to gmail-smtp-in.l.google.com[172.253.132.26]:25: Connection timed out
Aug 28 23:51:50 pve postfix/smtp[1675]: connect to gmail-smtp-in.l.google.com[172.253.132.26]:25: Connection timed out
Aug 28 23:51:50 pve postfix/smtp[1675]: connect to alt1.gmail-smtp-in.l.google.com[2607:f8b0:400d:c0e::1a]:25: Network is unreachable
Aug 28 23:52:20 pve postfix/smtp[1679]: connect to alt1.gmail-smtp-in.l.google.com[209.85.144.26]:25: Connection timed out
Aug 28 23:52:20 pve postfix/smtp[1679]: connect to alt2.gmail-smtp-in.l.google.com[2607:f8b0:400c:c01::1b]:25: Network is unreachable
Aug 28 23:52:20 pve postfix/smtp[1679]: A4FED18827: to=<redacted>, relay=none, delay=1651, delays=1590/0.01/60/0, dsn=4.4.1, status=deferred (connect to alt2.gmail-smtp-in.l.google.com[2607:f8b0:400c:c01::1b]:25: Network is unreachable)
Aug 28 23:52:20 pve postfix/smtp[1677]: connect to alt1.gmail-smtp-in.l.google.com[209.85.144.26]:25: Connection timed out
Aug 28 23:52:20 pve postfix/smtp[1677]: connect to alt1.gmail-smtp-in.l.google.com[2607:f8b0:400d:c0e::1a]:25: Network is unreachable
Aug 28 23:52:20 pve postfix/smtp[1677]: connect to alt2.gmail-smtp-in.l.google.com[2607:f8b0:400c:c01::1a]:25: Network is unreachable
Aug 28 23:52:20 pve postfix/smtp[1677]: 88B09180BD: to=<redacted>, relay=none, delay=763, delays=702/0.01/60/0, dsn=4.4.1, status=deferred (connect to alt2.gmail-smtp-in.l.google.com[2607:f8b0:400c:c01::1a]:25: Network is unreachable)
Aug 28 23:52:20 pve postfix/smtp[1680]: connect to alt1.gmail-smtp-in.l.google.com[209.85.144.26]:25: Connection timed out
Aug 28 23:52:20 pve postfix/smtp[1680]: connect to alt1.gmail-smtp-in.l.google.com[2607:f8b0:400d:c0e::1a]:25: Network is unreachable
Aug 28 23:52:20 pve postfix/smtp[1680]: connect to alt2.gmail-smtp-in.l.google.com[2607:f8b0:400c:c01::1a]:25: Network is unreachable
Aug 28 23:52:20 pve postfix/smtp[1680]: A4EE118826: to=<redacted>, relay=none, delay=1973, delays=1913/0.01/60/0, dsn=4.4.1, status=deferred (connect to alt2.gmail-smtp-in.l.google.com[2607:f8b0:400c:c01::1a]:25: Network is unreachable)
Aug 28 23:52:20 pve postfix/error[2006]: A6AB818828: to=<redacted>, relay=none, delay=1650, delays=1590/60/0/0, dsn=4.4.1, status=deferred (delivery temporarily suspended: connect to alt2.gmail-smtp-in.l.google.com[2607:f8b0:400c:c01::1a]:25: Network is unreachable)
Aug 28 23:52:20 pve postfix/error[2006]: 80F01180BC: to=<redacted>, relay=none, delay=769, delays=708/60/0/0, dsn=4.4.1, status=deferred (delivery temporarily suspended: connect to alt2.gmail-smtp-in.l.google.com[2607:f8b0:400c:c01::1a]:25: Network is unreachable)
Aug 28 23:52:20 pve postfix/error[2006]: 588FF1864E: to=<redacted>, relay=none, delay=2663, delays=2602/60/0/0, dsn=4.4.1, status=deferred (delivery temporarily suspended: connect to alt2.gmail-smtp-in.l.google.com[2607:f8b0:400c:c01::1a]:25: Network is unreachable)
Aug 28 23:52:20 pve postfix/error[2006]: 2F1C31864C: to=<redacted>, relay=none, delay=2763, delays=2703/60/0/0, dsn=4.4.1, status=deferred (delivery temporarily suspended: connect to alt2.gmail-smtp-in.l.google.com[2607:f8b0:400c:c01::1a]:25: Network is unreachable)
Aug 28 23:52:20 pve postfix/error[2006]: F37C318493: to=<redacted>, relay=none, delay=1973, delays=1913/60/0/0, dsn=4.4.1, status=deferred (delivery temporarily suspended: connect to alt2.gmail-smtp-in.l.google.com[2607:f8b0:400c:c01::1a]:25: Network is unreachable)
Aug 28 23:52:20 pve postfix/error[2006]: 2D0C11864B: to=<redacted>, relay=none, delay=2763, delays=2703/60/0/0, dsn=4.4.1, status=deferred (delivery temporarily suspended: connect to alt2.gmail-smtp-in.l.google.com[2607:f8b0:400c:c01::1a]:25: Network is unreachable)
Aug 28 23:52:20 pve postfix/error[2006]: AC79318005: to=<redacted>, relay=none, delay=763, delays=703/59/0/0, dsn=4.4.1, status=deferred (delivery temporarily suspended: connect to alt2.gmail-smtp-in.l.google.com[2607:f8b0:400c:c01::1a]:25: Network is unreachable)
Aug 28 23:52:20 pve postfix/error[2006]: AC8BF18006: to=<redacted>, relay=none, delay=60, delays=0.12/59/0/0, dsn=4.4.1, status=deferred (delivery temporarily suspended: connect to alt2.gmail-smtp-in.l.google.com[2607:f8b0:400c:c01::1a]:25: Network is unreachable)
Aug 28 23:52:20 pve postfix/error[2006]: AE47418706: to=<redacted>, relay=none, delay=59, delays=0/59/0/0, dsn=4.4.1, status=deferred (delivery temporarily suspended: connect to alt2.gmail-smtp-in.l.google.com[2607:f8b0:400c:c01::1a]:25: Network is unreachable)
Aug 28 23:52:20 pve postfix/smtp[1681]: connect to alt1.gmail-smtp-in.l.google.com[209.85.144.26]:25: Connection timed out
Aug 28 23:52:20 pve postfix/smtp[1675]: connect to alt1.gmail-smtp-in.l.google.com[209.85.144.26]:25: Connection timed out
Aug 28 23:52:20 pve postfix/smtp[1681]: connect to alt1.gmail-smtp-in.l.google.com[2607:f8b0:400d:c0e::1a]:25: Network is unreachable
Aug 28 23:52:20 pve postfix/smtp[1675]: connect to alt2.gmail-smtp-in.l.google.com[2607:f8b0:400c:c01::1b]:25: Network is unreachable
Aug 28 23:52:20 pve postfix/smtp[1681]: connect to alt2.gmail-smtp-in.l.google.com[2607:f8b0:400c:c01::1b]:25: Network is unreachable
Aug 28 23:52:20 pve postfix/smtp[1675]: DDBDF180B3: to=<redacted>, relay=none, delay=2058, delays=1998/0.01/61/0, dsn=4.4.1, status=deferred (connect to alt2.gmail-smtp-in.l.google.com[2607:f8b0:400c:c01::1b]:25: Network is unreachable)
Aug 28 23:52:20 pve postfix/smtp[1681]: 2580718384: to=<redacted>, relay=none, delay=1427, delays=1367/0.01/61/0, dsn=4.4.1, status=deferred (connect to alt2.gmail-smtp-in.l.google.com[2607:f8b0:400c:c01::1b]:25: Network is unreachable)
Aug 28 23:53:20 pve pvedaemon[1730]: <root@pam> successful auth for user 'root@pam'
Aug 28 23:55:24 pve kernel: pcieport 0000:c0:01.1: pciehp: Slot(1): Button press: will power off in 5 sec
Aug 28 23:55:24 pve kernel: pcieport 0000:c0:01.2: pciehp: Slot(2): Button press: will power off in 5 sec
Aug 28 23:55:24 pve kernel: pcieport 0000:c0:01.3: pciehp: Slot(3): Button press: will power off in 5 sec
Aug 28 23:55:24 pve kernel: pcieport 0000:c0:01.4: pciehp: Slot(4): Button press: will power off in 5 sec
-- Reboot --
...
I am not sure if these point to something, or are a red herring. There is definitely no corresponding button being pressed. Usually they are all essentially the same log as this, either with or without the PCIe messages, but one log had a minor variation for those:
Code:
...
Aug 28 23:19:21 pve kernel: pcieport 0000:c0:01.1: pciehp: Slot(1): Button press: will power off in 5 sec
Aug 28 23:19:21 pve kernel: pcieport 0000:c0:01.2: pciehp: Slot(2-1): Button press: will power off in 5 sec
Aug 28 23:19:21 pve kernel: pcieport 0000:c0:01.4: pciehp: Slot(4): Button press: will power off in 5 sec
Aug 28 23:19:21 pve kernel: pcieport 0000:c0:01.3: pciehp: Slot(3): Button press: will power off in 5 sec
-- Reboot --
...
The one time that I had a hang while actively watching the log, it put out those lines in the web interface, then went immediately unresponsive. The slot numbers did not change when moving the drives to another connector, but I do not know if those are supposed to show physical slot information or logical from enumerating whatever is present. I do not know if four slots is a coincidence, the four lanes of an NVMe drive, or something else. These drives had recently tested good in other machines.
I am a bit lost on where to look next. It seems like a hardware problem, but the system also stays up enough to handle at least very basic terminal functionality. Is it possible for postfix to mess things up that badly, being the last entries in the log about 2/3 of the time? I am going to run a memory test again, although it is 20+ hours per pass. If that comes up clean, I guess I will also try to dig up some other drives and reinstall on those, but mirrored drives going bad at the same time and not having any log messages (while also still being good enough to fully boot the system) seems unlikely, right? Is there a simple way to have more resilient logging if that is somehow the case?
If this is 100% an obvious hardware issue, I apologize for taking your time.
Base Hardware:
- Supermicro H13SSL-NT
- AMD Epyc 9115
- Kingston KSM56R46BD4PMI-64HAI
- Intel SSDPE21D015TA