Proxmox host unreachable after a few minutes

christianhau · Oct 18, 2023

Hi!

I have a cluster of two nodes running proxmox 7.4-17 and have had it running for a few years. Node 2 suddenly dropped out of the network and kept on becoming unreachable after a few minutes. I thought this was due to a lack of storage space as I had used a thin volume that I had increased bit too much and thought that might provoke it. I have installed a new hard drive with sufficient space and reinstalled the node, added it to the cluster again and still I get the exact same problem with the node becoming unreachable both through GUI and SSH. Any tips on the next steps of troubleshooting this?

The syslog does not show any error messages that I can see, and simply stops outputting when the node becomes unreachable:

Code:

Oct 18 13:19:33 pve kernel: fwbr104i0: port 1(fwln104i0) entered blocking state
Oct 18 13:19:33 pve kernel: fwbr104i0: port 1(fwln104i0) entered forwarding state
Oct 18 13:19:33 pve kernel: fwbr104i0: port 2(tap104i0) entered blocking state
Oct 18 13:19:33 pve kernel: fwbr104i0: port 2(tap104i0) entered disabled state
Oct 18 13:19:33 pve kernel: fwbr104i0: port 2(tap104i0) entered blocking state
Oct 18 13:19:33 pve kernel: fwbr104i0: port 2(tap104i0) entered forwarding state
Oct 18 13:19:34 pve chronyd[663]: Selected source 185.35.202.197 (2.debian.pool.ntp.org)
Oct 18 13:19:34 pve chronyd[663]: System clock TAI offset set to 37 seconds
Oct 18 13:19:34 pve kernel: FS-Cache: Loaded
Oct 18 13:19:34 pve kernel: FS-Cache: Netfs 'cifs' registered for caching
Oct 18 13:19:34 pve kernel: Key type cifs.spnego registered
Oct 18 13:19:34 pve kernel: Key type cifs.idmap registered
Oct 18 13:19:34 pve kernel: CIFS: Attempting to mount \\192.168.1.100\Backup
Oct 18 13:19:36 pve pve-guests[919]: <root@pam> end task UPID:pve:00000398:00000653:652FBF43:startall::root@pam: OK
Oct 18 13:19:36 pve systemd[1]: Finished PVE guests.
Oct 18 13:19:36 pve systemd[1]: Starting Proxmox VE scheduler...
Oct 18 13:19:37 pve pvescheduler[1028]: starting server
Oct 18 13:19:37 pve systemd[1]: Started Proxmox VE scheduler.
Oct 18 13:19:37 pve systemd[1]: Reached target Multi-User System.
Oct 18 13:19:37 pve systemd[1]: Reached target Graphical Interface.
Oct 18 13:19:37 pve systemd[1]: Starting Update UTMP about System Runlevel Changes...
Oct 18 13:19:37 pve systemd[1]: systemd-update-utmp-runlevel.service: Succeeded.
Oct 18 13:19:37 pve systemd[1]: Finished Update UTMP about System Runlevel Changes.
Oct 18 13:19:37 pve systemd[1]: Startup finished in 13.572s (firmware) + 5.597s (loader) + 8.737s (kernel) + 13.279s (userspace) = 41.187s.
Oct 18 13:20:40 pve chronyd[663]: Selected source 62.101.228.30 (2.debian.pool.ntp.org)
Oct 18 13:21:17 pve pmxcfs[741]: [status] notice: received log
Oct 18 13:21:17 pve sshd[1629]: Accepted publickey for root from 192.168.1.250 port 58978 ssh2: RSA SHA256:RbAcAWP3yChAIw13FMCz5SQo0qp8eCcTZedb5kuTxP8
Oct 18 13:21:17 pve sshd[1629]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0)
Oct 18 13:21:17 pve systemd[1]: Created slice User Slice of UID 0.
Oct 18 13:21:17 pve systemd[1]: Starting User Runtime Directory /run/user/0...
Oct 18 13:21:17 pve systemd-logind[561]: New session 1 of user root.
Oct 18 13:21:17 pve systemd[1]: Finished User Runtime Directory /run/user/0.
Oct 18 13:21:17 pve systemd[1]: Starting User Manager for UID 0...
Oct 18 13:21:17 pve systemd[1632]: pam_unix(systemd-user:session): session opened for user root(uid=0) by (uid=0)
Oct 18 13:21:17 pve systemd[1632]: Queued start job for default target Main User Target.
Oct 18 13:21:17 pve systemd[1632]: Created slice User Application Slice.
Oct 18 13:21:17 pve systemd[1632]: Reached target Paths.
Oct 18 13:21:17 pve systemd[1632]: Reached target Timers.
Oct 18 13:21:17 pve systemd[1632]: Listening on GnuPG network certificate management daemon.
Oct 18 13:21:17 pve systemd[1632]: Listening on GnuPG cryptographic agent and passphrase cache (access for web browsers).
Oct 18 13:21:17 pve systemd[1632]: Listening on GnuPG cryptographic agent and passphrase cache (restricted).
Oct 18 13:21:17 pve systemd[1632]: Listening on GnuPG cryptographic agent (ssh-agent emulation).
Oct 18 13:21:17 pve systemd[1632]: Listening on GnuPG cryptographic agent and passphrase cache.
Oct 18 13:21:17 pve systemd[1632]: Reached target Sockets.
Oct 18 13:21:17 pve systemd[1632]: Reached target Basic System.
Oct 18 13:21:17 pve systemd[1632]: Reached target Main User Target.
Oct 18 13:21:17 pve systemd[1632]: Startup finished in 184ms.
Oct 18 13:21:17 pve systemd[1]: Started User Manager for UID 0.
Oct 18 13:21:17 pve systemd[1]: Started Session 1 of user root.
-- Reboot --
Oct 18 13:28:42 pve kernel: Linux version 5.15.102-1-pve (build@proxmox) (gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP PVE 5.15.102-1 (2023-03-14T13:48Z) ()
Oct 18 13:28:42 pve kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-5.15.102-1-pve root=/dev/mapper/pve-root ro quiet
Oct 18 13:28:42 pve kernel: KERNEL supported cpus:
Oct 18 13:28:42 pve kernel:   Intel GenuineIntel
Oct 18 13:28:42 pve kernel:   AMD AuthenticAMD
Oct 18 13:28:42 pve kernel:   Hygon HygonGenuine
Oct 18 13:28:42 pve kernel:   Centaur CentaurHauls
Oct 18 13:28:42 pve kernel:   zhaoxin   Shanghai 
Oct 18 13:28:42 pve kernel: x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
Oct 18 13:28:42 pve kernel: x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
Oct 18 13:28:42 pve kernel: x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
Oct 18 13:28:42 pve kernel: x86/fpu: Supporting XSAVE feature 0x008: 'MPX bounds registers'
Oct 18 13:28:42 pve kernel: x86/fpu: Supporting XSAVE feature 0x010: 'MPX CSR'
Oct 18 13:28:42 pve kernel: x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
Oct 18 13:28:42 pve kernel: x86/fpu: xstate_offset[3]:  832, xstate_sizes[3]:   64
Oct 18 13:28:42 pve kernel: x86/fpu: xstate_offset[4]:  896, xstate_sizes[4]:   64
Oct 18 13:28:42 pve kernel: x86/fpu: Enabled xstate features 0x1f, context size is 960 bytes, using 'compacted' format.
Oct 18 13:28:42 pve kernel: signal: max sigframe size: 2032
Oct 18 13:28:42 pve kernel: BIOS-provided physical RAM map:
Oct 18 13:28:42 pve kernel: BIOS-e820: [mem 0x0000000000000000-0x0000000000057fff] usable
Oct 18 13:28:42 pve kernel: BIOS-e820: [mem 0x0000000000058000-0x0000000000058fff] reserved
Oct 18 13:28:42 pve kernel: BIOS-e820: [mem 0x0000000000059000-0x000000000009efff] usable
Oct 18 13:28:42 pve kernel: BIOS-e820: [mem 0x000000000009f000-0x00000000000fffff] reserved
Oct 18 13:28:42 pve kernel: BIOS-e820: [mem 0x0000000000100000-0x000000003fffffff] usable
Oct 18 13:28:42 pve kernel: BIOS-e820: [mem 0x0000000040000000-0x00000000403fffff] reserved
Oct 18 13:28:42 pve kernel: BIOS-e820: [mem 0x0000000040400000-0x000000006e287fff] usable
Oct 18 13:28:42 pve kernel: BIOS-e820: [mem 0x000000006e288000-0x000000006e288fff] ACPI NVS
Oct 18 13:28:42 pve kernel: BIOS-e820: [mem 0x000000006e289000-0x000000006e289fff] reserved
Oct 18 13:28:42 pve kernel: BIOS-e820: [mem 0x000000006e28a000-0x0000000079da9fff] usable
Oct 18 13:28:42 pve kernel: BIOS-e820: [mem 0x0000000079daa000-0x000000007a23efff] reserved
Oct 18 13:28:42 pve kernel: BIOS-e820: [mem 0x000000007a23f000-0x000000007a284fff] ACPI data
Oct 18 13:28:42 pve kernel: BIOS-e820: [mem 0x000000007a285000-0x000000007aa5cfff] ACPI NVS
Oct 18 13:28:42 pve kernel: BIOS-e820: [mem 0x000000007aa5d000-0x000000007af4dfff] reserved
Oct 18 13:28:42 pve kernel: BIOS-e820: [mem 0x000000007af4e000-0x000000007affdfff] type 20
Oct 18 13:28:42 pve kernel: BIOS-e820: [mem 0x000000007affe000-0x000000007affefff] usable
Oct 18 13:28:42 pve kernel: BIOS-e820: [mem 0x000000007afff000-0x000000007fffffff] reserved
Oct 18 13:28:42 pve kernel: BIOS-e820: [mem 0x00000000e0000000-0x00000000efffffff] reserved
Oct 18 13:28:42 pve kernel: BIOS-e820: [mem 0x00000000fe000000-0x00000000fe010fff] reserved
Oct 18 13:28:42 pve kernel: BIOS-e820: [mem 0x00000000fec00000-0x00000000fec00fff] reserved
Oct 18 13:28:42 pve kernel: BIOS-e820: [mem 0x00000000fed00000-0x00000000fed00fff] reserved
Oct 18 13:28:42 pve kernel: BIOS-e820: [mem 0x00000000fee00000-0x00000000fee00fff] reserved
Oct 18 13:28:42 pve kernel: BIOS-e820: [mem 0x00000000ff000000-0x00000000ffffffff] reserved
Oct 18 13:28:42 pve kernel: BIOS-e820: [mem 0x0000000100000000-0x000000087effffff] usable
Oct 18 13:28:42 pve kernel: NX (Execute Disable) protection: active
Oct 18 13:28:42 pve kernel: efi: EFI v2.70 by American Megatrends
Oct 18 13:28:42 pve kernel: efi: ACPI 2.0=0x7a24d000 ACPI=0x7a24d000 SMBIOS=0x7ae08000 SMBIOS 3.0=0x7ae07000 MEMATTR=0x7834f018 ESRT=0x7ae04418
Oct 18 13:28:42 pve kernel: secureboot: Secure boot disabled
Oct 18 13:28:42 pve kernel: SMBIOS 3.1.1 present.
Oct 18 13:28:42 pve kernel: DMI:  /NUC7i5BNB, BIOS BNKBL357.86A.0083.2020.0714.1344 07/14/2020
Oct 18 13:28:42 pve kernel: tsc: Detected 2200.000 MHz processor
Oct 18 13:28:42 pve kernel: tsc: Detected 2199.996 MHz TSC

Max Carrara · Oct 18, 2023

Can you physically access the node? As in, via screen and keyboard? If the node hangs there as well, you might have a hardware problem.

Regarding troubleshooting: Is there anything suspicious in journalctl?

If there really are no other messages, you could try to capture the kernel's messages like so:

Bash:

dmesg --follow &> "dmesg-$(date --iso-8601=minutes).log" &

This way every message gets redirected to a log file, which has the current timestamp in its name, in the background. This will not persist between reboots, though!

If you cannot find anything at all, you might have a hardware problem. You might want to run something like memtest86+.

christianhau · Oct 18, 2023

Thanks for the help! I have physical access to the node and it does freeze as well. Have run metest86+ and it passed. The only thing I did after reinstalling the node was to import the backup of the VM I had on the old host, it is a 1,7TB VM. Now when I deleted the VM then it does not shut down or freeze. So is it the VM that is causing this?

leesteken · Oct 19, 2023

christianhau said:
Now when I deleted the VM then it does not shut down or freeze. So is it the VM that is causing this?

Please share your hardware specs and the VM configuration (qm config VM_NR).If you give 100% of the resources to one VM, it can cause trouble, as can passthrough.

christianhau · Oct 20, 2023

Thanks! Attempted a few things:
1. Made the VM the single VM on the host. Restricted the cpu on the node to only use 3 out of 4 cores. The proxmox host still froze.
2. Thought it might be overheating, so put the machine itself outside and started it up. The proxmox host still froze.

Here is the output from the qm config:
boot: order=ide2;scsi0;net0
cores: 3
ide2: none,media=cdrom
memory: 25000
meta: creation-qemu=7.2.0,ctime=1681511628
name: nethermind
net0: virtio=EE:C4:9E:85:48:9B,bridge=vmbr0,firewall=1
numa: 0
onboot: 1
ostype: l26
scsi0: local-lvm:vm-104-disk-0,iothread=1,size=1700G
scsihw: virtio-scsi-single
smbios1: uuid=48b03530-3ab7-4f8c-a0c1-59ac45bba3a9
sockets: 1
vmgenid: 9247ce56-8459-4fe2-ab60-efe0d8e0c11a

The harddrive is completely new, and the ram checks out. But something is wrong...

leesteken · Oct 20, 2023

I can't explain the freeze, sorry. I do notice that your CPU has only 2 cores (hyper-threads are about 10-30% of a core), so I would not give more than 2 cores (or more than 50% of memory) to the VM and I would advise to use containers instead of VMs. What's the point of virtualization when you give the single VM 80% of everything? Since Proxmox is not designed for this use case: maybe just run your "Linux VM" on the real hardware without Proxmox?

christianhau · Oct 20, 2023

No worries. I have several hosts in a cluster and the ability to do snapshots and move VMs between machines in case of hardware failure and also ease of backup has been the reason why I have used Proxmox. Have worked flawlessly for the last two years.

leesteken · Oct 20, 2023

christianhau said:
No worries. I have several hosts in a cluster and the ability to do snapshots and move VMs between machines in case of hardware failure and also ease of backup has been the reason why I have used Proxmox. Have worked flawlessly for the last two years.

I also do like those features. Since it has worked fine before: what changed recently? Did you try booting with an earlier kernel version?
Are you sure it's not a generic two node cluster problem where any hiccup brings both nodes down? Maybe a network issue (which is known to ruin two node clusters)?

christianhau · Oct 20, 2023

The other node I have has been working fine throughout. Switched out network cable as well as it could be a network issue. And I have not tried with earlier kernels no, that might have changed. Other than updates to linux on both proxmox nodes and VMs I cant see what has changed. Any point in trying an update to 8.0?

leesteken · Oct 20, 2023

christianhau said:
Other than updates to linux on both proxmox nodes and VMs I cant see what has changed.

Could be one of those updates but there is no clue in the logs, so it's probably a hardware issue (but could be a driver issues because of the updates).

christianhau said:
Any point in trying an update to 8.0?

You'll have to eventually and it'll come with a new kernel (which you can install also on 7.4). More changes might make it harder to debug, but it might also fix the (unknown) problem. Since there is nothing to go on (except testing/replacing hardware piece by piece) and you'll update eventually, maybe just give it a go. Be sure to follow the instructions.

christianhau · Oct 20, 2023

leesteken said:
Could be one of those updates but there is no clue in the logs, so it's probably a hardware issue (but could be a driver issues because of the updates).

You'll have to eventually and it'll come with a new kernel (which you can install also on 7.4). More changes might make it harder to debug, but it might also fix the (unknown) problem. Since there is nothing to go on (except testing/replacing hardware piece by piece) and you'll update eventually, maybe just give it a go. Be sure to follow the instructions.

Thanks, will give that a shot!

christianhau · Oct 21, 2023

After updating its been running along nicely for 12 hours so far

Max Carrara · Oct 24, 2023

christianhau said:
After updating its been running along nicely for 12 hours so far

Fingers crossed it keeps working!

Search

Search

Proxmox host unreachable after a few minutes

christianhau

New Member

Max Carrara

Active Member

christianhau

New Member

leesteken

Distinguished Member

christianhau

New Member

leesteken

Distinguished Member

christianhau

New Member

leesteken

Distinguished Member

christianhau

New Member

leesteken

Distinguished Member

christianhau

New Member

christianhau

New Member

Max Carrara

Active Member