Fresh installation of VE 8 and seeing console and SSH freezes

rmontgomery1

New Member
Feb 15, 2023
10
0
1
Not sure if this is the right place to put it, but if not, we can move it.
Been running VE8 for little bit, and all was well. No recent changes to the network, or PVE. Suddenly, NFS stopped working (again, no changes were made), and I was seeing console and SSH sessions freeze up. Decided to wipe the drives, and do a fresh installation of VE8 (as the previous instance was a rather ugly upgrade from 7), but now I'm seeing the same thing with console sessions and SSH session to the node.
As I'm farily new to Proxmox, I'm not 100% sure what all to check. I don't lose connection to the GUI, but I get ~60 secs with an active console or SSH session before it just stops. I can restart the sessions, but again, only get about 60 seconds. Reboot the server - same result. This is now a fresh installation and I'm seeing it. For the record, I'm running a Dell R430 with 48G of memory, and 3 x 3Tb HDD's in the bays. I had 0 issues in VE7, so I'm starting to wonder if this is some weirdness with 8 and/or compatibility with my hw. I've also checked memory utilization, CPU utilization, and IO.. There is only one VM on this thing currently, and it's off, so that ain't it.
I'm also still just confused as to why NFS suddenly stopped working.

Any help is greatly appreciated! Let me know what's needed to help with my issues, and hoping it might help someone else, too!
 
terminal session in the GUI
I would recommend monitoring the physical console, if necessary with "journalctl -f" in one of the terminals (alt-f1/f2/f3).
Sounds like you have an issue with network with some symptoms pointing to a potential duplicate IP conflict.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
I would recommend monitoring the physical console, if necessary with "journalctl -f" in one of the terminals (alt-f1/f2/f3).
Sounds like you have an issue with network with some symptoms pointing to a potential duplicate IP conflict.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
I wish it were that easy. The IP I have assigned to that node is set statically outside of the DHCP pool of that subnet. I did run a 'journalctl -f' via a console session through the iDRAC, and I see the ROOT session login, and that's where that journal log stops. The console through the iDRAC still works. It's just the SSH sessions and terminal session in GUI that stop. This output is from the console via iDRACconsole.JPG
 
Last edited:
Unfortunately there is no universal solution that always works when your "SSH" session hangs. So far you provided two data points that point to network: ssh (tcp/ip) hangs, GUI (tls/ssl/tcp/ip) hangs. It has happened more than once here on forum where a poster was sure there are no duplicate IPs on the network, and in the end of much back and forth there were. Perhaps this is not your case. There is no way to be sure for anyone but you.

Your goal should be : isolate, divide and conquer.
- Determine if you have ICMP loss at any point, ie continuous ping.
- Determine if you have TPC/IP errors by examining interfaces
- Utilize pure TCP/IP traffic to confirm drops, ie iperf
- Change network cable and/or NIC
- If there are extra devices (switch/hub/etc) - try to remove them from path
- Start both ssh client and server with debug option
- Examine network trace (tcpdump/wireshark).
- Boot from one of many available Live Linux ISO - match the config and see if it happens again.
- Humor me and change the IP

Keep in mind that PVE is a set of packages on top of Debian userland with Ubuntu Kernel. SSH specifically is standard non-modified Debian userland package. So in fact, you are simply troubleshooting Linux networking.

Good luck

P.S. just realized that there is a 3rd data point - NFS. Although your description of what happened to it is rather nebulous ...
NFS stopped working


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
Last edited:
Unfortunately there is no universal solution that always works when your "SSH" session hangs. So far you provided two data points that point to network: ssh (tcp/ip) hangs, GUI (tls/ssl/tcp/ip) hangs. It has happened more than once here on forum where a poster was sure there are no duplicate IPs on the network, and in the end of much back and forth there were. Perhaps this is not your case. There is no way to be sure for anyone but you.

Your goal should be : isolate, divide and conquer.
- Determine if you have ICMP loss at any point, ie continuous ping.
- Determine if you have TPC/IP errors by examining interfaces
- Utilize pure TCP/IP traffic to confirm drops, ie iperf
- Change network cable and/or NIC
- If there are extra devices (switch/hub/etc) - try to remove them from path
- Start both ssh client and server with debug option
- Examine network trace (tcpdump/wireshark).
- Boot from one of many available Live Linux ISO - match the config and see if it happens again.
- Humor me and change the IP

Keep in mind that PVE is a set of packages on top of Debian userland with Ubuntu Kernel. SSH specifically is standard non-modified Debian userland package. So in fact, you are simply troubleshooting Linux networking.

Good luck

P.S. just realized that there is a 3rd data point - NFS. Although your description of what happened to it is rather nebulous ...



Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
First, please accept my sincere thanks for the help you've provided. I genuinely appreciate it. This is frustrating. I'm a network guy, not a server guy. This is a home lab, and I'm trying to expand my knowledge, and learn a bit more.
That being said, this is a simple topology. FW/router~> Switch~> Server

- Determine if you have ICMP loss at any point, ie continuous ping. None, whatsoever (except obviously, when the server or links are down).
- Determine if you have TPC/IP errors by examining interfaces None. Interfaces are clean.
- Utilize pure TCP/IP traffic to confirm drops, ie iperf None seen
- Change network cable and/or NIC This server has 2 4-port NICs, one built-in. So far, same results
- If there are extra devices (switch/hub/etc) - try to remove them from path NA
- Start both ssh client and server with debug option
- Examine network trace (tcpdump/wireshark).
- Boot from one of many available Live Linux ISO - match the config and see if it happens again. Not a bad idea. I will try this today.
- Humor me and change the IP I actually did this after your first reply. You made a valid point, and I needed to get it out of the DHCP pool. So, done.

Further troubleshooting - I have validated that this is not a duplicate IP issue. The VLAN that this is on is a /24 and there are literally only about 3-4 devices total on that subnet. It's not used for standard LAN traffic. Oddly, when I connect directly to that subnet, the issue goes away, and that led me to believe that this may be a router issue. Problem with that theory is that the other network devices on that subnet do not experience this issue, which points me back to this being a server/hypervisor issue. Router has been reloaded.
I DID remove the bonded link (LACP LAG) to narrow this to a single link between the switch and the server to see if this alleviated the issue. No go. Same problem.
Also to be clear, the GUI session itself doesn't hang. Only the terminal session in the GUI and SSH sessions outside of the GUI.

One thing I want to clarify - The way you mentioned SSH - Are you saying that even in the GUI, the terminal session is a built-in SSH session?

And yes, the NFS suddenly stopped working after being up and stable for quite some time. When I try to add it back, I'm getting an error stating that the server denied the connection. This is still happening, so that leds me to believe that something flaky has gone on with the NAS - however SMB/CIFs is working just fine. Thought I read somewhere that Proxmox tightened up on NFS or something.

Again, thank you for your help thus far.
 
Last edited:
- Determine if you have ICMP loss at any point, ie continuous ping. None, whatsoever (except obviously, when the server or links are down).
- Determine if you have TPC/IP errors by examining interfaces None. Interfaces are clean.
- Utilize pure TCP/IP traffic to confirm drops, ie iperf None seen
- Change network cable and/or NIC This server has 2 4-port NICs, one built-in. So far, same results
Just to confirm, during these tests you actually also had an instance of broken SSH session, but, ICMP and iPerf continued without drops?

Oddly, when I connect directly to that subnet, the issue goes away, and that led me to believe that this may be a router issue.
Could be a combination of both, we've seen some weird VLAN interaction with LACP/Linux, we've also seen many misconfigured switches.

Also to be clear, the GUI session itself doesn't hang. Only the terminal session in the GUI and SSH sessions outside of the GUI.

One thing I want to clarify - The way you mentioned SSH - Are you saying that even in the GUI, the terminal session is a built-in SSH session?
Sessions in the browser are much more forgiving of network drops/reroutes and will reconnect seamlessly to you in 99.9% of cases, SSH is much more session dependent. It can recover in many cases, but not always.
The browser opens a VNC terminal, which is again less tolerant to network interruptions than basic browser communications. You can check what the browser does with "ps -efw|egrep "vnc|termproxy|login"

NFS does sound like a NAS problem. However, NFS is also an unmodified part of basic Debian OS, used by millions of servers around the globe. The fact that it suddenly stopped working with "server denied connection", to me is an indication that something has changed.
One possibility is that support for NFS3 UDP was removed in more modern Kernels, you may need to specify TCP or V4.

You said that issues started after 7 to 8 upgrade. You can try booting to older Kernel (5.x) and testing with it. There have been cases of NIC firmware compatibility issues with newer Kernels. Specifically Realtek based.



Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
Just to confirm, during these tests you actually also had an instance of broken SSH session, but, ICMP and iPerf continued without drops?

That's correct.


Could be a combination of both, we've seen some weird VLAN interaction with LACP/Linux, we've also seen many misconfigured switches.
I agree, and that's why I removed the LACP/Bonded link for troubleshooting. Didn't fix it.


Sessions in the browser are much more forgiving of network drops/reroutes and will reconnect seamlessly to you in 99.9% of cases, SSH is much more session dependent. It can recover in many cases, but not always.
The browser opens a VNC terminal, which is again less tolerant to network interruptions than basic browser communications. You can check what the browser does with "ps -efw|egrep "vnc|termproxy|login"

NFS does sound like a NAS problem. However, NFS is also an unmodified part of basic Debian OS, used by millions of servers around the globe. The fact that it suddenly stopped working with "server denied connection", to me is an indication that something has changed.
One possibility is that support for NFS3 UDP was removed in more modern Kernels, you may need to specify TCP or V4. I can check in to that. I don't recall this being an option in the Synology, but I will look.

You said that issues started after 7 to 8 upgrade. You can try booting to older Kernel (5.x) and testing with it. There have been cases of NIC firmware compatibility issues with newer Kernels. Specifically Realtek based. Good idea. I think I saw the option in the GRUB menu. That would be my luck... However the other NIC is another 4-port, and I know it's an Intel NIC.



Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
Ok.. Noob question.. How do I load up an earlier version of the kernel? Currently grub is only showing 6.2.16-3 and 6.2.16-19
 
Ok, so update - I've tried older kernels, and even blew the OS out again and rolled back to 7.4.3. Same thing. I've given up on NFS, so let's remove that from the equation. Tried both of the NIC's in the machine, same thing. Even found that using an interface for SSH in the same VLAN and subnet my machine is in does the same. thing.

My next steps are a packet capture. I'd also like to look at the SSH daemon logs to see if something pops up. Any advise on the best way to do that?
Barring this - Starting to wonder if it's something with the machine itself.
 
Ok, so update - I've tried older kernels, and even blew the OS out again and rolled back to 7.4.3. Same thing. I've given up on NFS, so let's remove that from the equation. Tried both of the NIC's in the machine, same thing. Even found that using an interface for SSH in the same VLAN and subnet my machine is in does the same. thing.

My next steps are a packet capture. I'd also like to look at the SSH daemon logs to see if something pops up. Any advise on the best way to do that?
Barring this - Starting to wonder if it's something with the machine itself.
Did you ever figure this out? I'm having a similar problem where ssh just freezes and stops working randomly (but usually fairly quickly after connecting) if the interface is on a vlan other than the vmbr0 default. This happens both as a virtual vlan interface on the same nic that works fine connecting to the default and on totally seperate nic set to the vlan I am using for management. It's not an ip collision over here either.
 
Did you ever figure this out? I'm having a similar problem where ssh just freezes and stops working randomly (but usually fairly quickly after connecting) if the interface is on a vlan other than the vmbr0 default. This happens both as a virtual vlan interface on the same nic that works fine connecting to the default and on totally seperate nic set to the vlan I am using for management. It's not an ip collision over here either.
I ended up blowing this out and going back to 7.4.x. I just recently (and cautiously) updated to 7.4.17. I also gave up on NFS, and my shares are through SMB/CIFS now. So far I'm stable, but I'm very reticent to 8.x
Make sure you have REALLLLLY good backups before you blow it all out. Took me a month to rebuild a couple of my VM's/LXC's
 
I ended up blowing this out and going back to 7.4.x. I just recently (and cautiously) updated to 7.4.17. I also gave up on NFS, and my shares are through SMB/CIFS now. So far I'm stable, but I'm very reticent to 8.x
Make sure you have REALLLLLY good backups before you blow it all out. Took me a month to rebuild a couple of my VM's/LXC's
I figured mine out, it ended up being an asymmetric routing issue since I was sshing from my main net to the management interface o n the management vlan but proxmox also has an interface on the regular net so ti was responding via that. I ended up having to basically do this: https://wbhegedus.me/avoiding-asymmetric-routing/

and made it permanent with a bunch of post up stuff in /etc/network/interfaces.
 
I figured mine out, it ended up being an asymmetric routing issue since I was sshing from my main net to the management interface o n the management vlan but proxmox also has an interface on the regular net so ti was responding via that. I ended up having to basically do this: https://wbhegedus.me/avoiding-asymmetric-routing/

and made it permanent with a bunch of post up stuff in /etc/network/interfaces.
Hello,

I might have the exact same issue, can you elaborate more on how you fix this on your Proxmox ?
Thanks
 
Hello,

I might have the exact same issue, can you elaborate more on how you fix this on your Proxmox ?
Thanks
I did what was in my link but I modified it so my tables where named slightly differently. I manually edited /etc/iproute2/rt_tables on each node and added
1 mgmt
2 users
and then I did the rest of the howto for each of those tables putting the proper entries into the tables. Once that worked, I added post-up and post-down commands to my bridge and vlan on the bridge in /etc/network/interfaces since proxmox doesn't use network manager. Looks like this for me, although you will have to modify this to your situation only the post-up and post-down stuff directly applies to your question:


auto vmbr0
iface vmbr0 inet static
address 192.168.3.174/24
bridge-ports enp2s0f1np1
bridge-stp off
bridge-fd 0
bridge-vlan-aware yes
bridge-vids 1 20 30 40 60
offload-rx-vlan-filter off
post-up ip route add 192.168.3.0/24 dev vmbr0 table users
post-up ip route add default via 192.168.3.1 dev vmbr0 table users
post-up ip rule add from 192.168.3.174 lookup users
post-up ip rule add from all iif vmbr0 lookup users
post-down ip route del 192.168.3.0/24 dev vmbr0 table users
post-down ip rule del from 192.168.3.174 lookup users
post-down ip rule del from all iif vmbr0 lookup users

auto vmbr0.40
iface vmbr0.40 inet static
address 192.168.40.107/24
gateway 192.168.40.1
post-up ip route add 192.168.40.0/24 dev vmbr0.40 table mgmt
post-up ip route add default via 192.168.40.1 dev vmbr0.40 table mgmt
post-up ip rule add from 192.168.40.107 lookup mgmt
post-up ip rule add from all iif vmbr0.40 lookup mgmt
post-down ip route del 192.168.40.0/24 dev vmbr0.40 table mgmt
post-down ip rule del from 192.168.40.107 lookup mgmt
post-down ip rule del from all iif vmbr0.40 lookup mgmt
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!