stanikz

Member
Nov 24, 2021
5
5
8
36
Hi, I would like to apologize in advance for the length of this post. I will make an effort to be as concise
as possible while ensuring that I provide a comprehensive explanation of the issue I am currently experiencing with my server.
I kindly request your patience and understanding as I proceed to provide all the necessary details

Main issue: So the current issue I have, which I can't say if that is related to the latest updated proxmox 7.4
or if the problem is HW related (which I don't think), so the main problem is that the proxmox
server randomly freezes, that can happen within 1-2 hours, after a reboot. The IPMI interface works fine,
I can start the remote console, but the OS is completely frozen. I have turned off every single VM/Container,
just running the proxmox OS, it gets frezeed within some time. Then, what I can do basically to get back,
is to reboot the server from IPMI interface.

Secondary issue: Whenever, the proxmox server freezes, all the clients connected to the the D-link router
(which I have turned to a switch, by disabling the DHCP server + added static IP to the switch),
gets disconnected or can't access internet. The only way to solve that issue temporary is either
rebooting server or remove the network cable to the proxmox HOST not IPMI interfac,
and all clients suddenly works fine. I have attempted to replace the D-link switch/accesspoint
with two additional switches, but unfortunately, I encountered the same issue on both of them.

I have created script, that records exactly when the proxmox server gets disconnected
(whenever it can't ping 192.168.1.1 the main router) at the same time I'm attempting
to retrieve or extract data from both the syslog (''/var/log/syslog'') and messages (''var/log/messages'') logs,
specifically capturing the entries from 5 minutes prior to the event occurrence until the event itself. However,
nothing is shown/recorded in the logs files sadly because either my script is broken or something to do with server not being able to fetch the logs properly.

You might be wondering why I specifically want to fetch the logs from just 5 minutes prior to the event until the event occurrence.
The reason for this is to avoid the cumbersome task of sifting through a large number of log entries, which could amount to around 20,000 lines
in the IPMI interface (hard task to do). By narrowing down the time frame, I can focus on the relevant log entries that are most likely to provide insights into the issue at hand.

Here is my script for the bash people :

Bash:
#!/bin/bash

LOG_FILE="/var/log/server_disconnections.log"
TARGET_IP="192.168.1.1"
PING_INTERVAL=5  # in seconds
LOG_DURATION=$((5 * 60))  # 5 minutes in seconds

LOG_DIR="/root/log"  # Specify the desired directory path

# Function to create directory if it does not exist
create_directory() {
  if [ ! -d "$1" ]; then
    mkdir -p "$1"
  fi
}

# Function to store timestamped record
log_disconnection() {
  local timestamp=$(date +"%Y-%m-%d %H:%M:%S")

  echo "Server disconnected from $TARGET_IP at $timestamp" >> "$LOG_FILE"

  # Capture syslog logs before and during disconnection
  local syslog_start_time=$(date -d "-$LOG_DURATION seconds" +"%Y-%m-%d %H:%M:%S")
  local syslog_end_time=$(date -d "$timestamp" +"%Y-%m-%d %H:%M:%S")
  local syslog_logs="$LOG_DIR/syslog_logs_at_disconnection"
  sed -n "/$syslog_start_time/,/$syslog_end_time/p" /var/log/syslog > "$syslog_logs"

  # Capture messages logs before and during disconnection
  local messages_logs="$LOG_DIR/messages_logs_at_disconnection"
  sed -n "/$syslog_start_time/,/$syslog_end_time/p" /var/log/messages > "$messages_logs"
}

# Create directory if it does not exist
create_directory "$LOG_DIR"

# Main script logic
while true; do
  if ! ping -c 1 "$TARGET_IP" > /dev/null 2>&1; then
    log_disconnection
  fi
  sleep "$PING_INTERVAL"
done

Thank for your patient reading all the way here, and would appreciate if someone could help out.

Here is how the network infrastructure looks like when everything functioning VS
when secondary issue appears and the proxmox server freezes:
1687552365639.png

My currently setup is:
proxmox-ve: 7.4-1 (running kernel: 5.15.107-2-pve)
pve-manager: 7.4-3 (running version: 7.4-3/9002ab8a)

MB: Supermicro X9DRE-TF+/X9DR7-TF+
BIOS version: 3.2 [Build Date: 03/04/2015]
CPU: 2 x Xeon E5-2640 v2 @ 2.00GHz (Totally 16 cores and 32threads)
RAM: 256GB DDR3
Filesystem: ZFS
SSD: 2 x 240GB Kingstone A400
HDD: [2 x 14TB WDC WD140EDFZ] and [1 x 3TB WDC WD30EFRX-6]
 

Attachments

  • Network Infrastructure.png
    Network Infrastructure.png
    100.3 KB · Views: 1
you could look also at /var/log/kern.log. (it's already in syslog, but you'll have only kernel log here, so easier to look. If it's freeze, it should be kernel related).

for 2), if you router is hanging, maybe they are overloaded. Are you sure to not have a network loop somewhere ?

also, can you send your /etc/network/interfaces file ?
 
When I use the command cat /var/log/kern.log, I can see a lot of logs. However, it's not an fun experience to do this from the Java iKVM viewer (which I strongly dislike). Additionally, if I were to plug in the network cable, it would likely result in the server getting disconnected along with everything else connected to the same switch after 1-2 hours.

I do not believe that I have a network loop, but I am unsure how to confirm this. The only aspect related to a potential network loop is that I have a container running Ubuntu with WireGuard installed, which acts as a client for another network. The Host itself and other clients in the network uses this container to access the other subnets. Towards the end of my configuration file, you can also see that I had added the static routes which the host or other devices in my devices, VMs, containers etc uses to access to other subnets. However, I have commented them out (#) in an attempt to address the freezing and hanging issue with the OS. Unfortunately, disabling the static routes did not resolve the problem.
networkinterface.png

But, I will try to check the kernel logs, if I find something there, otherwise I will try to go back to older kernel and start the host on that and see if all this issue is due to the kernel.
 
After an extensive period of troubleshooting, I discovered the solution to my problem. Initially, I had 16 memory modules installed in 16 out of 24 DIMM slots. To diagnose the issue thoroughly, I conducted cross-testing and ran an extensive memory test (mem86test). Finally, I identified that one specific memory module was causing the mysterious issue of freezing.

To resolve the problem, I eliminated that single problematic memory module. Following the removal, everything was back and running smoothly, and I'm pleased to say that the system has been running strong without any issues up until today.
 
  • Like
Reactions: leesteken

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!