Hi, I would like to apologize in advance for the length of this post. I will make an effort to be as concise
as possible while ensuring that I provide a comprehensive explanation of the issue I am currently experiencing with my server.
I kindly request your patience and understanding as I proceed to provide all the necessary details
Main issue: So the current issue I have, which I can't say if that is related to the latest updated proxmox 7.4
or if the problem is HW related (which I don't think), so the main problem is that the proxmox
server randomly freezes, that can happen within 1-2 hours, after a reboot. The IPMI interface works fine,
I can start the remote console, but the OS is completely frozen. I have turned off every single VM/Container,
just running the proxmox OS, it gets frezeed within some time. Then, what I can do basically to get back,
is to reboot the server from IPMI interface.
Secondary issue: Whenever, the proxmox server freezes, all the clients connected to the the D-link router
(which I have turned to a switch, by disabling the DHCP server + added static IP to the switch),
gets disconnected or can't access internet. The only way to solve that issue temporary is either
rebooting server or remove the network cable to the proxmox HOST not IPMI interfac,
and all clients suddenly works fine. I have attempted to replace the D-link switch/accesspoint
with two additional switches, but unfortunately, I encountered the same issue on both of them.
I have created script, that records exactly when the proxmox server gets disconnected
(whenever it can't ping 192.168.1.1 the main router) at the same time I'm attempting
to retrieve or extract data from both the syslog (''/var/log/syslog'') and messages (''var/log/messages'') logs,
specifically capturing the entries from 5 minutes prior to the event occurrence until the event itself. However,
nothing is shown/recorded in the logs files sadly because either my script is broken or something to do with server not being able to fetch the logs properly.
You might be wondering why I specifically want to fetch the logs from just 5 minutes prior to the event until the event occurrence.
The reason for this is to avoid the cumbersome task of sifting through a large number of log entries, which could amount to around 20,000 lines
in the IPMI interface (hard task to do). By narrowing down the time frame, I can focus on the relevant log entries that are most likely to provide insights into the issue at hand.
Here is my script for the bash people :
Thank for your patient reading all the way here, and would appreciate if someone could help out.
Here is how the network infrastructure looks like when everything functioning VS
when secondary issue appears and the proxmox server freezes:
My currently setup is:
proxmox-ve: 7.4-1 (running kernel: 5.15.107-2-pve)
pve-manager: 7.4-3 (running version: 7.4-3/9002ab8a)
MB: Supermicro X9DRE-TF+/X9DR7-TF+
BIOS version: 3.2 [Build Date: 03/04/2015]
CPU: 2 x Xeon E5-2640 v2 @ 2.00GHz (Totally 16 cores and 32threads)
RAM: 256GB DDR3
Filesystem: ZFS
SSD: 2 x 240GB Kingstone A400
HDD: [2 x 14TB WDC WD140EDFZ] and [1 x 3TB WDC WD30EFRX-6]
as possible while ensuring that I provide a comprehensive explanation of the issue I am currently experiencing with my server.
I kindly request your patience and understanding as I proceed to provide all the necessary details
Main issue: So the current issue I have, which I can't say if that is related to the latest updated proxmox 7.4
or if the problem is HW related (which I don't think), so the main problem is that the proxmox
server randomly freezes, that can happen within 1-2 hours, after a reboot. The IPMI interface works fine,
I can start the remote console, but the OS is completely frozen. I have turned off every single VM/Container,
just running the proxmox OS, it gets frezeed within some time. Then, what I can do basically to get back,
is to reboot the server from IPMI interface.
Secondary issue: Whenever, the proxmox server freezes, all the clients connected to the the D-link router
(which I have turned to a switch, by disabling the DHCP server + added static IP to the switch),
gets disconnected or can't access internet. The only way to solve that issue temporary is either
rebooting server or remove the network cable to the proxmox HOST not IPMI interfac,
and all clients suddenly works fine. I have attempted to replace the D-link switch/accesspoint
with two additional switches, but unfortunately, I encountered the same issue on both of them.
I have created script, that records exactly when the proxmox server gets disconnected
(whenever it can't ping 192.168.1.1 the main router) at the same time I'm attempting
to retrieve or extract data from both the syslog (''/var/log/syslog'') and messages (''var/log/messages'') logs,
specifically capturing the entries from 5 minutes prior to the event occurrence until the event itself. However,
nothing is shown/recorded in the logs files sadly because either my script is broken or something to do with server not being able to fetch the logs properly.
You might be wondering why I specifically want to fetch the logs from just 5 minutes prior to the event until the event occurrence.
The reason for this is to avoid the cumbersome task of sifting through a large number of log entries, which could amount to around 20,000 lines
in the IPMI interface (hard task to do). By narrowing down the time frame, I can focus on the relevant log entries that are most likely to provide insights into the issue at hand.
Here is my script for the bash people :
Bash:
#!/bin/bash
LOG_FILE="/var/log/server_disconnections.log"
TARGET_IP="192.168.1.1"
PING_INTERVAL=5 # in seconds
LOG_DURATION=$((5 * 60)) # 5 minutes in seconds
LOG_DIR="/root/log" # Specify the desired directory path
# Function to create directory if it does not exist
create_directory() {
if [ ! -d "$1" ]; then
mkdir -p "$1"
fi
}
# Function to store timestamped record
log_disconnection() {
local timestamp=$(date +"%Y-%m-%d %H:%M:%S")
echo "Server disconnected from $TARGET_IP at $timestamp" >> "$LOG_FILE"
# Capture syslog logs before and during disconnection
local syslog_start_time=$(date -d "-$LOG_DURATION seconds" +"%Y-%m-%d %H:%M:%S")
local syslog_end_time=$(date -d "$timestamp" +"%Y-%m-%d %H:%M:%S")
local syslog_logs="$LOG_DIR/syslog_logs_at_disconnection"
sed -n "/$syslog_start_time/,/$syslog_end_time/p" /var/log/syslog > "$syslog_logs"
# Capture messages logs before and during disconnection
local messages_logs="$LOG_DIR/messages_logs_at_disconnection"
sed -n "/$syslog_start_time/,/$syslog_end_time/p" /var/log/messages > "$messages_logs"
}
# Create directory if it does not exist
create_directory "$LOG_DIR"
# Main script logic
while true; do
if ! ping -c 1 "$TARGET_IP" > /dev/null 2>&1; then
log_disconnection
fi
sleep "$PING_INTERVAL"
done
Thank for your patient reading all the way here, and would appreciate if someone could help out.
Here is how the network infrastructure looks like when everything functioning VS
when secondary issue appears and the proxmox server freezes:
My currently setup is:
proxmox-ve: 7.4-1 (running kernel: 5.15.107-2-pve)
pve-manager: 7.4-3 (running version: 7.4-3/9002ab8a)
MB: Supermicro X9DRE-TF+/X9DR7-TF+
BIOS version: 3.2 [Build Date: 03/04/2015]
CPU: 2 x Xeon E5-2640 v2 @ 2.00GHz (Totally 16 cores and 32threads)
RAM: 256GB DDR3
Filesystem: ZFS
SSD: 2 x 240GB Kingstone A400
HDD: [2 x 14TB WDC WD140EDFZ] and [1 x 3TB WDC WD30EFRX-6]