Hello Proxmox folks, I hope you are well.
I am once again asking kindly for your assistance with a very frustrating stability issue on my Proxmox VE server. Attached is the error in question from the syslog, though I'll happily provide any other information required.
Now, I would normally state here what I suspect to be causing the problem, and the reason(s) for my suspicons, but this time I really feel at a complete loss. I can't even say something as primitive as "when the system is under load", as it really does appear to be random. If someone could help me find what part of the system these crashes are tied to, I could certainly have a think about how my use of the server could tie in, and drill down from there.
Here is the list of things I've tried or checked so far:
Here is a quick list of things which I believe may be worth noting which are specific to my setup. I will gladly elaborate further, but am listing in brief here in case anything sets of alarm bells!
If anyone can lend any assistance, I would be hugely grateful as I'm really quite frustrated with this now. I'm just at a bit of a loose end now and feeling a tad hopeless.
That said, I'm highly motivated to find and resolve the issue so anything at all will be welcome to hear and try out. If there are any logs to check, tests to run or questions to answer, please do let me know. I've already spent numerous late nights and tens of hours on this, so what's a few more?
Thank you,
Chris
I am once again asking kindly for your assistance with a very frustrating stability issue on my Proxmox VE server. Attached is the error in question from the syslog, though I'll happily provide any other information required.
Now, I would normally state here what I suspect to be causing the problem, and the reason(s) for my suspicons, but this time I really feel at a complete loss. I can't even say something as primitive as "when the system is under load", as it really does appear to be random. If someone could help me find what part of the system these crashes are tied to, I could certainly have a think about how my use of the server could tie in, and drill down from there.
Here is the list of things I've tried or checked so far:
- Return of the host machine and exchange for replacement. It is a refurbished unit so different CPU, motherboard, power supply - I provide my own RAM & storage.
- Installed brand new 2x16GB RAM kit which is compatibility certified for my machine
- BIOS updated to latest version
- Lastest Intel Microcode installed
- Lastest firmware installed for boot SSD
- Opted-in to the Linux 6.2 kernel (a good idea? CPU is Intel 7700)
- Latest PVE updates installed
fsck
of host and all CTs & VMs come back clean- Temperatures remain reasonable at all times & loads
- No VMs or CTs are near to running out of disk space or memory
- Host is also nowhere near exhausting CPU, RAM, storage capacity etc.
Here is a quick list of things which I believe may be worth noting which are specific to my setup. I will gladly elaborate further, but am listing in brief here in case anything sets of alarm bells!
- I have a script which runs at 5AM each day to
fstrim -av
the PVE host and thenpct fstrim
all running CTs. I do this as I wasn't confident that this was happening (or at least regularly enough) but am certain a crash has never happened along side this script running. I would very much welcome any advice around this topic by the way! - I have an OPNsense VM running on the machine which is the router for the LAN. I pass through the two ports of an Intel I350-T2 NIC to this VM to act as the LAN and WAN links. This VM has exclusive access to these ports. All other VMs/CTs use a bridge of the motherboard's ethernet ports. Occassionally I see the lines such as these in the syslog. I'm not sure if this is a problem but it seems odd to me:
May 05 03:18:43 OptiPlex7050 kernel: vmbr1: port 2(tap501i0) entered disabled state
May 05 03:18:43 OptiPlex7050 kernel: vmbr1: port 2(tap501i0) entered blocking state
May 05 03:18:43 OptiPlex7050 kernel: vmbr1: port 2(tap501i0) entered forwarding state
- I have an 18TB SATA HDD in the system which acts as mass storage for my media centre. This is passed through directly to the media centre CT (and an SMB CT) as a mount point. (All VMs/CTs run on the M.2 SSD)
- I have had similar issues with general protection fault crashes on this machine before (another Proxmox Forum post), but am quite confident that these were in fact due to a mixed kit of RAM, half of which were not truly compatible. As above, I am now solely running a new, 2x16GB kit.
- Generally, I feel as though the system is more susceptible to crashing whilst under some amount of load, perhaps IO - however, it has crashed whilst seemingly at idle also. The system has crashed before whilst completing a backup job, and also whilst a CT running qBittorrent was doing some heavy lifting, for instance.
- Sometimes when doing large, high-speed downloads to the hard drive for my media centre, I see the IO delay statistic jump to 33%. I am assuming this represents the hard drive reaching 100% utilisation, which shows as 33% in the dashboard as I have two other drives making up the remaining 67% - the SSD which runs the host and all CTs/VMs and a 2TB HDD which is only for backups to reside on.
If anyone can lend any assistance, I would be hugely grateful as I'm really quite frustrated with this now. I'm just at a bit of a loose end now and feeling a tad hopeless.
That said, I'm highly motivated to find and resolve the issue so anything at all will be welcome to hear and try out. If there are any logs to check, tests to run or questions to answer, please do let me know. I've already spent numerous late nights and tens of hours on this, so what's a few more?
Thank you,
Chris
Attachments
Last edited: