Can't get stable Proxmox server

altano

Active Member
Apr 6, 2019
45
12
28
40
California, US
alan.norbauer.com
Hey All,

I can't seem to get a stable Proxmox server. This was my last post on the subject. I got sick of fiddling with one or two variables and then waiting 2 weeks to see if it would hang, so I went ahead and changed almost everything in the system. I upgraded from Proxmox 5.4 to 6, I swapped out all the RAM with known-good RAM from another system, and I swapped out the HD with another SSD. I also moved all the VMs from NFS storage to the local OS SSD, a 2TB Intel 760p formatted with ZFS.

The only thing my new build has in common with the old build is the motherboard/SoC, the SuperMicro M11SDV-8C-LN4F Epyc 3251 motherboard. That and all the VMs/containers I'm running, of course.

Today, after about two weeks of uptime, during a VM backup the hypervisor and VMs hung. I saw this on the console:

upload_2019-8-5_21-40-46.png

"WARNING: Pool 'rpool' has encountered an uncorrectable I/O failure and has been suspended."

I can still enter characters at the login prompt but then it hangs when I hit "enter". I can't ssh into the Proxmox host and I can't ssh into the VMs. Everything is unresponsive. Recovering the server requires a full power-off-power-on. (just rebooting always leads to the same bad state very quickly, if it boots at all).

Here is what I can gather:
  • No crash logging in kern.log or syslog. The message above never appears in kern.log so I assume as soon as it is hit the system can no longer log failures. I can find no logging from the time of the crash and no indication of the problem. This is the ZFS volume I'm using for the OS and SSDs, so obviously this coincides with the server dying.
  • `zpool history rpool` reports "zpool import -N rpool" was the only command run close to the crash
  • `zpool status rpool` reports "No known data errors"
  • The system died while backing up a Windows VM (that has the GPU passed-through, in case that matters) to an NFS share.
I'm just as stuck as I was before. Is my motherboard itself bad (e.g. the i/o controller itself)? Anyone have any idea how I can investigate further? Is there a way I can mirror dmesg to a network server to actually get the error logging next time this occurs, for example?
 
* sounds like some component is broken (can be the disks, controllers, cables, power supply,...)
* try gathering the output of `dmesg` before the error is reached - or boot with a live-cd and check its dmesg output
 
Hey Stoiko,

I agree something sounds broken, but the only part that hasn't changed is the motherboard and I'm not sure how to diagnose that further.

> try gathering the output of `dmesg` before the error is reached

Any idea how I can do that if the drive itself is becoming not writable? Can I automatically send dmesg output to another host somehow?
 
try booting from a live-cd and check the dmesg there
check the output `nvme` (you might need to install it after booting into the livecd)
 
Thank you so much for your help.

try booting from a live-cd and check the dmesg there
I'm not sure what you mean. I don't think I'll be able to reproduce the problem in a live CD. My machine usually only hangs after ~2 weeks of real usage.

check the output `nvme`
Are you referring to nvme-cli? What specific command would you like me to run? Here's list and smart-log:

Code:
root@red:~# nvme list
Node             SN                   Model                                    Namespace Usage                      Format           FW Rev
---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1     BTHH815607TY2P0F     INTEL SSDPEKKW020T8                      1           2.05  TB /   2.05  TB    512   B +  0 B   002C

Code:
root@red:~# nvme smart-log /dev/nvme0
Smart Log for NVME device:nvme0 namespace-id:ffffffff
critical_warning                    : 0
temperature                         : 38 C
available_spare                     : 100%
available_spare_threshold           : 10%
percentage_used                     : 0%
data_units_read                     : 4,942,537
data_units_written                  : 5,889,070
host_read_commands                  : 125,301,919
host_write_commands                 : 77,352,943
controller_busy_time                : 1,317
power_cycles                        : 34
power_on_hours                      : 306
unsafe_shutdowns                    : 11
media_errors                        : 0
num_err_log_entries                 : 0
Warning Temperature Time            : 0
Critical Composite Temperature Time : 0
Thermal Management T1 Trans Count   : 0
Thermal Management T2 Trans Count   : 0
Thermal Management T1 Total Time    : 0
Thermal Management T2 Total Time    : 0
 
I thought the problem occurs quite directly after booting...
can you try to scrub your pool (`man zpool`) and see if that's successful?
 
Once the hang occurs, if I reboot I get back into a hung state very quickly. Usually I cannot boot. If I clean power cycle (power off and then back on) I usually don't crash for a few weeks. Are you suggesting that once I get into this bad state, reboot into a live CD?

I just ran a scrub and it passed with 0 errors.
 
If it runs for a few weeks, and afterwards does not work unless you do a cold boot - that sounds to me like a hardware problem with some device, which needs to get completely reinitialized ...
* check dmesg (regularly while the system runs ok)
* check cables, powersupply, hba (maybe also pull them out and put them back in)
* make sure all your devices have the latest firmware updates installed (BIOS for the mainboard, the nvme, ...)
* run a memtest after the crash occurs
 
For anyone following along, I was able to finally get a stable system by using a recommended (by SuperMicro) as compatible M.2 drive: Toshiba XG5-P (HDS-TMN0-KXG50PNV2T04). The other drives I had in the system (HP EX900 120GB and Intel 760p) didn't seem to have any detectable problems, but this drive has given me 100 days of uptime so I'm pretty sure the problem is actually fixed and this system just wasn't compatible with those other drives. Which means this was a hardware issue all along. *shrug*
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!