pve-kernel hangs without info on AMD Opteron 6380

momus

Member
Dec 3, 2020
4
1
8
36
Hi,

I want to write here about problem with I'm fighting from several months. I was migrating my old hosts (from centos/opensuse with libvirt) to proxmox. 4 servers working without problems, with last (5-th) I have big problems.


Problematic server is Dell PowerEdge R815 (4 socket of AMD). After initial reinstallation to proxmox it worked very well, but it had very old and not powerfull cpus (4x AMD Opteron 6134). We found and ordered used ones 4x Opteron 6380, and from that our problems starts.

Server starts to hangs randomly (no console, no keyboard), iDRAC works, report higher than usual power consumption but console is empty and not responding. Reboot was needed to back to life (from idrac or via power button). We started searching cause of it: bios - newest available, C1E disabled, DMA Virtualization (vt-d from AMD) - disabled, bios restored to default, bios cleaned by changing jumper (and also removing battery), enabled UEFI boot, enabled watchdog, numa disabled. Nothing helped. (i also found that there was bug in ipmi driver in kernel, bud disabling it from loading not helped).

We noticed that server hangs when it was idle, when nothing working on it, it hangs after 1 to 7 days, but when cpus was used it can work without problem for a month (I rendered many things using blender or ffmpeg, 64 cores make a good job).

We switched back cpus to 6134, no problems, idling for 3 weeks without hangs. OK maybe we bough broken 6380, we tested it by removing and swapping in sockets. We checked all variants and in all we had problems (it strange that we bough at least 3 from 4 broken cpus). I found and ordered 4 new one, not used, without any scratch (from USA not China - they comes in original boxes).

This new cpus, after installation, when server goes idle, hangs after few days. We repeat procedure with testing each supported combination of cpus in socket, each test failed. It is unlikely that we bought 7 out of 8 broken CPUs (R815 required 2 or 4 cpus installed).

I tried to debug this problem, but: not output on screen (idrac and physical monitor connected to vga port), no reaction to keyboard. I tried console redirection (COM/serial cable - configured both in bios and grub/kernel), nothing. Logs empty (rsyslog redirect also didn't helped). Sysrq - nothing (I enabled it, switch to max debug, but nothing help, I even cannot use any combination).

I double check documentation, R815 support this CPU (official support by Dell, they sold it with 6380 on board). On Dell page there is information that supported OS is max Centos 7. I boot from usb centos 7 with its kernel (3.10) and server worked for more than 2 weeks. Live distribution is not a good test, I installed on hdd Centos 7, works (I tested this kernels: 3.10, 4.4, 5.7). I found that PVE use modified Ubuntu kernel, I removed centos and installed ubuntu (18.04 and 20.04, both with kernel 5.4), works without problem.


I back to Proxmox, I installed original Debian Buster kernel (4.19), it works, no hangs (of course I had problems with VM, CT, ZFS). I made some other test and installed kernels from backports, 5.4 - works (VM works, CT works when I disable apparmor and add swapaccount=1 to grub), 5.7 - works. Switched back to 5.4-pve (newest version from repo: pve-no-subscription) - hangs.


From my tests I know that there is a problem somewhere in the pve kernels:

Kernels that I'm sure not works: 5.0-pve, 5.3-pve, 5.4-pve (I not remember If I tested pve5.4 with kernel 4.15-pve)

Kernels that Works: centos 7: 3.10 (stock), 4.4 (external repo), 5.7(external repo); Ubuntu: 5.4.0-47-generic; Proxmox: 4.19 (debian buster), 5.4.19-1~bpo10+1, 5.7.10-1~bpo10+1.

Older cpu 6134 has much less instructions and features than 6380. I don't know if problem exists on other cpus from 63xx line. I found on this forum and on internet that proxmox worked on this cpu, but all informations are for kernels older than 4.15.

Now, my server works (installed on it 4x 6380) in cluster on kernel 5.4.19-1~bpo10+1, with enabled C1E and DMA Virtualization. From more than 1.5 month I don't have any hang. Upgraded to pve 6.3 without problem, ceph octopus works without problem. Migration of VM to this node: without problem. I'm not using ZFS on it. If I want run CT on this node I must add: lxc.apparmor.profile: unconfined, and CT start without problem (without this line is problem with names).

I cannot add more information because I was unable to get it: no logs, no output to screen. Only thing that I know that when server hangs it start use more energy.
Hangs were only at idle, at random intervals, sometimes after 1 days, sometimes after 7 days, or sometime after few minutes after end of stress test or connection to ssh to this server. Sometimes few hours after reboot.
 
  • Like
Reactions: savvadesogle

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!