Random Proxmox Server Hang | No VMs | No Web-Gui

MatthiasMT

New Member
Oct 8, 2019
5
0
1
28
Not sure how to post this, since this is my first time but I`m wondering if anyone else has encountered this strange issue,

At random times my entire Server randomly hangs / crashes meaning all VMs are no longer responding & nor is the web-gui to access proxmox. Restarting seems to fix the issue.

Checking the logs all i have to share is this from /var/log/syslog

https://prnt.sc/pgjio2 - Sent a link to maintain screenshot quality

The only thing strange i see is the repeated @^@^@^, no idea what that means

Anyone ever encountered this?
 

Attachments

  • Capture.PNG
    Capture.PNG
    232.3 KB · Views: 189
The @^@^@^ characters you're seeing are stand-ins for invalid/null characters. Entries like this usually occur when the system experiences a hard crash.

Could you give a bit more information about your setup and the issue? What PVE version (run pveversion -v) are you running, what hardware, storage, etc... Also, when the system hangs, is a complete hang (e.g. does SSH work? local physical terminal access? can you run 'reboot', or need to press power/reset button), or just the PVE services?
 
Thanks for the reply and for the info about the invalid/null characters hard crash, regarding what you requested:

Regular Proxmox installation on an NVMe

preversion -v
  • proxmox-ve: 6.0-2 (running kernel: 5.0.21-2-pve)
  • pve-manager: 6.0-7 (running version: 6.0-7/28984024)
  • pve-kernel-5.0: 6.0-8
  • pve-kernel-helper: 6.0-8
  • pve-kernel-4.15: 5.4-6
  • pve-kernel-5.0.21-2-pve: 5.0.21-6
  • pve-kernel-5.0.18-1-pve: 5.0.18-3
  • pve-kernel-4.15.18-18-pve: 4.15.18-44
  • pve-kernel-4.15.18-10-pve: 4.15.18-32
  • ceph: 12.2.12-pve1
  • ceph-fuse: 12.2.12-pve1
  • corosync: 3.0.2-pve2
  • criu: 3.11-3
  • glusterfs-client: 5.5-3
  • ksm-control-daemon: 1.3-1
  • libjs-extjs: 6.0.1-10
  • libknet1: 1.12-pve1
  • libpve-access-control: 6.0-2
  • libpve-apiclient-perl: 3.0-2
  • libpve-common-perl: 6.0-5
  • libpve-guest-common-perl: 3.0-1
  • libpve-http-server-perl: 3.0-2
  • libpve-storage-perl: 6.0-9
  • libqb0: 1.0.5-1
  • lvm2: 2.03.02-pve3
  • lxc-pve: 3.1.0-65
  • lxcfs: 3.0.3-pve60
  • novnc-pve: 1.1.0-1
  • proxmox-mini-journalreader: 1.1-1
  • proxmox-widget-toolkit: 2.0-7
  • pve-cluster: 6.0-7
  • pve-container: 3.0-7
  • pve-docs: 6.0-4
  • pve-edk2-firmware: 2.20190614-1
  • pve-firewall: 4.0-7
  • pve-firmware: 3.0-2
  • pve-ha-manager: 3.0-2
  • pve-i18n: 2.0-3
  • pve-qemu-kvm: 4.0.0-5
  • pve-xtermjs: 3.13.2-1
  • qemu-server: 6.0-7
  • smartmontools: 7.0-pve2
  • spiceterm: 3.1-1
  • vncterm: 1.6-1
  • zfsutils-linux: 0.8.1-pve2
System Specs:

1570642384062.png

After the "hang" happens, there is no way to communicate with the server, "No Ssh", "No Web-Gui", "No Ping" & the only way is to do a hard restart,

P.S Local terminal i dont think ive tried yet, next time it happens ill be sure to check

Hope this helps!
 
If the local terminal also doesn't show any useful information, a hardware error seems likely. Maybe try running a memcheck from the GRUB menu on boot.

Alternatively, a more advanced method would be to install kdump-tools. This should provide you with a log even in case of a kernel panic. There's more too that (many tutorials online though), but in general:

Code:
# apt install kdump-tools

Select no for kexec reboots
Select yes for enabling kdump-tools

# $EDITOR /etc/default/grub

Add 'nmi_watchdog=1' to the end of 'GRUB_CMDLINE_LINUX_DEFAULT'

# $EDITOR /etc/default/grub.d/kdump-tools.cfg

Change 128M to 256M at the end of the line

# update-grub
# reboot
# cat /sys/kernel/kexec_crash_loaded

should show 1 now

Next time the system crashes, it should automatically reboot after a while. You can then find a crashlog in /var/crash/<date>/dmesg

Remove kdump-tools once you figured out your error, since it takes some RAM away from your running system.
 
Last edited:
  • Like
Reactions: MatthiasMT
Greetings,

I'm having this same issue on two Dell Optiplex 390 DT (two identical intel sandy bridge systems except for ram and hard drive)
The main node in the cluster a Dell Optiplex 7010 MT, doesn't have this issue.
It happened to me twice on each of the 2 affected nodes after updating to pve 6.
From the gaps in the logging data, it happens after around give or take 12 hours, this however might just be coincidence.

The nodes will be compeletely unresponsive, no webgui, no ssh and if i plug in a monitor it will go into standby, and a keyboard or mouse doesnt seem to initialise (capslock doesnt toggle off and on) when plugged in after the freeze occurs, In this state there is no response to the power button either
They boot back up like nothing happened after holding the power button to force them off.
and also for me, there will be @^ at the time of the crash in /var/log/syslog if i check it afterwards

pveversion -v is identical on all 3 nodes
proxmox-ve: 6.0-2 (running kernel: 5.0.21-2-pve)
pve-manager: 6.0-7 (running version: 6.0-7/28984024)
pve-kernel-5.0: 6.0-8
pve-kernel-helper: 6.0-8
pve-kernel-4.15: 5.4-9
pve-kernel-5.0.21-2-pve: 5.0.21-6
pve-kernel-4.15.18-21-pve: 4.15.18-48
pve-kernel-4.13.13-2-pve: 4.13.13-33
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.2-pve2
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.12-pve1
libpve-access-control: 6.0-2
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-5
libpve-guest-common-perl: 3.0-1
libpve-http-server-perl: 3.0-2
libpve-storage-perl: 6.0-9
libqb0: 1.0.5-1
lvm2: 2.03.02-pve3
lxc-pve: 3.1.0-65
lxcfs: 3.0.3-pve60
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.0-7
pve-cluster: 6.0-7
pve-container: 3.0-7
pve-docs: 6.0-4
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-7
pve-firmware: 3.0-2
pve-ha-manager: 3.0-2
pve-i18n: 2.0-3
pve-qemu-kvm: 4.0.0-5
pve-xtermjs: 3.13.2-1
qemu-server: 6.0-9
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.1-pve2

some observations:

I previously discussed this problem with some people, and they recommended me to try turning off all power management options i could in the bios, which i did for one node. said node has now been running for 24 hours without a problem.​
i noticed that both the ASRock B450M Pro4 that Mathias and MrSoupman have and my Dell optiplex 390 DT have realtek RTL8111 on board network chipset, while my main node Dell Optiplex 7010 MT which doesnt have the issue has an intel chipset​
Which ties in to the first observation, on my intel system disabling powerstates also seemed to make the system more stable, is this really a ryzen related problem?, (a lot of posts i can find on the ryzen c6 problem also are using boards with realtek chipsets)​
 
Last edited:
Great info glad to know it's not just my hardware config that it's happening too.

I could try to change some power management settings in the bios, I left it mostly by default but I do remember trying to get Wake On Lan working even though I don't use it much. Aside from that I also tried messing around with PCIE passthrough which I also don't use, but if I remember correctly that's mostly configuring the software. The only thing is that I cannot remember if this used to happen before or after pve6, I think it happened to me before the upgrade to pve6 but aswell in might case, this issue happens on a weekly - monthly basis. I've had this setup for 6 months now and this server hang happened to me around 6 times max.
 
Bonjour,

I build a new infra @home, and i have the same problem as you, even the ^@^@^@ in the syslog.

Code:
Oct 16 09:31:09 p3 systemd-udevd[280337]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
Oct 16 09:31:09 p3 systemd-udevd[280337]: Could not generate persistent MAC address for fwpr166p0: No such file or directory
Oct 16 09:31:09 p3 systemd-udevd[280336]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
Oct 16 09:31:09 p3 systemd-udevd[280336]: Could not generate persistent MAC address for fwln166i0: No such file or directory
Oct 16 09:31:09 p3 kernel: [39593.379081] vmbr1: port 3(fwpr166p0) entered blocking state
Oct 16 09:31:09 p3 kernel: [39593.379086] vmbr1: port 3(fwpr166p0) entered disabled state
Oct 16 09:31:09 p3 kernel: [39593.379375] device fwpr166p0 entered promiscuous mode
Oct 16 09:31:09 p3 kernel: [39593.392448] fwbr166i0: port 2(tap166i0) entered blocking state
Oct 16 09:31:09 p3 kernel: [39593.392452] fwbr166i0: port 2(tap166i0) entered disabled state
Oct 16 09:31:09 p3 kernel: [39593.392782] fwbr166i0: port 2(tap166i0) entered blocking state
Oct 16 09:31:09 p3 kernel: [39593.392786] fwbr166i0: port 2(tap166i0) entered forwarding state
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@Oct 16 09:48:54 p3 systemd-modules-load[369]: Inserted module 'coretemp'
Oct 16 09:48:54 p3 systemd-modules-load[369]: Inserted module 'iscsi_tcp'
Oct 16 09:48:54 p3 systemd-modules-load[369]: Inserted module 'ib_iser'
Oct 16 09:48:54 p3 systemd-modules-load[369]: Inserted module 'vhost_net'
Oct 16 09:48:54 p3 lvm[371]:   1 logical volume(s) in volume group "VGEMMCP3" monitored
Oct 16 09:48:54 p3 systemd[1]: Starting Flush Journal to Persistent Storage...
Oct 16 09:48:54 p3 systemd[1]: Started udev Kernel Device Manager.
Oct 16 09:48:54 p3 systemd[1]: Started Flush Journal to Persistent Storage.
Oct 16 09:48:54 p3 systemd[1]: Started udev Coldplug all Devices.
Oct 16 09:48:54 p3 systemd[1]: Starting Helper to synchronize boot up for ifupdown...
Oct 16 09:48:54 p3 systemd-udevd[409]: Using default interface naming schem

The node suddenly not responding, I can ping the host, everything else not working.


I have no problem before with my old infrastructure, but i wasn't using ceph. Maybe the source of the problem?

Dark26
 
The @^@^@^ characters you're seeing are stand-ins for invalid/null characters. Entries like this usually occur when the system experiences a hard crash.

Could you give a bit more information about your setup and the issue? What PVE version (run pveversion -v) are you running, what hardware, storage, etc... Also, when the system hangs, is a complete hang (e.g. does SSH work? local physical terminal access? can you run 'reboot', or need to press power/reset button), or just the PVE services?

In my case, ssh, local physical terminal, and reboot doesn't work. Power button needed.

dark26
 
Bonjour,

I build a new infra @home, and i have the same problem as you, even the ^@^@^@ in the syslog.

Code:
Oct 16 09:31:09 p3 systemd-udevd[280337]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
Oct 16 09:31:09 p3 systemd-udevd[280337]: Could not generate persistent MAC address for fwpr166p0: No such file or directory
Oct 16 09:31:09 p3 systemd-udevd[280336]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
Oct 16 09:31:09 p3 systemd-udevd[280336]: Could not generate persistent MAC address for fwln166i0: No such file or directory
Oct 16 09:31:09 p3 kernel: [39593.379081] vmbr1: port 3(fwpr166p0) entered blocking state
Oct 16 09:31:09 p3 kernel: [39593.379086] vmbr1: port 3(fwpr166p0) entered disabled state
Oct 16 09:31:09 p3 kernel: [39593.379375] device fwpr166p0 entered promiscuous mode
Oct 16 09:31:09 p3 kernel: [39593.392448] fwbr166i0: port 2(tap166i0) entered blocking state
Oct 16 09:31:09 p3 kernel: [39593.392452] fwbr166i0: port 2(tap166i0) entered disabled state
Oct 16 09:31:09 p3 kernel: [39593.392782] fwbr166i0: port 2(tap166i0) entered blocking state
Oct 16 09:31:09 p3 kernel: [39593.392786] fwbr166i0: port 2(tap166i0) entered forwarding state
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@Oct 16 09:48:54 p3 systemd-modules-load[369]: Inserted module 'coretemp'
Oct 16 09:48:54 p3 systemd-modules-load[369]: Inserted module 'iscsi_tcp'
Oct 16 09:48:54 p3 systemd-modules-load[369]: Inserted module 'ib_iser'
Oct 16 09:48:54 p3 systemd-modules-load[369]: Inserted module 'vhost_net'
Oct 16 09:48:54 p3 lvm[371]:   1 logical volume(s) in volume group "VGEMMCP3" monitored
Oct 16 09:48:54 p3 systemd[1]: Starting Flush Journal to Persistent Storage...
Oct 16 09:48:54 p3 systemd[1]: Started udev Kernel Device Manager.
Oct 16 09:48:54 p3 systemd[1]: Started Flush Journal to Persistent Storage.
Oct 16 09:48:54 p3 systemd[1]: Started udev Coldplug all Devices.
Oct 16 09:48:54 p3 systemd[1]: Starting Helper to synchronize boot up for ifupdown...
Oct 16 09:48:54 p3 systemd-udevd[409]: Using default interface naming schem

The node suddenly not responding, I can ping the host, everything else not working.


I have no problem before with my old infrastructure, but i wasn't using ceph. Maybe the source of the problem?

Dark26

I don't use ceph so I doubt it's the cause
 
The node suddenly not responding, I can ping the host, everything else not working.

Ping is working? Even when the local terminal doesn't?

A hard crash like the one you describe is unlikely to be caused by anything in userspace, even filesystem in-kernel code like Ceph seldomly leads to kernel panics like these. A kernel panic is usually related to a kernel bug or a hardware fault, since for the OS, a panic is the absolute last resort when no other resolution (like logging or displaying an error) is feasible. This also doesn't mean that your errors are related at all.

Little tip on the side: As a (pretty bad) workaround, setting sysctl kernel.panic=60 makes the system restart automatically 1 minute after a kernel panic (without kdump at least). This does *not* fix the issue (e.g. potential disk corruption because of the crash), but at least means that you don't need to physically press the power button if it happens when you're not home for example.
 
Ping is working? Even when the local terminal doesn't?

A hard crash like the one you describe is unlikely to be caused by anything in userspace, even filesystem in-kernel code like Ceph seldomly leads to kernel panics like these. A kernel panic is usually related to a kernel bug or a hardware fault, since for the OS, a panic is the absolute last resort when no other resolution (like logging or displaying an error) is feasible. This also doesn't mean that your errors are related at all.

Little tip on the side: As a (pretty bad) workaround, setting sysctl kernel.panic=60 makes the system restart automatically 1 minute after a kernel panic (without kdump at least). This does *not* fix the issue (e.g. potential disk corruption because of the crash), but at least means that you don't need to physically press the power button if it happens when you're not home for example.


Thanks for all the information. I install, like you wrote before kdump-tools, on 2 of the node. I waiting the third to be online to install on it, and reboot them all.

And thanks for sysctl kernel.panic=60 , it will save my life if i can find the problème

for the record :

Code:
root@p1:~# pvecm  status
Quorum information
------------------
Date:             Wed Oct 16 13:39:19 2019
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000001
Ring ID:          1/232
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      2
Quorum:           2
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.10.10.91 (local)
0x00000003          1 10.10.10.93
root@p1:~# ping p2
PING p2.mendes63.fr (10.10.10.92) 56(84) bytes of data.
64 bytes from p2.toto63.fr (10.10.10.92): icmp_seq=1 ttl=64 time=0.215 ms
64 bytes from p2.toto63.fr (10.10.10.92): icmp_seq=2 ttl=64 time=0.188 ms
64 bytes from p2.toto63.fr (10.10.10.92): icmp_seq=3 ttl=64 time=0.271 ms
64 bytes from p2.toto63.fr (10.10.10.92): icmp_seq=4 ttl=64 time=0.466 ms
64 bytes from p2.toto63.fr (10.10.10.92): icmp_seq=5 ttl=64 time=0.430 ms
^C
--- p2toto63.fr ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 89ms
rtt min/avg/max/mdev = 0.188/0.314/0.466/0.113 ms
root@p1:~# ssh p2
^C
root@p1:~# ssh 10.10.10.92
^C
root@p1:~# ping 10.10.10.92
PING 10.10.10.92 (10.10.10.92) 56(84) bytes of data.
64 bytes from 10.10.10.92: icmp_seq=1 ttl=64 time=0.233 ms
64 bytes from 10.10.10.92: icmp_seq=2 ttl=64 time=0.244 ms
64 bytes from 10.10.10.92: icmp_seq=3 ttl=64 time=0.200 ms
64 bytes from 10.10.10.92: icmp_seq=4 ttl=64 time=0.221 ms
64 bytes from 10.10.10.92: icmp_seq=5 ttl=64 time=0.276 ms
^C
--- 10.10.10.92 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 98ms
rtt min/avg/max/mdev = 0.200/0.234/0.276/0.031 ms

root@p1:~# ping 10.10.5.92
PING 10.10.5.92 (10.10.5.92) 56(84) bytes of data.
64 bytes from 10.10.5.92: icmp_seq=1 ttl=64 time=0.313 ms
64 bytes from 10.10.5.92: icmp_seq=2 ttl=64 time=0.376 ms
^C
--- 10.10.5.92 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 24ms
rtt min/avg/max/mdev = 0.313/0.344/0.376/0.036 ms
root@p1:~# ping 192.168.1.92
PING 192.168.1.92 (192.168.1.92) 56(84) bytes of data.
64 bytes from 192.168.1.92: icmp_seq=1 ttl=64 time=0.272 ms
64 bytes from 192.168.1.92: icmp_seq=2 ttl=64 time=0.498 ms
64 bytes from 192.168.1.92: icmp_seq=3 ttl=64 time=0.434 ms
^C
--- 192.168.1.92 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 48ms
rtt min/avg/max/mdev = 0.272/0.401/0.498/0.096 ms
root@p1:~#
 
Last edited:
Little update from my side
i re-anabled and added non-free repositories to my sources.list and installed intel-microcode on both nodes, which did seem to improve stability.
however they still crash out after 2 days (which is still better then the previous 12 hours)
really odd.

i just had another node crash again, this time without any ^@ in the log

thanks for the tip about sysctl kernel.panic=60 so i atleast dont have to keep manually rebooting them, i will put it on one of the nodes that runs VMs that are more important then the ones on the other and leave one without it to double check if i can ping it like Dark26 can.

I also do not use ceph, and i also ran memtest which came up with no issues, will also try to see what kdump tells me
 
Last edited:
Another crash this morning :

[20357.427603] libceph: mon1 10.10.5.91:6789 session lost, hunting for new mon
[20357.427619] libceph: mon0 10.10.5.91:6789 socket closed (con state OPEN)
[20357.427635] libceph: mon0 10.10.5.91:6789 session lost, hunting for new mon
[20357.430578] libceph: mon2 10.10.5.93:6789 session established
[20357.430960] libceph: mon2 10.10.5.93:6789 session established
[20411.040187] libceph: osd0 up
[20488.596221] fwbr119i0: port 2(veth119i0) entered disabled state
[20489.494818] audit: type=1400 audit(1571260351.544:21): apparmor="STATUS" operation="profile_remove" profile="/usr/bin/lxc-start" name="lxc-119_</var/lib/lxc>" pid=86254 comm="apparmor_parser"
[20491.540746] fwbr119i0: port 2(veth119i0) entered disabled state
[20491.541208] device veth119i0 left promiscuous mode
[20491.541220] fwbr119i0: port 2(veth119i0) entered disabled state
[20491.591756] fwbr119i0: port 1(fwln119i0) entered disabled state
[20491.592045] vmbr1: port 4(fwpr119p0) entered disabled state
[20491.592432] device fwln119i0 left promiscuous mode
[20491.592444] fwbr119i0: port 1(fwln119i0) entered disabled state
[20491.611173] device fwpr119p0 left promiscuous mode
[20491.611187] vmbr1: port 4(fwpr119p0) entered disabled state
[27678.315480] perf: interrupt took too long (9653 > 9646), lowering kernel.perf_event_max_sample_rate to 20500
[44309.526089] hrtimer: interrupt took 66193 ns
[54600.212333] libceph: osd1 down
[54600.212352] libceph: osd1 up
[54600.212359] libceph: osd0 down
[54600.212366] libceph: osd0 up
[61666.627779] device tap254i0 entered promiscuous mode
[61666.652734] vmbr2: port 2(tap254i0) entered blocking state
[61666.652747] vmbr2: port 2(tap254i0) entered disabled state
[61666.653117] vmbr2: port 2(tap254i0) entered blocking state
[61666.653125] vmbr2: port 2(tap254i0) entered forwarding state
[61668.512333] ------------[ cut here ]------------
[61668.512348] kernel BUG at drivers/mmc/host/sdhci.c:734!
[61668.512360] invalid opcode: 0000 [#1] SMP NOPTI
[61668.512366] CPU: 0 PID: 196644 Comm: kworker/0:0H Kdump: loaded Not tainted 5.0.21-2-pve #1
[61668.512370] Hardware name: Acute angle AA-B4/AB4, BIOS 00.14 04/26/2018
[61668.512381] Workqueue: kblockd blk_mq_run_work_fn
[61668.512391] RIP: 0010:sdhci_send_command+0xa77/0xd10 [sdhci]
[61668.512395] Code: 48 c1 ea 20 89 50 08 8b 83 50 03 00 00 48 01 45 c8 e9 7f fd ff ff 48 8b 83 c0 02 00 00 48 8b 40 38 48 8b 40 10 e9 f9 fb ff ff <0f> 0b 48 8b 45 c8 bf 21 00 00 00 66 89 38 66 44 89 50 02 44 89 70
[61668.512401] RSP: 0018:ffffa16d48967b78 EFLAGS: 00010006
[61668.512405] RAX: 000000026c381000 RBX: ffff8c5376126580 RCX: ffff8c5376126000
[61668.512408] RDX: 0000000000000002 RSI: ffff8c536b4c8668 RDI: ffff8c5376126580
[61668.512411] RBP: ffffa16d48967be8 R08: ffff8c5376126000 R09: ffff8c536b4ed000
[61668.512415] R10: 0000000000080000 R11: ffff8c5377a22084 R12: 0000000000000000
[61668.512418] R13: ffff8c536b4ed000 R14: 0000000199400000 R15: 0000000000000000
[61668.512422] FS: 0000000000000000(0000) GS:ffff8c5377a00000(0000) knlGS:0000000000000000
[61668.512426] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[61668.512429] CR2: 000055e86476248c CR3: 000000025160e000 CR4: 00000000003426f0
[61668.512432] Call Trace:
[61668.512443] sdhci_request+0xb7/0x100 [sdhci]
[61668.512450] __mmc_start_request+0x86/0x180
[61668.512454] mmc_start_request+0xa2/0xc0
[61668.512462] mmc_blk_mq_issue_rq+0x365/0x900 [mmc_block]
[61668.512468] ? ktime_get+0x40/0xa0
[61668.512472] ? blk_add_timer+0x5d/0xa0
[61668.512477] mmc_mq_queue_rq+0x12d/0x260 [mmc_block]
[61668.512482] blk_mq_dispatch_rq_list+0x93/0x510
[61668.512487] ? deadline_remove_request+0x4e/0xb0
[61668.512492] blk_mq_do_dispatch_sched+0x67/0x100
[61668.512496] blk_mq_sched_dispatch_requests+0x120/0x170
[61668.512500] __blk_mq_run_hw_queue+0x57/0xf0
[61668.512504] blk_mq_run_work_fn+0x1b/0x20
[61668.512509] process_one_work+0x20f/0x410
[61668.512513] worker_thread+0x34/0x400
[61668.512518] kthread+0x120/0x140
[61668.512521] ? process_one_work+0x410/0x410
[61668.512525] ? __kthread_parkme+0x70/0x70
[61668.512530] ret_from_fork+0x1f/0x40
[61668.512534] Modules linked in: des_generic md4 nls_utf8 cifs ccm rbd veth tcp_diag inet_diag ceph libceph fscache ebtable_filter ebtables ip_set ip6table_filter ip6_tables sctp iptable_filter bpfilter 8021q garp mrp intel_rapl softdog intel_telemetry_pltdrv intel_punit_ipc intel_telemetry_core intel_pmc_ipc x86_pkg_temp_thermal intel_powerclamp nfnetlink_log nfnetlink kvm_intel nls_iso8859_1 crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 crypto_simd cryptd glue_helper intel_cstate intel_rapl_perf snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_soc_skl snd_soc_hdac_hda arc4 snd_hda_ext_core i915 snd_soc_skl_ipc snd_soc_sst_ipc snd_soc_sst_dsp snd_soc_acpi_intel_match snd_soc_acpi btusb kvmgt btrtl snd_soc_core btbcm vfio_mdev snd_compress ax88179_178a mdev btintel iwlmvm ac97_bus vfio_iommu_type1 snd_pcm_dmaengine vfio bluetooth mac80211 usbnet kvm mii input_leds pcspkr snd_hda_intel irqbypass ecdh_generic snd_hda_codec
[61668.512570] drm_kms_helper intel_xhci_usb_role_switch 8250_dw snd_hda_core iwlwifi roles drm snd_hwdep i2c_algo_bit fb_sys_fops snd_pcm syscopyarea sysfillrect sysimgblt snd_timer cfg80211 snd soundcore mei_me mac_hid idma64 mei processor_thermal_device int3400_thermal virt_dma int3403_thermal acpi_thermal_rel int340x_thermal_zone intel_soc_dts_iosf vhost_net vhost tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi sunrpc coretemp ip_tables x_tables autofs4 raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid0 multipath linear hid_generic usbkbd usbhid hid raid1 mmc_block spi_pxa2xx_platform i2c_i801 lpc_ich sdhci_pci cqhci sdhci intel_lpss_pci intel_lpss ahci r8169 realtek libahci video pinctrl_broxton pinctrl_intel

i think the problem is : kernel BUG at drivers/mmc/host/sdhci.c:734!
 
i think the problem is : kernel BUG at drivers/mmc/host/sdhci.c:734!

Yeah looks like it. sdhci.c is the SD card driver, do you maybe have a (micro) SD card connected? Could it be broken?
 
Yeah looks like it. sdhci.c is the SD card driver, do you maybe have a (micro) SD card connected? Could it be broken?

i have emmc storage, where the system is install. ( remember homelab)

Back in the day, i have similar problem some years with baytrail chipset and emmc with high load on it.

i will try with options : options sdhci sdhci.debug_quirks="0x40" debug_quirks2="0x4"

to see if it's reslove the problem

Thanks again for the debuging info
 
this morning the node crashed that i enabled sysctl kernel.panic=60 on, and it did not automatically reboot nor can i ping it.
 
i followed the instructions from Stefan_R for setting up kdump, theres dump in /var/crash after it crashed.
/sys/kernel/kexec_crash_loaded was reading 1
 
Well, as I wrote before, the kernel.panic trick does not work if you enable kdump (since the system immediately kexec's itself into the crash kernel, where it's no longer in a panicked state) - although it should have rebooted that way too.

In your '/var/crash' folder, there should be a file called dmesg.<something>, which should contain an error message for your panic somewhere near the end. If you post it, I can also take a look at it.
 
Well, as I wrote before, the kernel.panic trick does not work if you enable kdump (since the system immediately kexec's itself into the crash kernel, where it's no longer in a panicked state) - although it should have rebooted that way too.

In your '/var/crash' folder, there should be a file called dmesg.<something>, which should contain an error message for your panic somewhere near the end. If you post it, I can also take a look at it.

yeah, i was running kdump on one of the nodes, and the kernel.panic on the other.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!