pve node crash

Oct 2, 2024
6
0
1
hi,

I have a cluster with 4 nodes.

I have this event on one node:

Oct 29 09:48:31 pve3 kernel: rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
Oct 29 09:48:31 pve3 kernel: rcu: Tasks blocked on level-1 rcu_node (CPUs 0-9): P1365/2:b..l
Oct 29 09:48:31 pve3 kernel: rcu: (detected by 7, t=8520237 jiffies, g=27942925, q=150367 ncpus=20)
Oct 29 09:48:31 pve3 kernel: task:pve-firewall state:D stack:0 pid:1365 tgid:1365 ppid:1 flags:0x00004002
Oct 29 09:48:31 pve3 kernel: Call Trace:
Oct 29 09:48:31 pve3 kernel: <TASK>
Oct 29 09:48:31 pve3 kernel: __schedule+0x401/0x15e0
Oct 29 09:48:31 pve3 kernel: ? printk_get_next_message+0x96/0x320
Oct 29 09:48:31 pve3 kernel: schedule+0x33/0x110
Oct 29 09:48:31 pve3 kernel: schedule_preempt_disabled+0x15/0x30
Oct 29 09:48:31 pve3 kernel: rwsem_down_read_slowpath+0x284/0x4d0
Oct 29 09:48:31 pve3 kernel: down_read+0x48/0xc0
Oct 29 09:48:31 pve3 kernel: acct_collect+0x192/0x240
Oct 29 09:48:31 pve3 kernel: do_exit+0x249/0xae0
Oct 29 09:48:31 pve3 kernel: ? _printk+0x60/0x90
Oct 29 09:48:31 pve3 kernel: make_task_dead+0x83/0x170
Oct 29 09:48:31 pve3 kernel: rewind_stack_and_make_dead+0x17/0x20
Oct 29 09:48:31 pve3 kernel: RIP: 0033:0x701617443293
Oct 29 09:48:31 pve3 kernel: RSP: 002b:00007ffd7b92c818 EFLAGS: 00000246 ORIG_RAX: 0000000000000038
Oct 29 09:48:31 pve3 kernel: RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 0000701617443293
Oct 29 09:48:31 pve3 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011
Oct 29 09:48:31 pve3 kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
Oct 29 09:48:31 pve3 kernel: R10: 0000701617331e50 R11: 0000000000000246 R12: 0000000000000001
Oct 29 09:48:31 pve3 kernel: R13: 00007ffd7b92c930 R14: 00007ffd7b92c9b0 R15: 000070161766a020
Oct 29 09:48:31 pve3 kernel: </TASK>

What could be the cause ?

regards
 
Hi!

Could you provide further information on the running kernel version (uname -r) and package versions (pveversion -v)? What workload was running on the PVE node while this RCU stall was detected?
 
Hi,

here is the info, only 4 VM running Vith no cpu usage. just for eval, ubuntu vm, ubuntu LXC, Windows 2025 servers...
All seems ok. juste node crash at the moment..

I just tell it is a lab evaluation for futur Investment: the main goal is to prepare to replace VMWare with Proxmox with Ceph.
the final hardware will be 8 node AMD Epic 9000 series if all testing is ok.


1730304975921.png


root@pve1:~# uname -r
6.8.12-2-pve
root@pve1:~# pveversion -v
proxmox-ve: 8.2.0 (running kernel: 6.8.12-2-pve)
pve-manager: 8.2.7 (running version: 8.2.7/3e0176e6bb2ade3b)
proxmox-kernel-helper: 8.1.0
proxmox-kernel-6.8: 6.8.12-2
proxmox-kernel-6.8.12-2-pve-signed: 6.8.12-2
proxmox-kernel-6.8.4-2-pve-signed: 6.8.4-2
ceph: 18.2.4-pve3
ceph-fuse: 18.2.4-pve3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx9
intel-microcode: 3.20240813.1~deb12u1
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.1
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.4
libpve-access-control: 8.1.4
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.7
libpve-cluster-perl: 8.0.7
libpve-common-perl: 8.2.3
libpve-guest-common-perl: 5.1.4
libpve-http-server-perl: 5.1.1
libpve-network-perl: 0.9.8
libpve-rs-perl: 0.8.10
libpve-storage-perl: 8.2.5
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.4.0-4
proxmox-backup-client: 3.2.7-1
proxmox-backup-file-restore: 3.2.7-1
proxmox-firewall: 0.5.0
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.7
proxmox-widget-toolkit: 4.2.3
pve-cluster: 8.0.7
pve-container: 5.2.0
pve-docs: 8.2.3
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.2
pve-firewall: 5.0.7
pve-firmware: 3.13-2
pve-ha-manager: 4.0.5
pve-i18n: 3.2.3
pve-qemu-kvm: 9.0.2-3
pve-xtermjs: 5.3.0-3
qemu-server: 8.2.4
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.6-pve1
 
Last edited:
Are the packages the same version for all nodes? I can only see the versions for pve1 now. It would be interesting which kernel version is running on pve3, the node that crashed, and if they're as up-to-date as on pve1.

The stacktrace in the RCU stall warning seems rather generic to me. The pve-firewall is repeatedly run every ~10 seconds to update the firewall rules for the cluster/node/VMs, so it shouldn't consume too much CPU time. Are all nodes running the latest BIOS firmware and CPU microcode? Could you also provide the syslog when pve3 crashed and from the next boot (journalctl -b <boot-number>)?
 
all node are the same installation, same version of PVE, same hardware, same bios version, same microcode...
about bios, a new release was done this week, not already applied. We will do it next saturday.

I will try to collect info asked.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!