Hello,
I am losing hope, VM won't start, any help is really needed.
There are two nodes, that didn't restart the hardware, still running 4 VMs, that i am really afraid of rebooting and it not starting anymore.
When i try to start a VM, it tells the HA to start, but no success.
If the VM don't have HA configured i can try to start, but it timeout. Also can't console to it as it timeout too.(using web noVNC from proxmox VE)
Ceph "apparently" is showing all clean, but i am not sure it really is, could not test the VMs.
commands like qm list
freeze, and i cannot kill it's process. (with -9)
there are VMs in other nodes i also cannot start.
Please help me start the VM so i would have more time to debug what is happening.
Scenario:
8 PVE nodes running
Linux 6.8.12-5-pve
pve-manager/8.3.2/
Ceph Squid (19)
From that 8 nodes, 1 was offline for some months, motherboard issue.
I was booting the new motherboard, with proxmox already installed and was the old OS that is already joined the cluster, so it would come online and i don't know if would disturb the cluster synchronization as the other nodes would vote and fix this node config. (i think)
Also i was restoring a VM from proxmox backup server (the VM didn't exist in this cluster yet).
Suddenly 4 nodes hardware rebooted. No indication of power issue.
Servers are HP ProLiant DL380 Gen9 and one is also Gen9, 1U.
Maybe a coincidence of restoring a VM? and old node getting online?
Recently i updated pve kernel to be able to upgrade ceph from reef to squid.
But the cluster e relatively new, only a few months old and was installed Proxmox VE 8 directly/fresh install.
Tried to select kernel Linux 6.8.12-1-pve instead of Linux 6.8.12-5-pve, but no changes in the dmesg error messages.
Here are the dmesg -T output:
root@x3:~# ceph -s
pvecm status
When the hardware rebooted, it took almost 1 hour and a half to sync, but i had to reboot some of the nodes. then pvecm status showed quorate. Because it got "fenced" i think.
I had SSH access all the time, and could ping any node from any node.
Had a problem to access WEB UI, then later it was available again.
The node that was booting after months offline i disconnected from the switch, as a precaution.
So the kernel errors in dmesg are after that node got offline and are from the nodes that are online.
I am losing hope, VM won't start, any help is really needed.
There are two nodes, that didn't restart the hardware, still running 4 VMs, that i am really afraid of rebooting and it not starting anymore.
When i try to start a VM, it tells the HA to start, but no success.
If the VM don't have HA configured i can try to start, but it timeout. Also can't console to it as it timeout too.(using web noVNC from proxmox VE)
Ceph "apparently" is showing all clean, but i am not sure it really is, could not test the VMs.
commands like qm list
freeze, and i cannot kill it's process. (with -9)
there are VMs in other nodes i also cannot start.
Please help me start the VM so i would have more time to debug what is happening.
Scenario:
8 PVE nodes running
Linux 6.8.12-5-pve
pve-manager/8.3.2/
Ceph Squid (19)
Code:
# pveversion --verbose
proxmox-ve: 8.3.0 (running kernel: 6.8.12-5-pve)
pve-manager: 8.3.2 (running version: 8.3.2/3e76eec21c4a14a7)
proxmox-kernel-helper: 8.1.0
proxmox-kernel-6.8: 6.8.12-5
proxmox-kernel-6.8.12-5-pve-signed: 6.8.12-5
proxmox-kernel-6.8.12-1-pve-signed: 6.8.12-1
proxmox-kernel-6.8.4-2-pve-signed: 6.8.4-2
ceph: 19.2.0-pve2
ceph-fuse: 19.2.0-pve2
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx11
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-5
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.1
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.4
libpve-access-control: 8.2.0
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.10
libpve-cluster-perl: 8.0.10
libpve-common-perl: 8.2.9
libpve-guest-common-perl: 5.1.6
libpve-http-server-perl: 5.1.2
libpve-network-perl: 0.10.0
libpve-rs-perl: 0.9.1
libpve-storage-perl: 8.3.3
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.5.0-1
proxmox-backup-client: 3.3.2-1
proxmox-backup-file-restore: 3.3.2-2
proxmox-firewall: 0.6.0
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.3.1
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.7
proxmox-widget-toolkit: 4.3.3
pve-cluster: 8.0.10
pve-container: 5.2.3
pve-docs: 8.3.1
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.2
pve-firewall: 5.1.0
pve-firmware: 3.14-2
pve-ha-manager: 4.0.6
pve-i18n: 3.3.2
pve-qemu-kvm: 9.0.2-4
pve-xtermjs: 5.3.0-3
qemu-server: 8.3.3
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.6-pve1
From that 8 nodes, 1 was offline for some months, motherboard issue.
I was booting the new motherboard, with proxmox already installed and was the old OS that is already joined the cluster, so it would come online and i don't know if would disturb the cluster synchronization as the other nodes would vote and fix this node config. (i think)
Also i was restoring a VM from proxmox backup server (the VM didn't exist in this cluster yet).
Suddenly 4 nodes hardware rebooted. No indication of power issue.
Servers are HP ProLiant DL380 Gen9 and one is also Gen9, 1U.
Maybe a coincidence of restoring a VM? and old node getting online?
Recently i updated pve kernel to be able to upgrade ceph from reef to squid.
But the cluster e relatively new, only a few months old and was installed Proxmox VE 8 directly/fresh install.
Tried to select kernel Linux 6.8.12-1-pve instead of Linux 6.8.12-5-pve, but no changes in the dmesg error messages.
Here are the dmesg -T output:
Code:
[Fri Jan 3 13:15:32 2025] INFO: task pve-ha-crm:3025 blocked for more than 122 seconds.
[Fri Jan 3 13:15:32 2025] Tainted: P O 6.8.12-5-pve #1
[Fri Jan 3 13:15:32 2025] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Fri Jan 3 13:15:32 2025] task:pve-ha-crm state:D stack:0 pid:3025 tgid:3025 ppid:1 flags:0x00000002
[Fri Jan 3 13:15:32 2025] Call Trace:
[Fri Jan 3 13:15:32 2025] <TASK>
[Fri Jan 3 13:15:32 2025] __schedule+0x401/0x15e0
[Fri Jan 3 13:15:32 2025] ? try_to_unlazy+0x60/0xe0
[Fri Jan 3 13:15:32 2025] ? terminate_walk+0x65/0x100
[Fri Jan 3 13:15:32 2025] schedule+0x33/0x110
[Fri Jan 3 13:15:32 2025] schedule_preempt_disabled+0x15/0x30
[Fri Jan 3 13:15:32 2025] rwsem_down_write_slowpath+0x392/0x6a0
[Fri Jan 3 13:15:32 2025] down_write+0x5c/0x80
[Fri Jan 3 13:15:32 2025] filename_create+0xaf/0x1b0
[Fri Jan 3 13:15:32 2025] do_mkdirat+0x59/0x180
[Fri Jan 3 13:15:32 2025] __x64_sys_mkdir+0x4a/0x70
[Fri Jan 3 13:15:32 2025] x64_sys_call+0x2e3/0x2480
[Fri Jan 3 13:15:32 2025] do_syscall_64+0x81/0x170
[Fri Jan 3 13:15:32 2025] ? do_user_addr_fault+0x33e/0x660
[Fri Jan 3 13:15:32 2025] ? irqentry_exit_to_user_mode+0x7b/0x260
[Fri Jan 3 13:15:32 2025] ? irqentry_exit+0x43/0x50
[Fri Jan 3 13:15:32 2025] ? exc_page_fault+0x94/0x1b0
[Fri Jan 3 13:15:32 2025] entry_SYSCALL_64_after_hwframe+0x78/0x80
[Fri Jan 3 13:15:32 2025] RIP: 0033:0x731e04219ea7
[Fri Jan 3 13:15:32 2025] RSP: 002b:00007ffeef405e78 EFLAGS: 00000246 ORIG_RAX: 0000000000000053
[Fri Jan 3 13:15:32 2025] RAX: ffffffffffffffda RBX: 000064c124e6c2a0 RCX: 0000731e04219ea7
[Fri Jan 3 13:15:32 2025] RDX: 0000000000000021 RSI: 00000000000001ff RDI: 000064c12b8d1d90
[Fri Jan 3 13:15:32 2025] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000
[Fri Jan 3 13:15:32 2025] R10: 0000000000000000 R11: 0000000000000246 R12: 000064c124e71c88
[Fri Jan 3 13:15:32 2025] R13: 000064c12b8d1d90 R14: 000064c12b69df88 R15: 00000000000001ff
[Fri Jan 3 13:15:32 2025] </TASK>
[Fri Jan 3 13:15:32 2025] INFO: task pvescheduler:16975 blocked for more than 122 seconds.
[Fri Jan 3 13:15:32 2025] Tainted: P O 6.8.12-5-pve #1
[Fri Jan 3 13:15:32 2025] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Fri Jan 3 13:15:32 2025] task:pvescheduler state:D stack:0 pid:16975 tgid:16975 ppid:16974 flags:0x00000006
[Fri Jan 3 13:15:32 2025] Call Trace:
[Fri Jan 3 13:15:32 2025] <TASK>
[Fri Jan 3 13:15:32 2025] __schedule+0x401/0x15e0
[Fri Jan 3 13:15:32 2025] ? try_to_unlazy+0x60/0xe0
[Fri Jan 3 13:15:32 2025] ? terminate_walk+0x65/0x100
[Fri Jan 3 13:15:32 2025] schedule+0x33/0x110
[Fri Jan 3 13:15:32 2025] schedule_preempt_disabled+0x15/0x30
[Fri Jan 3 13:15:32 2025] rwsem_down_write_slowpath+0x392/0x6a0
[Fri Jan 3 13:15:32 2025] down_write+0x5c/0x80
[Fri Jan 3 13:15:32 2025] filename_create+0xaf/0x1b0
[Fri Jan 3 13:15:32 2025] do_mkdirat+0x59/0x180
[Fri Jan 3 13:15:32 2025] __x64_sys_mkdir+0x4a/0x70
[Fri Jan 3 13:15:32 2025] x64_sys_call+0x2e3/0x2480
[Fri Jan 3 13:15:32 2025] do_syscall_64+0x81/0x170
[Fri Jan 3 13:15:32 2025] ? do_mkdirat+0xe1/0x180
[Fri Jan 3 13:15:32 2025] ? timerqueue_add+0x69/0xd0
[Fri Jan 3 13:15:32 2025] ? enqueue_hrtimer+0x4d/0xc0
[Fri Jan 3 13:15:32 2025] ? hrtimer_start_range_ns+0x12f/0x3b0
[Fri Jan 3 13:15:32 2025] ? do_setitimer+0x1a4/0x230
[Fri Jan 3 13:15:32 2025] ? __x64_sys_alarm+0x76/0xd0
[Fri Jan 3 13:15:32 2025] ? syscall_exit_to_user_mode+0x86/0x260
[Fri Jan 3 13:15:32 2025] ? do_syscall_64+0x8d/0x170
[Fri Jan 3 13:15:32 2025] ? irqentry_exit+0x43/0x50
[Fri Jan 3 13:15:32 2025] ? exc_page_fault+0x94/0x1b0
[Fri Jan 3 13:15:32 2025] entry_SYSCALL_64_after_hwframe+0x78/0x80
[Fri Jan 3 13:15:32 2025] RIP: 0033:0x73911a227ea7
[Fri Jan 3 13:15:32 2025] RSP: 002b:00007ffd9254d298 EFLAGS: 00000246 ORIG_RAX: 0000000000000053
[Fri Jan 3 13:15:32 2025] RAX: ffffffffffffffda RBX: 000062febef9e2a0 RCX: 000073911a227ea7
[Fri Jan 3 13:15:32 2025] RDX: 0000000000000026 RSI: 00000000000001ff RDI: 000062fec5cc43e0
[Fri Jan 3 13:15:32 2025] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000
[Fri Jan 3 13:15:32 2025] R10: 0000000000000000 R11: 0000000000000246 R12: 000062febefa3c88
[Fri Jan 3 13:15:32 2025] R13: 000062fec5cc43e0 R14: 000062febf240368 R15: 00000000000001ff
[Fri Jan 3 13:15:32 2025] </TASK>
...
[Fri Jan 3 13:23:43 2025] Future hung task reports are suppressed, see sysctl kernel.hung_task_warnings
root@x3:~# ceph -s
Code:
cluster:
id: b682a805-8831-480a-8a8a-c7d896bfecae
health: HEALTH_OK
services:
mon: 3 daemons, quorum x1,x2,x3 (age 4h)
mgr: x2(active, since 4h), standbys: x1, x3
osd: 10 osds: 10 up (since 69m), 10 in (since 69m)
data:
pools: 4 pools, 385 pgs
objects: 35.15k objects, 131 GiB
usage: 367 GiB used, 35 TiB / 35 TiB avail
pgs: 385 active+clean
pvecm status
Code:
root@x3:~# pvecm status
Cluster information
-------------------
Name: m[redacted]3
Config Version: 8
Transport: knet
Secure auth: on
Quorum information
------------------
Date: Fri Jan 3 16:56:31 2025
Quorum provider: corosync_votequorum
Nodes: 7
Node ID: 0x00000005
Ring ID: 2.862
Quorate: Yes
Votequorum information
----------------------
Expected votes: 8
Highest expected: 8
Total votes: 7
Quorum: 5
Flags: Quorate
Membership information
----------------------
Nodeid Votes Name
0x00000002 1 100.80.83.121
0x00000003 1 100.80.83.124
0x00000004 1 100.80.83.125
0x00000005 1 100.80.83.126 (local)
0x00000006 1 100.80.83.127
0x00000007 1 100.80.83.129
0x00000008 1 100.80.83.130
When the hardware rebooted, it took almost 1 hour and a half to sync, but i had to reboot some of the nodes. then pvecm status showed quorate. Because it got "fenced" i think.
I had SSH access all the time, and could ping any node from any node.
Had a problem to access WEB UI, then later it was available again.
The node that was booting after months offline i disconnected from the switch, as a precaution.
So the kernel errors in dmesg are after that node got offline and are from the nodes that are online.
Last edited: