Repeated crashes/reboots of host after update from 6.4 to 7.4

thex

Member
Mar 25, 2021
35
4
13
Hi,
I updated two machines yesterday from 6.4 to 7.4 one is running fine the other is not.

The machine that misbehaves has not been stable for more than 3 hours since it was updated.

The syslog does not show any problems, must of the time something about the hourly cron or a disk temperature changing or just now as it crashed while I was investigating my login to the web ui a few minutes prior.

Code:
Apr  2 08:37:03 proxmox pvedaemon[3196]: <root@pam> successful auth for user 'root@pam'
Apr  2 08:43:29 proxmox systemd-modules-load[1910]: Inserted module 'iscsi_tcp'
Apr  2 08:43:29 proxmox systemd-modules-load[1910]: Inserted module 'ib_iser'
Apr  2 08:43:29 proxmox systemd-modules-load[1910]: Inserted module 'vhost_net'
Apr  2 08:43:29 proxmox systemd[1]: Starting Flush Journal to Persistent Storage...
Apr  2 08:43:29 proxmox systemd[1]: Finished Coldplug All udev Devices.
Apr  2 08:43:29 proxmox kernel: [    0.000000] Linux version 5.19.17-2-pve (build@proxmox) (gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP PREEMPT_DYNAMIC PVE 5.19.17-2 (Sat, 28 Jan 2023 16:40:25  ()
Apr  2 08:43:29 proxmox systemd[1]: Starting Helper to synchronize boot up for ifupdown...
Apr  2 08:43:29 proxmox kernel: [    0.000000] Command line: initrd=\EFI\proxmox\5.19.17-2-pve\initrd.img-5.19.17-2-pve root=ZFS=rpool/ROOT/pve-1 boot=zfs

pve-manager/7.4-3/9002ab8a (running kernel: 5.19.17-2-pve)

I already switched from 5.15 Kernel to 5.19 but it did not resolve the issue.

The machine is running on a AMD Ryzen 4650G any known issues here?
One additional suspicion is forwarding of USB devices as this is something the other machine does not do. Sadly it would be quite some work to migrate all the usb devices to the other server to migrate the VM for testing purposes.

Any ideas?
 
Some more hints, this is showing all the reboots without prior shutdown. Only one proper shutdown when I switched kernels

Code:
root@proxmox:~# last -xF reboot shutdown | head
reboot   system boot  5.19.17-2-pve    Sun Apr  2 10:03:37 2023   still running
reboot   system boot  5.19.17-2-pve    Sun Apr  2 08:43:26 2023   still running
reboot   system boot  5.19.17-2-pve    Sun Apr  2 07:59:49 2023   still running
reboot   system boot  5.19.17-2-pve    Sun Apr  2 07:30:31 2023   still running
reboot   system boot  5.19.17-2-pve    Sun Apr  2 04:04:13 2023   still running
reboot   system boot  5.19.17-2-pve    Sun Apr  2 01:27:00 2023   still running
shutdown system down  5.15.102-1-pve   Sun Apr  2 01:25:08 2023 - Sun Apr  2 01:27:00 2023  (00:01)
reboot   system boot  5.15.102-1-pve   Sat Apr  1 22:25:09 2023 - Sun Apr  2 01:25:08 2023  (02:59)
reboot   system boot  5.15.102-1-pve   Sat Apr  1 20:35:25 2023 - Sun Apr  2 01:25:08 2023  (04:49)
reboot   system boot  5.15.102-1-pve   Sat Apr  1 18:23:34 2023 - Sun Apr  2 01:25:08 2023  (07:01)
 
saw a reboot again and an ssh console I still had open showed this

Code:
root@proxmox:~#
Message from syslogd@proxmox at Apr  2 11:10:20 ...
 kernel:[ 3937.332015] NMI watchdog: Watchdog detected hard LOCKUP on cpu 4

Message from syslogd@proxmox at Apr  2 11:10:20 ...
 kernel:[ 3943.640004] NMI watchdog: Watchdog detected hard LOCKUP on cpu 13

Message from syslogd@proxmox at Apr  2 11:10:20 ...
 kernel:[ 3954.229196] NMI watchdog: Watchdog detected hard LOCKUP on cpu 0

Message from syslogd@proxmox at Apr  2 11:10:20 ...
 kernel:[ 3970.428009] NMI watchdog: Watchdog detected hard LOCKUP on cpu 1

Message from syslogd@proxmox at Apr  2 11:10:20 ...
 kernel:[ 3975.288005] NMI watchdog: Watchdog detected hard LOCKUP on cpu 6

Message from syslogd@proxmox at Apr  2 11:10:20 ...
 kernel:[ 3995.348011] NMI watchdog: Watchdog detected hard LOCKUP on cpu 5

Message from syslogd@proxmox at Apr  2 11:10:20 ...
 kernel:[ 4000.308008] NMI watchdog: Watchdog detected hard LOCKUP on cpu 15

trying my luck with 6.2 kernel now
 
Ok also crashed with 6.2 within less than 2h :( (Had a screen attached this time, no output at all at the time of hang). This time it did not recover, needed to powercycle.

Running out of ideas, maybe anybody a tip which logs to check as the usual suspects don't point me to anything
 
Ok 5h up and going. Sadly I did three things which might have remedied it.
- added parameter to corosync two_node:1
- updated the bios
- switched some usb devices around due to bios flashing

Let’s keep the fingers crossed.
I go it’s stable now but I also really would like to know what exactly fixed it now.
 
Hey, I am the guy from the other thread.

I got my NUC and made a fresh install there, no issue. I did update the bios beforehand - I would consider that a necessary step to avoid debugging already solved stuff.

I do have my fingers crossed for you as well - good luck!

V
 
Yes it’s still up and running, no problems or hiccups so far. Really wired, I did not update the bios initially as it was running with months uptime before the update without any problems.
Maybe the new kernel triggers some bug in/with the old bios. Maybe it is just a setting I set differently as I needed to re-configure it after the bios update. Guess I will never know.
 
Ok 5h up and going. Sadly I did three things which might have remedied it.
- added parameter to corosync two_node:1
- updated the bios
- switched some usb devices around due to bios flashing

Let’s keep the fingers crossed.
I go it’s stable now but I also really would like to know what exactly fixed it now.
I have a similar issue with Proxmox 7.4 - What changes did you make to the bios? thanks.
 
It sadly still keeps crashing but much less frequent

Code:
Apr 22 09:17:01 proxmox CRON[667217]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Apr 22 10:08:10 proxmox pmxcfs[14246]: [dcdb] notice: data verification successful
Apr 22 10:17:01 proxmox CRON[783415]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Apr 22 11:08:10 proxmox pmxcfs[14246]: [dcdb] notice: data verification successful
Apr 22 11:17:01 proxmox CRON[909242]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Apr 22 12:08:10 proxmox pmxcfs[14246]: [dcdb] notice: data verification successful
Apr 22 12:17:01 proxmox CRON[1028286]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Apr 22 23:01:53 proxmox systemd-modules-load[1980]: Inserted module 'iscsi_tcp'
Apr 22 23:01:53 proxmox systemd[1]: Starting Flush Journal to Persistent Storage...
Apr 22 23:01:53 proxmox systemd-modules-load[1980]: Inserted module 'ib_iser'
Apr 22 23:01:53 proxmox kernel: [    0.000000] Linux version 6.2.6-1-pve (build@proxmox) (gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP PREEMPT_DYNAMIC PVE 6.2.6-1 (2023-03-14T17:08Z) ()

so nothing in the logs, just the cron jobs and then my manual reboot

@bkejji Interesting, however I hesitate to go back to such an "old" kernel.

Now I'm thinking about setting up the machine for kernel debugging but that would be a hobby on it's own
 
  • Like
Reactions: bkejji
And again... even updated the bios once more last week as there was a new version.

I have now set up external monitoring and the host froze at exactly 04:00 in the morning.
Again nothing can be found in the log.

However the KVM had some USB devices disconnecting listed but that could also be a symptom.
snapshot.jpg

I'm also not sure yet if the time matches the crash.

The worst thing is that the system completely locks up, it doesn't even reboot automatically. Is ther some kind of additional watchdog I could configure to make it at least reboot?

Any other ideas for the investigation?
 
So the time since the last reboot should have been 559984 seconds so if the first digit is not cut off the usb disconnect happend much earlier
 
Last edited:
Sadly still continues to crash even after updating to pve 8

Could see some new info in last crash, any ideas?
Code:
2023-08-03T03:17:04.968617+02:00 proxmox kernel: [289193.998738] CPU: 0 PID: 118 Comm: kcompactd0 Tainted: P      D    O       6.2.16-3-pve #1
2023-08-03T03:17:04.968617+02:00 proxmox kernel: [289193.998741] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X300M-STX, BIOS P1.80E 03/28/2023
2023-08-03T03:17:04.968618+02:00 proxmox kernel: [289193.998744] RIP: 0010:native_queued_spin_lock_slowpath+0x82/0x300
2023-08-03T03:17:04.968618+02:00 proxmox kernel: [289193.998750] Code: 00 00 00 f0 0f ba 2b 08 0f 92 c2 8b 03 0f b6 d2 c1 e2 08 30 e4 09 d0 3d ff 00 00 00 77 5f 85 c0 74 0e 8b 03 84 c0 74 08 f3 90 <8b> 03 84 c0 75 f8 b8 01 00 00 00 66 89 03 5b 41 5c 41 5d 41 5e 41
2023-08-03T03:17:04.968619+02:00 proxmox kernel: [289193.998754] RSP: 0018:ffffb460c05cf9e8 EFLAGS: 00000202
2023-08-03T03:17:04.968620+02:00 proxmox kernel: [289193.998757] RAX: 0000000000000101 RBX: ffffe8978d3852a8 RCX: 000ffffffffff000
2023-08-03T03:17:04.968620+02:00 proxmox kernel: [289193.998759] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffe8978d3852a8
2023-08-03T03:17:04.968621+02:00 proxmox kernel: [289193.998762] RBP: ffffb460c05cfa10 R08: 0000000000000000 R09: 0000000000000000
2023-08-03T03:17:04.968622+02:00 proxmox kernel: [289193.998764] R10: 0000008000000000 R11: ffffff8000000000 R12: 00007f328ce12000
2023-08-03T03:17:04.968630+02:00 proxmox kernel: [289193.998766] R13: ffff96e2136e17c0 R14: ffff96e24d7e1bd0 R15: 0000000000000001
2023-08-03T03:17:04.968630+02:00 proxmox kernel: [289193.998769] FS:  0000000000000000(0000) GS:ffff96e821a00000(0000) knlGS:0000000000000000
2023-08-03T03:17:04.968631+02:00 proxmox kernel: [289193.998771] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2023-08-03T03:17:04.968631+02:00 proxmox kernel: [289193.998773] CR2: 000012c70d07b000 CR3: 0000000403210000 CR4: 0000000000350ef0
2023-08-03T03:17:04.968632+02:00 proxmox kernel: [289193.998776] Call Trace:
2023-08-03T03:17:04.968632+02:00 proxmox kernel: [289193.998778]  <TASK>
2023-08-03T03:17:04.968633+02:00 proxmox kernel: [289193.998782]  _raw_spin_lock+0x3f/0x60
2023-08-03T03:17:04.968633+02:00 proxmox kernel: [289193.998785]  page_vma_mapped_walk+0x1b3/0x9e0
2023-08-03T03:17:04.968634+02:00 proxmox kernel: [289193.998789]  ? __mmu_notifier_invalidate_range_start+0x13a/0x200
2023-08-03T03:17:04.968634+02:00 proxmox kernel: [289193.998793]  try_to_migrate_one+0x151/0xd10
2023-08-03T03:17:04.968635+02:00 proxmox kernel: [289193.998796]  ? __mod_lruvec_page_state+0x9f/0x160
2023-08-03T03:17:04.968635+02:00 proxmox kernel: [289193.998801]  rmap_walk_ksm+0x141/0x1f0
2023-08-03T03:17:04.968636+02:00 proxmox kernel: [289193.998804]  try_to_migrate+0xee/0x120
2023-08-03T03:17:04.968636+02:00 proxmox kernel: [289193.998806]  ? __pfx_try_to_migrate_one+0x10/0x10
2023-08-03T03:17:04.968636+02:00 proxmox kernel: [289193.998808]  ? __pfx_folio_not_mapped+0x10/0x10
2023-08-03T03:17:04.968637+02:00 proxmox kernel: [289193.998811]  ? __pfx_folio_lock_anon_vma_read+0x10/0x10
2023-08-03T03:17:04.968637+02:00 proxmox kernel: [289193.998814]  migrate_pages+0xdb9/0x1210
2023-08-03T03:17:04.968638+02:00 proxmox kernel: [289193.998816]  ? isolate_migratepages_block+0xf4d/0x1a10
2023-08-03T03:17:04.968638+02:00 proxmox kernel: [289193.998820]  ? __pfx_compaction_alloc+0x10/0x10
2023-08-03T03:17:04.968638+02:00 proxmox kernel: [289193.998823]  ? __pfx_compaction_free+0x10/0x10
2023-08-03T03:17:04.968639+02:00 proxmox kernel: [289193.998825]  ? __pfx_remove_migration_pte+0x10/0x10
2023-08-03T03:17:04.968639+02:00 proxmox kernel: [289193.998829]  compact_zone+0xa7d/0xf00
2023-08-03T03:17:04.968640+02:00 proxmox kernel: [289193.998833]  proactive_compact_node+0x8c/0xe0
2023-08-03T03:17:04.968640+02:00 proxmox kernel: [289193.998837]  kcompactd+0x390/0x4e0
2023-08-03T03:17:04.968640+02:00 proxmox kernel: [289193.998839]  ? __pfx_autoremove_wake_function+0x10/0x10
2023-08-03T03:17:04.968641+02:00 proxmox kernel: [289193.998844]  ? __pfx_kcompactd+0x10/0x10
2023-08-03T03:17:04.968641+02:00 proxmox kernel: [289193.998846]  kthread+0xe9/0x110
2023-08-03T03:17:04.968642+02:00 proxmox kernel: [289193.998850]  ? __pfx_kthread+0x10/0x10
2023-08-03T03:17:04.968642+02:00 proxmox kernel: [289193.998852]  ret_from_fork+0x2c/0x50
2023-08-03T03:17:04.968643+02:00 proxmox kernel: [289193.998856]  </TASK>
2023-08-03T03:17:32.968585+02:00 proxmox kernel: [289221.997941] watchdog: BUG: soft lockup - CPU#0 stuck for 52s! [kcompactd0:118]
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!