PVE Crash since update to 7.1

Taledo

Member
Nov 20, 2020
72
5
13
53
Good day all,

I've found that my PVE has crashed tonight. I've lost all monitoring at 0150, and the host has been in limbo since. Not only that, but I had run apt upgrade yesterday, so this might be related.

SSH isn't working, nor is the GUI. I can log in to the console, but I cannot get a bash to start up.


The journal service isn't starting any more, so I doubt we'll get any logs. There is, however, a quite nasty kernel stacktrace in the syslog :

Screenshot 2022-01-11 at 08-37-54 deucalion pelado lan Logs LibreNMS.png


Will try a soft reboot and report back.

Edit 1 :

Host has rebooted, all VMs are coming back to life.

Here's the last entries in the /var/log/syslog files prior to the crash :
Bash:
Jan 11 00:00:55 deucalion pveproxy[6543]: worker 2860888 started                                                      
Jan 11 00:00:55 deucalion pveproxy[6543]: worker 2860889 started                                                      
Jan 11 00:00:55 deucalion pveproxy[6543]: worker 2860890 started                                                      
Jan 11 00:00:59 deucalion spiceproxy[1514543]: worker exit                                                            
Jan 11 00:00:59 deucalion spiceproxy[6550]: worker 1514543 finished                                                    
Jan 11 00:01:00 deucalion pveproxy[2098330]: worker exit                                                              
Jan 11 00:01:00 deucalion pveproxy[2002386]: worker exit                                                              
Jan 11 00:01:00 deucalion pveproxy[1895981]: worker exit                                                              
Jan 11 00:01:00 deucalion pveproxy[6543]: worker 2002386 finished                                                      
Jan 11 00:01:00 deucalion pveproxy[6543]: worker 1895981 finished                                                      
Jan 11 00:01:00 deucalion pveproxy[6543]: worker 2098330 finished                                                      
Jan 11 00:01:01 deucalion CRON[2861601]: (root) CMD (/bin/bash /root/ventil.sh &> /dev/null)                          
Jan 11 00:17:01 deucalion CRON[3379041]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)                  
Jan 11 00:24:01 deucalion CRON[3409426]: (root) CMD (if [ $(date +%w) -eq 0 ] && [ -x /usr/lib/zfs-linux/scrub ]; then /usr/lib/zfs-linux/scrub; fi)                                                                                          
Jan 11 00:29:26 deucalion smartd[5767]: Device: /dev/sdc [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 66 to 65                                                                                                      
Jan 11 00:50:16 deucalion pvedaemon[2035974]: worker exit                                                              
Jan 11 00:50:16 deucalion pvedaemon[6532]: worker 2035974 finished                                                    
Jan 11 00:50:16 deucalion pvedaemon[6532]: starting 1 worker(s)                                                        
Jan 11 00:50:16 deucalion pvedaemon[6532]: worker 3529717 started                                                      
Jan 11 00:59:26 deucalion pmxcfs[6165]: [dcdb] notice: data verification successful                                    
Jan 11 01:00:17 deucalion kernel: [464490.164962] kauditd_printk_skb: 14 callbacks suppressed                          
Jan 11 01:00:17 deucalion kernel: [464490.164967] audit: type=1400 audit(1641859217.241:157): apparmor="DENIED" operation="mount" info="failed flags match" error=-13 profile="lxc-102_</var/lib/lxc>" name="/run/systemd/unit-root/" pid=3575739 comm="(ogrotate)" srcname="/" flags="rw, rbind"


kern.log doesn't contain anything relevant.


I can provide logs if needed.
 
Last edited:
PVE crashed again with a very similar error message in the syslog again.

[27309.327754] INFO: task jbd2/dm-1-8:577 blocked for more than 120 seconds.

Seems to be the first error. As far as I know, the host then goes into a weird state, where writing operations to the disks won't work.

I'll try switching from 5.11 Kernel to 5.15, but this isn't ideal.

I do not know if that's linked, but I'm seeing a high load before the crash each time.



1641916550313.png
 
Last edited:
Switching the kernel didn't change.

This time, though, no trace of the stacktrace. But I've instead found this :

1641933797579.png

Could this be related to a failing disk?
 
1641935437394.png

I'm seeing no smart error on the disks.


There appears to be a pattern between a minecraft server running & the server crashing.

I've had servers running before with no issues.


I'm seeing a pattern here. This is the CT with the minecraft server.

1641935951593.png
 
Last edited:
I got the exactly same issue! Any updates on this? I still couldn't nail down the root cause, but the update to 7.1 could be the culprit.
My month so far..

crash.jpg

Additional information:

5.15.19-1-pve Kernel, promox 7.1-10 running 2 LXCs and 3 VMs

Seems to always crash around the same time, 2 AM

The TrueNAS Core VM still runs when proxmox becomes unresponsive, I still can access the SMB shares
 
Last edited:
I got the exactly same issue! Any updates on this? I still couldn't nail down the root cause, but the update to 7.1 could be the culprit.
My month so far..

View attachment 34190

Additional information:

5.15.19-1-pve Kernel, promox 7.1-10 running 2 LXCs and 3 VMs

Seems to always crash around the same time, 2 AM

The TrueNAS Core VM still runs when proxmox becomes unresponsive, I still can access the SMB shares
do you have any scheduled backups at 2am ? that might be causing the it ?
 
do you have any scheduled backups at 2am ? that might be causing the it ?
I do have scheduled backups but at 4am. I also thought of backups as the cause because on the monitor directly attached to the server there is sometimes this line after a crash: \\Backup-PC has not responded in 180 seconds. Reconnecting...

I run my server at CET +0100 so even if proxmox would work based on CET +0 it would crash at 3am.

My backup drive:
1644703745469.jpeg
 
Last edited:
  • Like
Reactions: Spirog
Hey,

on my end, the problem went away, here's what I did (not sure what solved it though) :

The minecraft server CT was moved to another drive

AND

Since it was swapping out, I added ram and removed all swap for that CT.

As I said in my original post, pve mostly crashed when minecraft server was running with people on it.

Since then, no crash have occurred. So, it's either a bad drive on my end (no SMART errors, nothing, wasn't slow or anything) or the swapping.


I hope you can solve your issue!
 
  • Like
Reactions: FaySmash
Hey,

on my end, the problem went away, here's what I did (not sure what solved it though) :

The minecraft server CT was moved to another drive

AND

Since it was swapping out, I added ram and removed all swap for that CT.

As I said in my original post, pve mostly crashed when minecraft server was running with people on it.

Since then, no crash have occurred. So, it's either a bad drive on my end (no SMART errors, nothing, wasn't slow or anything) or the swapping.


I hope you can solve your issue!

I'll try to remove swap allocations too, thanks!
 
Last edited:
Quick update: at least until now it didn't crash anymore after removing the swap allocations for the LXCs. Lets see if it stays this way.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!