PVE Crash since update to 7.1

Taledo

Active Member
Nov 20, 2020
79
9
28
54
Good day all,

I've found that my PVE has crashed tonight. I've lost all monitoring at 0150, and the host has been in limbo since. Not only that, but I had run apt upgrade yesterday, so this might be related.

SSH isn't working, nor is the GUI. I can log in to the console, but I cannot get a bash to start up.


The journal service isn't starting any more, so I doubt we'll get any logs. There is, however, a quite nasty kernel stacktrace in the syslog :

Screenshot 2022-01-11 at 08-37-54 deucalion pelado lan Logs LibreNMS.png


Will try a soft reboot and report back.

Edit 1 :

Host has rebooted, all VMs are coming back to life.

Here's the last entries in the /var/log/syslog files prior to the crash :
Bash:
Jan 11 00:00:55 deucalion pveproxy[6543]: worker 2860888 started                                                      
Jan 11 00:00:55 deucalion pveproxy[6543]: worker 2860889 started                                                      
Jan 11 00:00:55 deucalion pveproxy[6543]: worker 2860890 started                                                      
Jan 11 00:00:59 deucalion spiceproxy[1514543]: worker exit                                                            
Jan 11 00:00:59 deucalion spiceproxy[6550]: worker 1514543 finished                                                    
Jan 11 00:01:00 deucalion pveproxy[2098330]: worker exit                                                              
Jan 11 00:01:00 deucalion pveproxy[2002386]: worker exit                                                              
Jan 11 00:01:00 deucalion pveproxy[1895981]: worker exit                                                              
Jan 11 00:01:00 deucalion pveproxy[6543]: worker 2002386 finished                                                      
Jan 11 00:01:00 deucalion pveproxy[6543]: worker 1895981 finished                                                      
Jan 11 00:01:00 deucalion pveproxy[6543]: worker 2098330 finished                                                      
Jan 11 00:01:01 deucalion CRON[2861601]: (root) CMD (/bin/bash /root/ventil.sh &> /dev/null)                          
Jan 11 00:17:01 deucalion CRON[3379041]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)                  
Jan 11 00:24:01 deucalion CRON[3409426]: (root) CMD (if [ $(date +%w) -eq 0 ] && [ -x /usr/lib/zfs-linux/scrub ]; then /usr/lib/zfs-linux/scrub; fi)                                                                                          
Jan 11 00:29:26 deucalion smartd[5767]: Device: /dev/sdc [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 66 to 65                                                                                                      
Jan 11 00:50:16 deucalion pvedaemon[2035974]: worker exit                                                              
Jan 11 00:50:16 deucalion pvedaemon[6532]: worker 2035974 finished                                                    
Jan 11 00:50:16 deucalion pvedaemon[6532]: starting 1 worker(s)                                                        
Jan 11 00:50:16 deucalion pvedaemon[6532]: worker 3529717 started                                                      
Jan 11 00:59:26 deucalion pmxcfs[6165]: [dcdb] notice: data verification successful                                    
Jan 11 01:00:17 deucalion kernel: [464490.164962] kauditd_printk_skb: 14 callbacks suppressed                          
Jan 11 01:00:17 deucalion kernel: [464490.164967] audit: type=1400 audit(1641859217.241:157): apparmor="DENIED" operation="mount" info="failed flags match" error=-13 profile="lxc-102_</var/lib/lxc>" name="/run/systemd/unit-root/" pid=3575739 comm="(ogrotate)" srcname="/" flags="rw, rbind"


kern.log doesn't contain anything relevant.


I can provide logs if needed.
 
Last edited:
PVE crashed again with a very similar error message in the syslog again.

[27309.327754] INFO: task jbd2/dm-1-8:577 blocked for more than 120 seconds.

Seems to be the first error. As far as I know, the host then goes into a weird state, where writing operations to the disks won't work.

I'll try switching from 5.11 Kernel to 5.15, but this isn't ideal.

I do not know if that's linked, but I'm seeing a high load before the crash each time.



1641916550313.png
 
Last edited:
Switching the kernel didn't change.

This time, though, no trace of the stacktrace. But I've instead found this :

1641933797579.png

Could this be related to a failing disk?
 
1641935437394.png

I'm seeing no smart error on the disks.


There appears to be a pattern between a minecraft server running & the server crashing.

I've had servers running before with no issues.


I'm seeing a pattern here. This is the CT with the minecraft server.

1641935951593.png
 
Last edited:
I got the exactly same issue! Any updates on this? I still couldn't nail down the root cause, but the update to 7.1 could be the culprit.
My month so far..

crash.jpg

Additional information:

5.15.19-1-pve Kernel, promox 7.1-10 running 2 LXCs and 3 VMs

Seems to always crash around the same time, 2 AM

The TrueNAS Core VM still runs when proxmox becomes unresponsive, I still can access the SMB shares
 
Last edited:
I got the exactly same issue! Any updates on this? I still couldn't nail down the root cause, but the update to 7.1 could be the culprit.
My month so far..

View attachment 34190

Additional information:

5.15.19-1-pve Kernel, promox 7.1-10 running 2 LXCs and 3 VMs

Seems to always crash around the same time, 2 AM

The TrueNAS Core VM still runs when proxmox becomes unresponsive, I still can access the SMB shares
do you have any scheduled backups at 2am ? that might be causing the it ?
 
do you have any scheduled backups at 2am ? that might be causing the it ?
I do have scheduled backups but at 4am. I also thought of backups as the cause because on the monitor directly attached to the server there is sometimes this line after a crash: \\Backup-PC has not responded in 180 seconds. Reconnecting...

I run my server at CET +0100 so even if proxmox would work based on CET +0 it would crash at 3am.

My backup drive:
1644703745469.jpeg
 
Last edited:
  • Like
Reactions: Spirog
Hey,

on my end, the problem went away, here's what I did (not sure what solved it though) :

The minecraft server CT was moved to another drive

AND

Since it was swapping out, I added ram and removed all swap for that CT.

As I said in my original post, pve mostly crashed when minecraft server was running with people on it.

Since then, no crash have occurred. So, it's either a bad drive on my end (no SMART errors, nothing, wasn't slow or anything) or the swapping.


I hope you can solve your issue!
 
  • Like
Reactions: FaySmash
Hey,

on my end, the problem went away, here's what I did (not sure what solved it though) :

The minecraft server CT was moved to another drive

AND

Since it was swapping out, I added ram and removed all swap for that CT.

As I said in my original post, pve mostly crashed when minecraft server was running with people on it.

Since then, no crash have occurred. So, it's either a bad drive on my end (no SMART errors, nothing, wasn't slow or anything) or the swapping.


I hope you can solve your issue!

I'll try to remove swap allocations too, thanks!
 
Last edited:
Quick update: at least until now it didn't crash anymore after removing the swap allocations for the LXCs. Lets see if it stays this way.