PVE Crash since update to 7.1

Taledo · Jan 11, 2022

Good day all,

I've found that my PVE has crashed tonight. I've lost all monitoring at 0150, and the host has been in limbo since. Not only that, but I had run apt upgrade yesterday, so this might be related.

SSH isn't working, nor is the GUI. I can log in to the console, but I cannot get a bash to start up.

The journal service isn't starting any more, so I doubt we'll get any logs. There is, however, a quite nasty kernel stacktrace in the syslog :

Screenshot 2022-01-11 at 08-37-54 deucalion pelado lan Logs LibreNMS.png

Will try a soft reboot and report back.

Edit 1 :

Host has rebooted, all VMs are coming back to life.

Here's the last entries in the /var/log/syslog files prior to the crash :

Bash:

Jan 11 00:00:55 deucalion pveproxy[6543]: worker 2860888 started                                                      
Jan 11 00:00:55 deucalion pveproxy[6543]: worker 2860889 started                                                      
Jan 11 00:00:55 deucalion pveproxy[6543]: worker 2860890 started                                                      
Jan 11 00:00:59 deucalion spiceproxy[1514543]: worker exit                                                            
Jan 11 00:00:59 deucalion spiceproxy[6550]: worker 1514543 finished                                                    
Jan 11 00:01:00 deucalion pveproxy[2098330]: worker exit                                                              
Jan 11 00:01:00 deucalion pveproxy[2002386]: worker exit                                                              
Jan 11 00:01:00 deucalion pveproxy[1895981]: worker exit                                                              
Jan 11 00:01:00 deucalion pveproxy[6543]: worker 2002386 finished                                                      
Jan 11 00:01:00 deucalion pveproxy[6543]: worker 1895981 finished                                                      
Jan 11 00:01:00 deucalion pveproxy[6543]: worker 2098330 finished                                                      
Jan 11 00:01:01 deucalion CRON[2861601]: (root) CMD (/bin/bash /root/ventil.sh &> /dev/null)                          
Jan 11 00:17:01 deucalion CRON[3379041]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)                  
Jan 11 00:24:01 deucalion CRON[3409426]: (root) CMD (if [ $(date +%w) -eq 0 ] && [ -x /usr/lib/zfs-linux/scrub ]; then /usr/lib/zfs-linux/scrub; fi)                                                                                          
Jan 11 00:29:26 deucalion smartd[5767]: Device: /dev/sdc [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 66 to 65                                                                                                      
Jan 11 00:50:16 deucalion pvedaemon[2035974]: worker exit                                                              
Jan 11 00:50:16 deucalion pvedaemon[6532]: worker 2035974 finished                                                    
Jan 11 00:50:16 deucalion pvedaemon[6532]: starting 1 worker(s)                                                        
Jan 11 00:50:16 deucalion pvedaemon[6532]: worker 3529717 started                                                      
Jan 11 00:59:26 deucalion pmxcfs[6165]: [dcdb] notice: data verification successful                                    
Jan 11 01:00:17 deucalion kernel: [464490.164962] kauditd_printk_skb: 14 callbacks suppressed                          
Jan 11 01:00:17 deucalion kernel: [464490.164967] audit: type=1400 audit(1641859217.241:157): apparmor="DENIED" operation="mount" info="failed flags match" error=-13 profile="lxc-102_</var/lib/lxc>" name="/run/systemd/unit-root/" pid=3575739 comm="(ogrotate)" srcname="/" flags="rw, rbind"

kern.log doesn't contain anything relevant.

I can provide logs if needed.

Taledo · Jan 11, 2022

PVE crashed again with a very similar error message in the syslog again.

[27309.327754] INFO: task jbd2/dm-1-8:577 blocked for more than 120 seconds.

Seems to be the first error. As far as I know, the host then goes into a weird state, where writing operations to the disks won't work.

I'll try switching from 5.11 Kernel to 5.15, but this isn't ideal.

I do not know if that's linked, but I'm seeing a high load before the crash each time.

Taledo · Jan 11, 2022

Switching the kernel didn't change.

This time, though, no trace of the stacktrace. But I've instead found this :

Could this be related to a failing disk?

Taledo · Jan 11, 2022

I'm seeing no smart error on the disks.

There appears to be a pattern between a minecraft server running & the server crashing.

I've had servers running before with no issues.

I'm seeing a pattern here. This is the CT with the minecraft server.

FaySmash · Feb 12, 2022

I got the exactly same issue! Any updates on this? I still couldn't nail down the root cause, but the update to 7.1 could be the culprit.
My month so far..

Additional information:

5.15.19-1-pve Kernel, promox 7.1-10 running 2 LXCs and 3 VMs

Seems to always crash around the same time, 2 AM

The TrueNAS Core VM still runs when proxmox becomes unresponsive, I still can access the SMB shares

Spirog · Feb 12, 2022

FaySmash said:
I got the exactly same issue! Any updates on this? I still couldn't nail down the root cause, but the update to 7.1 could be the culprit.
My month so far..

View attachment 34190

Additional information:

5.15.19-1-pve Kernel, promox 7.1-10 running 2 LXCs and 3 VMs

Seems to always crash around the same time, 2 AM

The TrueNAS Core VM still runs when proxmox becomes unresponsive, I still can access the SMB shares

do you have any scheduled backups at 2am ? that might be causing the it ?

FaySmash · Feb 12, 2022

Spirog said:
do you have any scheduled backups at 2am ? that might be causing the it ?

I do have scheduled backups but at 4am. I also thought of backups as the cause because on the monitor directly attached to the server there is sometimes this line after a crash: \\Backup-PC has not responded in 180 seconds. Reconnecting...

I run my server at CET +0100 so even if proxmox would work based on CET +0 it would crash at 3am.

My backup drive:

Taledo · Feb 12, 2022

Hey,

on my end, the problem went away, here's what I did (not sure what solved it though) :

The minecraft server CT was moved to another drive

AND

Since it was swapping out, I added ram and removed all swap for that CT.

As I said in my original post, pve mostly crashed when minecraft server was running with people on it.

Since then, no crash have occurred. So, it's either a bad drive on my end (no SMART errors, nothing, wasn't slow or anything) or the swapping.

I hope you can solve your issue!

FaySmash · Feb 12, 2022

Taledo said:
Hey,

on my end, the problem went away, here's what I did (not sure what solved it though) :

The minecraft server CT was moved to another drive

AND

Since it was swapping out, I added ram and removed all swap for that CT.

As I said in my original post, pve mostly crashed when minecraft server was running with people on it.

Since then, no crash have occurred. So, it's either a bad drive on my end (no SMART errors, nothing, wasn't slow or anything) or the swapping.

I hope you can solve your issue!

I'll try to remove swap allocations too, thanks!

FaySmash · Feb 18, 2022

Quick update: at least until now it didn't crash anymore after removing the swap allocations for the LXCs. Lets see if it stays this way.

FaySmash · Feb 25, 2022

2nd update: it still didn't crash since I removed the swap allocations. Seems like extensive swap usage causes this behavior in 7.1. Thanks again for the tip!

hhhhn · Nov 22, 2022

FaySmash said:
2nd update: it still didn't crash since I removed the swap allocations. Seems like extensive swap usage causes this behavior in 7.1. Thanks again for the tip!

Thanks @FaySmash , it works like a charm.

Search

Search

PVE Crash since update to 7.1

Taledo

Member

Taledo

Member

Taledo

Member

Taledo

Member

FaySmash

Member

Spirog

Member

FaySmash

Member

Taledo

Member

FaySmash

Member

FaySmash

Member

FaySmash

Member

hhhhn

New Member