Proxmox Root session down, OS is down but hardware is still running

francabicon

New Member
Jan 11, 2024
2
1
1
Hi there,
I have encountered this issue eversince the PVE 8.0.3 update and it has freeze the entire server EVERYDAY at a certain time but the hardware isn't down and still running
do check out this error for this log file:
Code:
Jan 10 03:17:01 pve CRON[2182601]: pam_unix(cron:session): session closed for user root
Jan 10 04:01:15 pve systemd[1]: Starting apt-daily.service - Daily apt download activities...
Jan 10 04:01:15 pve systemd[1]: apt-daily.service: Deactivated successfully.
Jan 10 04:01:15 pve systemd[1]: Finished apt-daily.service - Daily apt download activities.
Jan 10 04:17:01 pve CRON[2253132]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jan 10 04:17:01 pve CRON[2253133]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Jan 10 04:17:01 pve CRON[2253132]: pam_unix(cron:session): session closed for user root
Jan 10 05:17:01 pve CRON[2322604]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jan 10 05:17:01 pve CRON[2322605]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Jan 10 05:17:01 pve CRON[2322604]: pam_unix(cron:session): session closed for user root
Jan 10 05:46:15 pve systemd[1]: Starting pve-daily-update.service - Daily PVE download activities...
Jan 10 05:46:15 pve systemd[1]: Starting man-db.service - Daily man-db regeneration...
Jan 10 05:46:15 pve systemd[1]: man-db.service: Deactivated successfully.
Jan 10 05:46:15 pve systemd[1]: Finished man-db.service - Daily man-db regeneration.
Jan 10 05:46:22 pve pveupdate[2356464]: <root@pam> starting task UPID:pve:0023F5F2:00C00D4D:659DBEAE:aptupdate::root@pam:
Jan 10 05:46:23 pve pveupdate[2356722]: update new package list: /var/lib/pve-manager/pkgupdates
Jan 10 05:46:26 pve pveupdate[2356464]: <root@pam> end task UPID:pve:0023F5F2:00C00D4D:659DBEAE:aptupdate::root@pam: OK
Jan 10 05:46:26 pve systemd[1]: pve-daily-update.service: Deactivated successfully.
Jan 10 05:46:26 pve systemd[1]: Finished pve-daily-update.service - Daily PVE download activities.
Jan 10 05:46:26 pve systemd[1]: pve-daily-update.service: Consumed 5.230s CPU time.
Jan 10 06:07:36 pve systemd[1]: Starting apt-daily-upgrade.service - Daily apt upgrade and clean activities...
Jan 10 06:07:36 pve systemd[1]: apt-daily-upgrade.service: Deactivated successfully.
Jan 10 06:07:36 pve systemd[1]: Finished apt-daily-upgrade.service - Daily apt upgrade and clean activities.
Jan 10 06:17:01 pve CRON[2392895]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jan 10 06:17:01 pve CRON[2392916]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Jan 10 06:17:01 pve CRON[2392895]: pam_unix(cron:session): session closed for user root
Jan 10 06:25:01 pve CRON[2402557]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jan 10 06:25:01 pve CRON[2402558]: (root) CMD (test -x /usr/sbin/anacron || { cd / && run-parts --report /etc/cron.daily; })
Jan 10 06:25:01 pve CRON[2402557]: pam_unix(cron:session): session closed for user root
Jan 10 07:17:01 pve CRON[2464015]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jan 10 07:17:01 pve CRON[2464016]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Jan 10 07:17:01 pve CRON[2464015]: pam_unix(cron:session): session closed for user root
Jan 10 07:17:15 pve systemd[1]: Starting apt-daily.service - Daily apt download activities...
Jan 10 07:17:15 pve systemd[1]: apt-daily.service: Deactivated successfully.
Jan 10 07:17:15 pve systemd[1]: Finished apt-daily.service - Daily apt download activities.
Jan 10 07:49:03 pve smartd[3409]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 62 to 63
Jan 10 07:49:03 pve smartd[3409]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 38 to 37
Jan 10 08:17:01 pve CRON[2535227]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jan 10 08:17:01 pve CRON[2535228]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Jan 10 08:17:01 pve CRON[2535227]: pam_unix(cron:session): session closed for user root
Jan 10 08:49:02 pve smartd[3409]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 63 to 62
Jan 10 08:49:02 pve smartd[3409]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 37 to 38
Jan 10 09:17:01 pve CRON[2606823]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jan 10 09:17:01 pve CRON[2606824]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Jan 10 09:17:01 pve CRON[2606823]: pam_unix(cron:session): session closed for user root
Jan 10 10:17:01 pve CRON[2679605]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jan 10 10:17:01 pve CRON[2679606]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Jan 10 10:17:01 pve CRON[2679605]: pam_unix(cron:session): session closed for user root
Jan 10 11:17:01 pve CRON[2752657]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jan 10 11:17:01 pve CRON[2752658]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Jan 10 11:17:01 pve CRON[2752657]: pam_unix(cron:session): session closed for user root
-- Reboot --
Jan 11 17:48:41 pve kernel: Linux version 6.2.16-3-pve (tom@sbuild) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC PVE 6.2.16-3 (2023-06-17T05:58Z) ()
Jan 11 17:48:41 pve kernel: Command line: initrd=\EFI\proxmox\6.2.16-3-pve\initrd.img-6.2.16-3-pve root=ZFS=rpool/ROOT/pve-1 boot=zfs
Jan 11 17:48:41 pve kernel: KERNEL supported cpus:
Jan 11 17:48:41 pve kernel:   Intel GenuineIntel
Jan 11 17:48:41 pve kernel:   AMD AuthenticAMD
Jan 11 17:48:41 pve kernel:   Hygon HygonGenuine
Jan 11 17:48:41 pve kernel:   Centaur CentaurHauls
Jan 11 17:48:41 pve kernel:   zhaoxin   Shanghai

Any help is highly appreciated. Thanks in advance.
 
Last edited:
  • Like
Reactions: tanman0217
Bumping this. I am experiencing a pretty identical issue. It is interesting that the minute of our issue is the same. Not sure if the local timezone is used, but I think it is. I am in EST, and it matches up with my local time. So, depending on @francabicon TZ, it may be the same time for both of us. I am going to continue looking into it and if I find something, I will respond again. If someone else with better knowledge than I do has some insight, that would be great.

Thanks.
 

Attachments

  • Like
Reactions: tanman0217
I've been have the exact same issue. Was happening weekly with no pattern for over a year, but the past four days it has happened every single day. Times have been:
07:17:01
02:17:01
11:17:01
04:17:01

So always at the 17th minute of the hour, and always proceeded immediately by:

Jan 08 04:17:01 proxmox CRON[1982468]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jan 08 04:17:01 proxmox CRON[1982469]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Jan 08 04:17:01 proxmox CRON[1982468]: pam_unix(cron:session): session closed for user root
-- Reboot --

This is driving me mad! Anyone make any progress here?
 
Have you reviewed your cron jobs? Perhaps you ran a script that installed on and you forgot about it. Places to look are /etc/cron.* and "crontab -l"
I have indeed. Just went through a deep dive on this using Claude, it had some great "ideas". I had some success a couple months ago disabling the lower C-states on my CPU (AMD Ryzen 5700 series), which really really helped the frequency of the shutdowns, but then after the latest kernel update it got much worse, so it's some kind of scheduled issue and probably not hardware. The hourly cron job is indeed empty, but the scheduler that runs it contains:

Code:
17 *    * * *  root    cd / && run-parts --report /etc/cron.hourly

Here's what I / Claude think might be happening: When the hourly cron job runs cd /, it triggers some filesystem operation that involves the LVM thin pool and leads to a race condition with blocks being allocated / TRIM operations happening within VMs. I also have a CIFS mount with some aggressive caching settings, so for now I have unmounted that CIFS mount, and have expanded a couple VM drives that were low on space and turned off TRIM/discard on my VMs.

I also have set up a cron job that will come on 5 minutes before the :17 minute mark of the hour and take a bunch of iostat logs, so hoping I can capture exactly what's causing the system to hang (assuming the above steps didn't resolve the issue).

Will report back!
 
What VM's do you run? If you have a bunch of Debian or Ubuntu VM's or CT's they will also kick off their cron.hourly at the same time as the host. Perhaps your system can't handle the load or the power supply is too weak or something like that. Seems more likely than Claude's "idea". You do realize that Claude literally has no idea what it is talking about, right?

Anyway, looking forward to what you find.
 
Having the exact same issue. If anyone figures this out I'll be watching here for help. Thank you
 
I am having the same issue, I can’t connect to the web gui of the pve environment. But I believe one of the VMs is still running since I am able to ssh into the ubuntu vm. My current config is 1tb nvme as the boot drive, The VM i recently installed is from a template i created and it’s running ubuntu server. I also have a nas vm with 2 hdd that is passing through directly to the vm.
 
Same issue. I can ping it when it goes into this state - and for up to a month after (I have not tested leaving it "broken" longer than that), I cannot ssh to it when it does this, and I cannot get to any VMs. I was suspecting that maybe my primary drive was going read-only, but I have not found any evidence to support that theory. Would love to know if anyone figures anything out!
 
Last edited:
I have indeed. Just went through a deep dive on this using Claude, it had some great "ideas". I had some success a couple months ago disabling the lower C-states on my CPU (AMD Ryzen 5700 series), which really really helped the frequency of the shutdowns, but then after the latest kernel update it got much worse, so it's some kind of scheduled issue and probably not hardware. The hourly cron job is indeed empty, but the scheduler that runs it contains:

Code:
17 *    * * *  root    cd / && run-parts --report /etc/cron.hourly

Here's what I / Claude think might be happening: When the hourly cron job runs cd /, it triggers some filesystem operation that involves the LVM thin pool and leads to a race condition with blocks being allocated / TRIM operations happening within VMs. I also have a CIFS mount with some aggressive caching settings, so for now I have unmounted that CIFS mount, and have expanded a couple VM drives that were low on space and turned off TRIM/discard on my VMs.

I also have set up a cron job that will come on 5 minutes before the :17 minute mark of the hour and take a bunch of iostat logs, so hoping I can capture exactly what's causing the system to hang (assuming the above steps didn't resolve the issue).

Will report back!
Did you succeed?
 
Hi, I have (probably) the same problem since some v8 on my AMD node. I was suspecting hardware issues, but any logs (also rasdaemon) can't find any hardware issues. Proxmox just freezes randomly sometimes after a few days, sometimes few weeks. My recent logs:

Code:
Apr 23 06:57:16 bilbo systemd[1]: Starting apt-daily.service - Daily apt download activities...
Apr 23 06:57:17 bilbo systemd[1]: apt-daily.service: Deactivated successfully.
Apr 23 06:57:17 bilbo systemd[1]: Finished apt-daily.service - Daily apt download activities.
Apr 23 06:57:19 bilbo pmxcfs[2271]: [dcdb] notice: data verification successful
Apr 23 07:02:12 bilbo pmxcfs[2271]: [status] notice: received log
Apr 23 07:17:01 bilbo CRON[1048932]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Apr 23 07:17:01 bilbo CRON[1048933]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Apr 23 07:17:01 bilbo CRON[1048932]: pam_unix(cron:session): session closed for user root
Apr 23 07:17:12 bilbo pmxcfs[2271]: [status] notice: received log
-- Boot 432d2a5b1e7f49038797e7766edac918 --
Apr 24 08:50:17 bilbo kernel: Linux version 6.8.12-9-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for >
Apr 24 08:50:17 bilbo kernel: Command line: initrd=\EFI\proxmox\6.8.12-9-pve\initrd.img-6.8.12-9-pve root=ZFS=rpool/ROOT/pve-1 boo>


Apr 24 12:17:01 bilbo CRON[20241]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Apr 24 12:17:01 bilbo CRON[20242]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Apr 24 12:17:01 bilbo CRON[20241]: pam_unix(cron:session): session closed for user root
Apr 24 12:18:10 bilbo pmxcfs[2151]: [status] notice: received log
Apr 24 12:19:08 bilbo pmxcfs[2151]: [status] notice: received log
Apr 24 12:33:10 bilbo pmxcfs[2151]: [status] notice: received log
Apr 24 12:34:09 bilbo pmxcfs[2151]: [status] notice: received log
Apr 24 12:48:11 bilbo pmxcfs[2151]: [status] notice: received log
-- Boot 5cd1edb6667f409e97673ac67190e524 --
Apr 24 12:58:16 bilbo kernel: Linux version 6.8.12-10-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1>
Apr 24 12:58:16 bilbo kernel: Command line: initrd=\EFI\proxmox\6.8.12-10-pve\initrd.img-6.8.12-10-pve root=ZFS=rpool/ROOT/pve-1 boot=zfs quiet iom>

No custom cron jobs on my system.

Any ideas how to investigate this further?

Appreciete.