Proxmox rebooting daily - no errors

mustava · Feb 24, 2023

Hi All,
I am experiencing a incredibly frustrating issue and after weeks of research, I haven't found anything that has helped.
My proxmox server (7.3-6 - running Linux 5.19.17-2-pve #1) has been rebooting randomly every day for seemingly no reason - regardless of which VM's are running.
Everything was running fine for a few months before this started.

Every morning I wake up to the server rebooted, sometimes even multiple times.

Things I have tried:
- running a few passes of memtest, finding no errors
- simplify device passthrough config, removed igpu and other pcie devices. currently only passing through nvidia gpu and a few HDD's. server will reboot even when VM's are not using these.
- downgrading kernel

Ive really run out of ideas, I cant see anything super obvious in the logs... but i have attatched the syslog output from between two reboots. Anyone have any ideas?

donhwyo · Feb 24, 2023

I have been having similar issues. First thing I would do is move up to the 6.1 kernel.
https://forum.proxmox.com/threads/o...r-proxmox-ve-7-x-available.119483/post-532379.

Here is my thread'.
https://forum.proxmox.com/threads/opt-in-kernel-panics.122589/
Hope it helps.

mustava · Feb 24, 2023

donhwyo said:
I have been having similar issues. First thing I would do is move up to the 6.1 kernel.
https://forum.proxmox.com/threads/o...r-proxmox-ve-7-x-available.119483/post-532379.

Here is my thread'.
https://forum.proxmox.com/threads/opt-in-kernel-panics.122589/
Hope it helps.

I actually did try this and from what I can remember it still rebooted. I also couldn't get some of my VM's to boot when I tried this but perhaps I will try again.

donhwyo · Feb 24, 2023

Try these.

Code:

/cat var/log/syslog |grep -iE "error|fail"
dmesg |grep -iE "error|failed"

mustava · Feb 24, 2023

donhwyo said:
Try these.

Code:

/cat var/log/syslog |grep -iE "error|fail" dmesg |grep -iE "error|failed"

Getting permission denied on syslog, but it appears from the console logs that its mainly the following

Code:

Feb 24 11:13:26 pve kernel: nvidia-gpu 0000:01:00.3: i2c timeout error e0000000
Feb 24 11:13:26 pve kernel: ucsi_ccg: probe of 0-0008 failed with error -110
Feb 24 11:13:25 pve kernel: acpi PNP0A08:00: _OSC: platform retains control of PCIe features (AE_ERROR)
Feb 24 11:14:08 pve kernel: sd 8:0:0:0: [sdi] Synchronize Cache(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Feb 24 11:14:08 pve kernel: sd 8:0:0:0: [sdi] Start/Stop Unit failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Feb 24 11:13:26 pve kernel: ucsi_ccg 0-0008: i2c_transfer failed -110
Feb 24 11:13:26 pve kernel: ucsi_ccg 0-0008: ucsi_ccg_init failed - -110
Feb 24 11:13:26 pve kernel: ucsi_ccg: probe of 0-0008 failed with error -110

For dmesg:
I will try and run again close to or after crash

Code:

root@pve:~# dmesg |grep -iE "error|failed"
[    0.304616] acpi PNP0A08:00: _OSC: platform retains control of PCIe features (AE_ERROR)
[    0.973856] RAS: Correctable Errors collector initialized.
[    7.151660] nvidia-gpu 0000:01:00.3: i2c timeout error e0000000
[    7.151672] ucsi_ccg 0-0008: i2c_transfer failed -110
[    7.151680] ucsi_ccg 0-0008: ucsi_ccg_init failed - -110
[    7.151689] ucsi_ccg: probe of 0-0008 failed with error -110
[   48.938765] sd 8:0:0:0: [sdi] Synchronize Cache(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[   48.938772] sd 8:0:0:0: [sdi] Start/Stop Unit failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[   48.998762] sd 9:0:0:0: [sdj] Synchronize Cache(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[   48.998769] sd 9:0:0:0: [sdj] Start/Stop Unit failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[   49.046770] sd 10:0:0:0: [sdk] Synchronize Cache(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[   49.046777] sd 10:0:0:0: [sdk] Start/Stop Unit failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[   49.118745] sd 11:0:0:0: [sdl] Synchronize Cache(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[   49.118752] sd 11:0:0:0: [sdl] Start/Stop Unit failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[   49.570771] sd 16:0:0:0: [sdm] Synchronize Cache(10) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[   49.626754] sd 16:0:0:1: [sdn] Synchronize Cache(10) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[   49.730725] sd 16:0:0:2: [sdo] Synchronize Cache(10) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
root@pve:~#

mustava · Feb 25, 2023

ok so a few more reboots overnight even with the 6.1.10-1 kernel

I have no idea what to do!!
is there a way to reinstall proxmox without loosing all the settings?

Dunuin · Feb 25, 2023

See for example here: https://github.com/DerDanilo/proxmox-stuff
Or here: https://pve.proxmox.com/wiki/Proxmox_Cluster_File_System_(pmxcfs)#_recovery

donhwyo · Feb 25, 2023

Sorry fingers aren't always connected to the brain. Should have been.

Code:

cat /var/log/syslog

mustava · Mar 2, 2023

looks like Ive narrowed it down to the GPU being passed through being the cause of the reboots (still with no errors that I can see).

I have a 1660S that is passed through to a VM for use by docker containers. The actual pass-through works perfectly however upon disabling it and changing settings back to stock, the server has stayed up for two days (woah) without issue. Weird thing is that this worked perfectly for months previously?

The odd thing is that the server still crashes/reboots even when the GPU is not in use by the VM (VM not booted).

Ill start a new thread with the specific issue and config:
https://forum.proxmox.com/threads/gpu-pass-through-causing-nightly-reboots.123539/

mustava · Mar 4, 2023

Still experiencing issues.. It looks like the crash coincides with whenever nightly/hourly cron jobs run around 3-5AM. The logs look very similar each time. no messages in dmesg

Code:

Mar  4 03:03:46 pve pvedaemon[468000]: <root@pam> successful auth for user 'root@pam'
Mar  4 03:05:54 pve pveproxy[481032]: worker exit
Mar  4 03:05:54 pve pveproxy[2988]: worker 481032 finished
Mar  4 03:05:54 pve pveproxy[2988]: starting 1 worker(s)
Mar  4 03:05:54 pve pveproxy[2988]: worker 489205 started
Mar  4 03:06:07 pve pveproxy[2988]: worker 480114 finished
Mar  4 03:06:07 pve pveproxy[2988]: starting 1 worker(s)
Mar  4 03:06:07 pve pveproxy[2988]: worker 489239 started
Mar  4 03:06:10 pve pveproxy[489238]: got inotify poll request in wrong process - disabling inotify
Mar  4 03:06:10 pve pveproxy[489238]: worker exit
Mar  4 03:10:01 pve CRON[489947]: (root) CMD (test -e /run/systemd/system || SERVICE_MODE=1 /sbin/e2scrub_all -A -r)
Mar  4 03:12:17 pve pvedaemon[462470]: <root@pam> successful auth for user 'root@pam'
Mar  4 03:14:34 pve pveproxy[482304]: worker exit
Mar  4 03:14:34 pve pveproxy[2988]: worker 482304 finished
Mar  4 03:14:34 pve pveproxy[2988]: starting 1 worker(s)
Mar  4 03:14:34 pve pveproxy[2988]: worker 490743 started
Mar  4 03:17:01 pve CRON[491174]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Mar  4 03:18:30 pve postfix/qmgr[2939]: 425A91F2E3: from=<root@pve.>, size=1223, nrcpt=1 (queue active)
Mar  4 03:18:30 pve postfix/qmgr[2939]: 440A3B0081: from=<root@pve.>, size=1223, nrcpt=1 (queue active)
Mar  4 03:18:31 pve postfix/smtp[491440]: connect to gmail-smtp-in.l.google.com[2404:6800:4003:c05::1a]:25: Network is unreachable
Mar  4 03:18:46 pve pvedaemon[468454]: <root@pam> successful auth for user 'root@pam'
Mar  4 03:19:01 pve postfix/smtp[491440]: connect to gmail-smtp-in.l.google.com[172.253.118.27]:25: Connection timed out
Mar  4 03:19:01 pve postfix/smtp[491439]: connect to gmail-smtp-in.l.google.com[172.253.118.27]:25: Connection timed out
Mar  4 03:19:01 pve postfix/smtp[491439]: connect to gmail-smtp-in.l.google.com[2404:6800:4003:c05::1a]:25: Network is unreachable
Mar  4 03:19:01 pve postfix/smtp[491439]: connect to alt1.gmail-smtp-in.l.google.com[2607:f8b0:400e:c00::1b]:25: Network is unreachable
Mar  4 03:19:31 pve postfix/smtp[491440]: connect to alt1.gmail-smtp-in.l.google.com[173.194.202.26]:25: Connection timed out
Mar  4 03:19:31 pve postfix/smtp[491440]: connect to alt1.gmail-smtp-in.l.google.com[2607:f8b0:400e:c00::1b]:25: Network is unreachable
Mar  4 03:19:31 pve postfix/smtp[491440]: connect to alt2.gmail-smtp-in.l.google.com[2607:f8b0:4023:c0b::1a]:25: Network is unreachable
Mar  4 03:19:31 pve postfix/smtp[491439]: connect to alt1.gmail-smtp-in.l.google.com[173.194.202.26]:25: Connection timed out
Mar  4 03:20:01 pve postfix/smtp[491439]: connect to alt2.gmail-smtp-in.l.google.com[142.250.141.27]:25: Connection timed out
<Crash/Reboot>

Dataninja · Aug 8, 2023

mustava said:

Still experiencing issues.. It looks like the crash coincides with whenever nightly/hourly cron jobs run around 3-5AM. The logs look very similar each time. no messages in dmesg

Code:

Mar  4 03:03:46 pve pvedaemon[468000]: <root@pam> successful auth for user 'root@pam'
Mar  4 03:05:54 pve pveproxy[481032]: worker exit
Mar  4 03:05:54 pve pveproxy[2988]: worker 481032 finished
Mar  4 03:05:54 pve pveproxy[2988]: starting 1 worker(s)
Mar  4 03:05:54 pve pveproxy[2988]: worker 489205 started
Mar  4 03:06:07 pve pveproxy[2988]: worker 480114 finished
Mar  4 03:06:07 pve pveproxy[2988]: starting 1 worker(s)
Mar  4 03:06:07 pve pveproxy[2988]: worker 489239 started
Mar  4 03:06:10 pve pveproxy[489238]: got inotify poll request in wrong process - disabling inotify
Mar  4 03:06:10 pve pveproxy[489238]: worker exit
Mar  4 03:10:01 pve CRON[489947]: (root) CMD (test -e /run/systemd/system || SERVICE_MODE=1 /sbin/e2scrub_all -A -r)
Mar  4 03:12:17 pve pvedaemon[462470]: <root@pam> successful auth for user 'root@pam'
Mar  4 03:14:34 pve pveproxy[482304]: worker exit
Mar  4 03:14:34 pve pveproxy[2988]: worker 482304 finished
Mar  4 03:14:34 pve pveproxy[2988]: starting 1 worker(s)
Mar  4 03:14:34 pve pveproxy[2988]: worker 490743 started
Mar  4 03:17:01 pve CRON[491174]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Mar  4 03:18:30 pve postfix/qmgr[2939]: 425A91F2E3: from=<root@pve.>, size=1223, nrcpt=1 (queue active)
Mar  4 03:18:30 pve postfix/qmgr[2939]: 440A3B0081: from=<root@pve.>, size=1223, nrcpt=1 (queue active)
Mar  4 03:18:31 pve postfix/smtp[491440]: connect to gmail-smtp-in.l.google.com[2404:6800:4003:c05::1a]:25: Network is unreachable
Mar  4 03:18:46 pve pvedaemon[468454]: <root@pam> successful auth for user 'root@pam'
Mar  4 03:19:01 pve postfix/smtp[491440]: connect to gmail-smtp-in.l.google.com[172.253.118.27]:25: Connection timed out
Mar  4 03:19:01 pve postfix/smtp[491439]: connect to gmail-smtp-in.l.google.com[172.253.118.27]:25: Connection timed out
Mar  4 03:19:01 pve postfix/smtp[491439]: connect to gmail-smtp-in.l.google.com[2404:6800:4003:c05::1a]:25: Network is unreachable
Mar  4 03:19:01 pve postfix/smtp[491439]: connect to alt1.gmail-smtp-in.l.google.com[2607:f8b0:400e:c00::1b]:25: Network is unreachable
Mar  4 03:19:31 pve postfix/smtp[491440]: connect to alt1.gmail-smtp-in.l.google.com[173.194.202.26]:25: Connection timed out
Mar  4 03:19:31 pve postfix/smtp[491440]: connect to alt1.gmail-smtp-in.l.google.com[2607:f8b0:400e:c00::1b]:25: Network is unreachable
Mar  4 03:19:31 pve postfix/smtp[491440]: connect to alt2.gmail-smtp-in.l.google.com[2607:f8b0:4023:c0b::1a]:25: Network is unreachable
Mar  4 03:19:31 pve postfix/smtp[491439]: connect to alt1.gmail-smtp-in.l.google.com[173.194.202.26]:25: Connection timed out
Mar  4 03:20:01 pve postfix/smtp[491439]: connect to alt2.gmail-smtp-in.l.google.com[142.250.141.27]:25: Connection timed out
<Crash/Reboot>

Did you ever find a solution. I've been experiencing the same thing the last week and tried everything you've listed in the OP.

mustava · Aug 9, 2023

Dataninja said:
Did you ever find a solution. I've been experiencing the same thing the last week and tried everything you've listed in the OP.

I did eventually find a fix. Troubleshooting with some suggestions from users here, I ended up down-clocked my ram to run at base speeds (2133mhz) instead of the rated 3200mhz by the sticks - which stopped the nightly reboots. Based on some other threads I read, this seems to be a problem when PCI passthrough is enabled?

I tried 3-4 new sets of DDR4 and various supported speeds and had the same issue, and also ran memtest on all of them and confirm that they all appeared to be in good working order. I also took some sticks out of my other PC's and tried them.... Proxmox is the only system that had an issue?
I also booted the system into some OS's and didnt experience any issues when running tests with the ram set at 3200mhz, so again only seemed to be an issue with proxmox.

Hope this helps?

Dataninja · Aug 11, 2023

mustava said:
I did eventually find a fix. Troubleshooting with some suggestions from users here, I ended up down-clocked my ram to run at base speeds (2133mhz) instead of the rated 3200mhz by the sticks - which stopped the nightly reboots. Based on some other threads I read, this seems to be a problem when PCI passthrough is enabled?

I tried 3-4 new sets of DDR4 and various supported speeds and had the same issue, and also ran memtest on all of them and confirm that they all appeared to be in good working order. I also took some sticks out of my other PC's and tried them.... Proxmox is the only system that had an issue?
I also booted the system into some OS's and didnt experience any issues when running tests with the ram set at 3200mhz, so again only seemed to be an issue with proxmox.

Hope this helps?

Thanks for the tips. Unfortunately, this didn't work for me because my BIOS on my mini PC doesn't allow XMP on my ram so it's running at base speeds. I actually just a couple new sticks of ram and it still keeps rebooting.

I think proxmox may just not play nice with my system.

donhwyo · Aug 11, 2023

You could also try adding the mitigations=off kernel boot command line parameter. Worked for me but my hardware is nothing yours.
Good luck

Proxmox rebooting daily - no errors

mustava

New Member

Attachments

donhwyo

Member

mustava

New Member

donhwyo

Member

mustava

New Member

mustava

New Member

Dunuin

Distinguished Member

donhwyo

Member

mustava

New Member

mustava

New Member

Dataninja

Member

mustava

New Member

Dataninja

Member

donhwyo

Member

We value your privacy