Proxmox rebooting daily - no errors

mustava

New Member
Feb 24, 2023
13
0
1
Hi All,
I am experiencing a incredibly frustrating issue and after weeks of research, I haven't found anything that has helped.
My proxmox server (7.3-6 - running Linux 5.19.17-2-pve #1) has been rebooting randomly every day for seemingly no reason - regardless of which VM's are running.
Everything was running fine for a few months before this started.

Every morning I wake up to the server rebooted, sometimes even multiple times.

Things I have tried:
- running a few passes of memtest, finding no errors
- simplify device passthrough config, removed igpu and other pcie devices. currently only passing through nvidia gpu and a few HDD's. server will reboot even when VM's are not using these.
- downgrading kernel

Ive really run out of ideas, I cant see anything super obvious in the logs... but i have attatched the syslog output from between two reboots. Anyone have any ideas?
 

Attachments

Try these.
Code:
/cat var/log/syslog |grep -iE "error|fail"
dmesg |grep -iE "error|failed"
 
Last edited:
Try these.
Code:
/cat var/log/syslog |grep -iE "error|fail"
dmesg |grep -iE "error|failed"
Getting permission denied on syslog, but it appears from the console logs that its mainly the following

Code:
Feb 24 11:13:26 pve kernel: nvidia-gpu 0000:01:00.3: i2c timeout error e0000000
Feb 24 11:13:26 pve kernel: ucsi_ccg: probe of 0-0008 failed with error -110
Feb 24 11:13:25 pve kernel: acpi PNP0A08:00: _OSC: platform retains control of PCIe features (AE_ERROR)
Feb 24 11:14:08 pve kernel: sd 8:0:0:0: [sdi] Synchronize Cache(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Feb 24 11:14:08 pve kernel: sd 8:0:0:0: [sdi] Start/Stop Unit failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Feb 24 11:13:26 pve kernel: ucsi_ccg 0-0008: i2c_transfer failed -110
Feb 24 11:13:26 pve kernel: ucsi_ccg 0-0008: ucsi_ccg_init failed - -110
Feb 24 11:13:26 pve kernel: ucsi_ccg: probe of 0-0008 failed with error -110


For dmesg:
I will try and run again close to or after crash

Code:
root@pve:~# dmesg |grep -iE "error|failed"
[    0.304616] acpi PNP0A08:00: _OSC: platform retains control of PCIe features (AE_ERROR)
[    0.973856] RAS: Correctable Errors collector initialized.
[    7.151660] nvidia-gpu 0000:01:00.3: i2c timeout error e0000000
[    7.151672] ucsi_ccg 0-0008: i2c_transfer failed -110
[    7.151680] ucsi_ccg 0-0008: ucsi_ccg_init failed - -110
[    7.151689] ucsi_ccg: probe of 0-0008 failed with error -110
[   48.938765] sd 8:0:0:0: [sdi] Synchronize Cache(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[   48.938772] sd 8:0:0:0: [sdi] Start/Stop Unit failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[   48.998762] sd 9:0:0:0: [sdj] Synchronize Cache(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[   48.998769] sd 9:0:0:0: [sdj] Start/Stop Unit failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[   49.046770] sd 10:0:0:0: [sdk] Synchronize Cache(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[   49.046777] sd 10:0:0:0: [sdk] Start/Stop Unit failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[   49.118745] sd 11:0:0:0: [sdl] Synchronize Cache(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[   49.118752] sd 11:0:0:0: [sdl] Start/Stop Unit failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[   49.570771] sd 16:0:0:0: [sdm] Synchronize Cache(10) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[   49.626754] sd 16:0:0:1: [sdn] Synchronize Cache(10) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[   49.730725] sd 16:0:0:2: [sdo] Synchronize Cache(10) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
root@pve:~#
 
ok so a few more reboots overnight even with the 6.1.10-1 kernel :( I have no idea what to do!!
is there a way to reinstall proxmox without loosing all the settings?
 
Sorry fingers aren't always connected to the brain. Should have been.

Code:
cat /var/log/syslog
 
looks like Ive narrowed it down to the GPU being passed through being the cause of the reboots (still with no errors that I can see).

I have a 1660S that is passed through to a VM for use by docker containers. The actual pass-through works perfectly however upon disabling it and changing settings back to stock, the server has stayed up for two days (woah) without issue. Weird thing is that this worked perfectly for months previously?

The odd thing is that the server still crashes/reboots even when the GPU is not in use by the VM (VM not booted).

Ill start a new thread with the specific issue and config:
https://forum.proxmox.com/threads/gpu-pass-through-causing-nightly-reboots.123539/
 
Still experiencing issues.. It looks like the crash coincides with whenever nightly/hourly cron jobs run around 3-5AM. The logs look very similar each time. no messages in dmesg

Code:
Mar  4 03:03:46 pve pvedaemon[468000]: <root@pam> successful auth for user 'root@pam'
Mar  4 03:05:54 pve pveproxy[481032]: worker exit
Mar  4 03:05:54 pve pveproxy[2988]: worker 481032 finished
Mar  4 03:05:54 pve pveproxy[2988]: starting 1 worker(s)
Mar  4 03:05:54 pve pveproxy[2988]: worker 489205 started
Mar  4 03:06:07 pve pveproxy[2988]: worker 480114 finished
Mar  4 03:06:07 pve pveproxy[2988]: starting 1 worker(s)
Mar  4 03:06:07 pve pveproxy[2988]: worker 489239 started
Mar  4 03:06:10 pve pveproxy[489238]: got inotify poll request in wrong process - disabling inotify
Mar  4 03:06:10 pve pveproxy[489238]: worker exit
Mar  4 03:10:01 pve CRON[489947]: (root) CMD (test -e /run/systemd/system || SERVICE_MODE=1 /sbin/e2scrub_all -A -r)
Mar  4 03:12:17 pve pvedaemon[462470]: <root@pam> successful auth for user 'root@pam'
Mar  4 03:14:34 pve pveproxy[482304]: worker exit
Mar  4 03:14:34 pve pveproxy[2988]: worker 482304 finished
Mar  4 03:14:34 pve pveproxy[2988]: starting 1 worker(s)
Mar  4 03:14:34 pve pveproxy[2988]: worker 490743 started
Mar  4 03:17:01 pve CRON[491174]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Mar  4 03:18:30 pve postfix/qmgr[2939]: 425A91F2E3: from=<root@pve.>, size=1223, nrcpt=1 (queue active)
Mar  4 03:18:30 pve postfix/qmgr[2939]: 440A3B0081: from=<root@pve.>, size=1223, nrcpt=1 (queue active)
Mar  4 03:18:31 pve postfix/smtp[491440]: connect to gmail-smtp-in.l.google.com[2404:6800:4003:c05::1a]:25: Network is unreachable
Mar  4 03:18:46 pve pvedaemon[468454]: <root@pam> successful auth for user 'root@pam'
Mar  4 03:19:01 pve postfix/smtp[491440]: connect to gmail-smtp-in.l.google.com[172.253.118.27]:25: Connection timed out
Mar  4 03:19:01 pve postfix/smtp[491439]: connect to gmail-smtp-in.l.google.com[172.253.118.27]:25: Connection timed out
Mar  4 03:19:01 pve postfix/smtp[491439]: connect to gmail-smtp-in.l.google.com[2404:6800:4003:c05::1a]:25: Network is unreachable
Mar  4 03:19:01 pve postfix/smtp[491439]: connect to alt1.gmail-smtp-in.l.google.com[2607:f8b0:400e:c00::1b]:25: Network is unreachable
Mar  4 03:19:31 pve postfix/smtp[491440]: connect to alt1.gmail-smtp-in.l.google.com[173.194.202.26]:25: Connection timed out
Mar  4 03:19:31 pve postfix/smtp[491440]: connect to alt1.gmail-smtp-in.l.google.com[2607:f8b0:400e:c00::1b]:25: Network is unreachable
Mar  4 03:19:31 pve postfix/smtp[491440]: connect to alt2.gmail-smtp-in.l.google.com[2607:f8b0:4023:c0b::1a]:25: Network is unreachable
Mar  4 03:19:31 pve postfix/smtp[491439]: connect to alt1.gmail-smtp-in.l.google.com[173.194.202.26]:25: Connection timed out
Mar  4 03:20:01 pve postfix/smtp[491439]: connect to alt2.gmail-smtp-in.l.google.com[142.250.141.27]:25: Connection timed out
<Crash/Reboot>
 
Still experiencing issues.. It looks like the crash coincides with whenever nightly/hourly cron jobs run around 3-5AM. The logs look very similar each time. no messages in dmesg

Code:
Mar  4 03:03:46 pve pvedaemon[468000]: <root@pam> successful auth for user 'root@pam'
Mar  4 03:05:54 pve pveproxy[481032]: worker exit
Mar  4 03:05:54 pve pveproxy[2988]: worker 481032 finished
Mar  4 03:05:54 pve pveproxy[2988]: starting 1 worker(s)
Mar  4 03:05:54 pve pveproxy[2988]: worker 489205 started
Mar  4 03:06:07 pve pveproxy[2988]: worker 480114 finished
Mar  4 03:06:07 pve pveproxy[2988]: starting 1 worker(s)
Mar  4 03:06:07 pve pveproxy[2988]: worker 489239 started
Mar  4 03:06:10 pve pveproxy[489238]: got inotify poll request in wrong process - disabling inotify
Mar  4 03:06:10 pve pveproxy[489238]: worker exit
Mar  4 03:10:01 pve CRON[489947]: (root) CMD (test -e /run/systemd/system || SERVICE_MODE=1 /sbin/e2scrub_all -A -r)
Mar  4 03:12:17 pve pvedaemon[462470]: <root@pam> successful auth for user 'root@pam'
Mar  4 03:14:34 pve pveproxy[482304]: worker exit
Mar  4 03:14:34 pve pveproxy[2988]: worker 482304 finished
Mar  4 03:14:34 pve pveproxy[2988]: starting 1 worker(s)
Mar  4 03:14:34 pve pveproxy[2988]: worker 490743 started
Mar  4 03:17:01 pve CRON[491174]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Mar  4 03:18:30 pve postfix/qmgr[2939]: 425A91F2E3: from=<root@pve.>, size=1223, nrcpt=1 (queue active)
Mar  4 03:18:30 pve postfix/qmgr[2939]: 440A3B0081: from=<root@pve.>, size=1223, nrcpt=1 (queue active)
Mar  4 03:18:31 pve postfix/smtp[491440]: connect to gmail-smtp-in.l.google.com[2404:6800:4003:c05::1a]:25: Network is unreachable
Mar  4 03:18:46 pve pvedaemon[468454]: <root@pam> successful auth for user 'root@pam'
Mar  4 03:19:01 pve postfix/smtp[491440]: connect to gmail-smtp-in.l.google.com[172.253.118.27]:25: Connection timed out
Mar  4 03:19:01 pve postfix/smtp[491439]: connect to gmail-smtp-in.l.google.com[172.253.118.27]:25: Connection timed out
Mar  4 03:19:01 pve postfix/smtp[491439]: connect to gmail-smtp-in.l.google.com[2404:6800:4003:c05::1a]:25: Network is unreachable
Mar  4 03:19:01 pve postfix/smtp[491439]: connect to alt1.gmail-smtp-in.l.google.com[2607:f8b0:400e:c00::1b]:25: Network is unreachable
Mar  4 03:19:31 pve postfix/smtp[491440]: connect to alt1.gmail-smtp-in.l.google.com[173.194.202.26]:25: Connection timed out
Mar  4 03:19:31 pve postfix/smtp[491440]: connect to alt1.gmail-smtp-in.l.google.com[2607:f8b0:400e:c00::1b]:25: Network is unreachable
Mar  4 03:19:31 pve postfix/smtp[491440]: connect to alt2.gmail-smtp-in.l.google.com[2607:f8b0:4023:c0b::1a]:25: Network is unreachable
Mar  4 03:19:31 pve postfix/smtp[491439]: connect to alt1.gmail-smtp-in.l.google.com[173.194.202.26]:25: Connection timed out
Mar  4 03:20:01 pve postfix/smtp[491439]: connect to alt2.gmail-smtp-in.l.google.com[142.250.141.27]:25: Connection timed out
<Crash/Reboot>
Did you ever find a solution. I've been experiencing the same thing the last week and tried everything you've listed in the OP.
 
Did you ever find a solution. I've been experiencing the same thing the last week and tried everything you've listed in the OP.

I did eventually find a fix. Troubleshooting with some suggestions from users here, I ended up down-clocked my ram to run at base speeds (2133mhz) instead of the rated 3200mhz by the sticks - which stopped the nightly reboots. Based on some other threads I read, this seems to be a problem when PCI passthrough is enabled?

I tried 3-4 new sets of DDR4 and various supported speeds and had the same issue, and also ran memtest on all of them and confirm that they all appeared to be in good working order. I also took some sticks out of my other PC's and tried them.... Proxmox is the only system that had an issue?
I also booted the system into some OS's and didnt experience any issues when running tests with the ram set at 3200mhz, so again only seemed to be an issue with proxmox.

Hope this helps?
 
I did eventually find a fix. Troubleshooting with some suggestions from users here, I ended up down-clocked my ram to run at base speeds (2133mhz) instead of the rated 3200mhz by the sticks - which stopped the nightly reboots. Based on some other threads I read, this seems to be a problem when PCI passthrough is enabled?

I tried 3-4 new sets of DDR4 and various supported speeds and had the same issue, and also ran memtest on all of them and confirm that they all appeared to be in good working order. I also took some sticks out of my other PC's and tried them.... Proxmox is the only system that had an issue?
I also booted the system into some OS's and didnt experience any issues when running tests with the ram set at 3200mhz, so again only seemed to be an issue with proxmox.

Hope this helps?
Thanks for the tips. Unfortunately, this didn't work for me because my BIOS on my mini PC doesn't allow XMP on my ram so it's running at base speeds. I actually just a couple new sticks of ram and it still keeps rebooting.

I think proxmox may just not play nice with my system.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!