Proxmox reboots randomly when running docker containers inside Windows 11 VM with "host" CPU type

alex346

New Member
May 30, 2024
2
0
1
I am experiencing random reboots of my Proxmox server. This issue occurs specifically when running a Windows 11 VM with the CPU set to "host" to support Docker containers inside the VM. Interestingly, a Fedora Linux VM with the same "host" CPU setting and Docker containers does not cause reboots. The system has ran fine for months without setting the CPU type as "host" on the Windows VM, but once I change that, it reboots after some time when I need to run Docker containers. I seek advice on how to debug this issue, capture any logs, and hopefully fix these reboots. I could never find anything relevant in the system log. Any guidance or similar experiences would be greatly appreciated.

- proxmox-ve: 8.2.0
- Linux 6.8.4-3-pve (2024-05-02T11:55Z)
- pve-manager/8.2.2/9355359cd7afbae4

- 16 x AMD Ryzen 7 7700 8-Core Processor (1 Socket)
- Gigabyte X670 Aorus Elite AX
- Samsung 990 PRO 1 TB
- 4x16 GB Corsair Vengeance DDR5
 
Same issue but with quite different specs.

Windows Servers (many of them) - no docker inside these machines
One ubuntu with docker
Always enough RAM, CPU and HDD

- Kernel 6.5.11-7-pve (I've just downgraded today to the original from the ISO, that is 6.2.16-19-pve)
- Proxmox 8.1.3
- Linux 6.2.16-19-pve (2023-10-24T12:07Z)

- AMD Ryzen 5 3600X 6-Core Processor
- 64 GB RAM

This started to happen more and more suddenly, at the first just happened like 2 times a month, now it's almost daily (or twice a day).

Maybe is a hardware issue, so I've asked to an scheduled hardware review.


The only I can see at the logs is "-- Reboot --", but the other log files that everyone uses JUST DOES NOT EXIST.
 
  • Like
Reactions: alex346
The only I can see at the logs is "-- Reboot --", but the other log files that everyone uses JUST DOES NOT EXIST.
Bookworm utilizes journald by default. Try
  • man journalctl
  • journalctl -b 0 -p err -n 25 # the last 25 lines of errors of this current boot
  • journalctl -b -1 -p err -n 25 # the last 25 lines of errors of the previous boot
 
Samsung 990 PRO 1 TB
Assuming that is your only drive (OS &Storage) - try replacing with a different one & see if problems go away. Alternatively try updating the firmware. Samsung's NVMEs have had their fair share of problems.

Keep in mind this consumer SSD isn't adequate for the PVE OS which does massive amounts of constant writes. Its going to fail eventually - maybe that has already started on your setup.
 
  • Like
Reactions: Kingneutron
Assuming that is your only drive (OS &Storage) - try replacing with a different one & see if problems go away. Alternatively try updating the firmware. Samsung's NVMEs have had their fair share of problems.

Keep in mind this consumer SSD isn't adequate for the PVE OS which does massive amounts of constant writes. Its going to fail eventually - maybe that has already started on your setup.
I only have this drive for allocated of the Proxmox OS. This was a brand new system that I assembled and this issue was present from the start. S.M.A.R.T. is showing the wearout at 8%.

A Windows 11 VM with "host" CPU is unstable and causes the reboot, but a Linux VM with "host" CPU is stable. I have multiple CT containers, Linux and Windows VMs. Once I configure the Windows VM to use the "host" CPU and I run Docker container I get reboots. If the SSD was to blame, I would expect the issue to manifest itself when I don't use the "host" CPU configuration for the Windows VM, but this is not the case.
 
Same problem here, random crash & reboot when running Docker in a Windows 11 guest.

- Kernel 6.8.4-3-pve
- Proxmox 8.2.2

- AMD Ryzen 9 7950X 16-Core Processor
- 64 GB RAM
 
  • Like
Reactions: alex346
Bookworm utilizes journald by default. Try
  • man journalctl
  • journalctl -b 0 -p err -n 25 # the last 25 lines of errors of this current boot
  • journalctl -b -1 -p err -n 25 # the last 25 lines of errors of the previous boot
Sorry for late reply, these logs throws nothing useful.

I had a reboot at 9:50:

Code:
root@b1:~# journalctl -b  0  -p err  -n 25
Jun 27 09:20:35 b1 sshd[23950]: fatal: Timeout before authentication for 106.75.252.202 port 12648
Jun 27 09:22:51 b1 sshd[24552]: fatal: Timeout before authentication for 203.25.211.164 port 55596
Jun 27 09:24:27 b1 sshd[24978]: fatal: Timeout before authentication for 203.25.211.164 port 46662
Jun 27 09:26:00 b1 sshd[25348]: fatal: Timeout before authentication for 203.25.211.164 port 43154
Jun 27 09:26:56 b1 sshd[25573]: fatal: Timeout before authentication for 106.75.252.202 port 19224
Jun 27 09:29:01 b1 sshd[26096]: fatal: Timeout before authentication for 203.25.211.164 port 60342
Jun 27 09:34:38 b1 sshd[27536]: fatal: Timeout before authentication for 106.75.252.202 port 41178
Jun 27 09:39:17 b1 sshd[28693]: fatal: Timeout before authentication for 106.75.252.202 port 32344
Jun 27 09:39:40 b1 sshd[28780]: fatal: Timeout before authentication for 203.25.211.164 port 48284
Jun 27 09:42:26 b1 sshd[29487]: fatal: Timeout before authentication for 106.75.252.202 port 63128
Jun 27 09:42:46 b1 sshd[29548]: fatal: Timeout before authentication for 203.25.211.164 port 58014
Jun 27 09:45:33 b1 sshd[30239]: fatal: Timeout before authentication for 106.75.252.202 port 38908
Jun 27 09:47:25 b1 sshd[30652]: fatal: Timeout before authentication for 203.25.211.164 port 34738
Jun 27 09:48:58 b1 sshd[31019]: fatal: Timeout before authentication for 203.25.211.164 port 56550
Jun 27 09:50:17 b1 sshd[31374]: fatal: Timeout before authentication for 106.75.252.202 port 30086
Jun 27 09:52:07 b1 sshd[31828]: fatal: Timeout before authentication for 203.25.211.164 port 41684
Jun 27 09:53:39 b1 sshd[32235]: fatal: Timeout before authentication for 203.25.211.164 port 38250
Jun 27 09:54:28 b1 sshd[32805]: error: kex_exchange_identification: read: Connection reset by peer
Jun 27 10:42:31 b1 sshd[45859]: error: kex_exchange_identification: Connection closed by remote host


Code:
root@b1:~# journalctl -b  1  -p err  -n 25 
May 03 03:52:16 b1 sshd[2367632]: fatal: Timeout before authentication for 152.136.48.82 port 59858
May 03 03:52:50 b1 sshd[2368172]: error: kex_exchange_identification: Connection closed by remote host
May 03 03:55:56 b1 sshd[2368428]: fatal: Timeout before authentication for 152.136.48.82 port 45452
May 03 03:58:34 b1 sshd[2369384]: error: kex_exchange_identification: read: Connection reset by peer
May 03 03:59:41 b1 sshd[2369754]: error: kex_exchange_identification: Connection closed by remote host
May 03 04:43:22 b1 pveupdate[2379823]: validating challenge 'https://acme-v02.api.letsencrypt.org/acme/authz-v3/346127022387' failed - status: invalid
May 03 04:43:22 b1 pveupdate[2379274]: <root@pam> end task UPID:b1:0024502F:0BE433F1:6634B1B4:acmerenew::root@pam: validating challenge 'https://acme-v02.>
May 03 04:46:18 b1 sshd[2380377]: error: kex_exchange_identification: read: Connection reset by peer
May 03 04:49:09 b1 sshd[2380705]: fatal: Timeout before authentication for 180.101.88.240 port 34023
May 03 04:49:09 b1 sshd[2380706]: fatal: Timeout before authentication for 180.101.88.240 port 36525
May 03 04:49:20 b1 sshd[2380732]: fatal: Timeout before authentication for 180.101.88.240 port 13087
May 03 04:51:05 b1 sshd[2381098]: fatal: Timeout before authentication for 180.101.88.240 port 15420
May 03 04:52:33 b1 sshd[2381419]: fatal: Timeout before authentication for 180.101.88.240 port 46824
May 03 04:53:55 b1 sshd[2381675]: fatal: Timeout before authentication for 180.101.88.240 port 37322
May 03 04:53:59 b1 sshd[2381681]: fatal: Timeout before authentication for 180.101.88.240 port 19863
May 03 04:54:01 b1 sshd[2381682]: fatal: Timeout before authentication for 180.101.88.240 port 31557
May 03 04:54:09 b1 sshd[2381724]: fatal: Timeout before authentication for 180.101.88.240 port 28230
May 03 04:55:27 b1 sshd[2382019]: fatal: Timeout before authentication for 180.101.88.240 port 52563
May 03 04:56:33 b1 sshd[2382254]: fatal: Timeout before authentication for 180.101.88.240 port 13463
May 03 04:56:35 b1 sshd[2382268]: fatal: Timeout before authentication for 180.101.88.240 port 30835
May 03 04:56:41 b1 sshd[2382274]: fatal: Timeout before authentication for 180.101.88.240 port 57565
May 03 05:32:42 b1 sshd[2390586]: error: kex_exchange_identification: banner line contains invalid characters
May 03 05:33:00 b1 sshd[2390587]: error: kex_exchange_identification: Connection closed by remote host
May 03 05:33:01 b1 sshd[2390651]: error: Protocol major versions differ: 2 vs. 1



I've changed all virtual machines to use less cores in total and limited the CPU (also changed some AMD options), it rebooted after 16 days instead of 1 day or 2, so I guess that the issue is on how proxmox (does not) manage the "host" CPU cores.
It might be?

I don't know where more looking to solve this.
 
journalctl -b 1 -p err -n 25
The missing minus-sign ("-1") is relevant. You queried the first boot process in history - which was at May 03 ;-)

"-1" means the previous boot, before the currently running system has been booted. That is the one which initiated the Reboot. man journalctl

And no, I do not see any hint in your shown output...
 
The missing minus-sign ("-1") is relevant. You queried the first boot process in history - which was at May 03 ;-)

"-1" means the previous boot, before the currently running system has been booted. That is the one which initiated the Reboot. man journalctl

And no, I do not see any hint in your shown output...

Oh, I've copied the "0" and changed to :

Code:
root@b1:~# journalctl -b -1 -p err  -n 25
Jun 27 07:35:01 b1 kernel: ERROR: Unable to locate IOAPIC for GSI -1
Jun 27 07:35:01 b1 kernel: amd_gpio AMDI0030:00: Invalid config param 0014
Jun 27 07:35:02 b1 kernel: snd_hda_intel 0000:2e:00.4: no codecs found!
Jun 27 07:35:25 b1 kernel: kvm [1514]: ignored rdmsr: 0x3a data 0x0
Jun 27 07:35:25 b1 kernel: kvm [1514]: ignored rdmsr: 0xd90 data 0x0
Jun 27 07:35:25 b1 kernel: kvm [1514]: ignored rdmsr: 0x570 data 0x0
Jun 27 07:35:25 b1 kernel: kvm [1514]: ignored rdmsr: 0x571 data 0x0
Jun 27 07:35:25 b1 kernel: kvm [1514]: ignored rdmsr: 0x572 data 0x0
Jun 27 07:35:25 b1 kernel: kvm [1514]: ignored rdmsr: 0x560 data 0x0
Jun 27 07:35:25 b1 kernel: kvm [1514]: ignored rdmsr: 0x561 data 0x0
Jun 27 07:35:25 b1 kernel: kvm [1514]: ignored rdmsr: 0x580 data 0x0
Jun 27 07:35:25 b1 kernel: kvm [1514]: ignored rdmsr: 0x581 data 0x0
Jun 27 07:35:25 b1 kernel: kvm [1514]: ignored rdmsr: 0x582 data 0x0
Jun 27 07:36:44 b1 kernel: kvm [2275]: ignored rdmsr: 0x3a data 0x0
Jun 27 07:36:44 b1 kernel: kvm [2275]: ignored rdmsr: 0xd90 data 0x0
Jun 27 07:36:44 b1 kernel: kvm [2275]: ignored rdmsr: 0x570 data 0x0
Jun 27 07:36:44 b1 kernel: kvm [2275]: ignored rdmsr: 0x571 data 0x0
Jun 27 07:36:44 b1 kernel: kvm [2275]: ignored rdmsr: 0x572 data 0x0
Jun 27 07:36:44 b1 kernel: kvm [2275]: ignored rdmsr: 0x560 data 0x0
Jun 27 07:36:44 b1 kernel: kvm [2275]: ignored rdmsr: 0x561 data 0x0
Jun 27 07:36:44 b1 kernel: kvm [2275]: ignored rdmsr: 0x580 data 0x0
Jun 27 07:36:44 b1 kernel: kvm [2275]: ignored rdmsr: 0x581 data 0x0
Jun 27 07:36:44 b1 kernel: kvm [2275]: ignored rdmsr: 0x582 data 0x0
root@b1:~#

Any clues with this? So many errors, but not sure how to analyze them :(

Maybe is because I'm using "host" CPU and should use a virtual one? (But it restarted after 16 days, previously was each day)
 
Are you doing any passthrough to the Windows VM?

If you are; try removing it & see what happens. The errors you show above are usually linked to passthrough.
 
No, sorry :-(
Thanks anyways!

Are you doing any passthrough to the Windows VM?

If you are; try removing it & see what happens. The errors you show above are usually linked to passthrough.
As per I know, the only passtrhough I'm using are:
- CPU (as host)
- Network (Intel driver)

This can be the issue? i should change the network to paravirtualized and the CPU to a virtual one?
 
Watching with anticipation, we've started seeing these random reboots displying --Reboot-- in the syslog file on a PVE Host with Windows 11 VM's, using the following hardware...

- Proxmox VE 8.2.4
- CPU: AMD Ryzen 7900 12 Core 5.4Ghz
- MB: Gigabyte B650M DS3H
- Memory: Corsair Vengeance 96GB (2x48GB) C40 5600MHz
- Video: Geforce GT 730 (Passed through to 1 Windows 11 VM)
- Network: TP-Link 10Gbps PCIe Ethernet Network Card (TX401)
- VM's Drive: Samsung 990 Pro 4TB PCIe 4.0 M.2 2280 NVMe SSD
- Proxmox Install Drive: Samsung 980 Pro 1TB PCIe Gen4 M.2 2280 NVMe SSD

I have made sure to turn off this in the BIOS, and the BIOS is up to date.

1722919165633.png
 
Last edited:
Dears,

I notice and found something:

1rst: I had an LXCE container (just one), after stop it, the reboots are at least after 15 days instead daily.
2nd: There is a driver issue with realtek, I didn't applied the fix yet, but is this one: https://www.youtube.com/watch?v=To_hXK10Do8

@protocolnebula Thanks for the suggestion, I do have a Realtek onboard network card that I'm not using. I might try turning it off on the BIOS and see if that makes a difference.
 
@protocolnebula Thanks for the suggestion, I do have a Realtek onboard network card that I'm not using. I might try turning it off on the BIOS and see if that makes a difference.
Did you solve anything?

I never solved after a lot of tries (never test the driver because I had real customers working on the server).

What was my solution? 11 days ago I've migrated to a new Intel computer (from AMD), since then, no reboot happened, I hope it reboots max 1 time per month (or never).
 
Did you solve anything?

I never solved after a lot of tries (never test the driver because I had real customers working on the server).

What was my solution? 11 days ago I've migrated to a new Intel computer (from AMD), since then, no reboot happened, I hope it reboots max 1 time per month (or never).
Of the three identical servers I still have one that likes to reboot every now and then. I've seen up times of more than 20 days then it does the reboot. The other two have up times of over 44 days now. I had a WoL script that I was using (running as a daemon service) that I was suspect of. I turned that script off and have had much longer up times. I'm also running this version of the kernel on all three servers. Linux 6.8.12-2-pve, I've just noticed 12-3-pve is available so I'll give that a try on the 3rd server.
 
Last edited:
Did you solve anything?

I never solved after a lot of tries (never test the driver because I had real customers working on the server).

What was my solution? 11 days ago I've migrated to a new Intel computer (from AMD), since then, no reboot happened, I hope it reboots max 1 time per month (or never).
I didn't solve it. Did not have the time for more testing. I moved the failing VM to another machine (Intel CPU now) and don't have problems anymore. I just don't use the AMD machine for such VMs anymore.
 
I didn't solve it. Did not have the time for more testing. I moved the failing VM to another machine (Intel CPU now) and don't have problems anymore. I just don't use the AMD machine for such VMs anymore.
So we both did the same solution haha
15 days without a reboot!
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!