Some vms on node caught in "restart" loop

whiggs

New Member
Dec 11, 2024
25
0
1
Hello all. I am running into a very annoying issue. I have several Windows vms (not all) running on a particular host that caught in a restart loop. By restart loop, I mean the vms will successfully boot up load the desktop, and will then immediately restart (as in the restart screen shows and everything). See video below for vm behavior:

https://youtube.com/shorts/X-oGfo2Vp9o

Here is the interesting part. If I power off the virtual machines, remove their virtual nics, and then power them back on, they don't reboot. Then I can just go into the hardware tab, re-add the virtual nic back to the vm, and then it all seems to be good to go. Everything is fine. That is, until the I restart the vm again. Then the vms go right back into the boot loop. I don't understand what is going on. I did take a look at the log for the node, and there does appear to be something there, but I am not sure how to interpret it. Can anyone help me figure out what is going on?!?
 

Attachments

Hello,

the attached log contains reports about corrected errors of your Broadcom NetXtreme BCM5720 Gigabit Ethernet network adapter. If it is a Dell or HP server you might still try the hardware checking.

Code:
Jul 28 17:09:43 bigprox kernel: {89}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
Jul 28 17:09:43 bigprox kernel: {89}[Hardware Error]: It has been corrected by h/w and requires no further action
Jul 28 17:09:43 bigprox kernel: {89}[Hardware Error]: event severity: corrected
Jul 28 17:09:43 bigprox kernel: {89}[Hardware Error]:  Error 0, type: corrected
Jul 28 17:09:43 bigprox kernel: {89}[Hardware Error]:   section_type: PCIe error
Jul 28 17:09:43 bigprox kernel: {89}[Hardware Error]:   port_type: 0, PCIe end point
Jul 28 17:09:43 bigprox kernel: {89}[Hardware Error]:   version: 3.0
Jul 28 17:09:43 bigprox kernel: {89}[Hardware Error]:   command: 0x0546, status: 0x0010
Jul 28 17:09:43 bigprox kernel: {89}[Hardware Error]:   device_id: 0000:5e:00.0
Jul 28 17:09:43 bigprox kernel: {89}[Hardware Error]:   slot: 0
Jul 28 17:09:43 bigprox kernel: {89}[Hardware Error]:   secondary_bus: 0x00
Jul 28 17:09:43 bigprox kernel: {89}[Hardware Error]:   vendor_id: 0x14e4, device_id: 0x1657
Jul 28 17:09:43 bigprox kernel: {89}[Hardware Error]:   class_code: 020000
Jul 28 17:09:43 bigprox kernel: tg3 0000:5e:00.0: AER: aer_status: 0x00000080, aer_mask: 0x00003000
Jul 28 17:09:43 bigprox kernel: tg3 0000:5e:00.0:    [ 7] BadDLLP               
Jul 28 17:09:43 bigprox kernel: tg3 0000:5e:00.0: AER: aer_layer=Data Link Layer, aer_agent=Receiver ID
Jul 28 17:09:46 bigprox kernel: tg3 0000:5e:00.0: AER: aer_status: 0x00000080, aer_mask: 0x00003000
Jul 28 17:09:46 bigprox kernel: tg3 0000:5e:00.0:    [ 7] BadDLLP               
Jul 28 17:09:46 bigprox kernel: tg3 0000:5e:00.0: AER: aer_layer=Data Link Layer, aer_agent=Receiver ID

For the restarts: Is there any related event shown in the Windows logs? The reboot event might contain a reason or some event nearby might shed some light.
Event IDs to look for (source):
  • Event ID 41: This event indicates that Windows restarted without a complete shutdown.
  • Event ID 1074: This event is logged when an application is responsible for the system shutdown or restart. It also indicates when a user restarted or shut down the system by using the Start menu or by pressing Ctrl+Alt+Del.
  • Event ID 6006: This event indicates that Windows was adequately turned off.
  • Event ID 6008: This event indicates an improper or dirty shutdown. It is logged when the most recent shutdown was unexpected.

Would you like to share the content of /etc/network/interfaces and the config of one of the vm (qm config <vmid>), to better understand the setup?