[HELP] NODE KEEPS RESTARTING PVE 7.0

didi767 · Apr 11, 2022

Hello everyone,

I moved a passthrough disk to another VM and the web GUI frozen and then forced a reboot. the issue is that system is now stuck in an endless loop.
1) server is booting, I can see my ASRock logo and boot options
2) I can see the countdown to enter PVE
3) my NICs stops
4) system reboots

I tried to use the rescue boot but it just fails with a message:
"error: compression algorithm inherit not supported" (this message appears 4 times)
"error: unable to find boot disk automatically"
"press any key to continue..."

so it doesn't even reach the point where I can investigate the conf files.
another thing is that if I wait a bit, and choose the "rescue boot" it runs, and then it abruptly restarts (I can see that the NICs lights are gone at the same time),

So now I have no idea how to access the files, or how to start and troubleshoot this issue.
I'm not an expert and not sure what I did wrong or this is just bad luck

any suggestions?

[UPDATE]
I've managed to login quickly via ssh to the node.
I did zfs pool status and this was the result (I unplugged one of my raid drives, so I guess that's why it shows "degraded"?)
but still I'm not able to login to the web UI and it just rebooted again.

Linux proxmox 5.11.22-4-pve #1 SMP PVE 5.11.22-8 (Fri, 27 Aug 2021 11:51:34 +020 0) x86_64

The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
Last login: Mon Apr 11 16:52:55 2022 from 192.168.0.41
root@proxmox:~# zpool status
pool: local_vm_storage_1tb_sam
state: DEGRADED
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
scan: scrub repaired 0B in 00:10:57 with 0 errors on Sun Feb 13 00:34:58 2022
config:

NAME STATE READ WRITE CKSU M
local_vm_storage_1tb_sam DEGRADED 0 0 0
ata-Samsung_SSD_860_EVO_1TB_S599NZFNA00416T DEGRADED 0 0 0 too many errors

errors: No known data errors

pool: rpool
state: DEGRADED
status: One or more devices could not be used because the label is missing or
invalid. Sufficient replicas exist for the pool to continue
functioning in a degraded state.
action: Replace the device using 'zpool replace'.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-4J
scan: resilvered 141M in 00:00:01 with 0 errors on Mon Apr 11 14:28:09 2022
config:

NAME STATE READ WRI TE CKSUM
rpool DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
ata-KINGSTON_SA400S37120G_50026B778404BFFD-part3 ONLINE 0 0 0
7041556851563507203 UNAVAIL 0 0 0 was /dev/disk/by-id/ata-INTEL_SSDSC2CW120A3_CVCV224405EK120BGN-part3

errors: No known data errors
root@proxmox:~#

I'm stuck.

_________________________________________________________________________________________________

another update (some progress I guess?)
I manage to have about 10 minutes of ssh time and I can access the files via SCP.
It's not enough time for me to create backups for my VMs as the connection drops unexpectedly and ruins the backup.
What should I copy from the node?
maybe there's still a chance for me to recover my existing env.?
if I will re-install PVE, would the data on the other storage drives be accessible without formatting or initializing the drives? (like in windows)
if that's the case, I wouldn't mind just re-installing and attaching the drives, since all my actual data are on them (I learned my lesson to have data separated from the actual VM so in cases like these I might a silver lining

)

Please advise!

Regards,
Didi

mira · Apr 12, 2022

Just to be clear, you tried using the Rescue Boot from the ISO, but it couldn't find any disk?
For debugging, you might be able to use the Debug mode. Simply press Ctrl+D until you get to the actual installer.
Once there, press it again and you should be in a state where you have both network, and access to the ZFS tools to check your disks.

Does the network disconnect here as well?

This issue started right after passing a disk through to a VM?
Did you use PCI(e) passthrough?

didi767 · Apr 12, 2022

Hi @mira,

Thanks for your response!

So I made some progress (after ~8h of troubleshooting).
I update and upgrade the PVE and it seemed to prolong the online time that I have (accessing via SSH and retrieving files) but the web GUI is still not available. no matter what I tried. the funny thing is that some of the VMs were actually running in the backend as I could RDP to my windows servers.

I tried to use the Rescue Boot from the ISO but it still shows those error messages. But now, it seems like it's working and I get the proxmox login page (and it shows the web address as well, but still not access). and then after a while (it could be seconds, minutes, or even an hour at one point where it didn't reboot by itself).
I've noticed also that at one point, I tried to reboot one VM and it cause another reboot. But I'm not sure if that's the cause or if it was exactly on time with the random reboots that I have.

Another thing that I tried is to turn off all the VMs auto boot so I could determine if this is coming from a specific VM, but the issue persists.
I disabled the IOMMU also just in case and stopped any passthrough that was originally on the VM prior to the reboots.
The issue started when I removed a pass through disk from one VM to another (I turned off the origin before I switched it to the new one if that makes any difference) then when I started it on the other windows server it just rebooted the PVE node.

The network disconnects constantly on those reboots. (I've noticed that when I start the server after disconnecting the power plug, the network card is blinking, but after the first reboot, it just never comes back, only if I do a complete shutdown. So I guess something is disabling the network card after the first reboot :/)
I thought maybe it would be better to rebuild the PVE node, but wasn't sure which configuration I could grab quickly via SCP.
Another question I had is, do those storage drives that I pass through can be used later on another node without formatting them or initializing the disks? is that simple as just assigning them to the VMs?

_______________________________________
UPDATE: so no matter what I try, I can't get back the SSH access I had. it seems like it's in the reboot loop again (constantly and not with those intervals that I mentioned).
I tried to run the Rescue Boot but it almost immediately reboots, same goes for the install with debug mode, it just restarts no matter what my choice is.
is there an action on the boot that triggers this reboot? It must be something like that, I can't find any other explanation.

UPDATE: I've noticed that these block of lines repeat on each reboot (from the daemon.log file), I tried to paste it here but it says I have too many words so I'm attaching the log file.

mira · Apr 13, 2022

This sounds like a hardware issue.
Check your PSU, the RAM and also the NIC. If your PSU is starting to fail, it can lead to strange symptoms.

didi767 · Apr 13, 2022

Thing is that if I boot into the PVE installation window and just stay there for an hour, there is no reboot (I also suspected that maybe is something wrong with the hardware, but it should have happened regardless of PVE).
What files do I need to have in order to restore the node to the same configuration I had?
I Managed to copy a lot of the folders and files, but I am not sure which one would be useful for restoration.

mira · Apr 14, 2022

Copy /etc/hostname, /etc/hosts, /etc/network/interfaces and everything inside [/ICODE]/etc/pve[/ICODE].

Where are your VM disks located? On the same disk(s) your root is located?
If so, you'd need backups of those VMs, or at least copy the images/disk data somewhere else.

didi767 · Apr 14, 2022

mira said:
Copy /etc/hostname, /etc/hosts, /etc/network/interfaces and everything inside [/ICODE]/etc/pve[/ICODE].

Where are your VM disks located? On the same disk(s) your root is located?
If so, you'd need backups of those VMs, or at least copy the images/disk data somewhere else.

All my VM disks are on another SSD drive that I dedicated just for that. Does that mean that I can just install the node and re-attach the VMs' disks?

mira · Apr 15, 2022

Yes, just make sure you have a copy of the VM configuration files in /etc/pve/qemu-server and to use the same storage name for the SSD in your new installation, so that the configs don't have to be updated.

didi767 · Apr 17, 2022

mira said:
Yes, just make sure you have a copy of the VM configuration files in /etc/pve/qemu-server and to use the same storage name for the SSD in your new installation, so that the configs don't have to be updated.

I'm not sure if you are right, about some hardware failure, but I couldn't even install windows from scratch on a new drive. I ordered a new MBO and I'll see if just replacing it would solve the problem. Regarding the PSU, I did check the voltage in the BIOS and they looked fine.
Hopefully, this would be simple as that, just replacing the MBO instead of trying to rebuild everything.
I'll update you soon with the results.
Thanks again for your quick responses!

didi767 · Apr 18, 2022

So I see the same issue even after I replaced my MBO :/
I guess it wasn't a faulty MBO after all. Not sure what else to think.

_______________________
UPDATE:
It seems like I'm able to boot into the debug mode (from the PVE ISO).
but now whenever I try to display the NICs it gets stuck, and after a few minutes it shows some lines, but it's not displaying the actual network config (at least it doesn't reboot or loose the NIC connectivity LED).
I attached the screenshot.

"ip a" shows only the loopback int :/

mira · Apr 19, 2022

Onboard NIC or PCIe card?
Can you disconnect it and see if the machine is stable then?

NIC issues could be related to a faulty PSU. Had that case some time ago here in the forum.

didi767 · Apr 19, 2022

I have both, onboard and PCIe, I tried to remove the PCIe one, and still the same (about the onboard, now it's a new MBO so we can eliminate that as well).
Just did a memory test and it was flawless, so nothing is wrong there.
I ordered a new PSU to see if this is the culprit (and hopefully it would be because there's nothing else that I didn't test out).
I'll update soon.

didi767 · Apr 21, 2022

I replaced the PSU, still the same.

UPDATE:
Even when there's no drive connected, and I tried to boot into live CD (Ubuntu) it reboots when it starts to load.
wth is going on? :/

new MBP
new PSU
memory passed 4 passes!
no drives.

what???? lol

didi767 · Apr 21, 2022

I even changed the CPU and still the same issue.

didi767 · Apr 21, 2022

I don't know what I did, but now I have a stable SSH connection and can access the drives, but the web GUI is not available no matter what I tried.
clearing cache
apt install --reinstall pve-manager proxmox-widget-toolkit libjs-extjs
stopped the FW
(btw, I had to update the interface to the new ID/name in order to re-establish the connection).
checked with PS tnc command and the port 8006 is open and listening

I tried to go over logs but I keep seeing this message:

Apr 12 21:32:11 proxmox pveproxy[4206]: unable to open log file '/var/log/pveproxy/access.log' - Permission denied

At least we're making progress

I was about to color the server/PC blue and throw it into the ocean.

mira · Apr 22, 2022

That's really strange.

Could you provide the output of the following commands?
ls -l / | grep var
ls -l /var | grep log
ls -l /var/log | grep pveproxy
ls -l /var/log/pveproxy

Search

Search

[HELP] NODE KEEPS RESTARTING PVE 7.0

didi767

Member

Attachments

mira

Proxmox Staff Member

didi767

Member

Attachments

mira

Proxmox Staff Member

didi767

Member

mira

Proxmox Staff Member

didi767

Member

mira

Proxmox Staff Member

didi767

Member

didi767

Member

Attachments

mira

Proxmox Staff Member

didi767

Member

didi767

Member

didi767

Member

didi767

Member

mira

Proxmox Staff Member