Proxmox crashed constantly pve/data corrupted

Vin

New Member
Mar 6, 2023
17
3
3
Hello everybody,

Proxmox somehow constantly crashes, I have setup everything multiple times.

After the crash often times the pve is corrupted, and even though firstly It can be repaired with lvconvert --repair pve/data, eventually it turns into a persistant corruption and I have to setup everything all over again.

Can anybody make sense out of the logs, or tell me how to troubleshoot below issues and the crashes?

Thank you in advance

dmesg
https://pastebin.com/HnP5SF3v

journal
https://pastebin.com/ghGvbLjU

root@Snake:~# vgchange -a y pve
Check of pool pve/data failed (status:1). Manual repair required!
2 logical volume(s) in volume group "pve" now active
root@Snake:~# lvchange -a y pve/data
Check of pool pve/data failed (status:1). Manual repair required!
root@Snake:~# lvconvert --repair pve/data
Child 4234 exited abnormally
Repair of thin metadata volume of thin pool pve/data failed (status:-1). Manual repair required!
 
Hello everybody,

Proxmox somehow constantly crashes, I have setup everything multiple times.

After the crash often times the pve is corrupted, and even though firstly It can be repaired with lvconvert --repair pve/data, eventually it turns into a persistant corruption and I have to setup everything all over again.

Can anybody make sense out of the logs, or tell me how to troubleshoot below issues and the crashes?

Thank you in advance

dmesg
https://pastebin.com/HnP5SF3v

journal
https://pastebin.com/ghGvbLjU

root@Snake:~# vgchange -a y pve
Check of pool pve/data failed (status:1). Manual repair required!
2 logical volume(s) in volume group "pve" now active
root@Snake:~# lvchange -a y pve/data
Check of pool pve/data failed (status:1). Manual repair required!
root@Snake:~# lvconvert --repair pve/data
Child 4234 exited abnormally
Repair of thin metadata volume of thin pool pve/data failed (status:-1). Manual repair required!
Most probable a hardware error (disk? controller?). AFAIU lvm pve is corrupt after a while. It may help to have a look into lvm's history, shown by archive and backup file:
Code:
cat /etc/lvm/archive/*
cat /etc/lvm/backup/*
 
I reinstalled the entire setup multiple times by now

Also I just changed the NVMe to a brand new one
Only installed Proxmox and ist crashed again, with a corrupted file system

So unfortunately I dont have logfiles anymore.

Is there a way to see the I/O errors in a running system?

Also I apparently do have some corrupted sections in my NAS HDDs, can those lead to the crashes?
I tried to fix them via gparted, but I just ran into errors
 

Attachments

  • 1.png
    1.png
    129.5 KB · Views: 19
  • 2.png
    2.png
    48 KB · Views: 16
  • 3.png
    3.png
    56.9 KB · Views: 19
  • 5.png
    5.png
    725.3 KB · Views: 15
  • 6.png
    6.png
    841.8 KB · Views: 18
I reinstalled the entire setup multiple times by now

Also I just changed the NVMe to a brand new one
Only installed Proxmox and ist crashed again, with a corrupted file system

So unfortunately I dont have logfiles anymore.

Is there a way to see the I/O errors in a running system?

Try to boot via an external live media and investigate the file from previous Proxmox installation then.

Also I apparently do have some corrupted sections in my NAS HDDs, can those lead to the crashes?
I tried to fix them via gparted, but I just ran into errors
NAS is used as data storage only (or?), therefore it cannot cause the problem of Proxmox crash you reported. In order to exclude any inflince from NAS I suggest to run Proxmox for the moment without it (and configure it later as soon as Proxmox is stable).
 
I reinstalled everything from scratch, still crashes

dmesg Proxmox
https://pastebin.com/AkPiDT6j

Regarding the NAS I do passthrough the disks to a Debian VM with OMV installed to it.
In this particular dmesg from Proxmox here, I only passed one SSD to the VM.

I do suspect an I/O problem, due to I/O load, as described years ago there
https://bugzilla.kernel.org/show_bug.cgi?id=199727#c0

I did change all my VM Disks to VirtiIO SCSI single, Cache = Write Back, Discard = 1, IO Thread = 1, Async IO=threads, SSD emulation
 
Its impossible to use Proxmox

I really do think, its about the IO handling of Proxmox itself


as stated above, i already set my HDDs and NVMEs to IO thread and async io = threads
I still get a high IO delay of around 40-50 when I do copy things between HDDs

I do passthrough specific SATA HDDs and NVMEs to the VM without a HBA


Also it crashes constantly the Debian VM where OMV is located

Message from syslogd@debian at Apr 4 12:24:53 ...
kernel:[ 1071.553173] watchdog: BUG: soft lockup - CPU#6 stuck for 26s! [kworker/6:1:119]

Message from syslogd@debian at Apr 4 12:25:21 ...
kernel:[ 1099.552508] watchdog: BUG: soft lockup - CPU#6 stuck for 53s! [kworker/6:1:119]
 
Last edited:
In case that Proxmox is a debian based environment, just try to install a standard debian kernel for your system. You may check then if your system has some trouble to run debian or if Proxmox kernel causes the issue on your side.

I am encoutering a similar issue on our side, when running Proxmox on less CPU on a VMWARE Cluster. On heavy disk activity the system stops and reboot.

https://forum.proxmox.com/threads/proxmox-7-3-host-always-reboots-on-snapshot-via-vmware.125269/
 
Aloha @Vin were you ever able to fix your situation?
I am have the same issue. I've tried multiple installations of proxmox, and it'll be fine for about a week or two, and then it'll begin showing extremely high I/O waits and begins locking up. Similar outputs to your dmesg and journal Once locked up SSH and local commands stop working. All LXC containers freeze.
It's a weird issue for me, as it seems it could be NVME & network related. When I disconnect from the network and connect a monitor to access CLI directly it seems to be accessible better than over the network - basically in that I can run more commands before it'll freeze up.
I can't even backup containers even trying to use bwlimit.
I tried different iterations of this command:
vzdump xxx --dumpdir /xxx/xxx --mode stop --bwlimit 1024 --compress lzo

It's happening on 2 machines with similar HW configs, so maybe it's just the nvme drives o_O
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!