Yesterday I experienced a big data crash on one of my Proxmox nodes.
Almost all VM's corrupted and not working anymore. It was an awful day recovering data and accepting some data loss.
I'm worried something like this happens again and I hope someone can tell me what went wrong and/or what to do/test to know whether there are still risks/problems.
In the syslog, I can find the moment that this went wrong. As you can see, 9:24:56 the swap added, and 9:26:11 there is a segfault. I guess that's related.
What happened? How can I find out?
Pastebin syslog extract
HP DL380 Gen10 met RAID
HP Smart Array P408i-a SR Gen10 (BBU / everything is ok according to controller, RAID10)
Linux vrt14 5.0.21-5-pve #1 SMP PVE 5.0.21-10 (Wed, 13 Nov 2019 08:27:10 +0100) x86_64 GNU/Linux
pve-manager/6.0-15/52b91481
Some data about the filesystems below. Disk /ssd (the one that crashed) had more than enough space at the moment (20%+ free).
I'm still checking what'sthe /dev/sdc message, because I can't remember what this should be. As far as I remember I only work with the /dev/sda & /ssd thats on /dev/sdb
fsckcheck did find many errors and after fixing, also the qemu-img files contained many errors, and machines didn't work anymore (couldn't load root and/or disks-data was corrupted).
Almost all VM's corrupted and not working anymore. It was an awful day recovering data and accepting some data loss.
I'm worried something like this happens again and I hope someone can tell me what went wrong and/or what to do/test to know whether there are still risks/problems.
inux vrt14 5.0.21-5-pve #1 SMP PVE 5.0.21-10 (Wed, 13 Nov 2019 08:27:10 +0100) x86_64
The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.
Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
Last login: Mon Sep 20 08:15:28 2021 from 145.131.206.197
username@vrt14:~$ htop
username@vrt14:~$ cd /ssd
username@vrt14:/ssd$ sudo su
[sudo] password for username:
root@vrt14:/ssd# df -h
Filesystem Size Used Avail Use% Mounted on
udev 126G 0 126G 0% /dev
tmpfs 26G 2.6G 23G 11% /run
/dev/mapper/pve-root 7.1G 4.4G 2.4G 65% /
tmpfs 126G 63M 126G 1% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 126G 0 126G 0% /sys/fs/cgroup
/dev/mapper/vg--ssd-ssd 1.8T 1.5T 280G 85% /ssd
/dev/sda2 253M 288K 252M 1% /boot/efi
192.168.30.13:/virtual-backups 6.0T 5.2T 770G 88% /mnt/pve/virtual-backups
192.168.30.13:/virtual-storage 6.9T 6.6T 361G 95% /mnt/pve/virtual-storage
192.168.30.13:/virtual-machines 4.0T 633G 3.4T 16% /mnt/pve/virtual-machines
/dev/fuse 30M 108K 30M 1% /etc/pve
/dev/sda4 190G 17G 164G 9% /root2
tmpfs 26G 0 26G 0% /run/user/1000
root@vrt14:/ssd# cd /ssd
root@vrt14:/ssd# ls
images lost+found
root@vrt14:/ssd# fallocate -l 16G swapfile
root@vrt14:/ssd# chmod 600 swapfile
root@vrt14:/ssd# mkswap swapfile
Setting up swapspace version 1, size = 16 GiB (17179865088 bytes)
no label, UUID=d6351852-811f-44e4-9237-90d8809dd31e
root@vrt14:/ssd# swapon swapfile
root@vrt14:/ssd# nano /etc/fstab
root@vrt14:/ssd# swapon
NAME TYPE SIZE USED PRIO
/dev/dm-1 partition 3.6G 3.6G -2
/root2/swapfile file 16G 16G -3
/ssd/swapfile file 16G 3.7G -4
In the syslog, I can find the moment that this went wrong. As you can see, 9:24:56 the swap added, and 9:26:11 there is a segfault. I guess that's related.
What happened? How can I find out?
Pastebin syslog extract
HP DL380 Gen10 met RAID
HP Smart Array P408i-a SR Gen10 (BBU / everything is ok according to controller, RAID10)
Linux vrt14 5.0.21-5-pve #1 SMP PVE 5.0.21-10 (Wed, 13 Nov 2019 08:27:10 +0100) x86_64 GNU/Linux
pve-manager/6.0-15/52b91481
Some data about the filesystems below. Disk /ssd (the one that crashed) had more than enough space at the moment (20%+ free).
I'm still checking what'sthe /dev/sdc message, because I can't remember what this should be. As far as I remember I only work with the /dev/sda & /ssd thats on /dev/sdb
root@vrt14:/home/username# pvs
/dev/sdc: open failed: No medium found
WARNING: Not using device /dev/mapper/3600508b1001cdf9244b3582b1fae00fe for PV 45OdTg-2dg1-hk2H-wqws-zDH3-LuwY-ocjj5r.
WARNING: PV 45OdTg-2dg1-hk2H-wqws-zDH3-LuwY-ocjj5r prefers device /dev/sdb because device is used by LV.
PV VG Fmt Attr PSize PFree
/dev/sda3 pve lvm2 a-- <29.75g <3.59g
/dev/sdb vg-ssd lvm2 a-- <1.75t 0
root@vrt14:/home/username# lvs
/dev/sdc: open failed: No medium found
WARNING: Not using device /dev/mapper/3600508b1001cdf9244b3582b1fae00fe for PV 45OdTg-2dg1-hk2H-wqws-zDH3-LuwY-ocjj5r.
WARNING: PV 45OdTg-2dg1-hk2H-wqws-zDH3-LuwY-ocjj5r prefers device /dev/sdb because device is used by LV.
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert
data pve twi-a-tz-- 15.25g 0.00 10.57
root pve -wi-ao---- 7.25g
swap pve -wi-ao---- 3.62g
ssd vg-ssd -wi-ao---- <1.75t
root@vrt14:/home/username# df -Th | grep "^/dev"
/dev/mapper/pve-root ext4 7.1G 4.4G 2.4G 65% /
/dev/sda4 ext4 190G 17G 164G 9% /root2
/dev/sda2 vfat 253M 288K 252M 1% /boot/efi
/dev/fuse fuse 30M 112K 30M 1% /etc/pve
/dev/mapper/vg--ssd-ssd ext4 1.8T 1.4T 390G 78% /ssd
fsckcheck did find many errors and after fixing, also the qemu-img files contained many errors, and machines didn't work anymore (couldn't load root and/or disks-data was corrupted).
Last edited: