Yesterday I experienced a big data crash on one of my Proxmox nodes.
Almost all VM's corrupted and not working anymore. It was an awful day recovering data and accepting some data loss.
I'm worried something like this happens again and I hope someone can tell me what went wrong and/or what to do/test to know whether there are still risks/problems.
In the syslog, I can find the moment that this went wrong. As you can see, 9:24:56 the swap added, and 9:26:11 there is a segfault. I guess that's related.
What happened? How can I find out?
Pastebin syslog extract
HP DL380 Gen10 met RAID
HP Smart Array P408i-a SR Gen10 (BBU / everything is ok according to controller, RAID10)
Linux vrt14 5.0.21-5-pve #1 SMP PVE 5.0.21-10 (Wed, 13 Nov 2019 08:27:10 +0100) x86_64 GNU/Linux
pve-manager/6.0-15/52b91481
Some data about the filesystems below. Disk /ssd (the one that crashed) had more than enough space at the moment (20%+ free).
I'm still checking what'sthe /dev/sdc message, because I can't remember what this should be. As far as I remember I only work with the /dev/sda & /ssd thats on /dev/sdb
fsckcheck did find many errors and after fixing, also the qemu-img files contained many errors, and machines didn't work anymore (couldn't load root and/or disks-data was corrupted).
				
			Almost all VM's corrupted and not working anymore. It was an awful day recovering data and accepting some data loss.
I'm worried something like this happens again and I hope someone can tell me what went wrong and/or what to do/test to know whether there are still risks/problems.
inux vrt14 5.0.21-5-pve #1 SMP PVE 5.0.21-10 (Wed, 13 Nov 2019 08:27:10 +0100) x86_64
The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.
Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
Last login: Mon Sep 20 08:15:28 2021 from 145.131.206.197
username@vrt14:~$ htop
username@vrt14:~$ cd /ssd
username@vrt14:/ssd$ sudo su
[sudo] password for username:
root@vrt14:/ssd# df -h
Filesystem                       Size  Used Avail Use% Mounted on
udev                             126G     0  126G   0% /dev
tmpfs                             26G  2.6G   23G  11% /run
/dev/mapper/pve-root             7.1G  4.4G  2.4G  65% /
tmpfs                            126G   63M  126G   1% /dev/shm
tmpfs                            5.0M     0  5.0M   0% /run/lock
tmpfs                            126G     0  126G   0% /sys/fs/cgroup
/dev/mapper/vg--ssd-ssd          1.8T  1.5T  280G  85% /ssd
/dev/sda2                        253M  288K  252M   1% /boot/efi
192.168.30.13:/virtual-backups   6.0T  5.2T  770G  88% /mnt/pve/virtual-backups
192.168.30.13:/virtual-storage   6.9T  6.6T  361G  95% /mnt/pve/virtual-storage
192.168.30.13:/virtual-machines  4.0T  633G  3.4T  16% /mnt/pve/virtual-machines
/dev/fuse                         30M  108K   30M   1% /etc/pve
/dev/sda4                        190G   17G  164G   9% /root2
tmpfs                             26G     0   26G   0% /run/user/1000
root@vrt14:/ssd# cd /ssd
root@vrt14:/ssd# ls
images  lost+found
root@vrt14:/ssd# fallocate -l 16G swapfile
root@vrt14:/ssd# chmod 600 swapfile
root@vrt14:/ssd# mkswap swapfile
Setting up swapspace version 1, size = 16 GiB (17179865088 bytes)
no label, UUID=d6351852-811f-44e4-9237-90d8809dd31e
root@vrt14:/ssd# swapon swapfile
root@vrt14:/ssd# nano /etc/fstab
root@vrt14:/ssd# swapon
NAME            TYPE      SIZE USED PRIO
/dev/dm-1       partition 3.6G 3.6G   -2
/root2/swapfile file       16G  16G   -3
/ssd/swapfile   file       16G 3.7G   -4
In the syslog, I can find the moment that this went wrong. As you can see, 9:24:56 the swap added, and 9:26:11 there is a segfault. I guess that's related.
What happened? How can I find out?
Pastebin syslog extract
HP DL380 Gen10 met RAID
HP Smart Array P408i-a SR Gen10 (BBU / everything is ok according to controller, RAID10)
Linux vrt14 5.0.21-5-pve #1 SMP PVE 5.0.21-10 (Wed, 13 Nov 2019 08:27:10 +0100) x86_64 GNU/Linux
pve-manager/6.0-15/52b91481
Some data about the filesystems below. Disk /ssd (the one that crashed) had more than enough space at the moment (20%+ free).
I'm still checking what'sthe /dev/sdc message, because I can't remember what this should be. As far as I remember I only work with the /dev/sda & /ssd thats on /dev/sdb
root@vrt14:/home/username# pvs
  /dev/sdc: open failed: No medium found
  WARNING: Not using device /dev/mapper/3600508b1001cdf9244b3582b1fae00fe for PV 45OdTg-2dg1-hk2H-wqws-zDH3-LuwY-ocjj5r.
  WARNING: PV 45OdTg-2dg1-hk2H-wqws-zDH3-LuwY-ocjj5r prefers device /dev/sdb because device is used by LV.
  PV         VG     Fmt  Attr PSize   PFree
  /dev/sda3  pve    lvm2 a--  <29.75g <3.59g
  /dev/sdb   vg-ssd lvm2 a--   <1.75t     0
root@vrt14:/home/username# lvs
  /dev/sdc: open failed: No medium found
  WARNING: Not using device /dev/mapper/3600508b1001cdf9244b3582b1fae00fe for PV 45OdTg-2dg1-hk2H-wqws-zDH3-LuwY-ocjj5r.
  WARNING: PV 45OdTg-2dg1-hk2H-wqws-zDH3-LuwY-ocjj5r prefers device /dev/sdb because device is used by LV.
  LV   VG     Attr       LSize  Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  data pve    twi-a-tz-- 15.25g             0.00   10.57
  root pve    -wi-ao----  7.25g
  swap pve    -wi-ao----  3.62g
  ssd  vg-ssd -wi-ao---- <1.75t
root@vrt14:/home/username# df -Th | grep "^/dev"
/dev/mapper/pve-root            ext4      7.1G  4.4G  2.4G  65% /
/dev/sda4                       ext4      190G   17G  164G   9% /root2
/dev/sda2                       vfat      253M  288K  252M   1% /boot/efi
/dev/fuse                       fuse       30M  112K   30M   1% /etc/pve
/dev/mapper/vg--ssd-ssd         ext4      1.8T  1.4T  390G  78% /ssd
fsckcheck did find many errors and after fixing, also the qemu-img files contained many errors, and machines didn't work anymore (couldn't load root and/or disks-data was corrupted).
			
				Last edited: 
				
		
	
										
										
											
	
										
									
								 
	