Adding swaps leads to corrupted file system, what is wrong?

vmswtje

Member
Jun 29, 2020
12
0
6
34
Yesterday I experienced a big data crash on one of my Proxmox nodes.
Almost all VM's corrupted and not working anymore. It was an awful day recovering data and accepting some data loss.

I'm worried something like this happens again and I hope someone can tell me what went wrong and/or what to do/test to know whether there are still risks/problems.

inux vrt14 5.0.21-5-pve #1 SMP PVE 5.0.21-10 (Wed, 13 Nov 2019 08:27:10 +0100) x86_64 The programs included with the Debian GNU/Linux system are free software; the exact distribution terms for each program are described in the individual files in /usr/share/doc/*/copyright. Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent permitted by applicable law. Last login: Mon Sep 20 08:15:28 2021 from 145.131.206.197 username@vrt14:~$ htop username@vrt14:~$ cd /ssd username@vrt14:/ssd$ sudo su [sudo] password for username: root@vrt14:/ssd# df -h Filesystem Size Used Avail Use% Mounted on udev 126G 0 126G 0% /dev tmpfs 26G 2.6G 23G 11% /run /dev/mapper/pve-root 7.1G 4.4G 2.4G 65% / tmpfs 126G 63M 126G 1% /dev/shm tmpfs 5.0M 0 5.0M 0% /run/lock tmpfs 126G 0 126G 0% /sys/fs/cgroup /dev/mapper/vg--ssd-ssd 1.8T 1.5T 280G 85% /ssd /dev/sda2 253M 288K 252M 1% /boot/efi 192.168.30.13:/virtual-backups 6.0T 5.2T 770G 88% /mnt/pve/virtual-backups 192.168.30.13:/virtual-storage 6.9T 6.6T 361G 95% /mnt/pve/virtual-storage 192.168.30.13:/virtual-machines 4.0T 633G 3.4T 16% /mnt/pve/virtual-machines /dev/fuse 30M 108K 30M 1% /etc/pve /dev/sda4 190G 17G 164G 9% /root2 tmpfs 26G 0 26G 0% /run/user/1000 root@vrt14:/ssd# cd /ssd root@vrt14:/ssd# ls images lost+found root@vrt14:/ssd# fallocate -l 16G swapfile root@vrt14:/ssd# chmod 600 swapfile root@vrt14:/ssd# mkswap swapfile Setting up swapspace version 1, size = 16 GiB (17179865088 bytes) no label, UUID=d6351852-811f-44e4-9237-90d8809dd31e root@vrt14:/ssd# swapon swapfile root@vrt14:/ssd# nano /etc/fstab root@vrt14:/ssd# swapon NAME TYPE SIZE USED PRIO /dev/dm-1 partition 3.6G 3.6G -2 /root2/swapfile file 16G 16G -3 /ssd/swapfile file 16G 3.7G -4

In the syslog, I can find the moment that this went wrong. As you can see, 9:24:56 the swap added, and 9:26:11 there is a segfault. I guess that's related.
What happened? How can I find out?

Pastebin syslog extract

HP DL380 Gen10 met RAID
HP Smart Array P408i-a SR Gen10 (BBU / everything is ok according to controller, RAID10)
Linux vrt14 5.0.21-5-pve #1 SMP PVE 5.0.21-10 (Wed, 13 Nov 2019 08:27:10 +0100) x86_64 GNU/Linux
pve-manager/6.0-15/52b91481

Some data about the filesystems below. Disk /ssd (the one that crashed) had more than enough space at the moment (20%+ free).
I'm still checking what'sthe /dev/sdc message, because I can't remember what this should be. As far as I remember I only work with the /dev/sda & /ssd thats on /dev/sdb

root@vrt14:/home/username# pvs /dev/sdc: open failed: No medium found WARNING: Not using device /dev/mapper/3600508b1001cdf9244b3582b1fae00fe for PV 45OdTg-2dg1-hk2H-wqws-zDH3-LuwY-ocjj5r. WARNING: PV 45OdTg-2dg1-hk2H-wqws-zDH3-LuwY-ocjj5r prefers device /dev/sdb because device is used by LV. PV VG Fmt Attr PSize PFree /dev/sda3 pve lvm2 a-- <29.75g <3.59g /dev/sdb vg-ssd lvm2 a-- <1.75t 0 root@vrt14:/home/username# lvs /dev/sdc: open failed: No medium found WARNING: Not using device /dev/mapper/3600508b1001cdf9244b3582b1fae00fe for PV 45OdTg-2dg1-hk2H-wqws-zDH3-LuwY-ocjj5r. WARNING: PV 45OdTg-2dg1-hk2H-wqws-zDH3-LuwY-ocjj5r prefers device /dev/sdb because device is used by LV. LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert data pve twi-a-tz-- 15.25g 0.00 10.57 root pve -wi-ao---- 7.25g swap pve -wi-ao---- 3.62g ssd vg-ssd -wi-ao---- <1.75t root@vrt14:/home/username# df -Th | grep "^/dev" /dev/mapper/pve-root ext4 7.1G 4.4G 2.4G 65% / /dev/sda4 ext4 190G 17G 164G 9% /root2 /dev/sda2 vfat 253M 288K 252M 1% /boot/efi /dev/fuse fuse 30M 112K 30M 1% /etc/pve /dev/mapper/vg--ssd-ssd ext4 1.8T 1.4T 390G 78% /ssd

fsckcheck did find many errors and after fixing, also the qemu-img files contained many errors, and machines didn't work anymore (couldn't load root and/or disks-data was corrupted).
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!