HowTo: Proxmox VE 7 With Software RAID-1

Apr 29, 2016
10
9
23
38
Introduction

The operating system of our servers is always running on a RAID-1 (either hardware or software RAID) for redundancy reasons. Since Proxmox VE 7 does not offer out-of-the-box support for mdraid (there is support for ZFS RAID-1, though), I had to come up with a solution to migrate the base installation to an mdraid RAID-1 and thought this might be useful for others as well. Therefore, I am sharing this as a howto here.

High-level explanation

Basically, I install PVE7 on /dev/sda, then create a degraded(!) software RAID-1 on /dev/sdb3, move the existing PV from /dev/sda3 to /dev/md0 and then join the mdraid /dev/md0 consisting only of /dev/sdb3 at first with /dev/sda3.

Step-by-step instructions

Code:
# Install mdadm
apt install mdadm
 
# Create identical partition layout to /dev/sda on /dev/sdb without copying labels/UUIDs
sfdisk -d /dev/sda > part_table
grep -v ^label-id part_table | sed -e 's/, *uuid=[0-9A-F-]*//' | sfdisk /dev/sdb
 
# Create a degraded RAID-1 on /dev/sdb3
mdadm --create /dev/md0 --level 1 --raid-devices 2 /dev/sdb3 missing
 
# Create LVM PV and add to existing VG
pvcreate /dev/md0
vgextend /dev/pve /dev/md0
 
# Move data from /dev/sda3 to /dev/md0
pvmove /dev/sda3 /dev/md0
 
# IMPORTANT! Wait for the process to finish, then remove /dev/sda3 from the VG
vgreduce /dev/pve /dev/sda3
 
# Add /dev/sda3 to RAID-1
mdadm --manage --add /dev/md0 /dev/sda3
 
# Byte copy EFI and BIOS boot partitions
dd if=/dev/sda1 of=/dev/sdb1
dd if=/dev/sda2 of=/dev/sdb2
 
# Install GRUB on all disks
grub-install /dev/sda
grub-install /dev/sdb
 
# Update initramfs
update-initramfs -u -k all
 
# Wait for RAID-1 sync to finish
watch cat /proc/mdstat
 
Shouldn't you use UUID or something else than disk name, because the name can change during restart. or when adding disks?
How are you doing scrubbing and error reparing?
 
By default mdraid scans all disks and partitions for MD superblocks (metadata) on boot. It then checks for a match of the array name or UUID (depending on your config in /etc/mdadm/mdadm.conf) of the RAID (not the disk UUID!). Therefore, it does not matter what device name the disk receives after a reboot.

Regarding scrubbing and error reparing: When installing mdadm, a monthly cronjob will be added as well to ensure consistency.

Code:
root@proxmox:~# cat /etc/cron.d/mdadm
#
# cron.d/mdadm -- schedules periodic redundancy checks of MD devices
#
# Copyright © martin f. krafft <madduck@madduck.net>
# distributed under the terms of the Artistic Licence 2.0
#

# By default, run at 00:57 on every Sunday, but do nothing unless the day of
# the month is less than or equal to 7. Thus, only run on the first Sunday of
# each month. crontab(5) sucks, unfortunately, in this regard; therefore this
# hack (see #380425).
57 0 * * 0 root if [ -x /usr/share/mdadm/checkarray ] && [ $(date +\%d) -le 7 ]; then /usr/share/mdadm/checkarray --cron --all --idle --quiet; fi
 
Last edited:
Hello, I come here to throw a bottle into the sea.

It is not possible to apply the mdadm procedure for Proxmox ve 7.1-8 after the step :

Code:
# Move data from / dev / sda3 to / dev / md0
pvmove / dev / sda3 / dev / md0

We have an error message:

Code:
  Insufficient free space: 127871 extents needed, but only 127839 available
  Unable to allocate mirror extents for pve / pvmove0.
  Failed to convert pvmove LV to mirrored.

We have tested on several PVE 7.1-8 servers it is identical.

We tested with the old updated PVE version 6.4-13 and it works fine!

We don't understand, probably an LVM problem maybe, but why it doesn't work anymore?

Code:
root @ PVE7: ~ # pvdisplay
  --- Physical volume ---
  PV Name / dev / sda3
  VG Name pve
  PV Size <499.50 GiB / not usable 2.98 MiB
  Allocatable yes (but full)
  PE Size 4.00 MiB
  Total PE 127,871
  Free PE 0
  Allocated PE 127871
  PV UUID HLHwID-HmkC-JFdX-HkXd-ngXk-Sboc-p2fK42

  --- Physical volume ---
  PV Name / dev / md0
  VG Name pve
  PV Size 499.37 GiB / not usable <1.94 MiB
  Allocatable yes
  PE Size 4.00 MiB
  Total PE 127,839
  Free PE 127839
  Allocated PE 0
  PV UUID XQUEDm-swYp-j102-H0x1-kDih-qYPq-6vsGTF

root @ PVE7: ~ # pvmove / dev / sda3 / dev / md0
  Insufficient free space: 127871 extents needed, but only 127839 available
  Unable to allocate mirror extents for pve / pvmove0.
  Failed to convert pvmove LV to mirrored.
root @ PVE7: ~ #

How should we proceed ?

Thanks for your help.
 
Introduction

The operating system of our servers is always running on a RAID-1 (either hardware or software RAID) for redundancy reasons. Since Proxmox VE 7 does not offer out-of-the-box support for mdraid (there is support for ZFS RAID-1, though), I had to come up with a solution to migrate the base installation to an mdraid RAID-1 and thought this might be useful for others as well. Therefore, I am sharing this as a howto here.
....
I suggest to make a raid1 of boot partition sda2 with sdb2.

I had also many troubles with UEFI boot bios menu entry, at the end I overwrite the UEFI bootloader fallback with proxmox UEFI:
cp /boot/efi/EFI/proxmox/grubx64.efi /boot/efi/EFI/BOOT/BOOTx64.EFI
*todo on both boot partition sda2 and sdb2.

Best regards,

Emilien
 
I finally started from a debian 11 installation, without LVM in EXT4.

20GB for system in raid 1 (ext4) mdadm for /
A little swap...
and the rest in raid 1 madadm (ext4) for /var/lib/vz

Yes for the raid you have to copy the *.EFI file on the two VFAT partitions for the UEFI disks (Sda1 and Sdb1) and of course create the two entries in the bios (if the UUID disks are not cloned) .
 
Wha't the purpose of mdadm raid ? That's why ZFS is already implimented on Proxmox media installation. What's the benefit of using md raid?
 
Way less disk wear and better performance at the cost of data integrity and features. And ZFS with full system encryption wasn't possible before PVE6.3. So my PVE is using LUKS+mdadm+lvm as I wanted encryption and back then when I set it up (I think it was PVE 6.1 or 6.2) ZFS wasn't an option.

If you only got crappy consumer SSDs or SMR HDDs I would still prefer mdadm raid. ZFS is killing the consumer SSDs way too fast (killed 3 of my 4 consumer SSDs in the past 3 months) and SMR HDDs can cause the pool to degrade.

So often you don't really have the option to use ZFS. Then it's a nice alternative to onboard raid, especially when you don't got a free PCIe 8x slot for a HW raid card (like many people don't have when using thin-clients or NUCs).
 
Last edited:
Way less disk wear and better performance at the cost of data integrity and features. And ZFS with full system encryption wasn't possible before PVE6.3. So my PVE is using LUKS+mdadm+lvm as I wanted encryption and back then when I set it up (I think it was PVE 6.1 or 6.2) ZFS wasn't an option.

If you only got crappy consumer SSDs or SMR HDDs I would still prefer mdadm raid. ZFS is killing the consumer SSDs way too fast (killed 3 of my 4 consumer SSDs in the past 3 months) and SMR HDDs can cause the pool to degrade.

So often you don't really have the option to use ZFS. Then it's a nice alternative to onboard raid, especially when you don't got a free PCIe 8x slot for a HW raid card (like many people don't have when using thin-clients or NUCs).
Nice,


IT is possible to share disk layout partitioning just to understand what partition scheme defined ? maybe with lsblk -f
 
Sure:
Code:
root@Hypervisor:~# lsblk -f
NAME                              FSTYPE            FSVER    LABEL             UUID                                   FSAVAIL FSUSE% MOUNTPOINT
sda
├─sda1
├─sda2                            vfat              FAT32                      3E45-C742
├─sda3                            linux_raid_member 1.2      Hypervisor:0      b1f4608e-e986-17e9-eca2-dd21a148bbc0
│ └─md0                           ext4              1.0      boot              92bb126f-5732-4279-8a1c-36012f87ae18    149.8M    61% /boot
└─sda4                            linux_raid_member 1.2      Hypervisor:1      9ef382df-d617-bc50-d213-cf45f5401d4c
  └─md1                           crypto_LUKS       2                          3db015f7-6ca2-492e-9977-6a22e6f42e98
    └─md1_crypt                   LVM2_member       LVM2 001                   X0EWNC-LSfQ-G7PK-jpMX-pcOT-odc9-6356KT
      ├─vgpmx-lvroot              xfs                        root              b452ef99-8055-4b6a-88e0-b5c744500f72        8G    62% /
      └─vgpmx-lvswap              swap              1                          9e5e6780-7896-4124-b082-8ea5dbe8f24e                  [SWAP]
sdb
├─sdb1
├─sdb2                            vfat              FAT32                      3E45-16FA                               285.4M     0% /boot/efi
├─sdb3                            linux_raid_member 1.2      Hypervisor:0      b1f4608e-e986-17e9-eca2-dd21a148bbc0
│ └─md0                           ext4              1.0      boot              92bb126f-5732-4279-8a1c-36012f87ae18    149.8M    61% /boot
└─sdb4                            linux_raid_member 1.2      Hypervisor:1      9ef382df-d617-bc50-d213-cf45f5401d4c
  └─md1                           crypto_LUKS       2                          3db015f7-6ca2-492e-9977-6a22e6f42e98
    └─md1_crypt                   LVM2_member       LVM2 001                   X0EWNC-LSfQ-G7PK-jpMX-pcOT-odc9-6356KT
      ├─vgpmx-lvroot              xfs                        root              b452ef99-8055-4b6a-88e0-b5c744500f72        8G    62% /
      └─vgpmx-lvswap              swap              1                          9e5e6780-7896-4124-b082-8ea5dbe8f24e                  [SWAP]
sdc
├─sdc1                            zfs_member        5000     VMpool            15863678502274358828
└─sdc9
sdd
├─sdd1                            zfs_member        5000     VMpool            15863678502274358828
└─sdd9
sde
├─sde1                            zfs_member        5000     VMpool            15863678502274358828
└─sde9
sdf
├─sdf1                            zfs_member        5000     VMpool            15863678502274358828
└─sdf9
sdg
├─sdg1                            zfs_member        5000     VMpool            15863678502274358828
└─sdg9
sdh
└─sdh1                            crypto_LUKS       2                          6f304594-ff6f-4af4-b5ac-c0039f05ff0a
  └─lukslvm                       LVM2_member       LVM2 001                   wLR7wJ-BpXT-21JR-yG7Z-hpsZ-4yuv-TXWbNn
    ├─vgluks-lvthin_tmeta
    │ └─vgluks-lvthin-tpool
    │   ├─vgluks-lvthin
    │   ├─vgluks-vm--121--disk--0 ext4              1.0                        c6917469-bacc-4d79-9c67-b1560f7b2415
    │   ├─vgluks-vm--136--disk--1 ext4              1.0                        75269409-d4a1-4c44-bad6-55d2aafe3508
    │   ├─vgluks-vm--106--disk--0
    │   ├─vgluks-vm--140--disk--0
    │   ├─vgluks-vm--140--disk--1
    │   ├─vgluks-vm--140--disk--2
    │   ├─vgluks-vm--140--disk--3
    │   └─vgluks-vm--140--disk--4
    └─vgluks-lvthin_tdata
      └─vgluks-lvthin-tpool
        ├─vgluks-lvthin
        ├─vgluks-vm--121--disk--0 ext4              1.0                        c6917469-bacc-4d79-9c67-b1560f7b2415
        ├─vgluks-vm--136--disk--1 ext4              1.0                        75269409-d4a1-4c44-bad6-55d2aafe3508
        ├─vgluks-vm--106--disk--0
        ├─vgluks-vm--140--disk--0
        ├─vgluks-vm--140--disk--1
        ├─vgluks-vm--140--disk--2
        ├─vgluks-vm--140--disk--3
        └─vgluks-vm--140--disk--4
sdi
├─sdi1                            ext2              1.0                        41f0279b-3c3b-47de-a30e-fe87aa3ab22c
├─sdi2
└─sdi5                            crypto_LUKS       2                          17bc4279-1c49-4f41-b0e7-3e3a152ca389
sda+sdb are my system disks booting from grub (part1), part2 reserved for ESP but never used and hard to switch anyway because proxmox-boot-tool wants the ESP to be 512MB and I only prtitioned it with 300MB or so. Part3 is my boot partition on mdadm raid1 (could also be bigger). Part4 is mdadm raid1 -> LUKS encryption -> LVM -> LVs for swap and root.
sdc, sdd, sde, sdf, sdg form a encrypted raidz1 pool for my guests (not great for IOPS but fast enough for my workloads and more capacity and less SSD wear compared to a striped mirror).
sdh is LUKS encryption -> LVM -> LVM-Thin for my guests that do heavy sync writes but with just unimportant data (zabbix with MySQL for monitoring, graylog with mongodb+elasticsearch for log collection, blockchain DBs using SQLite) that would wear the SSDs too hard if I would run them on ZFS (saves about 400GB of writes per day when running them on LVM-Thin instead of ZFS).
sdi is an LUKS encrypted debian USB stick that I boot to backup my whole sda and sdb on block level to my PBS.


Not the best setup, but what I did back then. Today I would use a encrypted ZFS mirror with systemd boot for the system disks and a 4x 400GB SSDs for a ZFS striped mirror or 2x 800GB SSDs for a ZFS mirror to store my guests. Instead of the 5x 200GB in a raidz1 I'm using now. The forced volblocksize of 32K is just too annoying with raidz1.
But I still would keep that LVM-Thin for unimportant guests (but maybe mirror it with mdadm raid1) because the write amplification (average of factor 20 here) of ZFS is hitting the SSDs too hard.
 
Last edited:
  • Like
Reactions: alexskysilk
Ok, seems great so far. How did you manage with volblocksize value? and what is the optimal for you case? did you run any benchmark? Why are you using 32K insted of 16 or bigger value ?
 
Ok, seems great so far. How did you manage with volblocksize value?
You need to create different datasets and add them as different ZFS storages. One storage for each different volblocksize you want to use. Otherwise PVE will use the wrong volblocksize when doing a backup restore or migration between nodes.
and what is the optimal for you case?
For my 5 disk raidz1 with a ashift of 12 its 32K.
did you run any benchmark?
Jup, did 155 benchmarks in this thread.

And SSD wear of ZFS is really bad. See for example here:
1666619087994.png
Just compare the "Write Amplification Total" on the left between ZFS and LVM-Thin. ZFS got a factor 35 to 62 and LVM-Thin just factor 8. Means writing 1TB of 4k sync writes to ZFS will wear the SSD by 35 to 62 TB and LVM-Thin will just wear by 8TB. So SSDs used with ZFS will die 3 to 6 times faster as when using LVM-Thin. At least when doing small random sync writes. With big sequential async writes its still a lot, but way less write amplification. Sometimes down to factor 3. And thats all done with enterprise SSDs for write intense workloads with MLC NAND. Don't want to know how bad consumer QLC SSDs without a Powerloss Protection would wear when running these tests.

Thats why I moved my unimportant guests from ZFS to LVM-Thin. You also see that raidz causes less writes, compared to mirrors or striped mirrors, as the overhead is smaller because of a better data to partiy ratio. But the Read Overhead of raidz is way worse, as you are forced to use a bigger volblocksize. But read amplification will at least not wear the SSDs ;)

Why are you using 32K insted of 16 or bigger value ?
Smaller than 32K and you will lose a lot of capacity because of padding overhead. Bigger and the write amplification + read amplification will be even worse when running any workloads like DBs that do writes smaller than your volblocksize. You need to calculate the sweet spot between lost capacity and lost performance + SSDs wear when doing small sync IO.
 
Last edited:
Hey Alex,

Thanx for your help and your time. It seems that for Postgresql I'll need to run some benchmarks but currently my Psql block is 8k and my proxmox is by default using 8k , do you believe that I have to adjust anything here?
 
Hey Alex,

Thanx for your help and your time. It seems that for Postgresql I'll need to run some benchmarks but currently my Psql block is 8k and my proxmox is by default using 8k , do you believe that I have to adjust anything here?
Then you can only use a 1 vdev mirror or 2 vdev striped mirror when using shift=12. All raidz1/2/3 need at least a volblocksize of 16K if you don't want to waste too much space. And with that 8K performance will be really bad.
 
Then you can only use a 1 vdev mirror or 2 vdev striped mirror when using shift=12. All raidz1/2/3 need at least a volblocksize of 16K if you don't want to waste too much space. And with that 8K performance will be really bad

Thanx mate. Any good way to get the best performance on single vdev mirror with 2xdisks ? Currently using the 8k vblock but I am not sure if this is the proper valua, maybe I have to use different volblock for each services like 1 dataset with 8k for Postgresql and an other dataset for OS or filestorage.
 
8K should be fine. You could decrease it to 4K, but then block level compression won't work anymore, as ZFS can't store anything with a smaller block than 4K when using ashift=12. When using a 8K volblocksize it at least could save 50% of space for blocks that are more than 50% compressible.
 
Thanks for the guide. But won't this need dmraid as well? "apt install dmraid" before update-initramfs ?
Also I put these in /etc/modules: dm_mirror raid1 dm_raid raid456 dm-region-hash dm-log
 
I face similar issues, I don't want to use Btrfs (its now officially abandoned by Red Hat and removed in RHEL 8).

I also do not want to use ZFS, since it kills my SSDs too fast. Instead I want to use MDADM software RAID under Linux. Luckily there exists some guides: https://www.howtoforge.com/proxmox-2-with-software-raid or https://www.petercarrero.com/content/2012/04/22/adding-software-raid-proxmox-ve-20-install

HOPEFULLY, Proxmox will support Mdadm in the near future in their installer natively.
 
Last edited:
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!