4.4 install to ZRAID10

maxprox · Dec 12, 2016

If anyone is interested, here the data sheet of the HGST SAS and SATA hard drives with its Model Part No.
you have to look at the Part No. on position 13 and 14 if there is "N6" (SATA) or "42" (SAS) that are 4Kn HDDs.

maxprox · Dec 23, 2016

The work is done
I build in one M.2 NVMe SSD on the small Fujitsu D3417-B Skylake Mainboard with the newest BIOS
Install Proxmox 4.3 with ext4, 33 GB for root
There were no problems with the booting (yeah!)
Only now I have connected the 4 hard disks (HGST with 4Kn Sector size)
Thanks to spudger for the tip: Now I have deleted the old ZFS signatures

Code:

zpool status
zpool export rpool
zpool destroy rpool
zpool labelclear -f /dev/disk/by-id/xxx-disk1-to-disk4
for every disk and every partition
And for safety
wipefs /dev/sda ... sdd

~~With parted I created a new partition table~~ EDIT 2017: it is NOT necessary!

Code:

parted /dev/disk/by-id/ata-HGST_HUS726020ALN610_xxx  mklabel gpt

~~for every disk (I'm not sure if that is necessary)~~

There follows the big command to create the ZFS Raid 10
No partitioning one large pool called (~~rpool this name is reserved!)~~ r10pool

Code:

zpool create -n -O compression=on -o ashift=12 -f r10pool mirror /dev/disk/by-id/ata-HGST_HUS726020ALN610_00 /dev/disk/by-id/ata-HGST_HUS726020ALN610_01 mirror /dev/disk/by-id/ata-HGST_HUS726020ALN610_02 /dev/disk/by-id/ata-HGST_HUS726020ALN610_03

In the Wiki, by the way, a small "-o" before compression is wrong
-n => Displays the configuration that would be used without actually creating the pool.
When everything is alright, without option "-n":

Code:

zpool create -O compression=on -o ashift=12 -f r10pool mirror /dev/disk/by-id/ata-HGST_HUS726020ALN610_00 /dev/disk/by-id/ata-HGST_HUS726020ALN610_01 mirror /dev/disk/by-id/ata-HGST_HUS726020ALN610_02 /dev/disk/by-id/ata-HGST_HUS726020ALN610_03

And the last step I took via the GUI:
select - Datacenter - Storage => Add => ZFS => ID = Raid10, ZFS_Pool = r10pool => Add

The small problem that I see is the fact that no backups and ISO files can be selected here

Is there a way to also save ISO files and VZDump backups on this ZFS pool?

regards, maxprox

fabian · Dec 23, 2016

maxprox said:
The work is done
I build in one M.2 NVMe SSD on the small Fujitsu D3417-B Skylake Mainboard with the newest BIOS
Install Proxmox 4.3 with ext4, 33 GB for root
There were no problems with the booting (yeah!)
Only now I have connected the 4 hard disks (HGST with 4Kn Sector size)
Thanks to spudger for the tip: Now I have deleted the old ZFS signatures

Code:

zpool status zpool export rpool zpool destroy rpool zpool labelclear -f /dev/disk/by-id/xxx-disk1-to-disk4 for every disk and every partition And for safety wipefs /dev/sda ... sdd

With parted I created a new partition table

Code:

parted /dev/disk/by-id/ata-HGST_HUS726020ALN610_xxx mklabel gpt

for every disk (I'm not sure if that is necessary)

There follows the big command to create the ZFS Raid 10
No partitioning one large pool called rpool

Code:

zpool create -n -O compression=on -o ashift=12 -f rpool mirror /dev/disk/by-id/ata-HGST_HUS726020ALN610_00 /dev/disk/by-id/ata-HGST_HUS726020ALN610_01 mirror /dev/disk/by-id/ata-HGST_HUS726020ALN610_02 /dev/disk/by-id/ata-HGST_HUS726020ALN610_03

In the Wiki, by the way, a small "-o" before compression is wrong
-n => Displays the configuration that would be used without actually creating the pool.
When everything is alright, without option "-n":

Code:

zpool create -n -O compression=on -o ashift=12 -f rpool mirror /dev/disk/by-id/ata-HGST_HUS726020ALN610_00 /dev/disk/by-id/ata-HGST_HUS726020ALN610_01 mirror /dev/disk/by-id/ata-HGST_HUS726020ALN610_02 /dev/disk/by-id/ata-HGST_HUS726020ALN610_03

And the last step I took via the GUI:
select - Datacenter - Storage => Add => ZFS => ID = Raid10, ZFS_Pool = rpool => Add

The small problem that I see is the fact that no backups and ISO files can be selected here

Is there a way to also save ISO files and VZDump backups on this ZFS pool?

regards, maxprox

I would call the pool something else ("rpool" is by convention the root pool

) - you can just export and import with a new name to do that.

You can store backups and ISO files on it by creating a separate dataset (e.g., "zfs create yourpool/backups") and then configuring a directory storage for the mount point ("/yourpool/datasetname" by default). I would recommend using a different storage for backups though (ideally on a different machine) - locally you can already do snapshots which are way more convenient IMHO..

maxprox · Dec 23, 2016

fabian said:
I would call the pool something else ("rpool" is by convention the root pool ) - you can just export and import with a new name to do that.

Ahhh yes, thank you!

fabian said:
You can store backups and ISO files on it by creating a separate dataset (e.g., "zfs create yourpool/backups") and then configuring a directory storage for the mount point ("/yourpool/datasetname" by default). I would recommend using a different storage for backups though (ideally on a different machine) - locally you can already do snapshots which are way more convenient IMHO..

thanks again, In any case, the backups finally come to a different backup system
best regards,
maxprox

maxprox · Dec 23, 2016

If you will rename a zfs pool you first have to export the pool with

Code:

zpool export rpool

but every 5 Seconds is automaticly imported
and you get an error if you import the same pool with a new name:

Code:

root@oprox:# zpool import rpool r10pool
cannot import 'rpool': no such pool available

the sulotion was the forces import option "-f"

Code:

zpool import -f rpool r10pool

Code:

root@oprox:~# zpool status
  pool: r10pool
state: ONLINE
  scan: none requested
config:

  NAME  STATE  READ WRITE CKSUM
  r10pool  ONLINE  0  0  0
  mirror-0  ONLINE  0  0  0
  ata-HGST_HUS726020ALN610_00  ONLINE  0  0  0
  ata-HGST_HUS726020ALN610_01  ONLINE  0  0  0
  mirror-1  ONLINE  0  0  0
  ata-HGST_HUS726020ALN610_02  ONLINE  0  0  0
  ata-HGST_HUS726020ALN610_03  ONLINE  0  0  0

Then I create a new dataset for configuring a directory storage

Code:

zfs create r10pool/dataset

configuring the directory /r10pool/dataset as a new proxmox storage I have implemented over the GUI

Code:

root@oprox:~# zfs mount
r10pool  /r10pool
r10pool/dataset  /r10pool/dataset
root@oprox:~# ll /r10pool/dataset/
total 2.5K
drwxr-xr-x 5 root root 5 Dec 23 14:39 .
drwxr-xr-x 3 root root 3 Dec 23 14:37 ..
drwxr-xr-x 2 root root 2 Dec 23 14:39 dump
drwxr-xr-x 2 root root 2 Dec 23 14:39 images
drwxr-xr-x 4 root root 4 Dec 23 14:39 template

Thank you and best regards,
maxprox

MimCom · Dec 27, 2016

I have installed to an SSD, then exported the rpool that was created by the 4.4 installer, then imported as 'spool.' It conveniently contains the following mounts:

spool/ROOT
spool/ROOT/pve-1 (the hostname)
spool/data

The installer created
local Directory Disk Image, ISO Image, Container, Container Template
local-lvm LVM-Thin Disk Image, Container

What is the best process for moving as much of this onto the pool as practical?

I do look forward to seeing 4kn UEFI boot one of these days.

thank you~

MimCom · Dec 31, 2016

I noticed that the zpool was using /dev/sda2 and /dev/sdb2 for one mirror but /dev/sdc and /dev/sdd for the other mirror. Looking at the partition tables with fdisk I see two BIOS boot partitions (not unexpected) but also a "Solaris reserved 1" partition #9 on every drive (as shown here.) The ones with the BIOS boot partition are 8 mb, the other two are 64 mb. Are these some kind of remnant from cylinder alignment? More to the point, do I need them at all if I use four identical disks with one partition each?

Also, the installer started on cylinder 2048. When I use fdisk it defaults to 256. Is this a "just in case" thing or is there a good reason to skip that many?

fabian · Jan 2, 2017

mhotel said:
I noticed that the zpool was using /dev/sda2 and /dev/sdb2 for one mirror but /dev/sdc and /dev/sdd for the other mirror. Looking at the partition tables with fdisk I see two BIOS boot partitions (not unexpected) but also a "Solaris reserved 1" partition #9 on every drive (as shown here.) The ones with the BIOS boot partition are 8 mb, the other two are 64 mb. Are these some kind of remnant from cylinder alignment? More to the point, do I need them at all if I use four identical disks with one partition each?

for the non-boot disks, we leave the disk setup to ZFS - it seems they have sector count for the reserved partition so it's bigger on 4Kn disks. for the boot disks we always reserve 8mb (like ZFS does with 512e disks). of course, if you don't boot from a disk you don't need a boot partition there - but then I am not sure why you installed with ZFS (and then again with LVM-thin on other disks?)? you can just create a zpool with "zpool create" on any existing PVE installation..

Also, the installer started on cylinder 2048. When I use fdisk it defaults to 256. Is this a "just in case" thing or is there a good reason to skip that many?

that's just the default alignment used by sgdisk I think - it does not really matter.

MimCom · Jan 2, 2017

fabian said:
for the non-boot disks, we leave the disk setup to ZFS - it seems they have sector count for the reserved partition so it's bigger on 4Kn disks. for the boot disks we always reserve 8mb (like ZFS does with 512e disks).

Thanks, Fabian. That would probably explain the 64 mB one I got with these. Is that "Solaris reserved" partition something zfs actually uses? I haven't found much about it in the docs, and many of those are Solaris specific.

I am not sure why you installed with ZFS (and then again with LVM-thin on other disks?)? you can just create a zpool with "zpool create" on any existing PVE installation..

I installed with zfs in an attempt to get the system to boot from the rpool, which turned out to be problematic on 4kn drives.

Is there a guide to moving as much of Proxmox off the root drive as possible? I'm a bit unclear how/where things are organized with those three mount points:

ROOT
ROOT/nodename
data

MimCom · Jan 2, 2017

After reading this, I've decided to skip zraid10 and just put two mirrored vdevs into the pool. I also decided to skip partitioning completely and just let ZFS own the raw disks. Much simpler.

I could still use some guidance on moving bits of the Proxmox SSD install (especially the ones with a lot of I/O) to the zpool.

When I do a df, I see that /dev/dm-0 is mounted on /
Is it safe to move the whole rootfs to the zpool?

I'm a bit rusty on lvm admin, but lvdisplay shows:
/dev/pve/swap
/dev/pve/root
data

When I add the zpool in the Proxmox GUI, it chooses Disk image, Container and offers no other options. I called it local-zfs and then deleted local-lvm.

fabian · Jan 3, 2017

moving the root installation like this is not really supported or tested, although it should be possible. it probably does require some manual intervention here and there (e.g., Grub installation and configuration updates, ..) - so if you are not comfortable with that I would advise against attempting such a move. in any case you are probably faster doing a clean re-install, especially if no productive data is yet on the system (not sure whether that is the case?)

MimCom · Jan 4, 2017

The intent here was to get Proxmox installed on (and booting from) a small but reliable ZFS array. Given that the 4.X installer will install to that, perhaps there is a recipe for moving just the boot to another device?

maxprox · Jan 4, 2017

mhotel said:
The intent here was to get Proxmox installed on (and booting from) a small but reliable ZFS array. Given that the 4.X installer will install to that, perhaps there is a recipe for moving just the boot to another device?

Yes there is, have a look at the ubuntu wiki:https://wiki.ubuntuusers.de/Ubuntu_umziehen/
I have done this, but both disks (source / target) had an ext4 file system, that is no problem

MimCom · Jan 4, 2017

Thank you, but unfortunately I don't speak German. This doesn't appear to be Proxmox specific, so perhaps there is another option someplace.

maxprox · Jan 9, 2017

one more Info,
with my setup I get kernel panic, have a look at the screenshot attached

my system is a skylake Fujitsu D3417-B Mainboard with 64GB RAM and one E3-1245-v5 XEON
Proxmox is installed on the new 128 GB NVMe SSD with ext4
Currently, only one VM, a Windows 2008 R2 server is running on this system, with 16 GB virtual RAM

Code:

...
nvme0n1  259:0  0 119.2G  0 disk
├─nvme0n1p1  259:1  0  1007K  0 part
├─nvme0n1p2  259:2  0  127M  0 part
└─nvme0n1p3  259:3  0 119.1G  0 part
  ├─pve-root  251:0  0  29.8G  0 lvm  /
  ├─pve-swap  251:1  0  4G  0 lvm
  ├─pve-data_tmeta 251:2  0  72M  0 lvm
  │ └─pve-data  251:4  0  70.5G  0 lvm
  └─pve-data_tdata 251:3  0  70.5G  0 lvm
  └─pve-data  251:4  0  70.5G  0 lvm

As you can see, I give it only 4GB for swap.
and a ZFS Raid 10 pool with 4x 2 TB 4kn hard drives, as described above

Also I get an ACPI Error:

Code:

root@oprox:~# dmesg | grep -i error
....
[  0.581966] ACPI Error: [\_SB_.PCI0.LPCB.H_EC.ECAV] Namespace lookup failure, AE_NOT_FOUND (20150930/psargs-359)
[  0.581970] ACPI Error: Method parse/execution failed [\_TZ.FNCL] (Node ffff880fed91f4d8), AE_NOT_FOUND (20150930/psparse-542)
[  0.581976] ACPI Error: Method parse/execution failed [\_TZ.FN02._ON] (Node ffff880fed91f168), AE_NOT_FOUND (20150930/psparse-542)
[  0.589970] ACPI Error: [\_SB_.PCI0.LPCB.H_EC.ECAV] Namespace lookup failure, AE_NOT_FOUND (20150930/psargs-359)
[  0.589975] ACPI Error: Method parse/execution failed [\_TZ.FNCL] (Node ffff880fed91f4d8), AE_NOT_FOUND (20150930/psparse-542)
[  0.589980] ACPI Error: Method parse/execution failed [\_TZ.FN02._ON] (Node ffff880fed91f168), AE_NOT_FOUND (20150930/psparse-542)
[  0.610004] ACPI Error: [\_SB_.PCI0.LPCB.H_EC.ECAV] Namespace lookup failure, AE_NOT_FOUND (20150930/psargs-359)

-------------------------------------------------------------------------

Well, with these many errors and the kernel panic I made some adjustments. On the one hand, to talk about what is going on, on the other hand to listen to your opinion.

1:
ssd check with

Code:

$ smartctl /dev/nvme0 -x

SMART/Health Information (NVMe Log 0x02, NSID 0xffffffff)
Critical Warning:  0x00
Temperature:  45 Celsius
Available Spare:  100%
Available Spare Threshold:  10%
Percentage Used:  0%
Data Units Read:  1,303,319 [667 GB]
Data Units Written:  799,662 [409 GB]
Host Read Commands:  12,615,280
Host Write Commands:  5,916,573
Controller Busy Time:  92
Power Cycles:  23
Power On Hours:  373
Unsafe Shutdowns:  2
Media and Data Integrity Errors:  0
Error Information Log Entries:  0
Warning  Comp. Temperature Time:  0
Critical Comp. Temperature Time:  0

Error Information (NVMe Log 0x01, max 64 entries)
No Errors Logged

As you can see there are no Problems with the NVMe SSD

2:
build a weekly chron job for trim

Code:

root@oprox:~# cat /etc/cron.weekly/fstrim
#!/bin/sh
## trim das root /  file systems welches auf der NVMe SSD liegt
## /sbin/fstrim --all || true
LOG=/var/log/batched_discard.log
echo "*** $(date -R) ***" >> $LOG
/sbin/fstrim -v / >> $LOG
##/sbin/fstrim -v /home >> $LOG

3:
Concerning the ACPI Errors:
The only solution to this problem is either a BIOS update or a suitable kernel
Both are up to date
Therefore, I first have as workaround all energy saving modes disabled:

Code:

$ systemctl mask sleep.target suspend.target hibernate.target hybrid-sleep.target

In the BIOS I
- disable the Intel AMT (Intel Active Management Technology)
- and let the "Package C State limit in auto mode

4:
Now I have completely changed the swapping
4a: I switched the swapping from the SSD to the ZFS Raid 10.
Once to release the SSD, and with 64GB I have enough RAM, and it is more easy to build a bigger swap file on the ZFS pool

Code:

$ swapoff -a

root@oprox:~$ zfs create -V 16G -b $(getconf PAGESIZE) \
  -o logbias=throughput -o sync=always \
  -o primarycache=metadata -o secondarycache=none \
  -o com.sun:auto-snapshot=false r10pool/swap

$ mkswap -f /dev/zvol/r10pool/swap

$ swapon /dev/zvol/r10pool/swap

And I have exchanged the "swap" line in the /etc/fstab to:

Code:

/dev/zvol/r10pool/swap none swap defaults 0 0

4b: I configure the swappiness
to look was is going on:

Code:

$ cat /proc/sys/vm/swappiness  => 60

in the /etc/sysctl.conf with

Code:

vm.swappiness = 1

and with:

Code:

$ sysctl vm.swappiness=1

you can change it immediately

With 1 I changed it to the lowest value (0 = disabled)
the recommended value in the wiki is = 10

have a look at:

Code:

$ cat /proc/swaps
$ cat /proc/sys/vm/swappiness
$ free -hm

5:
Now I limited the ARC cache

Code:

root@oprox:~# cat /etc/modprobe.d/zfs.conf
## den von zfs arc max zu verwendende RAM-Speicher:
## in Byte, hier =
## 2GB=2147483648
## 4GB=4294967296
## 8GB=8589934592,
## 12GB=12884901888:
options zfs zfs_arc_min=2147483648
options zfs zfs_arc_max=12884901888

now you have to:

Code:

$ update-initramfs -u

and have a look at:

Code:

$ arcstat.py 3 4
$ cat /proc/spl/kstat/zfs/arcstats
$ cat /sys/module/zfs/parameters/zfs_arc_max

Now I have to test everything.
About feedback I would be pleased

regards,
maxprox

fortechitsolutions · Jan 9, 2017

For what it is worth, to briefly comment. I played with a test install of Proxmox Appliance-install directly to a "ZFS Mirror" (pair of SATA disk direct-attached to MB, no HW raid controller). It did work, but there were some workarounds needed for some VMS to behave as desired, and the end result for me, was that I was not so keen to do it this way. I realize some people are "KEEN" on ZFS and that is more their choice of course.

What I have done for a few deployments since then, with very good success,
- basic system, no hardware raid,
- 2 x SSD drives, just pair of 64 or 120gig disks. Nothing fancy is needed.
- 2 x SATA Drives, 2Tb each or larger if you need more space (4, 6,8 Tb etc just have a matched pair).
- do a base install of Jessie/Deb8 onto a manual custom install SW Raid using ~half of the SSD drives mirrored.
- reserve the remaining SSD drive space for Bcache cache volume (Mirror, SW raid)
- and use the mirrored pair big-spinning-rust-SATA disks as Bcache block storage volume (Mirror, SW raid)

Then do a standard Proxmox-add-onto-debian setup as per wiki
then add new storage on top of your bcache volume, call it something like /bcache-data
now any Proxmox VMs you put in the /bcache-data storage pool, are very much faster than if they were on vanilla SATA

your price point is only slightly higher than vanilla SATA Mirror
your end point is absurdly better in terms of performance for IO
(it seems that bcache is very solid, production ready).

of course this assumes you can build out bare jessie deb install / SW raid / then add in bcache etc. But such things are not so hard.

So. Just my 2 cents .. hope it helps maybe someone who is interested ?

Tim

maxprox · Jan 9, 2017

;-) thanks for the answer,
I am smiling because I have already worked with lvm caching, but with quite different hardware, and with a hardware raid controller:
https://forum.proxmox.com/threads/can-we-use-lvm-cache-in-proxmox-4-x.25636/#post-134061
And yes, it is known that ZFS is not so fast, but I like the security features of ZFS, raid, copy on write and so on, But still better I would have found the implementation with btrfs, because that will be the future file system under linux (and not ZFS) but that is a different topic ;-)
regads,
maxprox

maxprox · Jan 10, 2017

With my NVMe driver or kernel problem I open a new thread
here: https://forum.proxmox.com/threads/nvme-ssd-driver-or-kernel-problem.31845/

4.4 install to ZRAID10

Renowned Member

Attachments

Renowned Member

Proxmox Staff Member

Renowned Member

Renowned Member

Active Member

Active Member

Proxmox Staff Member

Active Member

Active Member

Proxmox Staff Member

Active Member

Renowned Member

Active Member

Renowned Member

Attachments

Renowned Member

Renowned Member

Renowned Member

We value your privacy