Move Proxmox ZFS install from one SSD to New SSD

AWP Technologies · Sep 21, 2023

My current boot drive for proxmox is on its way out the door. With 37% wearout. I would like to figure out how to do the process before its to late. I have a 250gb ssd and plan on keeping the same size for new one. how do i migrate current install to new disk without loosing anything? I use a single drive with zfs mainly to get the advantages of arc. I know its not ideal but it works for me. please if someone could explain the process of imaging the drive and allow me to start exactly where i left off.

_gabriel · Sep 21, 2023

the recommended and error free way is backup VMs to another disk, then install fresh pve on your fresh disk then restore.
if you know what you're doing, you can follow docs to add fresh disk as mirror disk then after background resilver, you can detach the drive from the pool then remove physically.

ubu · Sep 21, 2023

Of course you can always use dd

zodiac · Sep 22, 2023

You can always create a new ZFS pool on the new drive, snapshot your current state, and then use zfs send/receive to copy the snapshot into the new pool. Afterwards, you need to repair the boot process. That depends on exactly how you are booting your system. Different boot loaders are a little different there.

It's not super difficult, if you have some Linux experience. But you might want to have a rescue disk at hand, in case you confuse yourself.

As @_gabriel said, the more reliable way is to install a brand new version of Proxmox VE and restore backups. But depending on how many changes you made to the Proxmox host, that can be annoying. This is one of the reasons why I always recommend to minimize changes that you made to the host. The host is much more difficult to restore from backup than the containers and VMs

Dunuin · Sep 22, 2023

Most beginner friendly way should be to clone the disk using the backup ISO with UI of your choice. Do your VM/LXC/Config backups just in case. Put both disks into the server, boot into clonezilla and clone the whole disk from your old to the new one, shutdown the server, remove the old disk and test if booting from the new disk will work after changing the boot order via BIOS/UEFI. If you can't put in both disks at the same time backup to a USB disk or NAS and then restore from image later after switching the disks.

If you care about downtime I would go the ZFS mirror route _gabriel explained. Or even better keep that mirror until your old disks fails and then replace it with another one. So much less work, less downtime, less data loss for the 34€ you get a brand new 240GB enterprise SSD with PLP for that will also wear way less. Really no point to cheap out on storage when we are talking about 34€ for a proper disk like a Samsung PM883...

alanwparkerap said:
I use a single drive with zfs mainly to get the advantages of arc. I know its not ideal but it works for me.

I've heard some bikers tell the same for many years about driving without those annoying and expensive helmets...until finally hitting a car

AWP Technologies · Sep 23, 2023

thank you all. With zfs mirror route what boot config do i need to do. My system does not use grub im pretty sure its systemd. I know that I don't use update-grub i have to use proxmox-boot-tool refresh when doing things like enabling iommu. Also is the Samsung pm883 the recommended drive for regular zfs pools as well? Right now I'm using all consumer drives in my main data pool but they seem to be wearing out a lot slower then the boot drive. Honestly i just thinks its because its so old. Its a Samsung 840 pro

AWP Technologies · Sep 23, 2023

Also you guys seem very knowledgeable about proxmox. I have changed the default config for ballooning to encompass 90%% or ram instead of 80%. Is there any harm in this? System gets right to 90% and starts pulling ram from balloon enabled vms. As far as im concerned its working perfect. Just wanted to check though.

zodiac · Sep 23, 2023

As far as I can tell, the host system itself is relative easy on the disk. So, if you have moved all your containers and VMs to a pool on separate drives, it doesn't matter horribly much what you pick for your boot device. My only recommendation would be to switch to 4096 sector sizes before formatting the drive. That can avoid write amplification issues that can happen in some scenarios. And of course, whatever file system you then install, make sure it also is set to 4096 byte sectors. The various ZFS Wiki and forum pages that you can easily find with a web search should have all the gory details for how to do this.

Yes, in principle, proxmox-boot-tool is all you need to run. In practice, I do this so rarely that I don't remember the details. But I recall that it was surprisingly painless. Things definitely used to be much harder in the earlier days of Linux

Dunuin · Sep 23, 2023

zodiac said:
As far as I can tell, the host system itself is relative easy on the disk.

It's doing multiple GBs of writes per day while idleing without a guest running. The system itself isn't writing that much. The problem is that it is doing a lot of very small writes and these get amplified. So there are some noticable writes but primarily caused by overhead. At that rate the NAND of a consumer SSD will probably still last for many many years. The question is more if you are willing to pay 34€ for a proper enterprise grade SSD with PLP for addtional data integrity and higher quality parts or if you aren't willing to pay that and get a cheap 13€ consumer SSD instead. Especially if you keep in mind that when being to close-fisted to spend 34€ for a proper SSD you are probably also not willing to pay for a UPS or redundant PSU.
And you can fill entire books with threads about PVEs not working anymore after a power outage. So PLP isn't a bad thing to have...especially as the ESP won't be covered by ZFS.

zodiac · Sep 24, 2023

I don't dispute that upgrading to better more resilient hardware isn't a good idea. It obviously is. A good enterprise SSD can exceed 10 DWPD and even basic models are going to be at least 1 DWPD, whereas some cheap consumer SSD (e.g. bargain basement M.2 drives) can be much lower than 0.3 DWPD. That's quite a big range.

I also know that conventional wisdom says that you get these insanely large numbers of writes. I just honestly don't understand how PVE is any different from other Linux systems. If you move the containers and virtual machines to dedicated storage (or hypothetically, stop all guests), what can it possibly be doing? This is not a problem for other Debian systems, and those have been around for multiple decades.

You obviously know more about PVE than most of us. Can you fill in some of the details.

Off the top of my head, I can think of the following issues that can result in high amount of writes:

if there is disagreement between logical and physical sector sizes and the file system's block size, you can get some rather pathological edge cases. Worst case, performance goes way down, and write amplification sky rockets. If you make sure you set your drives to 4096 byte logical and physical sectors, and then create your filesystem with the same sector size, things should be much better. I wonder if this relatively common misconfiguration is at the root of the often-repeated warning against seemingly inevitable write amplification.
ZFS pays for some of its proverbial redundancy by duplicating some write operations. Nothing much you can do about that, if you want the benefits of ZFS. But proper tuning definitely helps here.
most of the disk activity for an idle system is a result of writing logs. As writes are buffered, this isn't as bad as it sounds, and I find it really only matters for single-board-computers running from an SD card. In that case, it's a good idea to configure journald for in-RAM logging, and to put /var/log/* onto a tmpfs. I am not convinced this is necessary for normal PC hardware unless I see numbers showing otherwise.
in fact, I just took a quick look at the I/O operations of my Proxmox system, and even without shutting down the containers, it's mostly quiet. I don't see this number going higher than around 1GB/day. For a hypothetical 1TB drive, that's around 0.001 DWPD under idle load. Even considering the proverbially bad write endurance of consumer-grade SSD, that's going to last for the rest of my life.
consumer-grade SSD do wear out much sooner than enterprise-grade. But even then, it's not that bad. Just to put things into perspective, I have consumer-grade SSDs that have ~10 high resolution video streams written to them 24/7. That's an awful lot of constant data. A lot more than a couple of log files generated by mostly idle management server. The last time I had to replace a broken SSD, it has taken this abuse for almost five years, I think.
stacking of different abstraction layers is a problem. If you have a virtual machine that puts its own filesystem inside of an image file that lives in ZFS, then expect a good amount of inefficiencies. All this indirection is going to fight each other. Containers tend to avoid this problem as they can directly access the file system. And so do virtual machines that can mount the host file system using something like virtiofs, instead of requiring a virtual block device. That's relatively easy to do for Linux VMs, but I don't know how to do so for Windows or MacOS.

In other words, I do agree that it is generally prudent to buy better hardware. The write endurance of an SSD is finite. And it's a pain having to deal with failed drives, if you don't have a plan for this inevitability. But I also don't believe that things are quite as dire as everyone says. I have no evidence that an idle Proxmox VE system would be any worse than any other Linux distribution. And while poorly configured virtual machines can wreak a good amount of havoc, containers are pretty easy to tame.

AWP Technologies · Sep 24, 2023

sorry if i was unclear my last post was in reference to ram. proxmox default cutoff is 80% and i changed it to 95%

Dunuin · Sep 24, 2023

zodiac said:
what can it possibly be doing?

For example writing to the SQLite DB each minute. The RRD metrics for the graphs showed in the webUI. Logs.

zodiac · Sep 25, 2023

I hear you in principle. And I agree that those things could be a problem, especially if writes were flushed synchronously. But I simply have a hard time seeing that on my system. The rate of writes is perfectly fine and so low that it won't significantly impact the life expectancy of my drives for years to come. I just checked all the stats both on the drives themselves, and also the number of bytes written as shown in iostat. Even if I round up very aggressively, I won't exceed 1 GB/day for background write activity. And that would be well within reason. My educated guess tells me, that the real number is much lower than 1 GB/day though, but that would require more effort to run proper tests than just estimating based on total lifetime vs total bytes written, and on a couple of minutes of snapshotting iostat.

At least on my system, this is a complete non-issue; and I don't think I did anything to go out of my way to disable logging. But if I was concerned about writes, that's what I would try. Things like logs and ephemeral databases can always be kept in RAM after all.

Dunuin · Sep 25, 2023

I definitely get more writes here. The ZFS mirror is purely used for the PVE system without any guests, ISOs, backups and whatever and according to my monitoring tool (which reads the "Host_Writes_32MiB" SMART attribute directly from the SSDs) both disks each get 20GB per day written to. Whats actually written to the NAND is even higher, according to "NAND_Writes_32MiB", because of the SSDs internal write amplification.

SSDs got PLP and are 4K formated, 8K volblocksize, 128K recordsize, simple ZFS mirror, no swap, sync=standard, relatime enabled. So there isn'T much that should add additional overhead except for the typical ZFS overhead that isn't avoidable.

zodiac · Sep 25, 2023

You know what, I was puzzled why our numbers are so different, and I think I fully believe you now. There are so many minute changes that you can make that can have a dramatic difference in how many writes the system generates. I was trying to drill down a little more, and -- oh my -- those numbers are extremely noisy. At times, I get only a few hundred kilobytes of writes and then it shoots up to megabytes and stays there for a while. Seemingly without rhyme or reason.

I start extra services, and the write load goes down; I stop them and it goes up again. It's infuriating to diagnose. But in the end it's just a normal Linux system. So, it shouldn't wear out an SSD this quickly.

I don't have a complete answer for you. But I have a few knobs that you can tweak and that might help. It turns out, that Proxmox likes to log all API calls to disk, and depending on what you are doing that could be almost nothing at all or a surprisingly constant stream of events. But honestly, I can't envision a scenario when I would need historic data. Storing the log of API accesses in RAM seems absolutely reasonable. They'll be there if I need them, and if a reboot wipes the list, then no big deal. I ended up mounting a "tmpfs" at /var/log/pveproxy I would think that this change should be relatively harmless for most users.

A more controversial change is also mounting a tmpfs at /var/lib/rrdcached That means you loose all your life metrics after a reboot and have to wait for a little bit for them to build up again. Personally, I couldn't care less. My hardware rarely gets rebooted, and while these stats are pretty to look at, I don't really use them for any important. I am fine with historic data disappearing when I need to restart the system.

If you run PBS, then there are similar tweaks. I believe the relevant paths are /var/log/proxmox-backup/api and /var/lib/proxmox-backup/rrdb

And, of course, if you really don't care about historic information, you can always edit /etc/systemd/journald and make the storage "volatile". But that loses you a lot more potentially valuable date. So tread carefully.

Now, for the million-dollar question, does all of this make a difference? And I honestly don't know. The readings are so noisy, I can't reliably tell. All I can say for certain is that I have historic S.M.A.R.T stats for the drives that show me total number of writes and total number of hours, and they don't look at all as bad as your drives. Sure, I have a bunch more disks to spread things out on. But even if I factor that into the math, the number is still much saner.

Move Proxmox ZFS install from one SSD to New SSD

AWP Technologies

Member

_gabriel

Distinguished Member

ubu

Famous Member

zodiac

Active Member

Dunuin

Distinguished Member

AWP Technologies

Member

AWP Technologies

Member

zodiac

Active Member

Dunuin

Distinguished Member

zodiac

Active Member

AWP Technologies

Member

Dunuin

Distinguished Member

zodiac

Active Member

Dunuin

Distinguished Member

zodiac

Active Member

We value your privacy