[SOLVED] Suggestions on mirroring boot drives, update! Booting directly from NVMe on older servers.

Good point. However we are a small family owned company and as a 25 year employee and vice president that family has been transferring me stock to the point that when our President (also their brother) retires, I will own controlling interest in the company. In short I have no reason to leave. Second Proxmox wasn't my idea. I've hired a younger guy I'm training as my assistant but to also be my main IT guy. After months of struggling with VMware trying to make the two node plus witness vsan we were told we could make only to find we had outdated hardware at every turn this younger guy suggested we just give Proxmox a try. Since he's been busy doing other things for me I started working on Proxmox. I had a cluster up and running on a single nic in less than a week. Then I tore that down and put in all my drives and NICs, ran into a now solved issue with ceph IPs and have this test cluster up and running.

I emphasize test at this point. We've put up a Linux VM and Windows server VM and have been doing failure testing. Aka pulling power on a server, unplugging network cables, pulling drives etc. No rebuilding a missing OSD isn't as simple as old school HW raid a three node, eventually a five node, is far more fault tolerant.

Why old equipment? Well our fire happened in May 2021 at the height of pandemic shortages. Our insurance payout did not cover the increased costs of building materials and other inflated items. We're done rebuilding now but we had our first losing year since the 2008 recession. I already had three RD440's running on single CPUs and only 32gb ram each. Now I have 4, soon to be 5, with both CPUs and 128gb ram each. The long term plan is to replace nodes one by one as budget allows. That's actually one reason I knew about the issue of using the same name and IP for a replacement and why I've reserved IPs in my network for future nodes. My idea, right or wrong, is to buy new servers, fail an old one, install the new, and let ceph populate it. I'd rather save money now and keep giving my people deserved raises than just dump a fortune into new or newish hardware. Plus I'm thinking I can transfer much of what I have to new servers like the dual port NICs of which I have four per server. At 10g everywhere my network is 10x faster than its ever been.

I bought these optanes because VMware needed them for vsan cache. May as well use them if I can. Failing that replace the current consumer drives with enterprise drives, like I need to do on the ceph cluster itself.

So in short, even with old hardware, this cluster is a rather big performance jump from two single CPUs running individually and on HDDs to five dual CPU, four times the ram, 10g networking, and all ssd storage. In baseline testing my VMs are clearly much faster. I'm not bottlenecked by a slow network, slow drives, not enough cores and ram etc.

Seems to me the solution here is use the optanes as described or install small enterprise drives for boot. Maybe not even worry about backups of the boot drives.

Having said that, please correct me if I'm wrong, but it seems like clusters are a lot like hardware raid cards. Replacement servers need to be "clean" just like raid cards will only use fresh empty drives to rebuild on. Which means to me if I have a server fail just replace what's broken, do a clean install and configure networking and join it to the cluster.

I would argue that even commodity equipment when run in a cluster gives you better resiliency than beefed up redundant server gear from 10 years ago. Of course if it can handle the load. This does not concern you, but when someone buys a 10yo server for a homelab, what they would have been better off doing is e.g. buying 5 NUCs for that homelab and run all sorts of clusters.

Yes, cluster node should be something without value of its own. You have the VMs/CTs and you have them backed up to be able to put them back in case of some catastrophic event (e.g. fire). But under normal circumstances (i.e. hardware parts failing) you just go exchanging parts. Someone might argue a system drive without RAID has no way of self-detecting bit rot, but for a system drive of a node ... well you will see it behaving oddly, logs becoming strange and simply shut it down, replace, rinse, repeat.

BTW There are other solutions than PVE, I am not saying better, but somehow people treat it like there's just all those expensive commercial ones and then PVE as the alternative. There's the not-so-comparable alternatives like OpenStack, but also the little forgotten ones like XCP-ng. If you have time to experiment, you may want to give the latter a try too. But it is very different to PVE, after all it's Xen. As for containers, there's even LXD (they added VMs later on), but last time I remember it was not as complete in terms of High Availability, etc.

One more thing .. if you plan to run HA anything, do a lot of tests with PVE, especially with networking, pulling cables, etc. There's enough horror stories of people trying to run HA ending up with lower total uptime because of self-fencing and endless reboots by the watchdog. If you do not plan to use any HA, it's not so critical, even with corosync impaired, it certainly won't go on to reboot your nodes.
 
  • Like
Reactions: rsr911
I wanted to add one more thing. Just before our fire we paid about $7k for VMware 7 licenses and support. It was a total pain in the butt. The interface is not intuitive. Hardware support sucks and their compatibility guide is useless. Now they've been sold to Broadcom for $61 billion. Sure the world relies on them but if people like me don't support products like Proxmox then VMware will never face any real competition. And with financial support they will only get better. In my eyes Proxmox is like the Ubuntu of the hypervisor world, a simple to setup and use system with lots of hardware support, a good community behind it, making it less intimidating to newbies. One shouldn't need a degree in IT to get a small company up and running on a solid server solution. At least that's my opinion. Who knows, maybe in a few years I'll be the guy on here helping the new guys. I hope so at least. So I'm not going to knock them too much for maybe not being entirely enterprise ready, everyone has to start somewhere. /end soapbox lol.

I understand this logic, but PVE will never be a serious competition to ESXi. It's Debian, it's KVM, it's corosync, it has it's own HA stack in pure Perl and watchdog in C, it now added SDN, it integrates with their own PBS product, etc. But when you look at the commercial setup e.g. support hours and all that, it's still a very non-commercial project at heart. And it's understaffed. Just my observations. But I wish there were many more choices as well.
 
  • Like
Reactions: rsr911
I would argue that even commodity equipment when run in a cluster gives you better resiliency than beefed up redundant server gear from 10 years ago. Of course if it can handle the load. This does not concern you, but when someone buys a 10yo server for a homelab, what they would have been better off doing is e.g. buying 5 NUCs for that homelab and run all sorts of clusters.

Yes, cluster node should be something without value of its own. You have the VMs/CTs and you have them backed up to be able to put them back in case of some catastrophic event (e.g. fire). But under normal circumstances (i.e. hardware parts failing) you just go exchanging parts. Someone might argue a system drive without RAID has no way of self-detecting bit rot, but for a system drive of a node ... well you will see it behaving oddly, logs becoming strange and simply shut it down, replace, rinse, repeat.
Oh I agree here. I wouldn't be in this mess of older servers if not for VMware. My now retired IT guy was a VMware guy. I bought two of these servers new in 2015. I wanted a bare metal server in each building running Windows server File Replication. They had SAS HDDs in them. For some reason he left two in mirrored for ESXi and then put the VM files on mirrored Buffalo Terrastations. Imagine having your OS drive not only an hdd but an hdd across a 1g network and not even bonded ports. It was always slow, but at least I didnt have to administer it. With nearly everything stored on those cheap NAS boxes it was painfully slow. So I told him I wanted mirrored servers at a minimum and ssd storage back in 2019. Late 2020 he bought VMware 7.0 HA and vsan. But he had no initiative to install it. So I propped up our important server VMs on ESXi on a server I was experimenting with Nextcloud on.

Anyway that's when the nightmare began. First a warning, CPUs not supported by the now release next version so no easy upgrade path. Then raid cards not supported. Then network cards not supported. Finally start setting up vsan but never got it working. Find out my brand new Intel SFP+ cards only work right with Intel SFP+ modules. Then I need cache drives. Fine out nvmes on riser cards aren't supported. Search VMware compatibility guide and find m.2 optanes are supposedly supported. Do the install, only partially supported. If there had been a 12 gauge anywhere around that day these servers would be full of holes! In the process we bought another server to be the witness and a fourth as a spare. Finally gave up on VMware. Ive been here maybe a month, I have a running cluster built on an OS I am used to (Debian) and while if I had it to do all over again I likely would have just gotten some newer hardware I now don't need to, at least not yet. You see my company has a grand total of 17 employees. All I need to run are a Windows DC and fileserver and backup DC and a MySQL back end for an access frontend I've designed to keep track of production records and quality control. We make generic OTC medicated patches and the FDA is pushing for digital records over paper. Once you switch its not easy to go back to paper. This database has to have an audit trail and reasonable security. I didnt see it running well on a single machine and downtime means lost production. We have a long history of making old equipment work for us. Modifying, even hotrodding, as need be. I'm not going to have 100's of users with intense workloads. Just 5-6 people at a time entering simple data into the database and office people working on spreadsheets and word docs. So the only reason I really need a cluster is less downtime but for few users not many users. We beta tested the database on a single CPU RD440 and it ran fine. I played around in it yesterday (still on an Access back end) and it flies in comparison. In fact the only reason to move to MySQL is the 2gb limit of Access. 2gb I calculated as more than a year of data, almost two years, and I could archive years but why do that when things like multi terabyte hardrives exist.... If I didnt need this database I'd probably just setup TrueNas or similar and would have stuck with a single Windows server VM on each server. But when your entire building needs rewired things like 10g just make sense. I mean I ran everything in cat5e in like 2002 or so...
 
I understand this logic, but PVE will never be a serious competition to ESXi. It's Debian, it's KVM, it's corosync, it has it's own HA stack in pure Perl and watchdog in C, it now added SDN, it integrates with their own PBS product, etc. But when you look at the commercial setup e.g. support hours and all that, it's still a very non-commercial project at heart. And it's understaffed. Just my observations. But I wish there were many more choices as well.
I have no doubt you're right. But I think it's a good option, especially costwise in both support and hardware, that it's a decent way for small companies to get in the door with hypervisors over baremetal. And who knows, if enough of those little companies support it maybe in time understaffing and lacking certain things will change. Maybe maybe not. I hope it does. I also hope Libreoffice or Openoffice can go somewhere but bigger business support just isn't there. Proxmox is sort of isolated there. Doesnt matter what hypervisor youre using, as long as it suits your needs. Not the same as trying to share office documents when the world runs on M$.
 
Oh I agree here. I wouldn't be in this mess of older servers if not for VMware. My now retired IT guy was a VMware guy. I bought two of these servers new in 2015. I wanted a bare metal server in each building running Windows server File Replication. They had SAS HDDs in them. For some reason he left two in mirrored for ESXi and then put the VM files on mirrored Buffalo Terrastations. Imagine having your OS drive not only an hdd but an hdd across a 1g network and not even bonded ports. It was always slow, but at least I didnt have to administer it. With nearly everything stored on those cheap NAS boxes it was painfully slow. So I told him I wanted mirrored servers at a minimum and ssd storage back in 2019. Late 2020 he bought VMware 7.0 HA and vsan. But he had no initiative to install it. So I propped up our important server VMs on ESXi on a server I was experimenting with Nextcloud on.

Anyway that's when the nightmare began. First a warning, CPUs not supported by the now release next version so no easy upgrade path. Then raid cards not supported. Then network cards not supported. Finally start setting up vsan but never got it working. Find out my brand new Intel SFP+ cards only work right with Intel SFP+ modules. Then I need cache drives. Fine out nvmes on riser cards aren't supported. Search VMware compatibility guide and find m.2 optanes are supposedly supported. Do the install, only partially supported. If there had been a 12 gauge anywhere around that day these servers would be full of holes! In the process we bought another server to be the witness and a fourth as a spare. Finally gave up on VMware. Ive been here maybe a month, I have a running cluster built on an OS I am used to (Debian) and while if I had it to do all over again I likely would have just gotten some newer hardware I now don't need to, at least not yet. You see my company has a grand total of 17 employees. All I need to run are a Windows DC and fileserver and backup DC and a MySQL back end for an access frontend I've designed to keep track of production records and quality control. We make generic OTC medicated patches and the FDA is pushing for digital records over paper. Once you switch its not easy to go back to paper. This database has to have an audit trail and reasonable security. I didnt see it running well on a single machine and downtime means lost production. We have a long history of making old equipment work for us. Modifying, even hotrodding, as need be. I'm not going to have 100's of users with intense workloads. Just 5-6 people at a time entering simple data into the database and office people working on spreadsheets and word docs. So the only reason I really need a cluster is less downtime but for few users not many users. We beta tested the database on a single CPU RD440 and it ran fine. I played around in it yesterday (still on an Access back end) and it flies in comparison. In fact the only reason to move to MySQL is the 2gb limit of Access. 2gb I calculated as more than a year of data, almost two years, and I could archive years but why do that when things like multi terabyte hardrives exist.... If I didnt need this database I'd probably just setup TrueNas or similar and would have stuck with a single Windows server VM on each server. But when your entire building needs rewired things like 10g just make sense. I mean I ran everything in cat5e in like 2002 or so...

This is sort of a problem with solutions like ESXi, it's the mentality also of "no one ever got fired buying Cisco", another thing is that the person who comes is familiar with something and he also knows it's more valuable to them to keep building their skillset in that commercial product that offers the most professional opportunities. If you are familiar with Debian, you will probably like running PVE and don't care about who comes next.

I am not a fan of any particular solution, if I was I am certainly not worried about voicing it out here. I would just urge anyone to test everything out well, compare, then choose. Unfortunately there's people on the forum coming sometimes with inquiries like hey I have no backup no nothing, no spare, something failed and I do not even know where my logfiles are, but can you help me, also this is production so I do not want to disrupt anything. That's hard then, with any solution whatsover.

I was fed up when I found some Perl bugs in PVE immediately after I started my share of tests, I am still not quite able to grasp how those do go unfixed for such a long time, but then on the other hand, it's open-source, it's Perl, I fixed it for myself, published the patch and can complain about it here still. Also, fair enough, I was not asked to pay anything for any of that. So in a way, it's not bad at all, despite I wished it was better.

... alas, there's always something good in anything bad. You dumped the licenses and also got new cabling. ;)
 
This is sort of a problem with solutions like ESXi, it's the mentality also of "no one ever got fired buying Cisco", another thing is that the person who comes is familiar with something and he also knows it's more valuable to them to keep building their skillset in that commercial product that offers the most professional opportunities. If you are familiar with Debian, you will probably like running PVE and don't care about who comes next.
My plan is when I pass the company onto the next person I'll have grown it enough that we can just buy what we need when we need it and not have to worry about costs so much. Heck maybe we'll have a full time IT guy at some point. Until then I need as simple and cheap as possible.
I am not a fan of any particular solution, if I was I am certainly not worried about voicing it out here. I would just urge anyone to test everything out well, compare, then choose. Unfortunately there's people on the forum coming sometimes with inquiries like hey I have no backup no nothing, no spare, something failed and I do not even know where my logfiles are, but can you help me, also this is production so I do not want to disrupt anything. That's hard then, with any solution whatsover.
Neither am I. We run Sage accounting because thats what our first office administrator knew. I hate it but we keep using it.

As for any vital equipment we always keep spares and with PCs we always have lots of backups and redundancies where needed, which is why I started this thread. I was nervous about single boot drives.
I was fed up when I found some Perl bugs in PVE immediately after I started my share of tests, I am still not quite able to grasp how those do go unfixed for such a long time, but then on the other hand, it's open-source, it's Perl, I fixed it for myself, published the patch and can complain about it here still. Also, fair enough, I was not asked to pay anything for any of that. So in a way, it's not bad at all, despite I wished it was better.

... alas, there's always something good in anything bad. You dumped the licenses and also got new cabling.
That sure is an upside. I'm in love with 10g over copper. I wish prices would come down and make it more affordable for home users. I mean we have NVMEs nowadays that do 7GBs a second and now 5th gen NVMEs are even faster. Yet most people are stuck at 110ish MB/s in their home. That wasn't so bad a few years back when a regular HDD could barely break 120, then SSDs came along around 500-600. But we are a magnitude higher now. Proxmox backups over bonded 10g are a total joy. Just minutes to backup an 80GB VM. I remember SCSI DAT tape autoloaders lol. Heck I remember our very first file server at work. The old office PC, Pentium II 266mhz, 128mb ram, NT 4.0, two 2gb mirrored SCSI boot drives and 6 9GB SCSI drives in Raid 5. That gave way to a dual Pentium Xeon machine thats CPUs ran so cool they had passive cooling. But it was "fast". Not long after is when we went to 1g networking and it was cheap for home use too. Of course I also remember building my boss a 1ghz gaming rig with a Ultra2 SCSI boot drive and "big" 20g IDE data drive with a Voodoo2 card to run Unreal tournament. Those were the days, lan parties at work on 1 gig networking with the game server in the same room. Latency? What's that? Nobody had a hdd that could saturate the network. Now office PCs can saturate 10g with just a gen3 NVMe. It's like networking is stuck in the past, at least for normal people.
 
Problem solved! As I stated previously I had 118gb M.2 NVMe Optane drives I'd bought for ESXi. Well my server won't boot NVMe. Well that was until I hacked the UEFI bios. My fifth machine came in today. I have a spare motherboard so I figured why not try. The bios on the machine was really old, 2014. I updated to the latest version. It's an AMI bios, so next I followed some ideas for consumer boards on the internet, extracted the bios with UEFI tool then injected an NVMe driver and saved the file and flashed it. After a power cycle it showed "PATA SS" as a boot option. In went my PVE install USB, install offered the P1600X NVMe Optane as an install option. I installed, rebooted and boom its running perfect. Check the bios and now have an option for "UEFI OS" followed by the drive model number. I have that first and PATA SS as the second option. No other drives in the system but the NVMe. Its booting noticeably faster than my other servers even though the only slot I had open was the single PCIe 2.0 slot as all the others are 3.0. A quick live boot to Ubuntu shows about what I expected from 2.0 x4, 1500MB/s read and 1000MB/s write. I could move it but my other slots are filled with the one HBA 12gbit/s card and dual port 10g NICs. I'm thinking I'd prefer to not bottle neck the NICs when I'm already 3x faster than a SATA/SAS ssd. And these optanes have great DWPD numbers. I just have them installed in low cost M.2 to PCIe x4 adapters.

I'm going to tinker with it after the holiday. If its stable, as I expect it to be since Debian natively supports NVMe I'll flash the bios on the other four servers.
 
  • Like
Reactions: esi_y
Update #2

Ive been benchmarking this drive and doing other tests. I made a false assumption in my comment about how I got this to work. I am NOT limited in transfer speed by being in the servers only PCIe gen 2.0 slot (the others are all gen 3.0) an x4 should do up to 4gbs/s. I'm "limited" by the Optane P1600x NVMe drive itself. The numbers I posted for sequential reads and writes are exactly inline with Intel's specs for these drives.

Where this solution is handy is its a relatively cheap way to get enterprise grade endurance and something faster than SATA/SAS for storage without too much trouble and it's also a solution for people wanting to use say a USB stick. If you have an AMI UEFI bios this solution should work for you.

I have one more thing I want to try. A u.2 to PCIe adapter card that holds and powers a single u.2 drive as u.2 NVMe are far more common and Optane has been discontinued. I see no reason it won't work. U.2, like M.2 is a connection specification, not a drive specification. My SAS backplane clearly doesnt support U.2.

Bottom line: I shouldn't have to worry about wearing out boot drives for awhile, which is why I asked about mirroring in the first place. Now this simple solution to add an enterprise grade flash drive has solved that problem. I'll just keep an eye on the SMART data. I can't say I would recommend any sort of Optane as a boot drive. Yes they are high endurance but they are really pricey per GB. However Ive now booted both the m.2 optanes and consumer m.2 NVMe and see no reason this can't work for u.2 NVMe. Making this a potential lower cost way to build a home cluster or small business cluster that doesn't need high speeds of newer servers. Like in my case where the cluster will run two Windows Server 2019 VMs and using ProxySQL (or similar) 3-4 copies of a MySQL server with very few transactions per hour. My use case needed high uptime and lots of data protection. We ran for years on these same servers with only one CPU installed and 32mb ram. Two servers each running a Windows server VM. Data was stored on HDDs in crappy Buffalo terrastations. Ive already seen major improvements with just three node ceph cluster.

The way I see things is Ive built the infrastructure. When I need more speed or capacity I can add new servers two at a time to my five node cluster and assign them priority until I've replaced all five. But right now, based on some real workloads I've made a big improvement. I migrated my first DC to the cluster last week and told no one. We ran all week, no issues, not much CPU or RAM usage, two of my users asked if I'd changed their desktops because they noticed things were opening faster or the machine was more responsive. And bear in mind this is still on consumer SSDs for storage. I am now sourcing enterprise SAS ssds for all five machines. I'll move all the consumer SSDs to an archival storage server for monthly and annual backup storage. My PBS machine has large SAS HDDs in hardware raid for daily backups with on card SSDs caching.

I suppose the only question now is do I make the fifth machine, currently running PBS in a VM on PVE a full member of the cluster or just use it as a monitor for the quorum. I have a smaller server that could run PBS on baremetal but it only holds four drives. Thats a topic for another thread I suppose.
 
Update #2

I'm "limited" by the Optane P1600x NVMe drive itself

It's not really any issue for the system drive.

Where this solution is handy is its a relatively cheap way to get enterprise grade endurance and something faster than SATA/SAS for storage without too much trouble and it's also a solution for people wanting to use say a USB stick. If you have an AMI UEFI bios this solution should work for you.

I might be wrong, but the currently sold "consumer" SSDs (w/DRAM and PLP) once at 2TB capacities and up, happen to have TBW reading 2PB, I do not think the Optane is/was cheaper, so it's mostly if you already have such drive around that you made good use of it.

I have one more thing I want to try. A u.2 to PCIe adapter card that holds and powers a single u.2 drive as u.2 NVMe are far more common and Optane has been discontinued. I see no reason it won't work. U.2, like M.2 is a connection specification, not a drive specification. My SAS backplane clearly doesnt support U.2.

But why are you so focused on NVMe speeds for the OS drive?

I migrated my first DC to the cluster last week and told no one.

That's some team work going on there. :D

I suppose the only question now is do I make the fifth machine, currently running PBS in a VM on PVE a full member of the cluster or just use it as a monitor for the quorum. I have a smaller server that could run PBS on baremetal but it only holds four drives. Thats a topic for another thread I suppose.

Unless you need that machine's capacity to run VMs, I would just have it host a QDevice for the cluster and keep it separate.
 
It's not really any issue for the system drive.
I might be wrong, but the currently sold "consumer" SSDs (w/DRAM and PLP) once at 2TB capacities and up, happen to have TBW reading 2PB, I do not think the Optane is/was cheaper, so it's mostly if you already have such drive around that you made good use of it.



But why are you so focused on NVMe speeds for the OS drive?
I know, and I don't need NVMe speeds for the OS drive. I bought these originally as caching drives for VMware. Now they are repurposed as high endurance system drives.
That's some team work going on there. :D
Well I did tell my apprentice. You can't do a blind test if you tell people. Actually this was sort of a controlled test. My other DC has some shares my lab and compliance teams use. I left those on the old hardware. File transfer speeds are better on the cluster as I would expect. But one check I did was opening excel files of similar size from the cluster vs single server. I noticed more responsiveness from the all SSD cluster vs the HDD raid single server. I expected that, but its nice to confirm it.

Sorry I'm more of a scientist and engineer than an IT pro. I'm a computer nerd to be sure but not s pro. As such blind testing with controls is something I'm used to doing.

I also ran ceph benchmarks and saved all the results in a file. I'm going to run these tests again once all the servers are in their three respective homes and I have "4.5" nodes setup I say 4.5 as one will not have ceph storage. See below.
Unless you need that machine's capacity to run VMs, I would just have it host a QDevice for the cluster and keep it separate.
It only needs to run PBS in a VM. Mainly because I'd like a backup of PBS itself at least monthly, not the data, the configuration.

I've repurposed my terrastations as long term archival storage. They already have replication between them. That's where I plan to keep a copy of my monthly and annual backups. Basically attach one as a share to PBS and push a backup copy off to them on a schedule.

I don't need it for VM space. I currently have two DCs with a grand total of less than 400gb total. MySQL won't add that much space. We just don't have that many client machines or data. This whole thing has always been about availability and security. Honestly I probably could have done it with SQL server on Windows and used FRS and just two servers. But I like the quorum and extra redundancy of ceph.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!