Best practices for Backup on SATA HDD

Dec 23, 2020
11
0
6
53
Hello,

after a period of testing we are about to go productive with PVE. But, for budget- and capacity reasons we have to store backups/snapshots on conventional SATA-HDDs. We also can see I/O issues that cause high CPU loads similar to those posted, for example: https://bugzilla.kernel.org/show_bug.cgi?id=199727
Most critical is a Suse Linux server with 2 virtual disks (50G + 100G, VirtIO SCSI), tends to stall due to CPU load as soon as Backups are going on.

Are there any recommendations / best practices / proven settings for such scenarios?

TIA,
Sándor
 
4 x Intel(R) Xeon(R) CPU E5603 @ 1.60GHz (1 Socket)
and, maybe of interest:
HDD = WesternDigital WD Ultrastar DC HC320
Controller: HP Smart Array B110i
 
backup with vzdump or pbs ?
if vzdump try without compression.
have you checked if disk writecache are enabled in bios ?
 
  • Like
Reactions: sry
No wonder you see high CPU-loads with such a low-end, almost 12 year old CPU [1]...

See if reducing: max-workers [2] helps.

And use the new defaults since 7.3 [3]:
In the web interface, new VMs default to iothread enabled and VirtIO SCSI-Single selected as SCSI controller (if supported by the guest OS)

[1] https://ark.intel.com/content/www/u...603-4m-cache-1-60-ghz-4-80-gts-intel-qpi.html
[2] https://pve.proxmox.com/wiki/Backup_and_Restore#vzdump_configuration
[3] https://pve.proxmox.com/wiki/Roadmap#Proxmox_VE_7.3
 
  • Like
Reactions: RolandK and sry
Backup with vzdump, compression already disabled (helped a bit) but open for alternative, is ok if backzp takes some time as a separate application dump is done before backing up VM. Will check writecache settings during holidays.

I know, HW is ancient, but I like the idea using it as long as possible for ecological reasons (amongst others). Performance is absolutely satisfying (Linux guests only), just those backup bottlenecks are a bit worrying.
If I had an idea whether SSD will help significantly I'd consider budgeting them, but I fear the (also ancient) controller will throttle them too much.

Will try the recommended settings.

But - /etc/vzdump.conf is completely commented out (full content below), and no max-workers entry. Fond a 'workers' setting only in GUI (Datacenter > Options > Maximal Workers/bulk-action) which I regard as not the correct one. Simply add the entry to etc/vzdump.conf ?

And, can I assume that changing from VirtIO to VirtIO -Single works with the Suse VM having 2 virtual disks?

# vzdump default settings

#tmpdir: DIR
#dumpdir: DIR
#storage: STORAGE_ID
#mode: snapshot|suspend|stop
#bwlimit: KBPS
#ionice: PRI
#lockwait: MINUTES
#stopwait: MINUTES
#stdexcludes: BOOLEAN
#mailto: ADDRESSLIST
#prune-backups: keep-INTERVAL=N[,...]
#script: FILENAME
#exclude-path: PATHLIST
#pigz: N
#notes-template: {{guestname}}
 
Last edited:
sorry for my english, french here...
you can try pbs alongside pve (not recommended but work nice here), first backup will be slow as full vzdump but next backups will be fast and even faster if your guest has few shutdowns (can be hours to minutes), because qemu process monitor disk blocks changed and transfer only changed blocks on destination.
PBS store backups in a datastore (.chunks folder with 65k folders to store data), an identical data inside source or for a second backup, is always stored one time.
another example : a stopped VM can be backup in infinity.
i use pbs with datastore on standard hdd with only hundred of GB, no yet performance penalty. i mainly have 1 to 2 Windows server vm per pve host.

edit: virtio scsi single require more resources than virtio scsi. btw, you can try ide controller to compare cpu usage during backups. i don't know, perhaps ide will be a bit slow but less cpu intensive. need confirmation.

edit3: reading your bugzilla kernel link is interesting, i havn't yet the problem, scsi single iothread=1 / aio=threads can be better, tests must be done on each hw to find the best settings.

edit2: how many disks ? backups are stored in same disk as VM disks ?!
 
Last edited:
  • Like
Reactions: sry
>Most critical is a Suse Linux server with 2 virtual disks (50G + 100G, VirtIO SCSI), tends to stall due to
>CPU load as soon as Backups are going on.
>Are there any recommendations / best practices / proven settings for such scenarios?

the stall of the VM does most likely not happen because of the cpu load in the host or in the vm or beacuse old CPU or sata backup target, but because in it's default configuration, the vm io task in qemu is being handled within the same thread like vm cpu processing. so if io gets stuck, your cpu gets stuck. details for that are mentioned in the bugreport.

use virtio-scsi-single, iothread=1 and aio=threads and these type of problems will likely go away

but mind, that if your backup target is too slow, that will also impact vm io performance, as vm is getting throttled if there is more io inside VM then your backup target can handle during backup.

roland (from that bugreport above ;)
 
  • Like
Reactions: sry and _gabriel
Simply add the entry to etc/vzdump.conf ?

Make sure, your PVE is up-to-date. This option was added quite recently; then:
Yes, add in a new line: performance: max-workers=1
I would start with: 1 to see/have the biggest impact. Later on you could test with higher values and see how it goes/performs.

And, can I assume that changing from VirtIO to VirtIO -Single works with the Suse VM having 2 virtual disks?

I would highly assume so, but can not give you any guarantee.
As ever, have a recent and known functional backup. ;)
 
  • Like
Reactions: sry
1st of all - THX for all your efforts!

@_gabriel: your English seems immaculate to me (maybe I don't see the mistakes because I'm German ;) )

PBS? - hmmm. Maybe I better describe the environment more in detail:
Have 3 HP Proliant DL320 G6 with 32G RAM each, old but in good condition. 2 of them have 1 Seagate Exos 7E8 4 TB (for OS and VMs) and 1 WesternDigital WD Ultrastar DC HC320 (for Backup Images), so it is 2 px nodes. Hence backups are not on same disk as VMs.

Regulary, node1 hosts vm1, vm2, vm3, and node2 hosts vm4, vm5, vm6. If one of the servers fails non-hdd related, I could simply put the disks in the 3rd (spare) machine. If one of the disks with the vms is broken, I could restore from backup to intact node.
Maybe it would be better to have all VMs on node1 and use the HW of node 2 for a PBS?

Apart from that, Santa brought me a SSD :) : Kingston DC450R 960 GB. Should I rather use that for VMs, for Backups or both (2 partitions)?

I will implement all your recommendations on Monday as the office is closed then, and I know - no backup, no mercy ;)
 
Last edited:
I run PBS on a Dell R200 8GB RAM using SAS HDDs running ZFS. Working fine.

It's also the POM (Proxmox Offline Mirror) server as well.
 
  • Like
Reactions: sry
notice that dc450r have not the PLP protection, dc500r would have been better because it has PLP.

edit: for me, in anyway pbs is better, even pbs alongside pve is better than vzdump.
 
Last edited:
  • Like
Reactions: sry
Update:
Yesterday I cloned the Suse VM, updated node to 7.3, changed clone from VirtIO to VirtIO single + did the other settings >> starts with filesystem errors. Reverted to VirtIO (kept new settings assuming they work only with VirtIO single which I now know is wrong) >> errors kept. Fortunately, the original VM with settinmgs unchanged started and works as ever.
Have also built in that SSD (Kingston DC450R) - should I rather use it for VMs or for backup?
 
lvmthin storage for ssd for VMs, hdd for backups + PBS if you use daily/frequent backups.
 
  • Like
Reactions: sry
changed clone from VirtIO to VirtIO single + did the other settings >> starts with filesystem errors. Reverted to VirtIO (kept new settings assuming they work only with VirtIO single which I now know is wrong) >> errors kept.

How exactly did those errors look like?

After a quick search, I could not find a statement for this, but my assumption would be, that the driver for: VirtIO SCSI single is the same as for: VirtIO SCSI? So the driver should not be a problem here?

If you aim at maximum performance, you can select a SCSI controller of type VirtIO SCSI single which will allow you to select the IO Thread option. When selecting VirtIO SCSI single Qemu will create a new controller for each disk, instead of adding all disks to the same controller.
https://pve.proxmox.com/pve-docs/chapter-qm.html#qm_hard_disk

In case you use non-unique drive-/partition-identifiers, maybe the identifiers have changed after the change to: VirtIO SCSI single and there your problems came from?
But of course, that is only wild guessing without knowing the exact errors...
 
  • Like
Reactions: sry
The errors were failures mounting volumes. I didn't dig deeper into this but also suspected rather probs with the clone itself than with the adoptions on controller side.

Next step will be to move the VM to the new SSD Storage. Then I have to take a final decision whether I go for 1 PVE node + 1 PBS or stay with 2 PVE nodes. Unfortunately my biuget is exhausted with PVE => community lic for PBS only (OT: What makes it xtimes more expensive than PVE?)
 
Update + new findings
Moved the storage of the critical VM to the new SSD (had it shut down to be on the safe side). Changed nothing else, and - backup duration (dump) is same as before. But the additional Backup I do on external NAS (with SATA Disk, too, 1GB NIC) is much faster now (takes half of the time) - no surprise as app 500.000 small database files are copied via rsync.

Then I moved the Disk of an other VM to the SSD (60 GM Linux VM) and - it choked the abovementioned VM completely (CPU 100%) and the node nearly (95+ % CPU). Looking for Info I found this:
https://forum.proxmox.com/threads/high-cpu-usage-by-kvm.66247/post-298866
Checked accordingly and saw that I highly overprovisioned (is that expression correct here, too?) CPU's
- 4 sockets 2 cores (=8 VCPU total) for the often stalling VM
- 2 sockets 2 cores (=4 VCPU total) for a second one
- 1 sockets 2 cores (=2 VCPU total) for another VM

So maybe I simply provided too many VCPUs and the I/O waits are not the culprit and rather a symptom? I reduced the VCPU's to

- 2 sockets 1 core (=2 VCPU total) for the often stalling VM
- 1 socket 1 core (=1 VCPU total) for the second one
- 1 socket 1 core (=1 VCPU total) for the 3rd VM

Will see what happens, but if that really cures it this would mean one should not use more VCPUs than physical Cores available (according to the post linked above). Would astonish me as I thought sharing resources (esp. CPU!) is a core advantage of virtualization in general?
 
could you ping your VM during high IO periods from the host ?

how does that look like? do you see permanent low value in ping (<1ms) or does it fluctuate heavily to hundres/thousands of miliseconds?
 
The VM is completely dead in those situations, no reaction on ping, no ssh possible, no vnc in browser console, no reaction in GUI on shutdown/reboot/... dropdown menu. Only thing that worked for me was qm stop and then qm start from console of the node. After VM is up then, it runs without any visible probs.
 
it dies when backup is run, correct?

does it die immediately ? (i.e. just after start) or does it need a moment to die?

if it does not die instantly, what does ping tell during backup being startet/running?


>use virtio-scsi-single, iothread=1 and aio=threads and these type of problems will likely go away

you have this settings applied or not ?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!