Proxmox ZFS vs LVM

@noxy

New Member
Aug 7, 2022
18
0
1
South Africa, Gauteng
sive.host
Hello, we think our CT server is somehow corrupted as it struggles to backup, would this be that we use LVM on the disks it works on as we are not sure? See attached screenshot and also when we check our LVM disks we use Samsung SSD 870 QVO the first one shows 55% wearout and the other one shows 1%. Does this have to do with something occupying a lot of space and also is it the reason why we experience backup failures now and then.

Please kindly assist.
 

Attachments

  • Screenshot from 2023-04-18 13-15-38.png
    Screenshot from 2023-04-18 13-15-38.png
    114.9 KB · Views: 15
  • Screenshot from 2023-04-18 13-16-07.png
    Screenshot from 2023-04-18 13-16-07.png
    19 KB · Views: 15
Hi,

Can you please provide us with the output of `cat /etc/pve/storage.cfg` command and the backup failed task (in order to get the failed backup task job, you can find it at the bottom of Proxmox VE Web UI, like in the second screen you're provided [double-click on it])?
 
Hi,

Can you please provide us with the output of `cat /etc/pve/storage.cfg` command and the backup failed task (in order to get the failed backup task job, you can find it at the bottom of Proxmox VE Web UI, like in the second screen you're provided [double-click on it])?
This is the output below and we have attached the backup failure errors:

root@ekhaya127:~# cat /etc/pve/storage.cfg
dir: local
disable
path /var/lib/vz
content images,rootdir
shared 0

lvmthin: local-lvm
thinpool data
vgname pve
content images,rootdir

dir: BackupSiveStax
path /mnt/sivestax/
content backup
prune-backups keep-last=2
shared 1

lvm: driveB
vgname driveB
content images,rootdir
nodes ekhaya127
shared 0

lvm: driveD
vgname driveD
content images,rootdir
nodes ekhaya127
shared 0

dir: ekhaya127Backup1
path /media/ekhaya127Backup1/
content images,iso,snippets,rootdir,backup,vztmpl
nodes ekhaya127
prune-backups keep-all=1
shared 1
 

Attachments

  • Screenshot from 2023-04-18 13-39-54.png
    Screenshot from 2023-04-18 13-39-54.png
    12.7 KB · Views: 3
  • Screenshot from 2023-04-18 13-24-34.png
    Screenshot from 2023-04-18 13-24-34.png
    82.7 KB · Views: 3
The actual error message of the task log is cut off at the right. Would be better if you could copy-paste the whole task log in CODE-Tags here.
 
The actual error message of the task log is cut off at the right. Would be better if you could copy-paste the whole task log in CODE-Tags here.
Sorry about that, here's the full error at hand:

INFO: starting new backup job: vzdump 100 --quiet 1 --compress zstd --storage ekhaya127Backup1 --notes-template '{{guestname}}' --node ekhaya127 --mailnotification always --prune-backups 'keep-daily=1,keep-last=1,keep-monthly=1,keep-weekly=1' --mailto noxolo@sive.host --mode snapshot
INFO: Starting Backup of VM 100 (lxc)
INFO: Backup started at 2023-04-18 01:00:03
INFO: status = running
INFO: CT Name: egumeni.sive.host
INFO: including mount point rootfs ('/') in backup
INFO: mode failure - some volumes do not support snapshots
INFO: trying 'suspend' mode instead
INFO: backup mode: suspend
INFO: ionice priority: 7
INFO: CT Name: egumeni.sive.host
INFO: including mount point rootfs ('/') in backup
INFO: starting first sync /proc/2687194/root/ to /media/ekhaya127Backup1//dump/vzdump-lxc-100-2023_04_18-01_00_03.tmp
INFO: first sync finished - transferred 89.50G bytes in 20841s
INFO: suspending guest
INFO: starting final sync /proc/2687194/root/ to /media/ekhaya127Backup1//dump/vzdump-lxc-100-2023_04_18-01_00_03.tmp
INFO: resume vm
INFO: guest is online again after 80 seconds
ERROR: Backup of VM 100 failed - command 'rsync --stats -h -X -A --numeric-ids -aH --delete --no-whole-file --inplace --one-file-system --relative '--exclude=/tmp/?*' '--exclude=/var/tmp/?*' '--exclude=/var/run/?*.pid' /proc/2687194/root//./ /media/ekhaya127Backup1//dump/vzdump-lxc-100-2023_04_18-01_00_03.tmp' failed: exit code 23
INFO: Failed at 2023-04-18 06:49:26
INFO: Backup job finished with errors
TASK ERROR: job errors
 
Thank you for the output!

The error "exit code 23" means that some data could not be transferred during the backup, is only this CT `100` or other CTs/VMs have the issue with the backup? Can you please do a backup of the same CT to another storage? last question, may I ask you what running on this CT it's is "MariaDB"?
 
So far that is the only error we receive but sometimes the backup works sometimes it does not.

Another error is "
INFO: trying to get global lock - waiting...
ERROR: can't acquire lock '/var/run/vzdump.lock' - got timeout ".

Ok we will try to back it up on another storage and see if it'll work with no errors.

Yes it is MariaDB as it holds our own WHMCS system we use daily.

And also is the LVM not an issue using it on the same disk as our backup? And also how do we go about the SSD wearout of 50% and the other one with 1%?
 
Hi,


Could you try running the same command directly after the backup has failed but add "--verbose" to the rsync parameters?

And also is the LVM not an issue using it on the same disk as our backup? And also how do we go about the SSD wearout of 50% and the other one with 1%?
it's difficult to say for sure without more information, can you please provide us with the smartctl?
 
See attached screenshot and also when we check our LVM disks we use Samsung SSD 870 QVO
I highly recommend to not use Samsung QVO SSDs for stuff like that. They will inevitably break and they usually do it in the wrong moment.
Using consumer SSDs is problematic in general but Samsung QVO is probably one of the worst possible options.
 
Hi,


Could you try running the same command directly after the backup has failed but add "--verbose" to the rsync parameters?


it's difficult to say for sure without more information, can you please provide us with the smartctl?
So we forgot to move the backup to another storage yesterday as per your instruction but then today the backup was successful with no errors
 
I highly recommend to not use Samsung QVO SSDs for stuff like that. They will inevitably break and they usually do it in the wrong moment.
Using consumer SSDs is problematic in general but Samsung QVO is probably one of the worst possible options.
Oh so which ones would you suggest we use that will be better than the samsung we have now to avoid having issues?
 
Hi,


Could you try running the same command directly after the backup has failed but add "--verbose" to the rsync parameters?


it's difficult to say for sure without more information, can you please provide us with the smartctl?
Since we have 2 of the SSD, first output is for the 55% and the senconf output is for the 1% SSD

root@ekhaya127:~# sudo smartctl -a /dev/sdb
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.35-2-pve] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family: Samsung based SSDs
Device Model: Samsung SSD 870 QVO 4TB
Serial Number: S5VYNG0N702977V
LU WWN Device Id: 5 002538 f70709703
Firmware Version: SVQ01B6Q
User Capacity: 4,000,787,030,016 bytes [4.00 TB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
TRIM Command: Available, deterministic, zeroed
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-4 T13/BSR INCITS 529 revision 5
SATA Version is: SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Wed Apr 19 12:05:56 2023 SAST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 0) seconds.
Offline data collection
capabilities: (0x53) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 320) minutes.
SCT capabilities: (0x003d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
9 Power_On_Hours 0x0032 096 096 000 Old_age Always - 18416
12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 13
177 Wear_Leveling_Count 0x0013 045 045 000 Pre-fail Always - 553
179 Used_Rsvd_Blk_Cnt_Tot 0x0013 100 100 010 Pre-fail Always - 0
181 Program_Fail_Cnt_Total 0x0032 100 100 010 Old_age Always - 0
182 Erase_Fail_Count_Total 0x0032 100 100 010 Old_age Always - 0
183 Runtime_Bad_Block 0x0013 100 100 010 Pre-fail Always - 0
187 Uncorrectable_Error_Cnt 0x0032 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0032 067 060 000 Old_age Always - 33
195 ECC_Error_Rate 0x001a 200 200 000 Old_age Always - 0
199 CRC_Error_Count 0x003e 100 100 000 Old_age Always - 0
235 POR_Recovery_Count 0x0012 099 099 000 Old_age Always - 10
241 Total_LBAs_Written 0x0032 099 099 000 Old_age Always - 133060086703

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
256 0 65535 Read_scanning was never started
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


root@ekhaya127:~# sudo smartctl -a /dev/sdd
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.35-2-pve] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family: Samsung based SSDs
Device Model: Samsung SSD 870 QVO 4TB
Serial Number: S5VYNG0N702944E
LU WWN Device Id: 5 002538 f707096e2
Firmware Version: SVQ01B6Q
User Capacity: 4,000,787,030,016 bytes [4.00 TB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
TRIM Command: Available, deterministic, zeroed
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-4 T13/BSR INCITS 529 revision 5
SATA Version is: SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Wed Apr 19 12:09:07 2023 SAST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 0) seconds.
Offline data collection
capabilities: (0x53) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 320) minutes.
SCT capabilities: (0x003d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
9 Power_On_Hours 0x0032 096 096 000 Old_age Always - 18417
12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 11
177 Wear_Leveling_Count 0x0013 099 099 000 Pre-fail Always - 7
179 Used_Rsvd_Blk_Cnt_Tot 0x0013 100 100 010 Pre-fail Always - 0
181 Program_Fail_Cnt_Total 0x0032 100 100 010 Old_age Always - 0
182 Erase_Fail_Count_Total 0x0032 100 100 010 Old_age Always - 0
183 Runtime_Bad_Block 0x0013 100 100 010 Pre-fail Always - 0
187 Uncorrectable_Error_Cnt 0x0032 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0032 078 063 000 Old_age Always - 22
195 ECC_Error_Rate 0x001a 200 200 000 Old_age Always - 0
199 CRC_Error_Count 0x003e 100 100 000 Old_age Always - 0
235 POR_Recovery_Count 0x0012 099 099 000 Old_age Always - 8
241 Total_LBAs_Written 0x0032 099 099 000 Old_age Always - 45381230427

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
256 0 65535 Read_scanning was never started
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
 
Oh so which ones would you suggest we use that will be better than the samsung we have now to avoid having issues?
Samsung is basically good. The problem here is this specific model (QVO). It is using very cheap chips (QLC chips) that are known for short lifespan if they get to much write load (e.g. fine as application data storage for games, bad for database servers with frequently changing data).

It very much depends on your specific use case. Maybe it is not that bad (low write servers...).

If you buy something new:
Use at least TLC chip SSDs with an own DRam Cache (if it has to cheap: Crucial MX500 is a solid consumer Sata SSD and uses TLC instead of QLC chips).

If your server has higher write load and whatever you run is actually important ---> Get real Server SSDs. They exist for a reason :)

edit: Btw, how old is the SSD with the 55% Wearout?
 
Last edited:
  • Like
Reactions: @noxy
Samsung is basically good. The problem here is this specific model (QVO). It is using very cheap chips (QLC chips) that are known for short lifespan if they get to much write load (e.g. fine as application data storage for games, bad for database servers with frequently changing data).

It very much depends on your specific use case. Maybe it is not that bad (low write servers...).

If you buy something new:
Use at least TLC chip SSDs with an own DRam Cache (if it has to cheap: Crucial MX500 is a solid consumer Sata SSD and uses TLC instead of QLC chips).

If your server has higher write load and whatever you run is actually important ---> Get real Server SSDs. They exist for a reason :)

edit: Btw, how old is the SSD with the 55% Wearout?
oh now i understand.
The SSD will be 3yrs this year around Dec
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!