Hi All,
After migration to 6.2 and reworking the whole disks layout, I am experience regular email messages with content:
I use ZFS replication to keep MV/container safe of any separate single server failure, and the most critical MV/LXCs replicated to 2 other servers.
I have zabbix on place and I see iowait on these servers has average about 3-5% and sometimes jumps up to 30% (!!!)
System layout:
"rpool" is a small ZFS mirror for system root mostly and a count of wery light, but important LXCs, build on two SAS 10K 300Gb drives.
"data" is a pool where relatively big MVs reside. It is a mirror of ST4000LM024 drives, 4TB, 5400RPM SATA, 130MB/s speed by specs. Yes, I know, it is slow. But I have no others.
One SATA attached 400GB intel SSD used, gpt partitioned for 2 parts: linux swap and all the otheer is used as cache for pool "data", but it looks like it is not very helpful.
All the pools have ashift=12, complession on and sync disabled.
The issue is that I have email notification on arror in one hand, but when I go to the web-interface to check - all the replications is Ok. I guess, Proxmox retries it with success before I go to the web-interface.
Th appearance of th eerror is a very random and I have no idea about corellation. It looks like it may happen if another sync task is on the go (may be incoming sync from other host?).
Quite annoying behaviour.
Any ideas about how to avoid snapshot timeout? Is the timeout configuable in Proxmox? Is any other options except of disk upgrate to 7200RPM or better?
After migration to 6.2 and reworking the whole disks layout, I am experience regular email messages with content:
Code:
command 'zfs snapshot data/vm-111-disk-0@__replicate_111-0_1593357001__' failed: got timeout
I use ZFS replication to keep MV/container safe of any separate single server failure, and the most critical MV/LXCs replicated to 2 other servers.
I have zabbix on place and I see iowait on these servers has average about 3-5% and sometimes jumps up to 30% (!!!)
System layout:
"rpool" is a small ZFS mirror for system root mostly and a count of wery light, but important LXCs, build on two SAS 10K 300Gb drives.
"data" is a pool where relatively big MVs reside. It is a mirror of ST4000LM024 drives, 4TB, 5400RPM SATA, 130MB/s speed by specs. Yes, I know, it is slow. But I have no others.
One SATA attached 400GB intel SSD used, gpt partitioned for 2 parts: linux swap and all the otheer is used as cache for pool "data", but it looks like it is not very helpful.
All the pools have ashift=12, complession on and sync disabled.
Code:
root@hp1:~# zpool status
pool: data
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
data ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
sdc ONLINE 0 0 0
sdd ONLINE 0 0 0
cache
sde2 ONLINE 0 0 0
errors: No known data errors
pool: rpool
state: ONLINE
scan: scrub repaired 0B in 0 days 00:37:35 with 0 errors on Sun Jun 14 01:01:37 2020
config:
NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
sda3 ONLINE 0 0 0
sdb3 ONLINE 0 0 0
errors: No known data errors
The issue is that I have email notification on arror in one hand, but when I go to the web-interface to check - all the replications is Ok. I guess, Proxmox retries it with success before I go to the web-interface.
Th appearance of th eerror is a very random and I have no idea about corellation. It looks like it may happen if another sync task is on the go (may be incoming sync from other host?).
Quite annoying behaviour.
Any ideas about how to avoid snapshot timeout? Is the timeout configuable in Proxmox? Is any other options except of disk upgrate to 7200RPM or better?
Last edited: