Hi,
I've been running a 3 node hyper-converged CEPH/Proxmox 5.2 cluster for a few months now.
It seems I'm not alone in having consistent issues with automatic backups:
- Scheduled backups repeatedly hang at some point, often on multiple nodes.
- This in turn causes hangs in the kernel IO system (high IO waits with no IOPS) -> reduced performance
- Ultimately blocked IO on some RBDs will occur, bringing down VMs.
- Nodes are generally left in a state that they cannot be cleanly shut down, even with manual unlocking and process killing, so a hard reboot is needed to recover (not nice!)
I've found several other threads on here with similar issues, but none seem to reach a complete conclusion, and the solutions that are reached by users do not seem to be fully acknowledged by Proxmox support.
By combining input from those threads, it appears my issues are solved (at least it's gone from never getting a successful backup of all CTs/VMs to 3 days of successfully daily backups).
So the purpose of this thread is to try and put all the knowledge of other threads I've read in one place, in the hope it:
a) Helps others suffering the same
b) Helps Proxmox support acknowledge and apply permanent fixes for this
TL;DR
There are two issues that can result in similar symptoms. Unless both are addressed you will likely still have problems.
1) Container backups using default 'snapshot' behaviour create corrupt snapshots, which crash CEPH RBD driver
*SOLUTION:
- Either use suspend instead of snapshot, or see the proposed patch here:
https://bugzilla.proxmox.com/show_bug.cgi?id=1911
2) A kernel/ceph bug (there is some debate as to which) affects bluestore and can cause checksum errors (actually zeros are returned rather than data, which is caught by the incorrect checksum). Arguable it is a kernel bug that this happens at all, and a ceph bug that it cannot recover from this, resulting in IO hangs. Evidence suggests this happens more on systems under memory pressure (which can be pressure through caching rather than running processes).
* SOLUTION (well, more a workaround):
- Reduce memory pressure:
In /etc/pve/ceph.conf [global] section set (eg.)
- Ensure enough memory is available for atomic operations (which seems to include some CEPH operations):
On all nodes with OSDs, create /etc/sysctl.d/99-minfree.conf with the line:
- Schedule scrubbing out of hours (and certainly outside of backup times).
In /etc/pve/ceph.conf [global] section set (eg.)
(for me backups run at 1:00 each day and take < 1h to complete on each node)
See details below for potential fix to Ceph to avoid this.
More details:
1) The Snapshot bug:
Symptoms of this are seeing backups freeze like this:
And this is dmesg:
After that the tar process trying to read from the EXT4 mount that gets removed is hung, and the system will show high IO Waits on the affected RBD, even with no IOPS. The RBD and any affected VM/CT are left locked, which cannot be manually overridden and only a reboot (which is normally hard, as shutdown does not complete in this scenario) is the only solution.
The issue is covered in the bug report here:
https://bugzilla.proxmox.com/show_bug.cgi?id=1911
With a proposed patch, but Proxmox team comment here:
https://forum.proxmox.com/threads/container-backup-issue-ceph-nfs.47835/#post-225349
explaining that they are not happy with the global sync, and are looking for other solutions (no news since).
Personally, without this patch backups fail and nodes need regular reboots to recover, with it, they work, so it seems performance with the sync is better than without. If you're not happy with the sync, don't use snapshot for container backups.
Proxmox team also comment this doesn't make sense, as EXT-4 should deal with an inconsistent snapshot as if it was a hard-poweroff scenario. However, the way I read the log above, it's actually the CEPH RBD driver that is crashing through a bad snapshot, not EXT-4. Thus it's important to follow the CEPH rules to take a consistent snapshot so that CEPH doesn't crash.
Several more threads that seem to have same issue, including:
https://forum.proxmox.com/threads/lxc-backups-hang-via-nfs-and-cifs.46669/
- User patch to freeze before snapshot even for one drive proposed and works for multiple users.
https://bugzilla.proxmox.com/show_bug.cgi?id=1911
https://forum.proxmox.com/threads/backup-lxc-container-freeze.47634/
- User says they will try above patch, then no more comments (assume because it worked).
https://forum.proxmox.com/threads/backup-hangup-with-ceph-rbd.45820/
- Comment that using 'Suspend' helps - which of course avoids the freeze before snapshot issue altogether.
https://forum.proxmox.com/threads/container-backup-issue-ceph-nfs.47835/#post-225335
- User seems to misunderstand patch saying 'The containers only have a single volume, so this is unfortunately not the issue.' - when in fact the freeze part of the patch applies only to single volume containers, so should indeed help here.
https://forum.proxmox.com/threads/in-letzter-zeit-immer-wieder-hängende-backups-von-ceph-nfs.44343/
- Seems to have not reached a solution
https://forum.proxmox.com/threads/ceph-bad-checksum.48301/
- Fixed by above patch, then goes on to suffer issue 2 below
2) The Zero Read / Checksum bug:
Symptoms of this are seeing backups hang either during tar or rsync (depending if snapshot or suspend mode used)
This in dmesg:
and this is /var/log/ceph/ceph-osd.X.log:
A big clue here is the got 0x6706be76 part for the checksum - this is the checksum for zeros being read, which is not normal.
Again, a process reading an RBD mount when this happens will hang, and the system will show high IO Waits on the affected RBD, even with no IOPS. The RBD and any affected VM/CT are left locked, which cannot be manually overridden and only a reboot (which is normally hard, as shutdown does not complete in this scenario) is the only solution.
Covered in Ceph bug report here:
https://tracker.ceph.com/issues/22464
The proposed retry patch was merged to Mimic to address this last month, but no luminous backport so far that I can see. Also no apparent news on identifying the change that appears to have happened in Kernel 4.9 to introduce this behaviour.
Related Proxmox threads:
https://forum.proxmox.com/threads/ceph-bad-checksum.48301/
- Thread goes quiet after updating to ceph 12.2.8, but in my experience this still doesn't fix it (and there is no reason it should given the evidence of the Ceph bug report)
https://forum.proxmox.com/threads/ceph-bug-patch-inclusion-request.46373/
- Clearly documents same issue and requests Ceph retry patch is included by Proxmox. Resistance from Proxmox team at the time due to no upstream support, however, patch has since been accepted into upstream Ceph, so maybe it's reasonable to reconsider this now?
https://forum.proxmox.com/threads/regular-errors-on-ceph-pgs.44658/
- Lot's of focus on potential of RAID controller being a potential cause, but may be related
My config details (as I know someone will ask):
Proxmox 5.2, clean install on all 3 nodes.
2 nodes with OSDs (2x4TB HDD, 1x 0.5TB NVMe SSD) + 24GB RAM
- No raid controller
- Crush rules separate HDD and SSD so RBDs are on one or other (and above mentioned issues can happen with VM/CTs on either storage type)
- Disks all < 6months old
1 node as MON and CT/VM host only - 16GB RAM
Dedicated 10GB Network between Nodes 1-2 for CEPH traffic
Dedicated 10GB Network for public traffic
1GB Network for admin/corosync
Proxmox 5.2 with all latest updates from pve-no-subscription on all nodes
ceph versions reports 12.2.8 for all components on all nodes
Admittedly my hardware is somewhat low resourced (but not highly loaded), though in general I'm pleased with the performance aside from these hangs.
As a closing thought, one thing I find somewhat concerning here is how apparently delicate Ceph can be. Two separate issues shown here can both crash Ceph in a way that necessitates a hard reboot to return to normal operation. Not ideal from a technology intended to support data integrity and HA systems?
I've been running a 3 node hyper-converged CEPH/Proxmox 5.2 cluster for a few months now.
It seems I'm not alone in having consistent issues with automatic backups:
- Scheduled backups repeatedly hang at some point, often on multiple nodes.
- This in turn causes hangs in the kernel IO system (high IO waits with no IOPS) -> reduced performance
- Ultimately blocked IO on some RBDs will occur, bringing down VMs.
- Nodes are generally left in a state that they cannot be cleanly shut down, even with manual unlocking and process killing, so a hard reboot is needed to recover (not nice!)
I've found several other threads on here with similar issues, but none seem to reach a complete conclusion, and the solutions that are reached by users do not seem to be fully acknowledged by Proxmox support.
By combining input from those threads, it appears my issues are solved (at least it's gone from never getting a successful backup of all CTs/VMs to 3 days of successfully daily backups).
So the purpose of this thread is to try and put all the knowledge of other threads I've read in one place, in the hope it:
a) Helps others suffering the same
b) Helps Proxmox support acknowledge and apply permanent fixes for this
TL;DR
There are two issues that can result in similar symptoms. Unless both are addressed you will likely still have problems.
1) Container backups using default 'snapshot' behaviour create corrupt snapshots, which crash CEPH RBD driver
*SOLUTION:
- Either use suspend instead of snapshot, or see the proposed patch here:
https://bugzilla.proxmox.com/show_bug.cgi?id=1911
2) A kernel/ceph bug (there is some debate as to which) affects bluestore and can cause checksum errors (actually zeros are returned rather than data, which is caught by the incorrect checksum). Arguable it is a kernel bug that this happens at all, and a ceph bug that it cannot recover from this, resulting in IO hangs. Evidence suggests this happens more on systems under memory pressure (which can be pressure through caching rather than running processes).
* SOLUTION (well, more a workaround):
- Reduce memory pressure:
In /etc/pve/ceph.conf [global] section set (eg.)
Code:
bluestore_cache_size_hdd = 176160768
bluestore_cache_size_ssd = 176160768
On all nodes with OSDs, create /etc/sysctl.d/99-minfree.conf with the line:
Code:
vm.min_free_kbytes=1048576
In /etc/pve/ceph.conf [global] section set (eg.)
Code:
osd_scrub_begin_hour = 2
osd_scrub_end_hour = 6
See details below for potential fix to Ceph to avoid this.
More details:
1) The Snapshot bug:
Symptoms of this are seeing backups freeze like this:
Code:
INFO: creating archive '/ceph/templates//dump/vzdump-lxc-105-2018_11_22-01_00_02.tar.lzo'
Code:
EXT4-fs error (device rbd5): ext4_lookup:1575: inode #2621882: comm tar: deleted inode referenced: 2643543
Assertion failure in rbd_queue_workfn() at line 4035:
rbd_assert(op_type == OBJ_OP_READ || rbd_dev->spec->snap_id == CEPH_NOSNAP);
------------[ cut here ]------------
kernel BUG at drivers/block/rbd.c:4035!
invalid opcode: 0000 [#1] SMP PTI
The issue is covered in the bug report here:
https://bugzilla.proxmox.com/show_bug.cgi?id=1911
With a proposed patch, but Proxmox team comment here:
https://forum.proxmox.com/threads/container-backup-issue-ceph-nfs.47835/#post-225349
explaining that they are not happy with the global sync, and are looking for other solutions (no news since).
Personally, without this patch backups fail and nodes need regular reboots to recover, with it, they work, so it seems performance with the sync is better than without. If you're not happy with the sync, don't use snapshot for container backups.
Proxmox team also comment this doesn't make sense, as EXT-4 should deal with an inconsistent snapshot as if it was a hard-poweroff scenario. However, the way I read the log above, it's actually the CEPH RBD driver that is crashing through a bad snapshot, not EXT-4. Thus it's important to follow the CEPH rules to take a consistent snapshot so that CEPH doesn't crash.
Several more threads that seem to have same issue, including:
https://forum.proxmox.com/threads/lxc-backups-hang-via-nfs-and-cifs.46669/
- User patch to freeze before snapshot even for one drive proposed and works for multiple users.
https://bugzilla.proxmox.com/show_bug.cgi?id=1911
https://forum.proxmox.com/threads/backup-lxc-container-freeze.47634/
- User says they will try above patch, then no more comments (assume because it worked).
https://forum.proxmox.com/threads/backup-hangup-with-ceph-rbd.45820/
- Comment that using 'Suspend' helps - which of course avoids the freeze before snapshot issue altogether.
https://forum.proxmox.com/threads/container-backup-issue-ceph-nfs.47835/#post-225335
- User seems to misunderstand patch saying 'The containers only have a single volume, so this is unfortunately not the issue.' - when in fact the freeze part of the patch applies only to single volume containers, so should indeed help here.
https://forum.proxmox.com/threads/in-letzter-zeit-immer-wieder-hängende-backups-von-ceph-nfs.44343/
- Seems to have not reached a solution
https://forum.proxmox.com/threads/ceph-bad-checksum.48301/
- Fixed by above patch, then goes on to suffer issue 2 below
2) The Zero Read / Checksum bug:
Symptoms of this are seeing backups hang either during tar or rsync (depending if snapshot or suspend mode used)
This in dmesg:
Code:
libceph: get_reply osd2 tid 333933 data 933888 > preallocated 131072, skipping
Code:
Nov 20 03:00:18 pve-1 ceph-osd[3101]: 2018-11-20 03:00:18.375231 7fe449601700 -1 bluestore(/var/lib/ceph/osd/ceph-4) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x7000, got 0x6706be76, expected 0xa2e5cb01, device location [0x46ac1b000~1000], logical extent 0x397000~1000, object #14:e06471de:::rbd_data.898266b8b4567.0000000000001826:head#
Nov 20 03:00:18 pve-1 ceph-osd[3101]: 2018-11-20 03:00:18.375299 7fe449601700 -1 log_channel(cluster) log [ERR] : 14.7 missing primary copy of 14:e06471de:::rbd_data.898266b8b4567.0000000000001826:head, will try copies on 5
Again, a process reading an RBD mount when this happens will hang, and the system will show high IO Waits on the affected RBD, even with no IOPS. The RBD and any affected VM/CT are left locked, which cannot be manually overridden and only a reboot (which is normally hard, as shutdown does not complete in this scenario) is the only solution.
Covered in Ceph bug report here:
https://tracker.ceph.com/issues/22464
The proposed retry patch was merged to Mimic to address this last month, but no luminous backport so far that I can see. Also no apparent news on identifying the change that appears to have happened in Kernel 4.9 to introduce this behaviour.
Related Proxmox threads:
https://forum.proxmox.com/threads/ceph-bad-checksum.48301/
- Thread goes quiet after updating to ceph 12.2.8, but in my experience this still doesn't fix it (and there is no reason it should given the evidence of the Ceph bug report)
https://forum.proxmox.com/threads/ceph-bug-patch-inclusion-request.46373/
- Clearly documents same issue and requests Ceph retry patch is included by Proxmox. Resistance from Proxmox team at the time due to no upstream support, however, patch has since been accepted into upstream Ceph, so maybe it's reasonable to reconsider this now?
https://forum.proxmox.com/threads/regular-errors-on-ceph-pgs.44658/
- Lot's of focus on potential of RAID controller being a potential cause, but may be related
My config details (as I know someone will ask):
Proxmox 5.2, clean install on all 3 nodes.
2 nodes with OSDs (2x4TB HDD, 1x 0.5TB NVMe SSD) + 24GB RAM
- No raid controller
- Crush rules separate HDD and SSD so RBDs are on one or other (and above mentioned issues can happen with VM/CTs on either storage type)
- Disks all < 6months old
1 node as MON and CT/VM host only - 16GB RAM
Dedicated 10GB Network between Nodes 1-2 for CEPH traffic
Dedicated 10GB Network for public traffic
1GB Network for admin/corosync
Proxmox 5.2 with all latest updates from pve-no-subscription on all nodes
ceph versions reports 12.2.8 for all components on all nodes
Admittedly my hardware is somewhat low resourced (but not highly loaded), though in general I'm pleased with the performance aside from these hangs.
As a closing thought, one thing I find somewhat concerning here is how apparently delicate Ceph can be. Two separate issues shown here can both crash Ceph in a way that necessitates a hard reboot to return to normal operation. Not ideal from a technology intended to support data integrity and HA systems?