CEPH, Hanging Backups=>IO Waits=>Reboots (Including solutions)

f948lan · Nov 22, 2018

Hi,

I've been running a 3 node hyper-converged CEPH/Proxmox 5.2 cluster for a few months now.
It seems I'm not alone in having consistent issues with automatic backups:
- Scheduled backups repeatedly hang at some point, often on multiple nodes.
- This in turn causes hangs in the kernel IO system (high IO waits with no IOPS) -> reduced performance
- Ultimately blocked IO on some RBDs will occur, bringing down VMs.
- Nodes are generally left in a state that they cannot be cleanly shut down, even with manual unlocking and process killing, so a hard reboot is needed to recover (not nice!)

I've found several other threads on here with similar issues, but none seem to reach a complete conclusion, and the solutions that are reached by users do not seem to be fully acknowledged by Proxmox support.

By combining input from those threads, it appears my issues are solved (at least it's gone from never getting a successful backup of all CTs/VMs to 3 days of successfully daily backups).

So the purpose of this thread is to try and put all the knowledge of other threads I've read in one place, in the hope it:
a) Helps others suffering the same
b) Helps Proxmox support acknowledge and apply permanent fixes for this

TL;DR
There are two issues that can result in similar symptoms. Unless both are addressed you will likely still have problems.
1) Container backups using default 'snapshot' behaviour create corrupt snapshots, which crash CEPH RBD driver
*SOLUTION:
- Either use suspend instead of snapshot, or see the proposed patch here:
https://bugzilla.proxmox.com/show_bug.cgi?id=1911

2) A kernel/ceph bug (there is some debate as to which) affects bluestore and can cause checksum errors (actually zeros are returned rather than data, which is caught by the incorrect checksum). Arguable it is a kernel bug that this happens at all, and a ceph bug that it cannot recover from this, resulting in IO hangs. Evidence suggests this happens more on systems under memory pressure (which can be pressure through caching rather than running processes).
* SOLUTION (well, more a workaround):
- Reduce memory pressure:
In /etc/pve/ceph.conf [global] section set (eg.)

Code:

             bluestore_cache_size_hdd = 176160768
             bluestore_cache_size_ssd = 176160768

- Ensure enough memory is available for atomic operations (which seems to include some CEPH operations):
On all nodes with OSDs, create /etc/sysctl.d/99-minfree.conf with the line:

Code:

             vm.min_free_kbytes=1048576

- Schedule scrubbing out of hours (and certainly outside of backup times).
In /etc/pve/ceph.conf [global] section set (eg.)

Code:

             osd_scrub_begin_hour = 2
             osd_scrub_end_hour = 6

(for me backups run at 1:00 each day and take < 1h to complete on each node)

See details below for potential fix to Ceph to avoid this.

More details:

1) The Snapshot bug:
Symptoms of this are seeing backups freeze like this:

Code:

   INFO: creating archive '/ceph/templates//dump/vzdump-lxc-105-2018_11_22-01_00_02.tar.lzo'

And this is dmesg:

Code:

   EXT4-fs error (device rbd5): ext4_lookup:1575: inode #2621882: comm tar: deleted inode referenced: 2643543
                Assertion failure in rbd_queue_workfn() at line 4035:
                        rbd_assert(op_type == OBJ_OP_READ || rbd_dev->spec->snap_id == CEPH_NOSNAP);

    ------------[ cut here ]------------
    kernel BUG at drivers/block/rbd.c:4035!
    invalid opcode: 0000 [#1] SMP PTI

After that the tar process trying to read from the EXT4 mount that gets removed is hung, and the system will show high IO Waits on the affected RBD, even with no IOPS. The RBD and any affected VM/CT are left locked, which cannot be manually overridden and only a reboot (which is normally hard, as shutdown does not complete in this scenario) is the only solution.

The issue is covered in the bug report here:
https://bugzilla.proxmox.com/show_bug.cgi?id=1911
With a proposed patch, but Proxmox team comment here:
https://forum.proxmox.com/threads/container-backup-issue-ceph-nfs.47835/#post-225349
explaining that they are not happy with the global sync, and are looking for other solutions (no news since).
Personally, without this patch backups fail and nodes need regular reboots to recover, with it, they work, so it seems performance with the sync is better than without. If you're not happy with the sync, don't use snapshot for container backups.
Proxmox team also comment this doesn't make sense, as EXT-4 should deal with an inconsistent snapshot as if it was a hard-poweroff scenario. However, the way I read the log above, it's actually the CEPH RBD driver that is crashing through a bad snapshot, not EXT-4. Thus it's important to follow the CEPH rules to take a consistent snapshot so that CEPH doesn't crash.

Several more threads that seem to have same issue, including:
https://forum.proxmox.com/threads/lxc-backups-hang-via-nfs-and-cifs.46669/
- User patch to freeze before snapshot even for one drive proposed and works for multiple users.
https://bugzilla.proxmox.com/show_bug.cgi?id=1911

https://forum.proxmox.com/threads/backup-lxc-container-freeze.47634/
- User says they will try above patch, then no more comments (assume because it worked).

https://forum.proxmox.com/threads/backup-hangup-with-ceph-rbd.45820/
- Comment that using 'Suspend' helps - which of course avoids the freeze before snapshot issue altogether.

https://forum.proxmox.com/threads/container-backup-issue-ceph-nfs.47835/#post-225335
- User seems to misunderstand patch saying 'The containers only have a single volume, so this is unfortunately not the issue.' - when in fact the freeze part of the patch applies only to single volume containers, so should indeed help here.

https://forum.proxmox.com/threads/in-letzter-zeit-immer-wieder-hängende-backups-von-ceph-nfs.44343/
- Seems to have not reached a solution

https://forum.proxmox.com/threads/ceph-bad-checksum.48301/
- Fixed by above patch, then goes on to suffer issue 2 below

2) The Zero Read / Checksum bug:
Symptoms of this are seeing backups hang either during tar or rsync (depending if snapshot or suspend mode used)
This in dmesg:

Code:

   libceph: get_reply osd2 tid 333933 data 933888 > preallocated 131072, skipping

and this is /var/log/ceph/ceph-osd.X.log:

Code:

   Nov 20 03:00:18 pve-1 ceph-osd[3101]: 2018-11-20 03:00:18.375231 7fe449601700 -1 bluestore(/var/lib/ceph/osd/ceph-4) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x7000, got 0x6706be76, expected 0xa2e5cb01, device location [0x46ac1b000~1000], logical extent 0x397000~1000, object #14:e06471de:::rbd_data.898266b8b4567.0000000000001826:head#
   Nov 20 03:00:18 pve-1 ceph-osd[3101]: 2018-11-20 03:00:18.375299 7fe449601700 -1 log_channel(cluster) log [ERR] : 14.7 missing primary copy of 14:e06471de:::rbd_data.898266b8b4567.0000000000001826:head, will try copies on 5

A big clue here is the got 0x6706be76 part for the checksum - this is the checksum for zeros being read, which is not normal.

Again, a process reading an RBD mount when this happens will hang, and the system will show high IO Waits on the affected RBD, even with no IOPS. The RBD and any affected VM/CT are left locked, which cannot be manually overridden and only a reboot (which is normally hard, as shutdown does not complete in this scenario) is the only solution.

Covered in Ceph bug report here:
https://tracker.ceph.com/issues/22464
The proposed retry patch was merged to Mimic to address this last month, but no luminous backport so far that I can see. Also no apparent news on identifying the change that appears to have happened in Kernel 4.9 to introduce this behaviour.

Related Proxmox threads:
https://forum.proxmox.com/threads/ceph-bad-checksum.48301/
- Thread goes quiet after updating to ceph 12.2.8, but in my experience this still doesn't fix it (and there is no reason it should given the evidence of the Ceph bug report)

https://forum.proxmox.com/threads/ceph-bug-patch-inclusion-request.46373/
- Clearly documents same issue and requests Ceph retry patch is included by Proxmox. Resistance from Proxmox team at the time due to no upstream support, however, patch has since been accepted into upstream Ceph, so maybe it's reasonable to reconsider this now?

https://forum.proxmox.com/threads/regular-errors-on-ceph-pgs.44658/
- Lot's of focus on potential of RAID controller being a potential cause, but may be related

My config details (as I know someone will ask):
Proxmox 5.2, clean install on all 3 nodes.
2 nodes with OSDs (2x4TB HDD, 1x 0.5TB NVMe SSD) + 24GB RAM
- No raid controller
- Crush rules separate HDD and SSD so RBDs are on one or other (and above mentioned issues can happen with VM/CTs on either storage type)
- Disks all < 6months old
1 node as MON and CT/VM host only - 16GB RAM
Dedicated 10GB Network between Nodes 1-2 for CEPH traffic
Dedicated 10GB Network for public traffic
1GB Network for admin/corosync

Proxmox 5.2 with all latest updates from pve-no-subscription on all nodes
ceph versions reports 12.2.8 for all components on all nodes

Admittedly my hardware is somewhat low resourced (but not highly loaded), though in general I'm pleased with the performance aside from these hangs.

As a closing thought, one thing I find somewhat concerning here is how apparently delicate Ceph can be. Two separate issues shown here can both crash Ceph in a way that necessitates a hard reboot to return to normal operation. Not ideal from a technology intended to support data integrity and HA systems?

starnetwork · Dec 28, 2018

wow, Thanks for that!
any plan to include A neat fix in Proxmox release?

AlexLup · Dec 31, 2018

Nice writeup, hope it gets more attention

Alwin · Jan 2, 2019

f948lan said:
1) The Snapshot bug:

There is no definite solution to that yet, but does an upgrade to 12.2.10 make a difference?

f948lan said:
2) The Zero Read / Checksum bug:

The workaround has been backported from master and was included in 12.2.10.

RobFantini · Mar 9, 2019

I have followed the suggestions suggested in this thread.

We still have backup hangs with this type of system:
lxc 2.4 linux on ceph storage.

our solution is to move 2.4 systems to kvm.

Alwin · Mar 11, 2019

Sadly there is still no solution for this. It seems to trigger under circumstances that we can't reproduce in the lab.

Add yourself to the bug report to be updated on its status.
https://bugzilla.proxmox.com/show_bug.cgi?id=1911

RobFantini · Mar 11, 2019

Alwin said:
Sadly there is still no solution for this. It seems to trigger under circumstances that we can't reproduce in the lab.

Add yourself to the bug report to be updated on its status.
https://bugzilla.proxmox.com/show_bug.cgi?id=1911

thank you Alwin.

cezary.zemis · Jul 2, 2019

My own observation (no workaround or proposed solution):

1. scheduled backup got stuck (as usual), leaving a 'tar' process waiting indefinitely for I/O:

100000 1832384 0.0 0.0 18160 2508 pts/18 D+ 15:18 0:00 tar cpf - --totals --one-file-system -p --sparse --numeric-owner --acls --xattrs --xattrs-include=user.* --xattrs-include=security.capability --warning=no-file-ignored --warning=no-xattr-write --one-file-system --warning=no-file-ignored --directory=/mnt/pve/nfs_hyp-b-pve-backup01_backup_pve-b-cluster/dump/vzdump-lxc-331074-2019_07_02-15_18_57.tmp ./etc/vzdump/pct.conf --directory=/mnt/vzsnap0 --no-anchored --exclude=lost+found --anchored --exclude=./var/lib/php/sessions/?* --exclude=./var/lib/apt/lists/?* --exclude=./var/lib/check_mk_agent/cache/?* --exclude=./var/lib/php5/sessions/?* --exclude=./tmp/?* --exclude=./var/tmp/?* --exclude=./var/run/?*.pid ./

2. I manually entered mount namespace of the 'tar' process and run similar command (as root):

root@hyp-b-pve02:/etc# nsenter -m -t 1832384
root@hyp-b-pve02:/# cd /etc
root@hyp-b-pve02:/etc# tar cpvf - --totals --one-file-system -p --sparse --numeric-owner --acls --xattrs --xattrs-include=user.* --xattrs-include=security.capability --warning=no-file-ignored --warning=no-xattr-write --one-file-system --warning=no-file-ignored --directory=/mnt/pve/nfs_hyp-b-pve-backup01_backup_pve-b-cluster/dump/vzdump-lxc-331074-2019_07_02-15_18_57.tmp ./etc/vzdump/pct.conf --directory=/mnt/vzsnap0 --no-anchored --exclude=lost+found --anchored --exclude=./var/lib/php/sessions/?* --exclude=./var/lib/apt/lists/?* --exclude=./var/lib/check_mk_agent/cache/?* --exclude=./var/lib/php5/sessions/?* --exclude=./tmp/?* --exclude=./var/tmp/?* --exclude=./var/run/?*.pid ./ > /dev/null
./etc/vzdump/pct.conf
./
./root/
./root/.docker/
[...]
./var/lib/docker/containers/8aec8ea373d190e15f4eef921aa2072d0c05549a3c50058484ceb348f0179e50/
./var/lib/docker/containers/8aec8ea373d190e15f4eef921aa2072d0c05549a3c50058484ceb348f0179e50/hostname
./var/lib/docker/containers/8aec8ea373d190e15f4eef921aa2072d0c05549a3c50058484ceb348f0179e50/resolv.conf
./var/lib/docker/containers/8aec8ea373d190e15f4eef921aa2072d0c05549a3c50058484ceb348f0179e50/mounts/
./var/lib/docker/containers/8aec8ea373d190e15f4eef921aa2072d0c05549a3c50058484ceb348f0179e50/mounts/shm/
./var/lib/docker/containers/8aec8ea373d190e15f4eef921aa2072d0c05549a3c50058484ceb348f0179e50/hosts
./var/lib/docker/containers/8aec8ea373d190e15f4eef921aa2072d0c05549a3c50058484ceb348f0179e50/config.v2.json

... and it got stuck, also waiting for I/O

3. I manually entered mount namespace of the 'tar' process and listed the directory on which the 'tar' got stuck:

root@hyp-b-pve02:~# nsenter -m -t 1832384
root@hyp-b-pve02:/# ls /mnt/vzsnap0/
bin boot data dev etc home lib lib64 lost+found media mnt opt proc root run sbin srv sys tmp usr var
root@hyp-b-pve02:/# ls /mnt/vzsnap0/./var/lib/docker/containers/8aec8ea373d190e15f4eef921aa2072d0c05549a3c50058484ceb348f0179e50/
checkpoints config.v2.json hostconfig.json hostname hosts mounts resolv.conf resolv.conf.hash
root@hyp-b-pve02:/# ls -l /mnt/vzsnap0/./var/lib/docker/containers/8aec8ea373d190e15f4eef921aa2072d0c05549a3c50058484ceb348f0179e50/

the former (simple ls) was successful, but the latter (ls -s) got stuck

4. I manually entered mount namespace of the 'tar' process and listed the directory again via strace:

root@hyp-b-pve02:~# nsenter -m -t 1832384
root@hyp-b-pve02:/# strace -f ls -l /mnt/vzsnap0/./var/lib/docker/containers/8aec8ea373d190e15f4eef921aa2072d0c05549a3c50058484ceb348f0179e50/
[...]
lstat("/mnt/vzsnap0/./var/lib/docker/containers/8aec8ea373d190e15f4eef921aa2072d0c05549a3c50058484ceb348f0179e50/config.v2.json", {st_mode=S_IFREG|0600, st_size=10061, ...}) = 0
lgetxattr("/mnt/vzsnap0/./var/lib/docker/containers/8aec8ea373d190e15f4eef921aa2072d0c05549a3c50058484ceb348f0179e50/config.v2.json", "security.selinux", 0x55b41ac4fe30, 255) = -1 ENODATA (No data available)
getxattr("/mnt/vzsnap0/./var/lib/docker/containers/8aec8ea373d190e15f4eef921aa2072d0c05549a3c50058484ceb348f0179e50/config.v2.json", "system.posix_acl_access", NULL, 0) = -1 ENODATA (No data available)
lstat("/mnt/vzsnap0/./var/lib/docker/containers/8aec8ea373d190e15f4eef921aa2072d0c05549a3c50058484ceb348f0179e50/hostconfig.json",

Apparently there is a problem with a CEPH snapshot - for some reason a file on the snapshot is somehow corrupted and reading it causes the issue.

Search

Search

CEPH, Hanging Backups=>IO Waits=>Reboots (Including solutions)

f948lan

New Member

starnetwork

Renowned Member

AlexLup

Well-Known Member

Alwin

Proxmox Retired Staff

RobFantini

Famous Member

Alwin

Proxmox Retired Staff

RobFantini

Famous Member

cezary.zemis

New Member