[SOLVED] vzdump fails - sysfs write failed

Sep 11, 2019
26
1
8
55
vzdump backup all functioned until July 20, 2020. now all vzdump fail with similar error:
INFO: starting new backup job: vzdump 101 --compress zstd --node putsproxp04 --remove 0 --mode snapshot --storage nas06p2
INFO: Starting Backup of VM 101 (qemu)
INFO: Backup started at 2020-08-03 14:24:44
INFO: status = running
INFO: VM Name: putsfishpv01
INFO: include disk 'scsi0' 'vmpool01:vm-101-disk-0' 100G
In some cases useful info is found in syslog - try "dmesg | tail".
rbd: sysfs write failed
ERROR: Backup of VM 101 failed - can't map rbd volume vm-101-disk-0: rbd: sysfs write failed
INFO: Failed at 2020-08-03 14:25:45
INFO: Backup job finished with errors

TASK ERROR: job errors
 
vzdump backup all functioned until July 20, 2020. now all vzdump fail with similar error:
<snip>
rbd: sysfs write failed
ERROR: Backup of VM 101 failed - can't map rbd volume vm-101-disk-0: rbd: sysfs write failed

From that, I'd say you have an even bigger problem with your Ceph storage. What is the output of ceph -s ? (Please post the output in \[CODE\] \[\/CODE\] tags.)
 
\[CODE\]
Seems fine
root@putsproxp01:~# ceph -s
cluster:
id: fbfde291-2831-4005-802f-5a01e95c9615
health: HEALTH_OK

services:
mon: 6 daemons, quorum putsproxp06,putsproxp05,putsproxp04,proxp03,proxp01,proxp02 (age 24m)
mgr: putsproxp03(active, since 3w), standbys: putsproxp02, putsproxp01
mds: cephfs01:1 {0=putsproxp06=up:active} 2 up:standby
osd: 24 osds: 24 up (since 23m), 24 in (since 7w)

data:
pools: 3 pools, 672 pgs
objects: 340.56k objects, 1.3 TiB
usage: 11 TiB used, 171 TiB / 182 TiB avail
pgs: 672 active+clean

io:
client: 495 KiB/s rd, 105 KiB/s wr, 12 op/s rd, 11 op/s wr
\[\/CODE\]
 
Code:
root@putsproxp01:~# ceph -s
  cluster:
    id:     fbfde291-2831-4005-802f-5a01e95c9615
    health: HEALTH_OK

  services:
    mon: 6 daemons, quorum putsproxp06,putsproxp05,putsproxp04,proxp03,proxp01,proxp02 (age 24m)
    mgr: putsproxp03(active, since 3w), standbys: putsproxp02, putsproxp01
    mds: cephfs01:1 {0=putsproxp06=up:active} 2 up:standby
    osd: 24 osds: 24 up (since 23m), 24 in (since 7w)

  data:
    pools:   3 pools, 672 pgs
    objects: 340.56k objects, 1.3 TiB
    usage:   11 TiB used, 171 TiB / 182 TiB avail
    pgs:     672 active+clean

  io:
    client:   495 KiB/s rd, 105 KiB/s wr, 12 op/s rd, 11 op/s wr
 
Code:
root@putsproxp01:~# ceph osd pool ls detail
pool 1 'cephfs01_data' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode warn last_change 8816 flags hashpspool stripe_width 0 application cephfs
pool 2 'cephfs01_metadata' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode warn last_change 8816 flags hashpspool stripe_width 0 application cephfs
pool 4 'vmpool01' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 512 pgp_num 512 autoscale_mode warn last_change 14281 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
        removed_snaps [1~3]

and
Code:
dir: local
        path /var/lib/vz
        content vztmpl,backup,iso

lvmthin: local-lvm
        thinpool data
        vgname pve
        content images,rootdir

rbd: vmpool01
        content rootdir,images
        krbd 1
        pool vmpool01

nfs: isoRepo
        export /data/isoRepo
        path /mnt/pve/isoRepo
        server putslinv12.bocc.putnam-fl.com
        content images,vztmpl,iso
        options vers=3

nfs: nas5p1
        export /mnt/nas5pool1
        path /mnt/pve/nas5p1
        server 10.10.23.39
        content images,rootdir
        options vers=3

nfs: nas5p2
        export /mnt/nas5pool2
        path /mnt/pve/nas5p2
        server 10.10.23.39
        content images
        options vers=3

nfs: nas06p1
        export /mnt/nas6pool1
        path /mnt/pve/nas06p1
        server 10.10.23.42
        content images
        
nfs: nas06p2
        export /mnt/nas6pool2
        path /mnt/pve/nas06p2
        server putsnasp06
        content images,backup
        maxfiles 0
 
That all seems fine. So, is the content of /etc/vzdump.conf the same on all the nodes? Is there something unique about putsproxp04 ? Does the backup fail the same way on all nodes or just this one?
 
All the same

Code:
root@putsproxp06:~# cat /etc/vzdump.conf
# vzdump default settings

#tmpdir: DIR
#dumpdir: DIR
#storage: STORAGE_ID
#mode: snapshot|suspend|stop
#bwlimit: KBPS
#ionice: PRI
#lockwait: MINUTES
#stopwait: MINUTES
#size: MB
#stdexcludes: BOOLEAN
#mailto: ADDRESSLIST
#maxfiles: N
#script: FILENAME
#exclude-path: PATHLIST
#pigz: N

Did notice that migration is failing with same sysfs write error
 
same on all nodes
pveversion -v
Code:
root@putsproxp06:~# pveversion -v
proxmox-ve: 6.2-1 (running kernel: 5.4.44-2-pve)
pve-manager: 6.2-10 (running version: 6.2-10/a20769ed)
pve-kernel-5.4: 6.2-4
pve-kernel-helper: 6.2-4
pve-kernel-5.3: 6.1-6
pve-kernel-5.0: 6.0-11
pve-kernel-5.4.44-2-pve: 5.4.44-2
pve-kernel-5.4.44-1-pve: 5.4.44-1
pve-kernel-5.4.41-1-pve: 5.4.41-1
pve-kernel-5.4.34-1-pve: 5.4.34-2
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.18-2-pve: 5.3.18-2
pve-kernel-5.3.13-3-pve: 5.3.13-3
pve-kernel-5.3.13-1-pve: 5.3.13-1
pve-kernel-5.3.10-1-pve: 5.3.10-1
pve-kernel-5.0.21-5-pve: 5.0.21-10
pve-kernel-5.0.21-4-pve: 5.0.21-9
pve-kernel-5.0.21-3-pve: 5.0.21-7
pve-kernel-5.0.21-2-pve: 5.0.21-7
pve-kernel-5.0.21-1-pve: 5.0.21-2
pve-kernel-5.0.15-1-pve: 5.0.15-1
ceph: 14.2.9-pve1
ceph-fuse: 14.2.9-pve1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve2
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.4
libpve-access-control: 6.1-2
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.1-5
libpve-guest-common-perl: 3.1-1
libpve-http-server-perl: 3.0-6
libpve-storage-perl: 6.2-5
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.2-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
openvswitch-switch: 2.12.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-9
pve-cluster: 6.1-8
pve-container: 3.1-11
pve-docs: 6.2-5
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-2
pve-firmware: 3.1-1
pve-ha-manager: 3.0-9
pve-i18n: 2.1-3
pve-qemu-kvm: 5.0.0-11
pve-xtermjs: 4.3.0-1
qemu-server: 6.2-10
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.4-pve1

and pvesm list vmpool01 ---- all nodes the same
Code:
root@putsproxp01:~# pvesm list vmpool01
Volid                    Format  Type              Size VMID
vmpool01:base-104-disk-0 raw     images    134217728000 104
vmpool01:base-105-disk-0 raw     images    214748364800 105
vmpool01:vm-100-disk-0   raw     images     10737418240 100
vmpool01:vm-101-disk-0   raw     images    107374182400 101
vmpool01:vm-102-disk-0   raw     images    107374182400 102
vmpool01:vm-103-disk-0   raw     images    107374182400 103
vmpool01:vm-106-disk-0   raw     images    107374182400 106
vmpool01:vm-107-disk-0   raw     images     17179869184 107
vmpool01:vm-108-disk-0   raw     images    134217728000 108
vmpool01:vm-109-disk-0   raw     images    214748364800 109
vmpool01:vm-110-disk-0   raw     images    214748364800 110
vmpool01:vm-111-disk-0   raw     images    107374182400 111
vmpool01:vm-112-disk-0   raw     images    214748364800 112
vmpool01:vm-114-disk-0   raw     images     80530636800 114
vmpool01:vm-114-disk-1   raw     images    214748364800 114
vmpool01:vm-115-disk-0   raw     images     80530636800 115
vmpool01:vm-116-disk-0   raw     images    214748364800 116
vmpool01:vm-117-disk-0   raw     images     80530636800 117
vmpool01:vm-119-disk-0   raw     images     80530636800 119
vmpool01:vm-119-disk-1   raw     images         1048576 119
vmpool01:vm-120-disk-0   raw     images     80530636800 120
vmpool01:vm-122-disk-0   raw     images     80530636800 122
vmpool01:vm-123-disk-0   raw     images    107374182400 123
vmpool01:vm-125-disk-0   raw     images    161061273600 125
vmpool01:vm-125-disk-1   raw     images         1048576 125
vmpool01:vm-126-disk-0   raw     images    107374182400 126
vmpool01:vm-127-disk-0   raw     images    161061273600 127
 
Code:
pvesm status
Name             Type     Status           Total            Used       Available        %
isoRepo           nfs     active      2108488704       336919552      1664457728   15.98%
local             dir     active        25413876        14911972         9187908   58.68%
local-lvm     lvmthin     active        55177216        35539644        19637571   64.41%
nas06p1           nfs     active     45030415616      4065703168     40964712448    9.03%
nas06p2           nfs     active     45030501504     19849495936     25181005568   44.08%
nas5p1            nfs     active     45030018304       890875392     44139142912    1.98%
nas5p2            nfs     active     45029629312      6443948160     38585681152   14.31%
vmpool01          rbd     active     57332962178      1337070466     55995891712    2.33%
 
Check the syslog of your nodes. You should find some information there.
 
At this point I have this issue with RBD - can't map rbd volume for a disk image when trying these functions:
Migrate a VM from any node (1-6) to any node (1-6)
back up any VM on any node (1-6)
start a new VM on any node (1-6)
 
I have searched all the logs and I do not see an indicator of what could be at fault here. I just updated the ceph to 14.2.10.
same error messges
Code:
2020-08-05 07:44:40 starting migration of VM 107 to node 'putsproxp01' (192.168.2.95)
2020-08-05 07:44:41 starting VM 107 on remote node 'putsproxp01'
2020-08-05 07:45:44 [putsproxp01] can't map rbd volume vm-107-disk-0: rbd: sysfs write failed
2020-08-05 07:45:44 ERROR: online migrate failure - remote command failed with exit code 255
2020-08-05 07:45:44 aborting phase 2 - cleanup resources
2020-08-05 07:45:44 migrate_cancel
2020-08-05 07:45:44 ERROR: migration finished with problems (duration 00:01:04)
TASK ERROR: migration problems
 
Unchecked the KRBD flag in the RBD config - seems to have fixed the sysfs write failed issue.
If someone can explain what the krbd flag does in the RBD config that would be great

Thank you