[SOLVED] vzdump fails - sysfs write failed

PutnamCountyIT · Aug 3, 2020

vzdump backup all functioned until July 20, 2020. now all vzdump fail with similar error:
INFO: starting new backup job: vzdump 101 --compress zstd --node putsproxp04 --remove 0 --mode snapshot --storage nas06p2
INFO: Starting Backup of VM 101 (qemu)
INFO: Backup started at 2020-08-03 14:24:44
INFO: status = running
INFO: VM Name: putsfishpv01
INFO: include disk 'scsi0' 'vmpool01:vm-101-disk-0' 100G
In some cases useful info is found in syslog - try "dmesg | tail".
rbd: sysfs write failed
ERROR: Backup of VM 101 failed - can't map rbd volume vm-101-disk-0: rbd: sysfs write failed
INFO: Failed at 2020-08-03 14:25:45
INFO: Backup job finished with errors

TASK ERROR: job errors

RokaKen · Aug 3, 2020

PutnamCountyIT said:
vzdump backup all functioned until July 20, 2020. now all vzdump fail with similar error:
<snip>
rbd: sysfs write failed
ERROR: Backup of VM 101 failed - can't map rbd volume vm-101-disk-0: rbd: sysfs write failed

From that, I'd say you have an even bigger problem with your Ceph storage. What is the output of ceph -s ? (Please post the output in \[CODE\] \[\/CODE\] tags.)

PutnamCountyIT · Aug 3, 2020

\[CODE\]
Seems fine
root@putsproxp01:~# ceph -s
cluster:
id: fbfde291-2831-4005-802f-5a01e95c9615
health: HEALTH_OK

services:
mon: 6 daemons, quorum putsproxp06,putsproxp05,putsproxp04,proxp03,proxp01,proxp02 (age 24m)
mgr: putsproxp03(active, since 3w), standbys: putsproxp02, putsproxp01
mds: cephfs01:1 {0=putsproxp06=up:active} 2 up:standby
osd: 24 osds: 24 up (since 23m), 24 in (since 7w)

data:
pools: 3 pools, 672 pgs
objects: 340.56k objects, 1.3 TiB
usage: 11 TiB used, 171 TiB / 182 TiB avail
pgs: 672 active+clean

io:
client: 495 KiB/s rd, 105 KiB/s wr, 12 op/s rd, 11 op/s wr
\[\/CODE\]

PutnamCountyIT · Aug 3, 2020

Code:

root@putsproxp01:~# ceph -s
  cluster:
    id:     fbfde291-2831-4005-802f-5a01e95c9615
    health: HEALTH_OK

  services:
    mon: 6 daemons, quorum putsproxp06,putsproxp05,putsproxp04,proxp03,proxp01,proxp02 (age 24m)
    mgr: putsproxp03(active, since 3w), standbys: putsproxp02, putsproxp01
    mds: cephfs01:1 {0=putsproxp06=up:active} 2 up:standby
    osd: 24 osds: 24 up (since 23m), 24 in (since 7w)

  data:
    pools:   3 pools, 672 pgs
    objects: 340.56k objects, 1.3 TiB
    usage:   11 TiB used, 171 TiB / 182 TiB avail
    pgs:     672 active+clean

  io:
    client:   495 KiB/s rd, 105 KiB/s wr, 12 op/s rd, 11 op/s wr

RokaKen · Aug 3, 2020

Yes it does -- so, what is the output of ceph osd pool ls detail ? What is the output of cat /etc/pve/storage.cfg ?

PutnamCountyIT · Aug 3, 2020

Code:

root@putsproxp01:~# ceph osd pool ls detail
pool 1 'cephfs01_data' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode warn last_change 8816 flags hashpspool stripe_width 0 application cephfs
pool 2 'cephfs01_metadata' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode warn last_change 8816 flags hashpspool stripe_width 0 application cephfs
pool 4 'vmpool01' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 512 pgp_num 512 autoscale_mode warn last_change 14281 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
        removed_snaps [1~3]

and

Code:

dir: local
        path /var/lib/vz
        content vztmpl,backup,iso

lvmthin: local-lvm
        thinpool data
        vgname pve
        content images,rootdir

rbd: vmpool01
        content rootdir,images
        krbd 1
        pool vmpool01

nfs: isoRepo
        export /data/isoRepo
        path /mnt/pve/isoRepo
        server putslinv12.bocc.putnam-fl.com
        content images,vztmpl,iso
        options vers=3

nfs: nas5p1
        export /mnt/nas5pool1
        path /mnt/pve/nas5p1
        server 10.10.23.39
        content images,rootdir
        options vers=3

nfs: nas5p2
        export /mnt/nas5pool2
        path /mnt/pve/nas5p2
        server 10.10.23.39
        content images
        options vers=3

nfs: nas06p1
        export /mnt/nas6pool1
        path /mnt/pve/nas06p1
        server 10.10.23.42
        content images
        
nfs: nas06p2
        export /mnt/nas6pool2
        path /mnt/pve/nas06p2
        server putsnasp06
        content images,backup
        maxfiles 0

RokaKen · Aug 3, 2020

That all seems fine. So, is the content of /etc/vzdump.conf the same on all the nodes? Is there something unique about putsproxp04 ? Does the backup fail the same way on all nodes or just this one?

PutnamCountyIT · Aug 3, 2020

All the same

Code:

root@putsproxp06:~# cat /etc/vzdump.conf
# vzdump default settings

#tmpdir: DIR
#dumpdir: DIR
#storage: STORAGE_ID
#mode: snapshot|suspend|stop
#bwlimit: KBPS
#ionice: PRI
#lockwait: MINUTES
#stopwait: MINUTES
#size: MB
#stdexcludes: BOOLEAN
#mailto: ADDRESSLIST
#maxfiles: N
#script: FILENAME
#exclude-path: PATHLIST
#pigz: N

Did notice that migration is failing with same sysfs write error

RokaKen · Aug 3, 2020

PutnamCountyIT said:
Did notice that migration is failing with same sysfs write error

Ok, but does migration fail only to/from putsproxp04 or does it fail from any node to any other node?

PutnamCountyIT · Aug 3, 2020

migration fails from any to any, not just putsproxp04

RokaKen · Aug 3, 2020

well, what is the pveversion -v and is it the same on every node? What is the pvesm status ? What is the output of pvesm list vmpool01 ?

PutnamCountyIT · Aug 3, 2020

same on all nodes
pveversion -v

Code:

root@putsproxp06:~# pveversion -v
proxmox-ve: 6.2-1 (running kernel: 5.4.44-2-pve)
pve-manager: 6.2-10 (running version: 6.2-10/a20769ed)
pve-kernel-5.4: 6.2-4
pve-kernel-helper: 6.2-4
pve-kernel-5.3: 6.1-6
pve-kernel-5.0: 6.0-11
pve-kernel-5.4.44-2-pve: 5.4.44-2
pve-kernel-5.4.44-1-pve: 5.4.44-1
pve-kernel-5.4.41-1-pve: 5.4.41-1
pve-kernel-5.4.34-1-pve: 5.4.34-2
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.18-2-pve: 5.3.18-2
pve-kernel-5.3.13-3-pve: 5.3.13-3
pve-kernel-5.3.13-1-pve: 5.3.13-1
pve-kernel-5.3.10-1-pve: 5.3.10-1
pve-kernel-5.0.21-5-pve: 5.0.21-10
pve-kernel-5.0.21-4-pve: 5.0.21-9
pve-kernel-5.0.21-3-pve: 5.0.21-7
pve-kernel-5.0.21-2-pve: 5.0.21-7
pve-kernel-5.0.21-1-pve: 5.0.21-2
pve-kernel-5.0.15-1-pve: 5.0.15-1
ceph: 14.2.9-pve1
ceph-fuse: 14.2.9-pve1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve2
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.4
libpve-access-control: 6.1-2
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.1-5
libpve-guest-common-perl: 3.1-1
libpve-http-server-perl: 3.0-6
libpve-storage-perl: 6.2-5
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.2-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
openvswitch-switch: 2.12.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-9
pve-cluster: 6.1-8
pve-container: 3.1-11
pve-docs: 6.2-5
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-2
pve-firmware: 3.1-1
pve-ha-manager: 3.0-9
pve-i18n: 2.1-3
pve-qemu-kvm: 5.0.0-11
pve-xtermjs: 4.3.0-1
qemu-server: 6.2-10
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.4-pve1

and pvesm list vmpool01 ---- all nodes the same

Code:

root@putsproxp01:~# pvesm list vmpool01
Volid                    Format  Type              Size VMID
vmpool01:base-104-disk-0 raw     images    134217728000 104
vmpool01:base-105-disk-0 raw     images    214748364800 105
vmpool01:vm-100-disk-0   raw     images     10737418240 100
vmpool01:vm-101-disk-0   raw     images    107374182400 101
vmpool01:vm-102-disk-0   raw     images    107374182400 102
vmpool01:vm-103-disk-0   raw     images    107374182400 103
vmpool01:vm-106-disk-0   raw     images    107374182400 106
vmpool01:vm-107-disk-0   raw     images     17179869184 107
vmpool01:vm-108-disk-0   raw     images    134217728000 108
vmpool01:vm-109-disk-0   raw     images    214748364800 109
vmpool01:vm-110-disk-0   raw     images    214748364800 110
vmpool01:vm-111-disk-0   raw     images    107374182400 111
vmpool01:vm-112-disk-0   raw     images    214748364800 112
vmpool01:vm-114-disk-0   raw     images     80530636800 114
vmpool01:vm-114-disk-1   raw     images    214748364800 114
vmpool01:vm-115-disk-0   raw     images     80530636800 115
vmpool01:vm-116-disk-0   raw     images    214748364800 116
vmpool01:vm-117-disk-0   raw     images     80530636800 117
vmpool01:vm-119-disk-0   raw     images     80530636800 119
vmpool01:vm-119-disk-1   raw     images         1048576 119
vmpool01:vm-120-disk-0   raw     images     80530636800 120
vmpool01:vm-122-disk-0   raw     images     80530636800 122
vmpool01:vm-123-disk-0   raw     images    107374182400 123
vmpool01:vm-125-disk-0   raw     images    161061273600 125
vmpool01:vm-125-disk-1   raw     images         1048576 125
vmpool01:vm-126-disk-0   raw     images    107374182400 126
vmpool01:vm-127-disk-0   raw     images    161061273600 127

PutnamCountyIT · Aug 3, 2020

Code:

pvesm status
Name             Type     Status           Total            Used       Available        %
isoRepo           nfs     active      2108488704       336919552      1664457728   15.98%
local             dir     active        25413876        14911972         9187908   58.68%
local-lvm     lvmthin     active        55177216        35539644        19637571   64.41%
nas06p1           nfs     active     45030415616      4065703168     40964712448    9.03%
nas06p2           nfs     active     45030501504     19849495936     25181005568   44.08%
nas5p1            nfs     active     45030018304       890875392     44139142912    1.98%
nas5p2            nfs     active     45029629312      6443948160     38585681152   14.31%
vmpool01          rbd     active     57332962178      1337070466     55995891712    2.33%

RokaKen · Aug 3, 2020

Ok, perhaps @Alwin or somebody will be able to provide a simple fix -- I don't see the root cause of the problem. Sorry.

PutnamCountyIT · Aug 3, 2020

Thank you for your time

Alwin · Aug 4, 2020

Check the syslog of your nodes. You should find some information there.

PutnamCountyIT · Aug 4, 2020

begin migration from node 2 to node 1
attached node 2 syslog output and node 1 syslog output

PutnamCountyIT · Aug 4, 2020

At this point I have this issue with RBD - can't map rbd volume for a disk image when trying these functions:
Migrate a VM from any node (1-6) to any node (1-6)
back up any VM on any node (1-6)
start a new VM on any node (1-6)

PutnamCountyIT · Aug 5, 2020

I have searched all the logs and I do not see an indicator of what could be at fault here. I just updated the ceph to 14.2.10.
same error messges

Code:

2020-08-05 07:44:40 starting migration of VM 107 to node 'putsproxp01' (192.168.2.95)
2020-08-05 07:44:41 starting VM 107 on remote node 'putsproxp01'
2020-08-05 07:45:44 [putsproxp01] can't map rbd volume vm-107-disk-0: rbd: sysfs write failed
2020-08-05 07:45:44 ERROR: online migrate failure - remote command failed with exit code 255
2020-08-05 07:45:44 aborting phase 2 - cleanup resources
2020-08-05 07:45:44 migrate_cancel
2020-08-05 07:45:44 ERROR: migration finished with problems (duration 00:01:04)
TASK ERROR: migration problems

PutnamCountyIT · Aug 5, 2020

Unchecked the KRBD flag in the RBD config - seems to have fixed the sysfs write failed issue.
If someone can explain what the krbd flag does in the RBD config that would be great

Thank you

[SOLVED] vzdump fails - sysfs write failed

Member

Active Member

Member

Member

Active Member

Member

Active Member

Member

Active Member

Member

Active Member

Member

Member

Active Member

Member

Proxmox Retired Staff

Member

Attachments

Member

Member

Member