[SOLVED] vzdump fails - sysfs write failed

Sep 11, 2019
26
1
8
54
vzdump backup all functioned until July 20, 2020. now all vzdump fail with similar error:
INFO: starting new backup job: vzdump 101 --compress zstd --node putsproxp04 --remove 0 --mode snapshot --storage nas06p2
INFO: Starting Backup of VM 101 (qemu)
INFO: Backup started at 2020-08-03 14:24:44
INFO: status = running
INFO: VM Name: putsfishpv01
INFO: include disk 'scsi0' 'vmpool01:vm-101-disk-0' 100G
In some cases useful info is found in syslog - try "dmesg | tail".
rbd: sysfs write failed
ERROR: Backup of VM 101 failed - can't map rbd volume vm-101-disk-0: rbd: sysfs write failed
INFO: Failed at 2020-08-03 14:25:45
INFO: Backup job finished with errors

TASK ERROR: job errors
 
vzdump backup all functioned until July 20, 2020. now all vzdump fail with similar error:
<snip>
rbd: sysfs write failed
ERROR: Backup of VM 101 failed - can't map rbd volume vm-101-disk-0: rbd: sysfs write failed

From that, I'd say you have an even bigger problem with your Ceph storage. What is the output of ceph -s ? (Please post the output in \[CODE\] \[\/CODE\] tags.)
 
\[CODE\]
Seems fine
root@putsproxp01:~# ceph -s
cluster:
id: fbfde291-2831-4005-802f-5a01e95c9615
health: HEALTH_OK

services:
mon: 6 daemons, quorum putsproxp06,putsproxp05,putsproxp04,proxp03,proxp01,proxp02 (age 24m)
mgr: putsproxp03(active, since 3w), standbys: putsproxp02, putsproxp01
mds: cephfs01:1 {0=putsproxp06=up:active} 2 up:standby
osd: 24 osds: 24 up (since 23m), 24 in (since 7w)

data:
pools: 3 pools, 672 pgs
objects: 340.56k objects, 1.3 TiB
usage: 11 TiB used, 171 TiB / 182 TiB avail
pgs: 672 active+clean

io:
client: 495 KiB/s rd, 105 KiB/s wr, 12 op/s rd, 11 op/s wr
\[\/CODE\]
 
Code:
root@putsproxp01:~# ceph -s
  cluster:
    id:     fbfde291-2831-4005-802f-5a01e95c9615
    health: HEALTH_OK

  services:
    mon: 6 daemons, quorum putsproxp06,putsproxp05,putsproxp04,proxp03,proxp01,proxp02 (age 24m)
    mgr: putsproxp03(active, since 3w), standbys: putsproxp02, putsproxp01
    mds: cephfs01:1 {0=putsproxp06=up:active} 2 up:standby
    osd: 24 osds: 24 up (since 23m), 24 in (since 7w)

  data:
    pools:   3 pools, 672 pgs
    objects: 340.56k objects, 1.3 TiB
    usage:   11 TiB used, 171 TiB / 182 TiB avail
    pgs:     672 active+clean

  io:
    client:   495 KiB/s rd, 105 KiB/s wr, 12 op/s rd, 11 op/s wr
 
Code:
root@putsproxp01:~# ceph osd pool ls detail
pool 1 'cephfs01_data' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode warn last_change 8816 flags hashpspool stripe_width 0 application cephfs
pool 2 'cephfs01_metadata' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode warn last_change 8816 flags hashpspool stripe_width 0 application cephfs
pool 4 'vmpool01' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 512 pgp_num 512 autoscale_mode warn last_change 14281 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
        removed_snaps [1~3]

and
Code:
dir: local
        path /var/lib/vz
        content vztmpl,backup,iso

lvmthin: local-lvm
        thinpool data
        vgname pve
        content images,rootdir

rbd: vmpool01
        content rootdir,images
        krbd 1
        pool vmpool01

nfs: isoRepo
        export /data/isoRepo
        path /mnt/pve/isoRepo
        server putslinv12.bocc.putnam-fl.com
        content images,vztmpl,iso
        options vers=3

nfs: nas5p1
        export /mnt/nas5pool1
        path /mnt/pve/nas5p1
        server 10.10.23.39
        content images,rootdir
        options vers=3

nfs: nas5p2
        export /mnt/nas5pool2
        path /mnt/pve/nas5p2
        server 10.10.23.39
        content images
        options vers=3

nfs: nas06p1
        export /mnt/nas6pool1
        path /mnt/pve/nas06p1
        server 10.10.23.42
        content images
        
nfs: nas06p2
        export /mnt/nas6pool2
        path /mnt/pve/nas06p2
        server putsnasp06
        content images,backup
        maxfiles 0
 
That all seems fine. So, is the content of /etc/vzdump.conf the same on all the nodes? Is there something unique about putsproxp04 ? Does the backup fail the same way on all nodes or just this one?
 
All the same

Code:
root@putsproxp06:~# cat /etc/vzdump.conf
# vzdump default settings

#tmpdir: DIR
#dumpdir: DIR
#storage: STORAGE_ID
#mode: snapshot|suspend|stop
#bwlimit: KBPS
#ionice: PRI
#lockwait: MINUTES
#stopwait: MINUTES
#size: MB
#stdexcludes: BOOLEAN
#mailto: ADDRESSLIST
#maxfiles: N
#script: FILENAME
#exclude-path: PATHLIST
#pigz: N

Did notice that migration is failing with same sysfs write error
 
same on all nodes
pveversion -v
Code:
root@putsproxp06:~# pveversion -v
proxmox-ve: 6.2-1 (running kernel: 5.4.44-2-pve)
pve-manager: 6.2-10 (running version: 6.2-10/a20769ed)
pve-kernel-5.4: 6.2-4
pve-kernel-helper: 6.2-4
pve-kernel-5.3: 6.1-6
pve-kernel-5.0: 6.0-11
pve-kernel-5.4.44-2-pve: 5.4.44-2
pve-kernel-5.4.44-1-pve: 5.4.44-1
pve-kernel-5.4.41-1-pve: 5.4.41-1
pve-kernel-5.4.34-1-pve: 5.4.34-2
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.18-2-pve: 5.3.18-2
pve-kernel-5.3.13-3-pve: 5.3.13-3
pve-kernel-5.3.13-1-pve: 5.3.13-1
pve-kernel-5.3.10-1-pve: 5.3.10-1
pve-kernel-5.0.21-5-pve: 5.0.21-10
pve-kernel-5.0.21-4-pve: 5.0.21-9
pve-kernel-5.0.21-3-pve: 5.0.21-7
pve-kernel-5.0.21-2-pve: 5.0.21-7
pve-kernel-5.0.21-1-pve: 5.0.21-2
pve-kernel-5.0.15-1-pve: 5.0.15-1
ceph: 14.2.9-pve1
ceph-fuse: 14.2.9-pve1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve2
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.4
libpve-access-control: 6.1-2
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.1-5
libpve-guest-common-perl: 3.1-1
libpve-http-server-perl: 3.0-6
libpve-storage-perl: 6.2-5
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.2-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
openvswitch-switch: 2.12.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-9
pve-cluster: 6.1-8
pve-container: 3.1-11
pve-docs: 6.2-5
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-2
pve-firmware: 3.1-1
pve-ha-manager: 3.0-9
pve-i18n: 2.1-3
pve-qemu-kvm: 5.0.0-11
pve-xtermjs: 4.3.0-1
qemu-server: 6.2-10
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.4-pve1

and pvesm list vmpool01 ---- all nodes the same
Code:
root@putsproxp01:~# pvesm list vmpool01
Volid                    Format  Type              Size VMID
vmpool01:base-104-disk-0 raw     images    134217728000 104
vmpool01:base-105-disk-0 raw     images    214748364800 105
vmpool01:vm-100-disk-0   raw     images     10737418240 100
vmpool01:vm-101-disk-0   raw     images    107374182400 101
vmpool01:vm-102-disk-0   raw     images    107374182400 102
vmpool01:vm-103-disk-0   raw     images    107374182400 103
vmpool01:vm-106-disk-0   raw     images    107374182400 106
vmpool01:vm-107-disk-0   raw     images     17179869184 107
vmpool01:vm-108-disk-0   raw     images    134217728000 108
vmpool01:vm-109-disk-0   raw     images    214748364800 109
vmpool01:vm-110-disk-0   raw     images    214748364800 110
vmpool01:vm-111-disk-0   raw     images    107374182400 111
vmpool01:vm-112-disk-0   raw     images    214748364800 112
vmpool01:vm-114-disk-0   raw     images     80530636800 114
vmpool01:vm-114-disk-1   raw     images    214748364800 114
vmpool01:vm-115-disk-0   raw     images     80530636800 115
vmpool01:vm-116-disk-0   raw     images    214748364800 116
vmpool01:vm-117-disk-0   raw     images     80530636800 117
vmpool01:vm-119-disk-0   raw     images     80530636800 119
vmpool01:vm-119-disk-1   raw     images         1048576 119
vmpool01:vm-120-disk-0   raw     images     80530636800 120
vmpool01:vm-122-disk-0   raw     images     80530636800 122
vmpool01:vm-123-disk-0   raw     images    107374182400 123
vmpool01:vm-125-disk-0   raw     images    161061273600 125
vmpool01:vm-125-disk-1   raw     images         1048576 125
vmpool01:vm-126-disk-0   raw     images    107374182400 126
vmpool01:vm-127-disk-0   raw     images    161061273600 127
 
Code:
pvesm status
Name             Type     Status           Total            Used       Available        %
isoRepo           nfs     active      2108488704       336919552      1664457728   15.98%
local             dir     active        25413876        14911972         9187908   58.68%
local-lvm     lvmthin     active        55177216        35539644        19637571   64.41%
nas06p1           nfs     active     45030415616      4065703168     40964712448    9.03%
nas06p2           nfs     active     45030501504     19849495936     25181005568   44.08%
nas5p1            nfs     active     45030018304       890875392     44139142912    1.98%
nas5p2            nfs     active     45029629312      6443948160     38585681152   14.31%
vmpool01          rbd     active     57332962178      1337070466     55995891712    2.33%
 
Check the syslog of your nodes. You should find some information there.
 
begin migration from node 2 to node 1
attached node 2 syslog output and node 1 syslog output
 

Attachments

  • syslog_node1.txt
    121.2 KB · Views: 5
  • syslog_node2.txt
    66.3 KB · Views: 3
At this point I have this issue with RBD - can't map rbd volume for a disk image when trying these functions:
Migrate a VM from any node (1-6) to any node (1-6)
back up any VM on any node (1-6)
start a new VM on any node (1-6)
 
I have searched all the logs and I do not see an indicator of what could be at fault here. I just updated the ceph to 14.2.10.
same error messges
Code:
2020-08-05 07:44:40 starting migration of VM 107 to node 'putsproxp01' (192.168.2.95)
2020-08-05 07:44:41 starting VM 107 on remote node 'putsproxp01'
2020-08-05 07:45:44 [putsproxp01] can't map rbd volume vm-107-disk-0: rbd: sysfs write failed
2020-08-05 07:45:44 ERROR: online migrate failure - remote command failed with exit code 255
2020-08-05 07:45:44 aborting phase 2 - cleanup resources
2020-08-05 07:45:44 migrate_cancel
2020-08-05 07:45:44 ERROR: migration finished with problems (duration 00:01:04)
TASK ERROR: migration problems
 
Unchecked the KRBD flag in the RBD config - seems to have fixed the sysfs write failed issue.
If someone can explain what the krbd flag does in the RBD config that would be great

Thank you
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!