KVM crash during vzdump of virtio-scsi using CEPH

e100 · Jan 22, 2017

When this problem happens the KVM process dies.
Never had this problem untilI changed from virtio to virtio-scsi-single, also happened with virtio-scsi

vm.conf:

Code:

args: -D /var/log/pve/105.log
boot: cd
bootdisk: ide0
cores: 4
ide0: ceph_rbd:vm-105-disk-1,cache=writeback,size=512M
ide2: none,media=cdrom
memory: 6144
name: XXXXXXXXXXXXXXX
net0: virtio=XX:XX:XX:XX:XX:XX,bridge=vmbr219
numa: 0
onboot: 1
ostype: l26
scsi0: ceph_rbd:vm-105-disk-2,cache=writeback,discard=on,size=4T
scsihw: virtio-scsi-single
smbios1: uuid=c4d9009b-7ca8-4126-9c96-e85354fab637
sockets: 1

KVM Log:

Code:

main-loop: WARNING: I/O thread spun for 1000 iterations
osdc/ObjectCacher.cc: In function 'void ObjectCacher::Object::discard(loff_t, loff_t)' thread 7fd5a17fd700 time 2017-01-22 08:21:59.618962
osdc/ObjectCacher.cc: 533: FAILED assert(bh->waitfor_read.empty())
 ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367)
 1: (()+0x2bc4e2) [0x7fd5cb5054e2]
 2: (()+0x514187) [0x7fd5cb75d187]
 3: (()+0x5144df) [0x7fd5cb75d4df]
 4: (()+0x7f607) [0x7fd5cb2c8607]
 5: (()+0x80888) [0x7fd5cb2c9888]
 6: (()+0x7e374) [0x7fd5cb2c7374]
 7: (()+0x83f70) [0x7fd5cb2ccf70]
 8: (()+0x2acbaf) [0x7fd5cb4f5baf]
 9: (()+0x2adad0) [0x7fd5cb4f6ad0]
 10: (()+0x80a4) [0x7fd5bd9770a4]
 11: (clone()+0x6d) [0x7fd5bd6ac62d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

vzdump log:

Code:

INFO: status: 73% (3210988486656/4398583382016), sparse 6% (300175785984), duration 132751, 16/15 MB/s
INFO: status: 74% (3254974545920/4398583382016), sparse 6% (304453844992), duration 135367, 16/15 MB/s
ERROR: VM 105 not running
INFO: aborting backup job
ERROR: VM 105 not running
ERROR: Backup of VM 105 failed - VM 105 not running

pveversion -v

Code:

proxmox-ve: 4.4-78 (running kernel: 4.4.35-2-pve)
pve-manager: 4.4-5 (running version: 4.4-5/c43015a5)
pve-kernel-4.4.13-1-pve: 4.4.13-56
pve-kernel-4.4.35-1-pve: 4.4.35-77
pve-kernel-4.2.6-1-pve: 4.2.6-36
pve-kernel-4.4.35-2-pve: 4.4.35-78
pve-kernel-4.2.8-1-pve: 4.2.8-41
pve-kernel-4.4.10-1-pve: 4.4.10-54
lvm2: 2.02.116-pve3
corosync-pve: 2.4.0-1
libqb0: 1.0-1
pve-cluster: 4.0-48
qemu-server: 4.0-102
pve-firmware: 1.1-10
libpve-common-perl: 4.0-85
libpve-access-control: 4.0-19
libpve-storage-perl: 4.0-71
pve-libspice-server1: 0.12.8-1
vncterm: 1.2-1
pve-docs: 4.4-1
pve-qemu-kvm: 2.7.0-10
pve-container: 1.0-90
pve-firewall: 2.0-33
pve-ha-manager: 1.0-38
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u2
lxc-pve: 2.0.6-5
lxcfs: 2.0.5-pve2
criu: 1.6.0-1
novnc-pve: 0.5-8
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.8-pve13~bpo80
drbdmanage: 0.97.3-1

spirit · Jan 22, 2017

mmmm, maybe it's a bug with discard locking when pve-backup is running ?

do you use fstrim or do you have discard option enable in /etc/fstab in your guest vm ?

e100 · Jan 22, 2017

I use fstrim and it was not running at the time this occurred.

I changed to virtio-scsi so I could use fstrim to reclaim some space in CEPH. Ran fstrim once a couple of weeks ago.

spirit · Jan 23, 2017

e100 said:
I use fstrim and it was not running at the time this occurred.

I changed to virtio-scsi so I could use fstrim to reclaim some space in CEPH. Ran fstrim once a couple of weeks ago.

ok.

I'll try to do tests on my side to see if it's a qemu or ceph bug.

which version of ceph do you use ?

e100 · Jan 23, 2017

I'm running jewel 10.2.5-1~bpo80+1

e100 · Jan 30, 2017

After changing from virtio-scsi back to virtio vzdump no longer causes crash.
So whatever the problem is it seems limited to virtio-scsi

spirit · Jan 31, 2017

do you have the problem with virtio-scsi without discard ?

(discard is not effective on virtio)

e100 · Jan 31, 2017

I can switch back to virtio-scsi without discard and let you know what happens.

e100 · Feb 6, 2017

@spirit

When I setup the VM to use virtio-scsi without discard the backup completed successfully

Code:

scsi0: ceph_rbd:vm-105-disk-2,cache=writeback,size=4T
scsihw: virtio-scsi-single

lankaster · Dec 10, 2017

I can confirm this bug. (the LXC are not affected)

When I setup the VM (kvm) with virtio-scsi, virtio0 hdd on ceph storage the backup freeze at ca. 50% :

Code:

INFO: status: 52% (36293574656/69793218560), sparse 47% (33105027072), duration 1661, read/write 18/0 MB/s
INFO: status: 53% (36994154496/69793218560), sparse 48% (33770270720), duration 1698, read/write 18/0 MB/s
INFO: status: 54% (37692506112/69793218560), sparse 49% (34468622336), duration 3629, read/write 22/0 MB/s

Code:

virtio0: ceph-kvm:vm-100-disk-1,cache=writeback,size=65G

Code:

[1401752.950634] ------------[ cut here ]------------
[1401752.950639] WARNING: CPU: 6 PID: 0 at net/sched/sch_generic.c:316 dev_watchdog+0x21f/0x230
[1401752.950643] Modules linked in: 8021q garp mrp tcp_diag inet_diag cfg80211 veth rbd libceph nfsv3 nfs_acl nfs lockd grace fscache ip_set ip6table_filter ip6_tables xfs iptable_filter softdog openvswitch nf_conntrack_ipv6 nf_nat_ipv6 nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_defrag_ipv6 nf_nat nf_conntrack libcrc32c nfnetlink_log nfnetlink snd_hda_codec_hdmi zfs(PO) intel_rapl x86_pkg_temp_thermal zunicode(PO) intel_powerclamp zavl(PO) coretemp icp(PO) kvm_intel kvm cmdlinepart irqbypass crct10dif_pclmul crc32_pclmul intel_spi_platform ghash_clmulni_intel intel_spi spi_nor mtd ppdev pcbc i915 zcommon(PO) znvpair(PO) spl(O) aesni_intel snd_hda_codec_realtek aes_x86_64 crypto_simd drm_kms_helper snd_hda_codec_generic glue_helper cryptd drm intel_cstate i2c_algo_bit intel_rapl_perf pcspkr input_leds
[1401752.950689]  snd_hda_intel fb_sys_fops syscopyarea serio_raw snd_hda_codec sysfillrect snd_hda_core sysimgblt snd_hwdep intel_pch_thermal snd_pcm lpc_ich snd_timer mei_me snd mei soundcore shpchp ie31200_edac parport_pc parport fujitsu_laptop sparse_keymap video tpm_infineon mac_hid vhost_net vhost tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp sunrpc libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs xor raid6_pq psmouse ahci i2c_i801 libahci e1000e(O) ptp pps_core
[1401752.950720] CPU: 6 PID: 0 Comm: swapper/6 Tainted: P           O    4.13.4-1-pve #1
[1401752.950722] Hardware name: FUJITSU CELSIUS W530/D3227-A1, BIOS V4.6.5.4 R1.23.0 for D3227-A1x 05/14/2014
[1401752.950725] task: ffff98117b281740 task.stack: ffffa685031c4000
[1401752.950727] RIP: 0010:dev_watchdog+0x21f/0x230
[1401752.950731] RSP: 0018:ffff98119e383e58 EFLAGS: 00010282
[1401752.950733] RAX: 000000000000003c RBX: 0000000000000000 RCX: 000000000000083f
[1401752.950734] RDX: 0000000000000000 RSI: 00000000000000f6 RDI: 000000000000083f
[1401752.950735] RBP: ffff98119e383e88 R08: 0000000000000001 R09: 00000000000011df
[1401752.950736] R10: ffff98119e391f70 R11: 00000000000011df R12: 0000000000000001
[1401752.950737] R13: 0000000000000006 R14: ffff981172a98000 R15: ffff981172165a80
[1401752.950739] FS:  0000000000000000(0000) GS:ffff98119e380000(0000) knlGS:0000000000000000
[1401752.950740] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1401752.950741] CR2: 00007fb9c9150790 CR3: 00000003d7409000 CR4: 00000000001426e0
[1401752.950743] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[1401752.950744] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[1401752.950745] Call Trace:
[1401752.950747]  <IRQ>
[1401752.950750]  ? qdisc_rcu_free+0x50/0x50
[1401752.950754]  call_timer_fn+0x35/0x130
[1401752.950760]  run_timer_softirq+0x1e1/0x450
[1401752.950767]  ? ktime_get+0x40/0xa0
[1401752.950772]  ? native_apic_msr_write+0x2b/0x30
[1401752.950774]  ? lapic_next_event+0x1d/0x30
[1401752.950777]  ? clockevents_program_event+0x7a/0xf0
[1401752.950780]  __do_softirq+0x104/0x28f
[1401752.950784]  irq_exit+0xb6/0xc0
[1401752.950788]  smp_apic_timer_interrupt+0x3d/0x50
[1401752.950793]  apic_timer_interrupt+0x89/0x90
[1401752.950798] RIP: 0010:cpuidle_enter_state+0x126/0x2c0
[1401752.950801] RSP: 0018:ffffa685031c7e60 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
[1401752.950803] RAX: 0000000000000000 RBX: 0000000000000005 RCX: 000000000000001f
[1401752.950806] RDX: 0004fae30aaebcec RSI: fffffffa8aef8f5a RDI: 0000000000000000
[1401752.950809] RBP: ffffa685031c7e98 R08: 0000000000003c30 R09: 0000000000000018
[1401752.950812] R10: ffffa685031c7e30 R11: 000000000000394e R12: ffff98119e3a4e18
[1401752.950814] R13: ffffffffa5b78778 R14: 0004fae30aaebcec R15: ffffffffa5b78760
[1401752.950817]  </IRQ>
[1401752.950820]  cpuidle_enter+0x17/0x20
[1401752.950828]  call_cpuidle+0x23/0x40
[1401752.950833]  do_idle+0x199/0x200
[1401752.950838]  cpu_startup_entry+0x73/0x80
[1401752.950846]  start_secondary+0x156/0x190
[1401752.950850]  secondary_startup_64+0x9f/0x9f
[1401752.950855] Code: 63 8e 60 04 00 00 eb 95 4c 89 f7 c6 05 64 bf 80 00 01 e8 65 5a fd ff 89 d9 48 89 c2 4c 89 f6 48 c7 c7 78 ac 95 a5 e8 62 87 8d ff <0f> ff eb c3 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 
[1401752.950876] ---[ end trace 2cb5fa297b348f65 ]---

Code:

root@node1:~# ps aux | grep kvm
root     12983  0.4  0.0      0     0 ?        Zl   Nov30  68:35 [kvm] <defunct>
root     13023  0.0  0.0      0     0 ?        S    Nov30   0:07 [kvm-pit/12983]
root     24324  0.0  0.0  12788  1000 pts/2    S+   17:45   0:00 grep kvm

Code:

root@node1:~# pveversion -v
proxmox-ve: 5.1-26 (running kernel: 4.13.4-1-pve)
pve-manager: 5.1-36 (running version: 5.1-36/131401db)
pve-kernel-4.13.4-1-pve: 4.13.4-26
libpve-http-server-perl: 2.0-6
lvm2: 2.02.168-pve6
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-15
qemu-server: 5.0-17
pve-firmware: 2.0-3
libpve-common-perl: 5.0-20
libpve-guest-common-perl: 2.0-13
libpve-access-control: 5.0-7
libpve-storage-perl: 5.0-16
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-2
pve-docs: 5.1-12
pve-qemu-kvm: 2.9.1-2
pve-container: 2.0-17
pve-firewall: 3.0-3
pve-ha-manager: 2.0-3
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.1.0-2
lxcfs: 2.0.7-pve4
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
root@node1:~# ceph version
ceph version 12.2.1 (1a629971a9bcaaae99e5539a3a43f800a297f267) luminous (stable)

After that a VM cannot start/stop/restart and because of "Z" kvm process i cannot restart node

Now I have two solutions:
1. Unlock VM and restart Ceph OSD

Code:

root@node1:~# qm unlock 100
root@node1:~# systemctl stop ceph-osd@8
root@node1:~# systemctl stop ceph-osd@7
root@node1:~# ps aux | grep kvm
root     24835  0.0  0.0  12788   940 pts/2    S+   17:47   0:00 grep kvm

2. Use virtio-scsi for KVM:

Code:

scsi0: ceph-kvm:vm-100-disk-1,cache=writeback,size=65G
scsihw: virtio-scsi-single

manu · Dec 13, 2017

@lankaster

Tried your setup here and backup works fine. This is my relevant vm.conf
scsi0: pvepool_vm:vm-610-disk-2,cache=writeback,discard=on,size=9G

Also using virtio-scsi, ceph luminous and PVE 5.1

lankaster · Dec 20, 2017

@manu Yesterday we have same backup problem again. 1 of 40 VMs crashed(frozen) because of backup.
We test it tomorrow again.

lankaster · Jan 15, 2018

Last kernel. The problem still exists, backup freeze again...

Code:

...
NFO: status: 98% (33701560320/34359738368), sparse 57% (19869216768), duration 1689, read/write 41/0 MB/s
INFO: status: 99% (34031337472/34359738368), sparse 58% (20198993920), duration 1698, read/write 36/0 MB/s
INFO: status: 100% (34359738368/34359738368), sparse 59% (20527394816), duration 1707, read/write 36/0 MB/s
INFO: transferred 34359 MB in 1707 seconds (20 MB/s)
INFO: archive file size: 5.41GB
INFO: Finished Backup of VM 137 (00:28:28)
INFO: Starting Backup of VM 138 (qemu)
INFO: status = running
INFO: update VM 138: -lock backup
INFO: VM Name: gitlab-docker
INFO: include disk 'scsi0' 'ceph-kvm:vm-138-disk-1' 128G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating archive '/mnt/pve/server2k/dump/vzdump-qemu-138-2018_01_14-16_17_28.vma.gz'
INFO: started backup task '2f6511a7-4814-49af-82f3-2ec04511605e'
INFO: status: 1% (1876033536/137438953472), sparse 1% (1875099648), duration 3, read/write 625/0 MB/s
INFO: status: 2% (3753246720/137438953472), sparse 2% (3752312832), duration 6, read/write 625/0 MB/s
INFO: status: 4% (5858525184/137438953472), sparse 4% (5857169408), duration 9, read/write 701/0 MB/s
INFO: status: 5% (7758413824/137438953472), sparse 5% (7757058048), duration 12, read/write 633/0 MB/s
INFO: status: 6% (8248360960/137438953472), sparse 5% (8132599808), duration 23, read/write 44/10 MB/s
INFO: status: 7% (9635758080/137438953472), sparse 6% (8258064384), duration 118, read/write 14/13 MB/s
INFO: status: 8% (11000348672/137438953472), sparse 6% (8420261888), duration 211, read/write 14/12 MB/s
INFO: status: 9% (12437159936/137438953472), sparse 6% (8753455104), duration 301, read/write 15/12 MB/s
INFO: status: 10% (13883736064/137438953472), sparse 7% (9978183680), duration 321, read/write 72/11 MB/s
INFO: status: 11% (15755968512/137438953472), sparse 8% (11718438912), duration 333, read/write 156/10 MB/s
INFO: status: 12% (16741761024/137438953472), sparse 9% (12681818112), duration 336, read/write 328/7 MB/s
INFO: status: 13% (18083282944/137438953472), sparse 10% (13972639744), duration 342, read/write 223/8 MB/s
INFO: status: 14% (19291111424/137438953472), sparse 10% (15105748992), duration 351, read/write 134/8 MB/s
INFO: status: 15% (20901462016/137438953472), sparse 12% (16714301440), duration 354, read/write 536/0 MB/s
INFO: status: 16% (22446145536/137438953472), sparse 13% (17924378624), duration 383, read/write 53/11 MB/s
INFO: status: 17% (23365287936/137438953472), sparse 13% (18639249408), duration 398, read/write 61/13 MB/s
INFO: status: 18% (25192890368/137438953472), sparse 14% (19902382080), duration 449, read/write 35/11 MB/s
ERROR: interrupted by signal
INFO: aborting backup job
ERROR: Backup of VM 138 failed - interrupted by signal
ERROR: Backup job failed - interrupted by signal
TASK ERROR: interrupted by signal

Code:

pveversion -v
proxmox-ve: 5.1-35 (running kernel: 4.13.13-4-pve)
pve-manager: 5.1-42 (running version: 5.1-42/724a6cb3)
pve-kernel-4.13.4-1-pve: 4.13.4-26
pve-kernel-4.13.13-4-pve: 4.13.13-35
pve-kernel-4.13.13-2-pve: 4.13.13-33
pve-kernel-4.13.13-3-pve: 4.13.13-34
libpve-http-server-perl: 2.0-8
lvm2: 2.02.168-pve6
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-19
qemu-server: 5.0-19
pve-firmware: 2.0-3
libpve-common-perl: 5.0-25
libpve-guest-common-perl: 2.0-14
libpve-access-control: 5.0-7
libpve-storage-perl: 5.0-17
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-3
pve-docs: 5.1-16
pve-qemu-kvm: 2.9.1-5
pve-container: 2.0-18
pve-firewall: 3.0-5
pve-ha-manager: 2.0-4
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.1.1-2
lxcfs: 2.0.8-1
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.7.3-pve1~bpo9
openvswitch-switch: 2.7.0-2
ceph: 12.2.2-pve1

Code:

[206976.586550] INFO: task kvm:4805 blocked for more than 120 seconds.
[206976.586571]       Tainted: G           O    4.13.13-4-pve #1
[206976.586584] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[206976.586601] kvm             D    0  4805      1 0x00000006
[206976.586603] Call Trace:
[206976.586609]  __schedule+0x3cc/0x850
[206976.586611]  schedule+0x36/0x80
[206976.586613]  io_schedule+0x16/0x40
[206976.586615]  __lock_page+0xff/0x140
[206976.586617]  ? page_cache_tree_insert+0xc0/0xc0
[206976.586618]  truncate_inode_pages_range+0x495/0x830
[206976.586620]  truncate_inode_pages+0x15/0x20
[206976.586623]  kill_bdev+0x2f/0x40
[206976.586624]  __blkdev_put+0x82/0x210
[206976.586625]  blkdev_put+0x4c/0xd0
[206976.586625]  blkdev_close+0x34/0x70
[206976.586627]  __fput+0xe7/0x220
[206976.586628]  ____fput+0xe/0x10
[206976.586630]  task_work_run+0x80/0xa0
[206976.586632]  do_exit+0x2d1/0xad0
[206976.586633]  do_group_exit+0x43/0xb0
[206976.586635]  get_signal+0x28a/0x5d0
[206976.586637]  do_signal+0x37/0x730
[206976.586639]  ? __fpu__restore_sig+0x95/0x510
[206976.586640]  ? recalc_sigpending+0x1b/0x50
[206976.586642]  exit_to_usermode_loop+0x80/0xd0
[206976.586643]  syscall_return_slowpath+0x59/0x60
[206976.586645]  entry_SYSCALL_64_fastpath+0x7f/0x81
[206976.586646] RIP: 0033:0x7f30a56a2f5c
[206976.586646] RSP: 002b:00007f308dffc2d8 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
[206976.586648] RAX: fffffffffffffe00 RBX: 00007f3091358000 RCX: 00007f30a56a2f5c
[206976.586648] RDX: 0000000000000002 RSI: 0000000000000080 RDI: 0000561211eaad80
[206976.586649] RBP: 0000561211eaad80 R08: 0000561211eaad80 R09: 000000000000000b
[206976.586649] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f30913580a3
[206976.586649] R13: 00007ffd76da048f R14: 0000000000000000 R15: 00007f30ab135040
[207028.619696] vmbr0: port 9(tap109i0) entered disabled state
[207028.903789] vmbr0v6: port 2(tap109i1) entered disabled state
[207076.532144] vmbr0: port 5(veth118i0) entered disabled state
[207076.532453] device veth118i0 left promiscuous mode
[207076.532457] vmbr0: port 5(veth118i0) entered disabled state
[207094.916495] vmbr0: port 7(veth147i0) entered disabled state
[207095.212628] vmbr0: port 7(veth147i0) entered disabled state
[207095.212903] device veth147i0 left promiscuous mode
[207095.212905] vmbr0: port 7(veth147i0) entered disabled state
[207097.412573] INFO: task kvm:4805 blocked for more than 120 seconds.
[207097.412605]       Tainted: G           O    4.13.13-4-pve #1
[207097.412618] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[207097.412636] kvm             D    0  4805      1 0x00000006
[207097.412637] Call Trace:
[207097.412652]  __schedule+0x3cc/0x850
[207097.412654]  schedule+0x36/0x80
[207097.412656]  io_schedule+0x16/0x40
[207097.412658]  __lock_page+0xff/0x140
[207097.412660]  ? page_cache_tree_insert+0xc0/0xc0
[207097.412662]  truncate_inode_pages_range+0x495/0x830
[207097.412664]  truncate_inode_pages+0x15/0x20
[207097.412667]  kill_bdev+0x2f/0x40
[207097.412668]  __blkdev_put+0x82/0x210
[207097.412669]  blkdev_put+0x4c/0xd0
[207097.412670]  blkdev_close+0x34/0x70
[207097.412671]  __fput+0xe7/0x220
[207097.412672]  ____fput+0xe/0x10
[207097.412675]  task_work_run+0x80/0xa0
[207097.412677]  do_exit+0x2d1/0xad0
[207097.412678]  do_group_exit+0x43/0xb0
[207097.412680]  get_signal+0x28a/0x5d0
[207097.412682]  do_signal+0x37/0x730
[207097.412684]  ? __fpu__restore_sig+0x95/0x510
[207097.412685]  ? recalc_sigpending+0x1b/0x50
[207097.412687]  exit_to_usermode_loop+0x80/0xd0
[207097.412689]  syscall_return_slowpath+0x59/0x60
[207097.412690]  entry_SYSCALL_64_fastpath+0x7f/0x81
[207097.412691] RIP: 0033:0x7f30a56a2f5c
[207097.412692] RSP: 002b:00007f308dffc2d8 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
[207097.412693] RAX: fffffffffffffe00 RBX: 00007f3091358000 RCX: 00007f30a56a2f5c
[207097.412693] RDX: 0000000000000002 RSI: 0000000000000080 RDI: 0000561211eaad80
[207097.412694] RBP: 0000561211eaad80 R08: 0000561211eaad80 R09: 000000000000000b
[207097.412694] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f30913580a3
[207097.412695] R13: 00007ffd76da048f R14: 0000000000000000 R15: 00007f30ab135040
[207115.028681] vmbr0: port 8(veth143i0) entered disabled state
[207115.193315] vmbr0: port 8(veth143i0) entered disabled state
[207115.193445] device veth143i0 left promiscuous mode
[207115.193448] vmbr0: port 8(veth143i0) entered disabled state
[207200.202228] systemd[1]: apt-daily.timer: Adding 4h 26min 28.251402s random time.
[207200.284942] systemd[1]: apt-daily.timer: Adding 1h 1.555728s random time.
[207200.693659] systemd[1]: apt-daily.timer: Adding 4h 12min 11.159111s random time.
[207200.766134] systemd[1]: apt-daily.timer: Adding 4h 28min 46.736907s random time.
[207200.877959] systemd[1]: apt-daily.timer: Adding 7h 55min 43.474943s random time.
[207200.988849] systemd[1]: apt-daily.timer: Adding 11h 54min 15.045552s random time.
[207201.514012] systemd[1]: apt-daily.timer: Adding 9h 26min 52.514598s random time.
[207201.585506] systemd[1]: apt-daily.timer: Adding 2min 7.945282s random time.
[207201.681819] systemd[1]: apt-daily.timer: Adding 6h 50min 16.829093s random time.
[207201.753665] systemd[1]: apt-daily.timer: Adding 1h 10min 45.549762s random time.
[207218.238690] INFO: task kvm:4805 blocked for more than 120 seconds.
[207218.238709]       Tainted: G           O    4.13.13-4-pve #1
[207218.238722] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[207218.238739] kvm             D    0  4805      1 0x00000006
[207218.238741] Call Trace:
[207218.238747]  __schedule+0x3cc/0x850
[207218.238749]  schedule+0x36/0x80
[207218.238751]  io_schedule+0x16/0x40
[207218.238753]  __lock_page+0xff/0x140
[207218.238755]  ? page_cache_tree_insert+0xc0/0xc0
[207218.238756]  truncate_inode_pages_range+0x495/0x830
[207218.238758]  truncate_inode_pages+0x15/0x20
[207218.238761]  kill_bdev+0x2f/0x40
[207218.238762]  __blkdev_put+0x82/0x210
[207218.238763]  blkdev_put+0x4c/0xd0
[207218.238763]  blkdev_close+0x34/0x70
[207218.238765]  __fput+0xe7/0x220
[207218.238766]  ____fput+0xe/0x10
[207218.238768]  task_work_run+0x80/0xa0
[207218.238770]  do_exit+0x2d1/0xad0
[207218.238771]  do_group_exit+0x43/0xb0
[207218.238773]  get_signal+0x28a/0x5d0
[207218.238775]  do_signal+0x37/0x730
[207218.238777]  ? __fpu__restore_sig+0x95/0x510
[207218.238778]  ? recalc_sigpending+0x1b/0x50
[207218.238780]  exit_to_usermode_loop+0x80/0xd0
[207218.238781]  syscall_return_slowpath+0x59/0x60
[207218.238783]  entry_SYSCALL_64_fastpath+0x7f/0x81
[207218.238784] RIP: 0033:0x7f30a56a2f5c
[207218.238784] RSP: 002b:00007f308dffc2d8 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
[207218.238786] RAX: fffffffffffffe00 RBX: 00007f3091358000 RCX: 00007f30a56a2f5c
[207218.238786] RDX: 0000000000000002 RSI: 0000000000000080 RDI: 0000561211eaad80
[207218.238787] RBP: 0000561211eaad80 R08: 0000561211eaad80 R09: 000000000000000b
[207218.238787] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f30913580a3
[207218.238787] R13: 00007ffd76da048f R14: 0000000000000000 R15: 00007f30ab135040
[207218.238801] INFO: task sync:29910 blocked for more than 120 seconds.
[207218.238815]       Tainted: G           O    4.13.13-4-pve #1
[207218.238828] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[207218.238845] sync            D    0 29910  29906 0x00000100
[207218.238846] Call Trace:
[207218.238848]  __schedule+0x3cc/0x850
[207218.238849]  schedule+0x36/0x80
[207218.238851]  schedule_preempt_disabled+0xe/0x10
[207218.238852]  __mutex_lock.isra.2+0x2b1/0x4e0
[207218.238853]  ? fdatawait_one_bdev+0x20/0x20
[207218.238854]  __mutex_lock_slowpath+0x13/0x20
[207218.238855]  ? __mutex_lock_slowpath+0x13/0x20
[207218.238855]  mutex_lock+0x2f/0x40
[207218.238856]  iterate_bdevs+0xf1/0x160
[207218.238857]  sys_sync+0x72/0xb0
[207218.238858]  do_syscall_64+0x5b/0xc0
[207218.238860]  entry_SYSCALL64_slow_path+0x25/0x25
[207218.238860] RIP: 0033:0x7f6a52f15a77
[207218.238861] RSP: 002b:00007ffd821ee0d8 EFLAGS: 00000206 ORIG_RAX: 00000000000000a2
[207218.238861] RAX: ffffffffffffffda RBX: 00007ffd821ee1d8 RCX: 00007f6a52f15a77
[207218.238862] RDX: 0000000000404237 RSI: 0000000000404a61 RDI: 00007f6a52f98bc8
[207218.238862] RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000000
[207218.238863] R10: 00007ffd821edea0 R11: 0000000000000206 R12: 0000000000401569
[207218.238863] R13: 00007ffd821ee1d0 R14: 0000000000000000 R15: 0000000000000000
[207339.064811] INFO: task kvm:4805 blocked for more than 120 seconds.
[207339.064829]       Tainted: G           O    4.13.13-4-pve #1
[207339.064842] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[207339.064860] kvm             D    0  4805      1 0x00000006
[207339.064862] Call Trace:
[207339.064868]  __schedule+0x3cc/0x850
[207339.064870]  schedule+0x36/0x80
[207339.064872]  io_schedule+0x16/0x40
[207339.064874]  __lock_page+0xff/0x140
[207339.064876]  ? page_cache_tree_insert+0xc0/0xc0
[207339.064878]  truncate_inode_pages_range+0x495/0x830
[207339.064879]  truncate_inode_pages+0x15/0x20
[207339.064883]  kill_bdev+0x2f/0x40
[207339.064883]  __blkdev_put+0x82/0x210
[207339.064884]  blkdev_put+0x4c/0xd0
[207339.064885]  blkdev_close+0x34/0x70
[207339.064886]  __fput+0xe7/0x220
[207339.064887]  ____fput+0xe/0x10
[207339.064890]  task_work_run+0x80/0xa0
[207339.064892]  do_exit+0x2d1/0xad0
[207339.064893]  do_group_exit+0x43/0xb0
[207339.064894]  get_signal+0x28a/0x5d0
[207339.064896]  do_signal+0x37/0x730
[207339.064898]  ? __fpu__restore_sig+0x95/0x510
[207339.064900]  ? recalc_sigpending+0x1b/0x50
[207339.064902]  exit_to_usermode_loop+0x80/0xd0
[207339.064903]  syscall_return_slowpath+0x59/0x60
[207339.064904]  entry_SYSCALL_64_fastpath+0x7f/0x81
[207339.064906] RIP: 0033:0x7f30a56a2f5c
[207339.064906] RSP: 002b:00007f308dffc2d8 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
[207339.064907] RAX: fffffffffffffe00 RBX: 00007f3091358000 RCX: 00007f30a56a2f5c
[207339.064908] RDX: 0000000000000002 RSI: 0000000000000080 RDI: 0000561211eaad80
[207339.064908] RBP: 0000561211eaad80 R08: 0000561211eaad80 R09: 000000000000000b
[207339.064909] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f30913580a3
[207339.064909] R13: 00007ffd76da048f R14: 0000000000000000 R15: 00007f30ab135040
[207339.064923] INFO: task sync:29910 blocked for more than 120 seconds.
[207339.064938]       Tainted: G           O    4.13.13-4-pve #1
[207339.064951] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[207339.064968] sync            D    0 29910  29906 0x00000100
[207339.064969] Call Trace:
[207339.064971]  __schedule+0x3cc/0x850
[207339.064973]  schedule+0x36/0x80
[207339.064974]  schedule_preempt_disabled+0xe/0x10
[207339.064975]  __mutex_lock.isra.2+0x2b1/0x4e0
[207339.064977]  ? fdatawait_one_bdev+0x20/0x20
[207339.064978]  __mutex_lock_slowpath+0x13/0x20
[207339.064978]  ? __mutex_lock_slowpath+0x13/0x20
[207339.064979]  mutex_lock+0x2f/0x40
[207339.064980]  iterate_bdevs+0xf1/0x160
[207339.064981]  sys_sync+0x72/0xb0
[207339.064983]  do_syscall_64+0x5b/0xc0
[207339.064984]  entry_SYSCALL64_slow_path+0x25/0x25
[207339.064985] RIP: 0033:0x7f6a52f15a77
[207339.064985] RSP: 002b:00007ffd821ee0d8 EFLAGS: 00000206 ORIG_RAX: 00000000000000a2
[207339.064986] RAX: ffffffffffffffda RBX: 00007ffd821ee1d8 RCX: 00007f6a52f15a77
[207339.064986] RDX: 0000000000404237 RSI: 0000000000404a61 RDI: 00007f6a52f98bc8
[207339.064987] RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000000
[207339.064987] R10: 00007ffd821edea0 R11: 0000000000000206 R12: 0000000000401569
[207339.064988] R13: 00007ffd821ee1d0 R14: 0000000000000000 R15: 0000000000000000

Code:

cat /etc/pve/storage.cfg
...
rbd: ceph-kvm
   content images
   krbd 1
   pool pool1

nfs: server2k
   export /mnt/zpool/vmstorage
   path /mnt/pve/server2k
   server 192.168.1.10
   content backup,vztmpl,iso
   maxfiles 5
   options vers=3

Code:

cat /etc/pve/nodes/*/qemu-server/138*
bootdisk: scsi0
cores: 4
ide2: none,media=cdrom
memory: 2048
name: gitlab-docker
net0: virtio=FE:35:04:FC:12:01,bridge=vmbr0
numa: 0
ostype: l26
scsi0: ceph-kvm:vm-138-disk-1,cache=writeback,size=128G
scsihw: virtio-scsi-pci
smbios1: uuid=8bcf1eec-3d7c-4dba-83fa-19f654bd0767
sockets: 1

lankaster · Jan 15, 2018

and again same solution:

Code:

qm unlock 138
systemctl stop ceph-osd@X
systemctl stop ceph-osd@Y
qm destroy 138
systemctl start ceph-osd@X
systemctl start ceph-osd@Y
qm start 138

Search

Search

KVM crash during vzdump of virtio-scsi using CEPH

e100

Renowned Member

spirit

Distinguished Member

e100

Renowned Member

spirit

Distinguished Member

e100

Renowned Member

e100

Renowned Member

spirit

Distinguished Member

e100

Renowned Member

e100

Renowned Member

lankaster

New Member

manu

Proxmox Staff Member

lankaster

New Member

lankaster

New Member

lankaster

New Member