[SOLVED] VM hangs during backup

felix_84

Member
Oct 22, 2017
29
6
23
39
Hi all! Recently we upgraded our cluster to v6.1. Everything went smoothly, but during backup we noticed that one of VM is in down state. Zabbix reported that vm had hdd i/o overload. We restarted VM, unlocked it and tried to start backup manually, when backup log reached 1% it hangs again.

On proxmox log - no errors. Other VM looks fine. Backup is performing to NFS server. Had to say, that on PVE v5.4, we often had 'hdd i/o overloaded' during backup on this vm and general slowdowns was evident, but no crashes...

This VM is running Ubuntu 14.04
Free disk space is about 50Gb, Free memory - 5Gb
Qemu-agent is running

qm config 100
Code:
agent: 1
boot: cdn
bootdisk: scsi0
cores: 4
ide0: none,media=cdrom
memory: 10240
name: vm100
net0: virtio=9E:EF:6F:88:B0:80,bridge=vmbr1
numa: 0
onboot: 1
ostype: l26
scsi0: pool_vm:vm-100-disk-1,discard=on,size=285G
scsihw: virtio-scsi-pci
smbios1: uuid=5c64f933-826f-43bd-9f87-b23f77169257
sockets: 2
startup: order=2

pveversion -v
Code:
proxmox-ve: 6.1-2 (running kernel: 5.3.13-1-pve)
pve-manager: 6.1-5 (running version: 6.1-5/9bf06119)
pve-kernel-5.3: 6.1-1
pve-kernel-helper: 6.1-1
pve-kernel-4.15: 5.4-12
pve-kernel-5.3.13-1-pve: 5.3.13-1
pve-kernel-4.15.18-24-pve: 4.15.18-52
pve-kernel-4.15.18-11-pve: 4.15.18-34
pve-kernel-4.4.134-1-pve: 4.4.134-112
pve-kernel-4.4.83-1-pve: 4.4.83-96
pve-kernel-4.4.35-1-pve: 4.4.35-77
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.2-pve4
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.13-pve1
libpve-access-control: 6.0-5
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-9
libpve-guest-common-perl: 3.0-3
libpve-http-server-perl: 3.0-3
libpve-storage-perl: 6.1-3
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve3
lxc-pve: 3.2.1-1
lxcfs: 3.0.3-pve60
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.1-1
pve-cluster: 6.1-2
pve-container: 3.0-15
pve-docs: 6.1-3
pve-edk2-firmware: 2.20191127-1
pve-firewall: 4.0-9
pve-firmware: 3.0-4
pve-ha-manager: 3.0-8
pve-i18n: 2.0-3
pve-qemu-kvm: 4.1.1-2
pve-xtermjs: 3.13.2-1
qemu-server: 6.1-4
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.2-pve2

Безымянный.png
 
Last edited:
UPD. Tried to disable guest agent and removed package from ubuntu, but result was the same.
This is very frustrating...
 
  • Like
Reactions: mtze
hi,

if you can boot the VM properly, you can try to run fsck inside to detect filesystem errors. maybe it will help
 
hi,

if you can boot the VM properly, you can try to run fsck inside to detect filesystem errors. maybe it will help

Thanks for reply, yes vm is booting and running, and at start it also runs 'fsck' and i see no problem.
 
Today, second vm was crashed during backup. So this is not a guest problem.
what did it say during the crash? was it another kernel panic?
 
what did it say during the crash? was it another kernel panic?

It does not look like kernel panic, but vm is unresponsive

Code:
Jan  9 20:31:31 servername kernel: [463522.292945] NMI watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [java:1239]
Jan  9 20:31:31 servername kernel: [463522.293611] Modules linked in: binfmt_misc shpchp joydev input_leds serio_raw i2c_piix4 mac_hid ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic usbhid hid psmouse pata_acpi floppy
Jan  9 20:31:31 servername kernel: [463522.293643] CPU: 6 PID: 1239 Comm: java Tainted: G             L  4.4.0-151-generic #178-Ubuntu
Jan  9 20:31:31 servername kernel: [463522.293644] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
Jan  9 20:31:31 servername kernel: [463522.293647] task: ffff88032fc00cc0 ti: ffff8800bbaa8000 task.ti: ffff8800bbaa8000
Jan  9 20:31:31 servername kernel: [463522.293648] RIP: 0010:[<ffffffff8110c2ac>]  [<ffffffff8110c2ac>] smp_call_function_many+0x1fc/0x260
Jan  9 20:31:31 servername kernel: [463522.293657] RSP: 0018:ffff8800bbaabc48  EFLAGS: 00000202
Jan  9 20:31:31 servername kernel: [463522.293659] RAX: 0000000000000003 RBX: 0000000000000200 RCX: 0000000000000003
Jan  9 20:31:31 servername kernel: [463522.293661] RDX: ffff88033fcdb1d0 RSI: 0000000000000200 RDI: ffff88033fd98208
Jan  9 20:31:31 servername kernel: [463522.293662] RBP: ffff8800bbaabc80 R08: 0000000000000000 R09: 000000000000003e
Jan  9 20:31:31 servername kernel: [463522.293664] R10: 0000000000000008 R11: ffff88033fd98208 R12: ffff88033fd98208
Jan  9 20:31:31 servername kernel: [463522.293666] R13: ffff88033fd98200 R14: ffffffff81075180 R15: ffff8800bbaabc90
Jan  9 20:31:31 servername kernel: [463522.293669] FS:  00007f0d2d13c700(0000) GS:ffff88033fd80000(0000) knlGS:0000000000000000
Jan  9 20:31:31 servername kernel: [463522.293671] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan  9 20:31:31 servername kernel: [463522.293673] CR2: 00007f0431aea000 CR3: 00000000bb84a000 CR4: 0000000000000670
Jan  9 20:31:31 servername kernel: [463522.293682] Stack:
Jan  9 20:31:31 servername kernel: [463522.293685]  00000000000181c0 01ff880000000001 00007f0d6c26b000 ffff88033064dad0
Jan  9 20:31:31 servername kernel: [463522.293688]  ffff88033064dad0 00007f0d6c26a000 ffff88033064d800 ffff8800bbaabcc8
Jan  9 20:31:31 servername kernel: [463522.293691]  ffffffff810756a7 ffff88033064d800 00007f0d6c26a000 00007f0d6c26b000
Jan  9 20:31:31 servername kernel: [463522.293695] Call Trace:
Jan  9 20:31:31 servername kernel: [463522.293702]  [<ffffffff810756a7>] native_flush_tlb_others+0x57/0x160
Jan  9 20:31:31 servername kernel: [463522.293705]  [<ffffffff8107584d>] flush_tlb_mm_range+0x9d/0x180
Jan  9 20:31:31 servername kernel: [463522.293711]  [<ffffffff811d5fe8>] change_protection_range+0x898/0x900
Jan  9 20:31:31 servername kernel: [463522.293715]  [<ffffffff811d60b4>] change_protection+0x14/0x20
Jan  9 20:31:31 servername kernel: [463522.293718]  [<ffffffff811d6210>] mprotect_fixup+0x150/0x330
Jan  9 20:31:31 servername kernel: [463522.293723]  [<ffffffff810d2584>] ? rwsem_wake+0x64/0xa0
Jan  9 20:31:31 servername kernel: [463522.293728]  [<ffffffff813a490d>] ? apparmor_file_mprotect+0x2d/0x30
Jan  9 20:31:31 servername kernel: [463522.293731]  [<ffffffff811d656a>] SyS_mprotect+0x17a/0x260
Jan  9 20:31:31 servername kernel: [463522.293735]  [<ffffffff81863b5b>] entry_SYSCALL_64_fastpath+0x22/0xcb
Jan  9 20:31:31 servername kernel: [463522.293737] Code: 48 63 d2 e8 c7 42 31 00 3b 05 75 82 e3 00 89 c1 0f 8d 93 fe ff ff 48 98 49 8b 55 00 48 03 14 c5 a0 1b f4 81 8b 42 18 a8 01 74 ca <f3> 90 8b 42 18 a8 01 75 f7 eb bf 0f b6 4d d0 4c 89 fa 4c 89 f6
 
After some investiogations we suddenly discovered that our NFS storage is too slow. dd shows 10-15Mb/s on gigabit link. Also we use pigz to speed up gzip backups. So during backup CPU went above 40% and io delay was spiked to 10%. Also host with 64Gb RAM (60% usage) swapped actively during backup. We decided to switch to CIFS, got 120Mb/s dd speed, and problem solved. Its hard to say what was the key of the problem, but i hope this information would be helpful for someone
 
  • Like
Reactions: mtze
hello, maybe you have the nfs export with "sync" option and nfs disks without protected write cache (eg. BBU)?
 
Hello and thanks for reply. Actually i cant say for sure, because NFS was configured a long time ago on a windows machine and worked decently, more or less. Raid controller has no BBU.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!