Snpashot Backup freeze's VMs randomly

FinneDEV

Member
May 5, 2020
8
3
23
Hello ProxMoxers,

currently, we have the Problem that the Backup in the snapshot mode randomly freezes a vm at the end of the backup.
It happens randomly, so not in each backup task.

Does anyone have an idea how to fix it or where the problem belongs to?

Best regards
Fynn
 
The backup calls a fsfreeze and fsthaw while snapshotting. can you check the logs if it successfully did a thaw on the machine or if it timed out?
 
Here is the output of the latest log of a VM that had the problems today.


Code:
INFO: starting new backup job: vzdump 6044 --compress lzo --storage pbs --mode snapshot --node ls-ds-34
INFO: Starting Backup of VM 6044 (qemu)
INFO: Backup started at 2020-10-21 09:00:07
INFO: status = running
INFO: VM Name: HOST.MUSTER.DE
INFO: include disk 'scsi0' 'replica2:vm-6044-disk-0' 50G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: pending configuration changes found (not included into backup)
INFO: creating Proxmox Backup Server archive 'vm/6044/2020-10-21T07:00:07Z'
INFO: issuing guest-agent 'fs-freeze' command
INFO: enabling encryption
INFO: issuing guest-agent 'fs-thaw' command
INFO: started backup task 'feb2fdc6-5f2d-4bdc-9f66-f5d3f4d279c8'
INFO: resuming VM again
INFO: using fast incremental mode (dirty-bitmap), 516.0 MiB dirty of 50.0 GiB total
INFO:  78% (404.0 MiB of 516.0 MiB) in  3s, read: 134.7 MiB/s, write: 0 B/s
INFO:  79% (408.0 MiB of 516.0 MiB) in  1m 30s, read: 47.1 KiB/s, write: 47.1 KiB/s
INFO:  82% (424.0 MiB of 516.0 MiB) in  1m 35s, read: 3.2 MiB/s, write: 3.2 MiB/s
INFO:  83% (432.0 MiB of 516.0 MiB) in  1m 38s, read: 2.7 MiB/s, write: 2.7 MiB/s
INFO:  86% (444.0 MiB of 516.0 MiB) in  1m 41s, read: 4.0 MiB/s, write: 4.0 MiB/s
INFO:  87% (452.0 MiB of 516.0 MiB) in  1m 45s, read: 2.0 MiB/s, write: 2.0 MiB/s
INFO:  89% (460.0 MiB of 516.0 MiB) in  1m 48s, read: 2.7 MiB/s, write: 2.7 MiB/s
INFO:  92% (476.0 MiB of 516.0 MiB) in  1m 51s, read: 5.3 MiB/s, write: 5.3 MiB/s
INFO:  93% (484.0 MiB of 516.0 MiB) in  1m 56s, read: 1.6 MiB/s, write: 1.6 MiB/s
INFO:  96% (496.0 MiB of 516.0 MiB) in  1m 59s, read: 4.0 MiB/s, write: 4.0 MiB/s
INFO:  99% (512.0 MiB of 516.0 MiB) in  2m  2s, read: 5.3 MiB/s, write: 5.3 MiB/s
INFO: 100% (516.0 MiB of 516.0 MiB) in  5m 25s, read: 20.2 KiB/s, write: 20.2 KiB/s
INFO: backup was done incrementally, reused 49.50 GiB (99%)
INFO: transferred 516.00 MiB in 325 seconds (1.6 MiB/s)
INFO: Finished Backup of VM 6044 (00:05:51)
INFO: Backup finished at 2020-10-21 09:05:58
INFO: Backup job finished successfully
TASK OK
 
Yes, we have the current release running on the System.
And yes it’s seems like it’s issuing the command successfully.
 
Hi,

we have a problem too, when backups run, the guest vm (kvm - qemu guest agent installed ) hangs , dmesg:
[4242501.119670] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[4242501.122565] rcu: 0-...0: (2 GPs behind) idle=b86/0/0x1 softirq=433444253/433444253 fqs=2415
[4242501.123487] rcu: (detected by 2, t=5252 jiffies, g=1324201561, q=2636)
[4242501.124288] Sending NMI from CPU 2 to CPUs 0:
[4242502.139816] NMI backtrace for cpu 0
 
I have the same issue when the snapshot is created. This VM runs MongoDB server and I am pretty sure that crushes during the snapshot.

Code:
[Wed Dec  9 23:53:17 2020] INFO: rcu_sched detected stalls on CPUs/tasks:
[Wed Dec  9 23:53:17 2020]      9-...: (15 GPs behind) idle=5d3/1/0 softirq=1222053/1222053 fqs=2555
[Wed Dec  9 23:53:17 2020]      (detected by 7, t=5255 jiffies, g=4567914, c=4567913, q=127)
[Wed Dec  9 23:53:17 2020] Task dump for CPU 9:
[Wed Dec  9 23:53:17 2020] swapper/9       R  running task        0     0      1 0x00000008
[Wed Dec  9 23:53:17 2020]  0000000103e22def 0000000000000000 0000000000000000 0100000000000000
[Wed Dec  9 23:53:17 2020]  ffffffffa80141e0 0000000000000000 0000000000000000 0000000000000000
[Wed Dec  9 23:53:17 2020]  0000000000000000 ffffffffffffff10 ffffffffa80144b2 0000000000000010
[Wed Dec  9 23:53:17 2020] Call Trace:
[Wed Dec  9 23:53:17 2020]  [<ffffffffa80141e0>] ? __sched_text_end+0x1/0x1
[Wed Dec  9 23:53:17 2020]  [<ffffffffa80144b2>] ? native_safe_halt+0x2/0x10
[Wed Dec  9 23:53:17 2020]  [<ffffffffa80141fa>] ? default_idle+0x1a/0xd0
[Wed Dec  9 23:53:17 2020]  [<ffffffffa7abbd9a>] ? cpu_startup_entry+0x1ca/0x240
[Wed Dec  9 23:53:17 2020]  [<ffffffffa7a48bb0>] ? start_secondary+0x170/0x1b0
 
I have noticed that the problem occurs also on other VMs.

Host machine specification:
CPU(s) 48 x Intel(R) Xeon(R) CPU E7-4860 v2 @ 2.60GHz (2 Sockets)
Kernel Version Linux 5.4.78-1-pve #1 SMP PVE 5.4.78-1 (Mon, 30 Nov 2020 10:57:47 +0100)

Below logs from PBS and kernel messages from VM. Is it possible that this problem depends on the host's kernel?

DMESG >> [Sun Dec 13 00:12:42 2020] watchdog: BUG: soft lockup - CPU#4 stuck for 22s! [node:3811]
PBS LOG>> 507: 2020-12-13 00:12:43 INFO: issuing guest-agent 'fs-thaw' command

Every VM is affected by cpu_stall during backup.

Code:
507: 2020-12-13 00:12:15 INFO: Starting Backup of VM 507 (qemu)
507: 2020-12-13 00:12:15 INFO: status = running
507: 2020-12-13 00:12:15 INFO: VM Name: xxx...xxx
507: 2020-12-13 00:12:15 INFO: include disk 'virtio0' 'DATA-VM:507/vm-507-disk-0.qcow2' 50G
507: 2020-12-13 00:12:15 INFO: backup mode: snapshot
507: 2020-12-13 00:12:15 INFO: ionice priority: 7
507: 2020-12-13 00:12:15 INFO: creating Proxmox Backup Server archive 'vm/507/2020-12-12T23:12:15Z'
507: 2020-12-13 00:12:15 INFO: issuing guest-agent 'fs-freeze' command
507: 2020-12-13 00:12:16 INFO: enabling encryption
507: 2020-12-13 00:12:43 INFO: issuing guest-agent 'fs-thaw' command
507: 2020-12-13 00:12:43 INFO: started backup task '8a64d844-5d33-4508-b3bc-f42696e0382e'
507: 2020-12-13 00:12:43 INFO: resuming VM again
507: 2020-12-13 00:12:43 INFO: virtio0: dirty-bitmap status: OK (1016.0 MiB of 50.0 GiB dirty)
507: 2020-12-13 00:12:43 INFO: using fast incremental mode (dirty-bitmap), 1016.0 MiB dirty of 50.0 GiB total
507: 2020-12-13 00:12:46 INFO:  18% (192.0 MiB of 1016.0 MiB) in  3s, read: 64.0 MiB/s, write: 64.0 MiB/s
507: 2020-12-13 00:12:49 INFO:  38% (396.0 MiB of 1016.0 MiB) in  6s, read: 68.0 MiB/s, write: 68.0 MiB/s
507: 2020-12-13 00:12:52 INFO:  39% (404.0 MiB of 1016.0 MiB) in  9s, read: 2.7 MiB/s, write: 2.7 MiB/s
507: 2020-12-13 00:13:16 INFO:  41% (424.0 MiB of 1016.0 MiB) in 33s, read: 853.3 KiB/s, write: 853.3 KiB/s
507: 2020-12-13 00:13:19 INFO:  60% (612.0 MiB of 1016.0 MiB) in 36s, read: 62.7 MiB/s, write: 62.7 MiB/s
507: 2020-12-13 00:13:22 INFO:  72% (740.0 MiB of 1016.0 MiB) in 39s, read: 42.7 MiB/s, write: 42.7 MiB/s
507: 2020-12-13 00:13:25 INFO:  79% (812.0 MiB of 1016.0 MiB) in 42s, read: 24.0 MiB/s, write: 24.0 MiB/s
507: 2020-12-13 00:13:38 INFO:  81% (828.0 MiB of 1016.0 MiB) in 55s, read: 1.2 MiB/s, write: 1.2 MiB/s
507: 2020-12-13 00:13:41 INFO:  95% (968.0 MiB of 1016.0 MiB) in 58s, read: 46.7 MiB/s, write: 46.7 MiB/s
507: 2020-12-13 00:13:44 INFO: 100% (1016.0 MiB of 1016.0 MiB) in  1m  1s, read: 16.0 MiB/s, write: 16.0 MiB/s
507: 2020-12-13 00:14:05 INFO: backup was done incrementally, reused 49.01 GiB (98%)
507: 2020-12-13 00:14:05 INFO: transferred 1016.00 MiB in 82 seconds (12.4 MiB/s)
507: 2020-12-13 00:14:05 INFO: running 'proxmox-backup-client prune' for 'vm/507'
507: 2020-12-13 00:14:06 INFO: Finished Backup of VM 507 (00:01:51)


Code:
[Sun Dec 13 00:12:42 2020] watchdog: BUG: soft lockup - CPU#4 stuck for 22s! [node:3811]
[Sun Dec 13 00:12:42 2020] Modules linked in: bochs_drm ttm drm_kms_helper crct10dif_pclmul drm crc32_pclmul ghash_clmulni_intel virtio_console evdev sg button joydev virtio_balloon serio                        _raw qemu_fw_cfg pcspkr ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 crc32c_generic fscrypto ecb hid_generic usbhid hid sr_mod cdrom ata_generic virtio_net net_failover virtio_blk f                        ailover crc32c_intel uhci_hcd ehci_hcd ata_piix aesni_intel libata usbcore aes_x86_64 crypto_simd cryptd scsi_mod glue_helper psmouse virtio_pci virtio_ring i2c_piix4 usb_common virtio fl                        oppy
[Sun Dec 13 00:12:42 2020] CPU: 4 PID: 3811 Comm: node Not tainted 4.19.0-8-amd64 #1 Debian 4.19.98-1
[Sun Dec 13 00:12:42 2020] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-48-gd9c812dda519-prebuilt.qemu.org 04/01/2014
[Sun Dec 13 00:12:42 2020] RIP: 0010:smp_call_function_many+0x1f8/0x250
[Sun Dec 13 00:12:42 2020] Code: c7 e8 7c 34 5f 00 3b 05 aa 69 01 01 0f 83 8c fe ff ff 48 63 d0 48 8b 0b 48 03 0c d5 20 d7 4e a7 8b 51 18 83 e2 01 74 0a f3 90 <8b> 51 18 83 e2 01 75 f6 eb                         c8 48 c7 c2 60 dc 72 a7 4c 89 fe 89 df
[Sun Dec 13 00:12:42 2020] RSP: 0018:ffff9cf3c8fc7cb0 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
[Sun Dec 13 00:12:42 2020] RAX: 0000000000000002 RBX: ffff8c619fb23080 RCX: ffff8c619faa6f80
[Sun Dec 13 00:12:42 2020] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8c619fb23088
[Sun Dec 13 00:12:42 2020] RBP: ffff8c619fb23088 R08: 0000000000000004 R09: ffff8c619fb230c8
[Sun Dec 13 00:12:42 2020] R10: ffff8c619fb23088 R11: 0000000000000008 R12: ffffffffa666cbf0
[Sun Dec 13 00:12:42 2020] R13: ffff9cf3c8fc7cf8 R14: 0000000000000001 R15: 0000000000000200
[Sun Dec 13 00:12:42 2020] FS:  00007f0603fff700(0000) GS:ffff8c619fb00000(0000) knlGS:0000000000000000
[Sun Dec 13 00:12:42 2020] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Sun Dec 13 00:12:42 2020] CR2: 00002d4a8a68e000 CR3: 00000005dce96000 CR4: 00000000000006e0
[Sun Dec 13 00:12:42 2020] Call Trace:
[Sun Dec 13 00:12:42 2020]  flush_tlb_mm_range+0xc3/0x110
[Sun Dec 13 00:12:42 2020]  change_protection+0x9bb/0x9d0
[Sun Dec 13 00:12:42 2020]  mprotect_fixup+0x1b1/0x2f0
[Sun Dec 13 00:12:42 2020]  ? __switch_to_asm+0x41/0x70
[Sun Dec 13 00:12:42 2020]  ? __switch_to_asm+0x35/0x70
[Sun Dec 13 00:12:42 2020]  ? __switch_to_asm+0x41/0x70
[Sun Dec 13 00:12:42 2020]  ? __switch_to+0x115/0x440
[Sun Dec 13 00:12:42 2020]  ? __switch_to_asm+0x41/0x70
[Sun Dec 13 00:12:42 2020]  ? finish_task_switch+0x71/0x280
[Sun Dec 13 00:12:42 2020]  do_mprotect_pkey+0x1b4/0x2f0
[Sun Dec 13 00:12:42 2020]  __x64_sys_mprotect+0x1b/0x20
[Sun Dec 13 00:12:42 2020]  do_syscall_64+0x53/0x110
[Sun Dec 13 00:12:42 2020]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[Sun Dec 13 00:12:42 2020] RIP: 0033:0x7f060a091207
[Sun Dec 13 00:12:42 2020] Code: 73 01 c3 48 8b 0d 89 6c 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 0a 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3                         48 8b 0d 59 6c 0c 00 f7 d8 64 89 01 48
[Sun Dec 13 00:12:42 2020] RSP: 002b:00007f0603ffecd8 EFLAGS: 00000246 ORIG_RAX: 000000000000000a
[Sun Dec 13 00:12:42 2020] RAX: ffffffffffffffda RBX: 00003c7d75500000 RCX: 00007f060a091207
[Sun Dec 13 00:12:42 2020] RDX: 0000000000000000 RSI: 0000000000040000 RDI: 00003c7d75500000
[Sun Dec 13 00:12:42 2020] RBP: 00007f0603ffed00 R08: 0000000000040000 R09: 00000000051c9c00
[Sun Dec 13 00:12:42 2020] R10: 0000000000000007 R11: 0000000000000246 R12: 0000000000040000
[Sun Dec 13 00:12:42 2020] R13: 0000000000040000 R14: 00003c7d75500000 R15: 0000000004c8de58
[Sun Dec 13 00:12:45 2020] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[Sun Dec 13 00:12:45 2020] rcu:         (detected by 2, t=6735 jiffies, g=85605465, q=247)
[Sun Dec 13 00:12:45 2020] rcu: All QSes seen, last rcu_sched kthread activity 6735 (4421922872-4421916137), jiffies_till_next_fqs=1, root ->qsmask 0x0
[Sun Dec 13 00:12:45 2020] kworker/2:2     R  running task        0  3860      2 0x80000008
[Sun Dec 13 00:12:45 2020] Workqueue: events_freezable_power_ disk_events_workfn
[Sun Dec 13 00:12:45 2020] Call Trace:
[Sun Dec 13 00:12:45 2020]  <IRQ>
[Sun Dec 13 00:12:45 2020]  sched_show_task.cold.85+0x9a/0xc6
[Sun Dec 13 00:12:45 2020]  rcu_check_callbacks.cold.81+0x31e/0x335
[Sun Dec 13 00:12:45 2020]  ? tick_sched_do_timer+0x60/0x60
[Sun Dec 13 00:12:45 2020]  update_process_times+0x28/0x60
[Sun Dec 13 00:12:45 2020]  tick_sched_handle+0x22/0x60
[Sun Dec 13 00:12:45 2020]  tick_sched_timer+0x37/0x70
[Sun Dec 13 00:12:45 2020]  __hrtimer_run_queues+0x100/0x280
[Sun Dec 13 00:12:45 2020]  hrtimer_interrupt+0x100/0x220
[Sun Dec 13 00:12:45 2020]  smp_apic_timer_interrupt+0x6a/0x140
[Sun Dec 13 00:12:45 2020]  apic_timer_interrupt+0xf/0x20
[Sun Dec 13 00:12:45 2020]  </IRQ>
[Sun Dec 13 00:12:45 2020] RIP: 0010:_raw_spin_unlock_irqrestore+0x11/0x20
[Sun Dec 13 00:12:45 2020] Code: d8 48 3d 90 d0 03 00 76 cc 80 4d 00 08 eb 98 90 90 90 90 90 90 90 90 90 90 66 66 66 66 90 e8 76 63 9a ff 66 90 48 89 f7 57 9d <66> 66 90 66 90 c3 66 0f 1f                         84 00 00 00 00 00 66 66 66 66 90 8b 07
[Sun Dec 13 00:12:45 2020] RSP: 0018:ffff9cf3c382fa88 EFLAGS: 00000293 ORIG_RAX: ffffffffffffff13
[Sun Dec 13 00:12:45 2020] RAX: 0000000000000001 RBX: 0000000000000000 RCX: ffff8c619faabb05
[Sun Dec 13 00:12:45 2020] RDX: ffff8c619faa1a60 RSI: 0000000000000293 RDI: 0000000000000293
[Sun Dec 13 00:12:45 2020] RBP: ffff8c6194b00000 R08: 0000000000000001 R09: ffff8c619f002480
[Sun Dec 13 00:12:45 2020] R10: ffff8c619f0024a8 R11: 0000000000000000 R12: 0000000000000293
[Sun Dec 13 00:12:45 2020] R13: ffff8c6194d87000 R14: ffff8c61959c0138 R15: 0000000000000000
[Sun Dec 13 00:12:45 2020]  ata_scsi_queuecmd+0x11c/0x330 [libata]
[Sun Dec 13 00:12:45 2020]  scsi_dispatch_cmd+0x95/0x230 [scsi_mod]
[Sun Dec 13 00:12:45 2020]  scsi_queue_rq+0x527/0x610 [scsi_mod]
[Sun Dec 13 00:12:45 2020]  blk_mq_dispatch_rq_list+0x38c/0x590
[Sun Dec 13 00:12:45 2020]  ? syscall_return_via_sysret+0x14/0x83
[Sun Dec 13 00:12:45 2020]  ? __switch_to_asm+0x41/0x70
[Sun Dec 13 00:12:45 2020]  ? __switch_to_asm+0x35/0x70
[Sun Dec 13 00:12:45 2020]  ? __switch_to_asm+0x41/0x70
[Sun Dec 13 00:12:45 2020]  ? update_group_capacity+0x23/0x2a0
[Sun Dec 13 00:12:45 2020]  ? refcount_inc_checked+0x5/0x30
[Sun Dec 13 00:12:45 2020]  ? kobject_get+0x18/0x40
[Sun Dec 13 00:12:45 2020]  blk_mq_do_dispatch_sched+0x76/0x120
[Sun Dec 13 00:12:45 2020]  blk_mq_sched_dispatch_requests+0x11e/0x170
[Sun Dec 13 00:12:45 2020]  __blk_mq_run_hw_queue+0x4e/0xe0
[Sun Dec 13 00:12:45 2020]  __blk_mq_delay_run_hw_queue+0x143/0x160
[Sun Dec 13 00:12:45 2020]  blk_mq_run_hw_queue+0x88/0x110
[Sun Dec 13 00:12:45 2020]  blk_mq_sched_insert_request+0x142/0x1e0
[Sun Dec 13 00:12:45 2020]  blk_execute_rq+0x4b/0x9a
[Sun Dec 13 00:12:45 2020]  __scsi_execute+0xeb/0x250 [scsi_mod]
[Sun Dec 13 00:12:45 2020]  sr_check_events+0x9f/0x270 [sr_mod]
[Sun Dec 13 00:12:45 2020]  ? __switch_to_asm+0x41/0x70
[Sun Dec 13 00:12:45 2020]  ? __switch_to_asm+0x35/0x70
[Sun Dec 13 00:12:45 2020]  cdrom_check_events+0x1a/0x30 [cdrom]
[Sun Dec 13 00:12:45 2020]  sr_block_check_events+0xa3/0xb0 [sr_mod]
[Sun Dec 13 00:12:45 2020]  disk_check_events+0x69/0x150
[Sun Dec 13 00:12:45 2020]  process_one_work+0x1a7/0x3a0
[Sun Dec 13 00:12:45 2020]  worker_thread+0x30/0x390
[Sun Dec 13 00:12:45 2020]  ? create_worker+0x1a0/0x1a0
[Sun Dec 13 00:12:45 2020]  kthread+0x112/0x130
[Sun Dec 13 00:12:45 2020]  ? kthread_bind+0x30/0x30
[Sun Dec 13 00:12:45 2020]  ret_from_fork+0x35/0x40
[Sun Dec 13 00:12:45 2020] rcu: rcu_sched kthread starved for 6735 jiffies! g85605465 f0x2 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=7
[Sun Dec 13 00:12:45 2020] rcu: RCU grace-period kthread stack dump:
[Sun Dec 13 00:12:45 2020] rcu_sched       R  running task        0    10      2 0x80000000
[Sun Dec 13 00:12:45 2020] Call Trace:
[Sun Dec 13 00:12:45 2020]  ? __schedule+0x2a2/0x870
[Sun Dec 13 00:12:45 2020]  schedule+0x28/0x80
[Sun Dec 13 00:12:45 2020]  schedule_timeout+0x16b/0x390
[Sun Dec 13 00:12:45 2020]  ? __next_timer_interrupt+0xc0/0xc0
[Sun Dec 13 00:12:45 2020]  rcu_gp_kthread+0x40d/0x850
[Sun Dec 13 00:12:45 2020]  ? call_rcu_sched+0x20/0x20
[Sun Dec 13 00:12:45 2020]  kthread+0x112/0x130
[Sun Dec 13 00:12:45 2020]  ? kthread_bind+0x30/0x30
[Sun Dec 13 00:12:45 2020]  ret_from_fork+0x35/0x40
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!