Snapshot breaks docker containers on debian VM?

thejp

New Member
Dec 12, 2022
2
1
3
Hi,

I've searched around but couldn't quite find a solution.

So I have a node with two VMs only - Unraid and a Debian with docker and about 25 containers running.

I have noticed with trial and error that my Debian VM will running smoothly if I don't back it up, but when I place it in the backup job, that's configured with snapshot mode, a good portion of the containers will face issues and eventually stop working. Perfect example is Plex, which actually breaks and I then consistently get "Database corrupt".
In fact, one time this even led to having to run fsck to fix filesystem errors within the Debian VM.

Curiously enough, two other VMs in other machines also running docker, but over Ubuntu VMs and don't seem to have any issue (yet those have maybe 4 or 5 containers each).

So the question is, any ideias of other things I should look at/try? Or just accept the fact that I need to run a backup with stop mode just to make sure I have no issues?

Extra info: qemu agent running (regular shutdown and IP showing); backup is done via PBS; backup with snapshot does work, it is the underlying docker which causes issues, maybe because of the freeze operation.
 
Hey there!
Now that's curious - the next time it happens in the affected VM, can you look through your syslog, kernel log and systemd journal? I'm curious if there's anything in there regarding Docker.

Also, which filesystem is that VM using, and what's backing the filesystem of that VM? (For example, does it run ext4 and is backed by ZFS?)

Is the portion of containers that break completely random or are there any patterns (apart from the Plex container)?

Or just accept the fact that I need to run a backup with stop mode just to make sure I have no issues?
This might just be the safest option if we can't identify what the exact problem is here. You could also try splitting that VM up into multiple smaller ones, since that seems to work for the other VMs. You could also test whether that problem really only exists on that Debian host - maybe you can replicate that setup on an Ubuntu VM (should be easy enough if you're using Docker Compose, right?) and see if the problem still persists.
 
Hi,

Both the VM and underlying Proxmox storage are plain old ext4.
Seems to me that some containers break more often that other - Plex is a very constant one... I first noticed when I was still messing around with the VM (I'm only about two months old using Proxmox... used to run everything without a hypervisor) and had a daily snapshot of my VMs at 1am, and would get the email with backup success, then at around 2am everyday I would get a message from Plex saying the database is corrupted. I would then fix the db, everything would work througout the day (since I had to restart Plex itself) but the other containers would need to be restarted when I found something not working... then at 1am that next day, rinse and repeat - this last week I took this VM out of the backup job and it's been very steady ever since.

Honestly, running it in stop mode is not going to be any sort of deal breaker, but for the sake of science I'm going to try a couple of other things, including switching the docker-compose over to a fresh ubuntu server VM.
Thanks :)
 
  • Like
Reactions: Max Carrara
Alright! Let me know how it goes. This is a rather interesting issue, so I'm curious what you'll find.
 
Hi, i have that same behavior on my kubernetes cluster and OpenEBS cSTOR storage engine & Longhorn storage engine. But in this case i disabled the backup on these VM (AlmaLinux 9) and when others VMs get backup done, then some PVC get corrupt and then must use fsck on that volumes. On Proxmox I'm using backup with snapshot over ZFS backend and saving on PBS. Is really strange behavior. Dmesg:

Code:
[82698.901351] INFO: task OrientDB WAL Fl:296254 blocked for more than 122 seconds.
[82698.901985]       Not tainted 5.14.0-162.22.2.el9_1.x86_64 #1
[82698.902618] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[82698.903228] task:OrientDB WAL Fl state:D stack:    0 pid:296254 ppid:295441 flags:0x00004000
[82698.903790] Call Trace:
[82698.904367]  __schedule+0x248/0x590
[82698.904953]  ? out_of_line_wait_on_bit_lock+0xb0/0xb0
[82698.905510]  schedule+0x5a/0xc0
[82698.906118]  io_schedule+0x42/0x70
[82698.906658]  bit_wait_io+0xd/0x60
[82698.907212]  __wait_on_bit_lock+0x5b/0xb0
[82698.907739]  out_of_line_wait_on_bit_lock+0x92/0xb0
[82698.908274]  ? __ia32_sys_membarrier+0x20/0x20
[82698.908806]  ext4_update_super+0x3c0/0x410 [ext4]
[82698.909444]  ? wake_up_q+0x4a/0x90
[82698.910004]  ext4_commit_super+0x46/0x100 [ext4]
[82698.910559]  ext4_handle_error+0x1d9/0x1f0 [ext4]
[82698.911128]  __ext4_error+0x9c/0x120 [ext4]
[82698.911674]  ext4_journal_check_start+0x89/0xa0 [ext4]
[82698.912236]  __ext4_journal_start_sb+0x31/0x120 [ext4]
[82698.912771]  ext4_dirty_inode+0x35/0x80 [ext4]
[82698.913335]  __mark_inode_dirty+0x123/0x350
[82698.913883]  generic_update_time+0x6c/0xd0
[82698.914431]  file_update_time+0x127/0x140
[82698.915029]  ? generic_write_checks+0x61/0xc0
[82698.915550]  ext4_buffered_write_iter+0x50/0x110 [ext4]
[82698.916115]  new_sync_write+0x11c/0x1b0
[82698.916635]  vfs_write+0x1ef/0x280
[82698.917180]  __x64_sys_pwrite64+0x90/0xc0
[82698.917726]  do_syscall_64+0x59/0x90
[82698.918326]  ? __irq_exit_rcu+0x46/0xe0
[82698.918881]  ? sysvec_apic_timer_interrupt+0x3c/0x90
[82698.919442]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[82698.919989] RIP: 0033:0x7f6b85082487
[82698.920511] RSP: 002b:00007f6aebafb160 EFLAGS: 00000293 ORIG_RAX: 0000000000000012
[82698.921069] RAX: ffffffffffffffda RBX: 00000000000003aa RCX: 00007f6b85082487
[82698.921608] RDX: 0000000000010000 RSI: 00007f6b4c001820 RDI: 00000000000003aa
[82698.922239] RBP: 00007f6b4c001820 R08: 0000000000000000 R09: 0000000757d46fc8
[82698.922836] R10: 0000000000030000 R11: 0000000000000293 R12: 0000000000010000
[82698.923425] R13: 0000000000030000 R14: 0000000000030000 R15: 00007f6b3c029800
[82724.777149] sd 7:0:0:0: [sde] tag#38 timing out command, waited 180s
[82724.780451] sd 7:0:0:0: [sde] tag#38 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=180s
[82724.781068] sd 7:0:0:0: [sde] tag#38 CDB: Write(10) 2a 00 00 00 00 00 00 00 08 00
[82724.781657] I/O error, dev sde, sector 0 op 0x1:(WRITE) flags 0x3800 phys_seg 1 prio class 0
[82724.782329] I/O error, dev sde, sector 0 op 0x1:(WRITE) flags 0x3800 phys_seg 1 prio class 0
[82724.782948] Buffer I/O error on dev sde, logical block 0, lost sync page write
[82724.783557] EXT4-fs (sde): I/O error while writing superblock
[82724.783584] EXT4-fs (sde): previous I/O error to superblock detected
[82724.783629] EXT4-fs (sde): previous I/O error to superblock detected
[82724.784217] EXT4-fs (sde): Remounting filesystem read-only
[82724.786191] EXT4-fs (sde): failed to convert unwritten extents to written extents -- potential data loss!  (inode 4458758, error -30)
[82724.786866] Buffer I/O error on device sde, logical block 17861959
[82724.787572] EXT4-fs (sde): failed to convert unwritten extents to written extents -- potential data loss!  (inode 4458753, error -30)
[82724.788275] Buffer I/O error on device sde, logical block 17861962
[82724.789007] Buffer I/O error on device sde, logical block 17861963
[82724.789988] EXT4-fs (sde): failed to convert unwritten extents to written extents -- potential data loss!  (inode 4456455, error -30)
[82724.790695] Buffer I/O error on device sde, logical block 17907658
[82724.791593] EXT4-fs (sde): failed to convert unwritten extents to written extents -- potential data loss!  (inode 4458762, error -30)
[82724.792327] Buffer I/O error on device sde, logical block 17866770
[82724.793087] Buffer I/O error on device sde, logical block 17866771
[82738.734358] sd 6:0:0:0: [sdd] tag#103 timing out command, waited 180s
[82738.735247] sd 6:0:0:0: [sdd] tag#103 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=180s
[82738.736021] sd 6:0:0:0: [sdd] tag#103 CDB: Write(10) 2a 00 00 00 00 00 00 00 08 00
[82738.736814] I/O error, dev sdd, sector 0 op 0x1:(WRITE) flags 0x3800 phys_seg 1 prio class 0
[82738.737591] I/O error, dev sdd, sector 0 op 0x1:(WRITE) flags 0x3800 phys_seg 1 prio class 0
[82738.738357] Buffer I/O error on dev sdd, logical block 0, lost sync page write
[82738.739156] EXT4-fs (sdd): I/O error while writing superblock
[82738.739189] EXT4-fs (sdd): previous I/O error to superblock detected
[82738.739349] EXT4-fs (sdd): previous I/O error to superblock detected
[82738.739976] EXT4-fs (sdd): Remounting filesystem read-only
[82905.414351] sd 7:0:0:0: [sde] tag#89 timing out command, waited 180s
[82905.417982] sd 7:0:0:0: [sde] tag#89 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=180s
[82905.418951] sd 7:0:0:0: [sde] tag#89 CDB: Write(10) 2a 00 00 00 00 00 00 00 08 00
[82905.419975] I/O error, dev sde, sector 0 op 0x1:(WRITE) flags 0x3800 phys_seg 1 prio class 0
[82905.420846] I/O error, dev sde, sector 0 op 0x1:(WRITE) flags 0x3800 phys_seg 1 prio class 0
[82905.421626] Buffer I/O error on dev sde, logical block 0, lost sync page write
[82905.422517] EXT4-fs (sde): I/O error while writing superblock
[82919.386332] sd 6:0:0:0: [sdd] tag#42 timing out command, waited 180s
[82919.387239] sd 6:0:0:0: [sdd] tag#42 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=180s
[82919.388113] sd 6:0:0:0: [sdd] tag#42 CDB: Write(10) 2a 00 00 00 00 00 00 00 08 00
[82919.388945] I/O error, dev sdd, sector 0 op 0x1:(WRITE) flags 0x3800 phys_seg 1 prio class 0
[82919.389813] I/O error, dev sdd, sector 0 op 0x1:(WRITE) flags 0x3800 phys_seg 1 prio class 0
[82919.390676] Buffer I/O error on dev sdd, logical block 0, lost sync page write
[82919.391558] EXT4-fs (sdd): I/O error while writing superblock
[83086.072000] sd 7:0:0:0: [sde] tag#28 timing out command, waited 180s
[83086.075441] sd 7:0:0:0: [sde] tag#28 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=180s
[83086.076401] sd 7:0:0:0: [sde] tag#28 CDB: Write(10) 2a 00 00 00 00 00 00 00 08 00
[83086.077263] I/O error, dev sde, sector 0 op 0x1:(WRITE) flags 0x3800 phys_seg 1 prio class 0
[83086.078164] I/O error, dev sde, sector 0 op 0x1:(WRITE) flags 0x3800 phys_seg 1 prio class 0
[83086.079019] Buffer I/O error on dev sde, logical block 0, lost sync page write
[83086.079896] EXT4-fs (sde): I/O error while writing superblock
[83086.079896] EXT4-fs (sde): I/O error while writing superblock
[83100.019596] sd 6:0:0:0: [sdd] tag#93 timing out command, waited 180s
[83100.020540] sd 6:0:0:0: [sdd] tag#93 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=180s
[83100.021509] sd 6:0:0:0: [sdd] tag#93 CDB: Write(10) 2a 00 00 00 00 00 00 00 08 00
[83100.022488] I/O error, dev sdd, sector 0 op 0x1:(WRITE) flags 0x3800 phys_seg 1 prio class 0
[83100.023392] I/O error, dev sdd, sector 0 op 0x1:(WRITE) flags 0x3800 phys_seg 1 prio class 0
[83100.024183] Buffer I/O error on dev sdd, logical block 0, lost sync page write
[83100.025012] EXT4-fs (sdd): I/O error while writing superblock
[83100.025024] EXT4-fs (sdd): I/O error while writing superblock
[83100.025168] EXT4-fs (sdd): I/O error while writing superblock
[83100.025883] EXT4-fs (sdd): failed to convert unwritten extents to written extents -- potential data loss!  (inode 5505116, error -30)
[83100.028961] Buffer I/O error on device sdd, logical block 11573522
[84074.247686] EXT4-fs (sde): recovery complete
[84074.251125] EXT4-fs (sde): mounted filesystem with ordered data mode. Quota mode: none.
[84075.057172] sd 7:0:0:0: [sde] Synchronizing SCSI cache
[125216.888509] hrtimer: interrupt took 172921101 ns
[157327.585880] sd 6:0:0:0: [sdd] Synchronizing SCSI cache
 
My question is, why that VM fail on ext4 when the backup is not done on that VM, maybe the snapshot of ZFS is doing something to all the volumes?
 
In some research i think that i found the issue. The problem is (i think) with the Linked Clone and Base Disk image. These VMs are Linked Clone from a template, so these HDD have a base on hdd template, that are used on some others VM that have the snapshot backup enable. So i think that when other VM enter on backup state, the base HDD is freeze with snapshot and that have an impact on the Kubernetes VM. Can be?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!