[SOLVED] rsync to vm crashes proxmox host

nidomiro

New Member
Sep 23, 2023
14
7
3
I currently have the problem that my Proxmox host crashes at random times.
My first thought was that CEPH was the problem, so I removed it from my cluster.
After removing CEPH, everything runs more stable but I still experience host crashes.
While moving data from my synology NAS to my TrueNas VM I found a way to let the host crash in a reproducible way: Copy a large amount of relatively small files (5-30mb) to the TrueNas VM via rsync (the command runs inside the TrueNas VM).
At first it works, but then the transfer pauses for some time and continues. The pauses get longer, until the host crashes.

Before the crashes I often (not always) see spikes in IO delay. Also the last two times I observed the crashes, the ServerLoad was quite high. CPU and memory however are not at 100%. CPU was at around 20% load.
Screenshot 2024-08-22 at 12-21-55 jupiter - Proxmox Virtual Environment.png


This time I managed to get the dmesg output (of the Host) just before the crash.
[59228.679912] INFO: task txg_sync:468 blocked for more than 122 seconds.
[59228.680195] Tainted: P O 6.8.12-1-pve #1
[59228.680463] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[59228.680726] task:txg_sync state:D stack:0 pid:468 tgid:468 ppid:2 flags:0x00004000
[59228.680997] Call Trace:
[59228.681295] <TASK>
[59228.681560] __schedule+0x3ec/0x1520
[59228.681824] schedule+0x33/0xf0
[59228.682093] schedule_timeout+0x95/0x170
[59228.682443] ? __pfx_process_timeout+0x10/0x10
[59228.682708] io_schedule_timeout+0x51/0x80
[59228.682990] __cv_timedwait_common+0x140/0x180 [spl]
[59228.683307] ? __pfx_autoremove_wake_function+0x10/0x10
[59228.683565] __cv_timedwait_io+0x19/0x30 [spl]
[59228.683868] zio_wait+0x13a/0x2c0 [zfs]
[59228.684405] dsl_pool_sync+0xce/0x4e0 [zfs]
[59228.684805] spa_sync+0x578/0x1030 [zfs]
[59228.685238] ? spa_txg_history_init_io+0x120/0x130 [zfs]
[59228.685628] txg_sync_thread+0x207/0x3a0 [zfs]
[59228.686020] ? __pfx_txg_sync_thread+0x10/0x10 [zfs]
[59228.686474] ? __pfx_thread_generic_wrapper+0x10/0x10 [spl]
[59228.686711] thread_generic_wrapper+0x5c/0x70 [spl]
[59228.686951] kthread+0xef/0x120
[59228.687252] ? __pfx_kthread+0x10/0x10
[59228.687508] ret_from_fork+0x44/0x70
[59228.687818] ? __pfx_kthread+0x10/0x10
[59228.688237] ret_from_fork_asm+0x1b/0x30
[59228.688519] </TASK>
[59228.688847] INFO: task zfs:642766 blocked for more than 122 seconds.
[59228.689118] Tainted: P O 6.8.12-1-pve #1
[59228.689350] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[59228.689585] task:zfs state:D stack:0 pid:642766 tgid:642766 ppid:642602 flags:0x00000006
[59228.689828] Call Trace:
[59228.690074] <TASK>
[59228.690416] __schedule+0x3ec/0x1520
[59228.690664] ? dsl_dataset_snapshot_check+0x153/0x4e0 [zfs]
[59228.691127] schedule+0x33/0xf0
[59228.691368] io_schedule+0x46/0x80
[59228.691606] cv_wait_common+0xac/0x140 [spl]
[59228.691860] ? __pfx_autoremove_wake_function+0x10/0x10
[59228.692146] __cv_wait_io+0x18/0x30 [spl]
[59228.692396] txg_wait_synced_impl+0xe1/0x130 [zfs]
[59228.692806] txg_wait_synced+0x10/0x60 [zfs]
[59228.693285] dsl_sync_task_common+0x1dd/0x2b0 [zfs]
[59228.693693] ? __pfx_dsl_dataset_snapshot_check+0x10/0x10 [zfs]
[59228.694156] ? __pfx_dsl_dataset_snapshot_sync+0x10/0x10 [zfs]
[59228.694619] ? __pfx_dsl_dataset_snapshot_check+0x10/0x10 [zfs]
[59228.695032] ? __pfx_dsl_dataset_snapshot_sync+0x10/0x10 [zfs]
[59228.695479] dsl_sync_task+0x1a/0x30 [zfs]
[59228.695909] dsl_dataset_snapshot+0x191/0x380 [zfs]
[59228.696350] ? __kmalloc_node+0x1cb/0x4a0
[59228.696598] ? kvmalloc_node+0x24/0x100
[59228.696856] ? kvmalloc_node+0x24/0x100
[59228.697111] ? kvmalloc_node+0x24/0x100
[59228.697356] ? spl_kvmalloc+0xa5/0xc0 [spl]
[59228.697605] ? spl_kmem_alloc_impl+0xfe/0x130 [spl]
[59228.697855] ? nvt_lookup_name_type.isra.0+0x73/0xc0 [zfs]
[59228.698243] zfs_ioc_snapshot+0x27c/0x360 [zfs]
[59228.698639] zfsdev_ioctl_common+0x5a9/0x9f0 [zfs]
[59228.699041] zfsdev_ioctl+0x57/0xf0 [zfs]
[59228.699440] __x64_sys_ioctl+0xa0/0xf0
[59228.699666] x64_sys_call+0xa68/0x24b0
[59228.699906] do_syscall_64+0x81/0x170
[59228.700191] ? irqentry_exit+0x43/0x50
[59228.700411] ? exc_page_fault+0x94/0x1b0
[59228.700630] entry_SYSCALL_64_after_hwframe+0x78/0x80
[59228.700856] RIP: 0033:0x7d1b1a1bac5b
[59228.701113] RSP: 002b:00007ffca086abb0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[59228.701347] RAX: ffffffffffffffda RBX: 0000000000005a23 RCX: 00007d1b1a1bac5b
[59228.701580] RDX: 00007ffca086ac30 RSI: 0000000000005a23 RDI: 0000000000000004
[59228.701814] RBP: 00007ffca086e220 R08: 0000000000040441 R09: 0000000000000000
[59228.702058] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffca086ac30
[59228.702340] R13: 0000000000005a23 R14: 00007ffca086e201 R15: 00007ffca086e398
[59228.702575] </TASK>

The server config:
The server is a Fujitsu PRIMERGY TX1320 M3
CPU: Intel(R) Xeon(R) CPU E3-1230 v6
RAM: 64GB ECC Ram
BootDisk: 2x 500GB SATA SSD with ZFS mirror (Connected to the Mainboard)
Network: 10Gbe (Intel Corporation Ethernet 10G 2P X520 Adapter (rev 01))

The TrueNas VM has an (PCIe) HBA that is passed trhough and powers a 3 HDD RaidZ1 Pool, which I copy the data to.
The VM uses the virtual Ethernet connection, provided by Proxmox.

For me it is intersting, that the Host dmesg complains with zfs timeouts, but I copy to a pool, the host does not know.

proxmox-ve: 8.2.0 (running kernel: 6.8.12-1-pve)
pve-manager: 8.2.4 (running version: 8.2.4/faa83925c9641325)
proxmox-kernel-helper: 8.1.0
pve-kernel-6.2: 8.0.5
proxmox-kernel-6.8: 6.8.12-1
proxmox-kernel-6.8.12-1-pve-signed: 6.8.12-1
proxmox-kernel-6.8.4-3-pve-signed: 6.8.4-3
proxmox-kernel-6.2.16-20-pve: 6.2.16-20
proxmox-kernel-6.2: 6.2.16-20
pve-kernel-6.2.16-3-pve: 6.2.16-3
ceph-fuse: 18.2.2-pve1
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx9
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.1
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.1.4
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.7
libpve-cluster-perl: 8.0.7
libpve-common-perl: 8.2.2
libpve-guest-common-perl: 5.1.4
libpve-http-server-perl: 5.1.0
libpve-network-perl: 0.9.8
libpve-rs-perl: 0.8.9
libpve-storage-perl: 8.2.3
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.2.7-1
proxmox-backup-file-restore: 3.2.7-1
proxmox-firewall: 0.5.0
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-widget-toolkit: 4.2.3
pve-cluster: 8.0.7
pve-container: 5.1.12
pve-docs: 8.2.3
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.1
pve-firewall: 5.0.7
pve-firmware: 3.13-1
pve-ha-manager: 4.0.5
pve-i18n: 3.2.2
pve-qemu-kvm: 9.0.2-2
pve-xtermjs: 5.3.0-3
qemu-server: 8.2.4
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.4-pve1
 
Last edited:
I think I found the problem. It seems IO delay was the problem here.
After removing the Crucial BX500 from the pool the crashes where gone.

Since replacing both boot drives with used enterprise SSDs everything is stable.
 
  • Like
Reactions: leesteken and UdoB

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!