Replication crash

cbx

Active Member
Mar 2, 2012
45
1
28
Hello

I have an cluster with 4 servers, 3 with 1 VM on each and no one on the last server. I configure the replcation of each VM on the empty last server each 30 mn.
At 21h today, I review the state of all of them and see that one have already the task at 05h30... the other, correctly have done the task at 21h and next task at 21h30.
I review the server that seem have an problem, and see that pvesr have crached :


Oct 21 06:03:49 b7 kernel: [256037.384220] pvesr D 0 6783 1 0x00000000
Oct 21 06:03:49 b7 kernel: [256037.384224] Call Trace:
Oct 21 06:03:49 b7 kernel: [256037.384234] __schedule+0x233/0x6f0
Oct 21 06:03:49 b7 kernel: [256037.384239] ? kmem_cache_alloc_node+0x11d/0x1b0
Oct 21 06:03:49 b7 kernel: [256037.384242] ? alloc_request_struct+0x19/0x20
Oct 21 06:03:49 b7 kernel: [256037.384245] schedule+0x36/0x80
Oct 21 06:03:49 b7 kernel: [256037.384247] schedule_timeout+0x22a/0x3f0
Oct 21 06:03:49 b7 kernel: [256037.384250] ? cpumask_next_and+0x2d/0x50
Oct 21 06:03:49 b7 kernel: [256037.384253] ? update_sd_lb_stats+0x108/0x540
Oct 21 06:03:49 b7 kernel: [256037.384256] ? ktime_get+0x41/0xb0
Oct 21 06:03:49 b7 kernel: [256037.384258] io_schedule_timeout+0xa4/0x110
Oct 21 06:03:49 b7 kernel: [256037.384262] __lock_page+0x10d/0x150
Oct 21 06:03:49 b7 kernel: [256037.384264] ? unlock_page+0x30/0x30
Oct 21 06:03:49 b7 kernel: [256037.384266] pagecache_get_page+0x19f/0x2a0
Oct 21 06:03:49 b7 kernel: [256037.384269] shmem_unused_huge_shrink+0x214/0x3b0
Oct 21 06:03:49 b7 kernel: [256037.384272] shmem_unused_huge_scan+0x20/0x30
Oct 21 06:03:49 b7 kernel: [256037.384275] super_cache_scan+0x190/0x1a0
Oct 21 06:03:49 b7 kernel: [256037.384278] shrink_slab.part.40+0x1f5/0x420
Oct 21 06:03:49 b7 kernel: [256037.384281] shrink_slab+0x29/0x30
Oct 21 06:03:49 b7 kernel: [256037.384283] shrink_node+0x108/0x320

but server have still worked normally as this time. I decide to kill the task as pvesr have a D state, and both host as VM go down... (...blocked for more than 120 seconds.)

is this an bug?
 

wolfgang

Proxmox Staff Member
Staff member
Oct 1, 2014
6,195
422
103
Hi,

I never see this before.
Please send your pveversion -v
Are all nodes on the same version level?
 

cbx

Active Member
Mar 2, 2012
45
1
28
Yes, 3 nodes have been installed a few days ago :

pveversion -v
proxmox-ve: 5.0-19 (running kernel: 4.10.17-2-pve)
pve-manager: 5.0-30 (running version: 5.0-30/5ab26bc)
pve-kernel-4.10.17-2-pve: 4.10.17-19
libpve-http-server-perl: 2.0-6
lvm2: 2.02.168-pve3
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-12
qemu-server: 5.0-15
pve-firmware: 2.0-2
libpve-common-perl: 5.0-16
libpve-guest-common-perl: 2.0-11
libpve-access-control: 5.0-6
libpve-storage-perl: 5.0-14
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-2
pve-docs: 5.0-9
pve-qemu-kvm: 2.9.0-3
pve-container: 2.0-15
pve-firewall: 3.0-2
pve-ha-manager: 2.0-2
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.0.8-3
lxcfs: 2.0.7-pve4
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.6.5.9-pve16~bpo90

I have added

vm.dirty_background_bytes = 0
vm.dirty_bytes = 0
vm.dirty_expire_centisecs = 3000
vm.dirty_writeback_centisecs = 500
vm.dirty_ratio = 10
vm.dirty_background_ratio = 5

in /etc/sysctl.conf as I have read that can solve this type of problem (I can't confirm it for now)...
 

wolfgang

Proxmox Staff Member
Staff member
Oct 1, 2014
6,195
422
103
Can you please tell us where do you find with workaround?
A link would be nice.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE and Proxmox Mail Gateway. We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!