Replication crash

cbx · Oct 21, 2017

Hello

I have an cluster with 4 servers, 3 with 1 VM on each and no one on the last server. I configure the replcation of each VM on the empty last server each 30 mn.
At 21h today, I review the state of all of them and see that one have already the task at 05h30... the other, correctly have done the task at 21h and next task at 21h30.
I review the server that seem have an problem, and see that pvesr have crached :

Oct 21 06:03:49 b7 kernel: [256037.384220] pvesr D 0 6783 1 0x00000000
Oct 21 06:03:49 b7 kernel: [256037.384224] Call Trace:
Oct 21 06:03:49 b7 kernel: [256037.384234] __schedule+0x233/0x6f0
Oct 21 06:03:49 b7 kernel: [256037.384239] ? kmem_cache_alloc_node+0x11d/0x1b0
Oct 21 06:03:49 b7 kernel: [256037.384242] ? alloc_request_struct+0x19/0x20
Oct 21 06:03:49 b7 kernel: [256037.384245] schedule+0x36/0x80
Oct 21 06:03:49 b7 kernel: [256037.384247] schedule_timeout+0x22a/0x3f0
Oct 21 06:03:49 b7 kernel: [256037.384250] ? cpumask_next_and+0x2d/0x50
Oct 21 06:03:49 b7 kernel: [256037.384253] ? update_sd_lb_stats+0x108/0x540
Oct 21 06:03:49 b7 kernel: [256037.384256] ? ktime_get+0x41/0xb0
Oct 21 06:03:49 b7 kernel: [256037.384258] io_schedule_timeout+0xa4/0x110
Oct 21 06:03:49 b7 kernel: [256037.384262] __lock_page+0x10d/0x150
Oct 21 06:03:49 b7 kernel: [256037.384264] ? unlock_page+0x30/0x30
Oct 21 06:03:49 b7 kernel: [256037.384266] pagecache_get_page+0x19f/0x2a0
Oct 21 06:03:49 b7 kernel: [256037.384269] shmem_unused_huge_shrink+0x214/0x3b0
Oct 21 06:03:49 b7 kernel: [256037.384272] shmem_unused_huge_scan+0x20/0x30
Oct 21 06:03:49 b7 kernel: [256037.384275] super_cache_scan+0x190/0x1a0
Oct 21 06:03:49 b7 kernel: [256037.384278] shrink_slab.part.40+0x1f5/0x420
Oct 21 06:03:49 b7 kernel: [256037.384281] shrink_slab+0x29/0x30
Oct 21 06:03:49 b7 kernel: [256037.384283] shrink_node+0x108/0x320

but server have still worked normally as this time. I decide to kill the task as pvesr have a D state, and both host as VM go down... (...blocked for more than 120 seconds.)

is this an bug?

wolfgang · Oct 23, 2017

Hi,

I never see this before.
Please send your pveversion -v
Are all nodes on the same version level?

cbx · Oct 23, 2017

Yes, 3 nodes have been installed a few days ago :

pveversion -v
proxmox-ve: 5.0-19 (running kernel: 4.10.17-2-pve)
pve-manager: 5.0-30 (running version: 5.0-30/5ab26bc)
pve-kernel-4.10.17-2-pve: 4.10.17-19
libpve-http-server-perl: 2.0-6
lvm2: 2.02.168-pve3
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-12
qemu-server: 5.0-15
pve-firmware: 2.0-2
libpve-common-perl: 5.0-16
libpve-guest-common-perl: 2.0-11
libpve-access-control: 5.0-6
libpve-storage-perl: 5.0-14
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-2
pve-docs: 5.0-9
pve-qemu-kvm: 2.9.0-3
pve-container: 2.0-15
pve-firewall: 3.0-2
pve-ha-manager: 2.0-2
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.0.8-3
lxcfs: 2.0.7-pve4
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.6.5.9-pve16~bpo90

I have added

vm.dirty_background_bytes = 0
vm.dirty_bytes = 0
vm.dirty_expire_centisecs = 3000
vm.dirty_writeback_centisecs = 500
vm.dirty_ratio = 10
vm.dirty_background_ratio = 5

in /etc/sysctl.conf as I have read that can solve this type of problem (I can't confirm it for now)...

wolfgang · Oct 23, 2017

Can you please tell us where do you find with workaround?
A link would be nice.

cbx · Oct 24, 2017

https://www.blackmoreops.com/2014/0...ask_timeout_secs-blocked-120-seconds-problem/

Search

Search

Replication crash

cbx

Active Member

wolfgang

Proxmox Retired Staff

cbx

Active Member

wolfgang

Proxmox Retired Staff

cbx

Active Member

We value your privacy