Severe instability problems after upgrade to 6.1

Dec 28, 2016
6
0
6
52
I have been experiencing daily GPFs since upgrading a stable platform to 6.1. Once or twice per day, the machine goes into a series of soft lockups that can only be recovered by power cycling the box. The problem appears to be ZFS related as it is usually triggered immediately after a znapzend backup is initiated however multiple backups happen hourly and failures occur only intermittently. If I disable the backups, lockups still happen, but marginally less frequently.

I have set all the schedulers to 'none' but am still seeing exactly the same symptoms.

Currently on 6.1-8

The failures start like this and degrade over a short period of time until the journal gets trashed:

Mar 25 16:00:27 newton-iii znapzend[37524]: sending snapshots from wiz/vm-100-disk-2 to rpool/zbackups/newton-iii/wiz/vm-100-disk-2
Mar 25 16:00:27 newton-iii zed[43029]: eid=2296 class=history_event pool_guid=0x506DF3B09AC22638
Mar 25 16:00:27 newton-iii znapzend[14227]: mbuffer: warning: HOME environment variable not set - unable to find defaults file
Mar 25 16:00:27 newton-iii zed[43268]: eid=2297 class=history_event pool_guid=0x506DF3B09AC22638
Mar 25 16:00:27 newton-iii znapzend[37394]: cleaning up snapshots on rpool/zbackups/newton-iii/hum
Mar 25 16:00:27 newton-iii znapzend[37394]: cleaning up snapshots on rpool/zbackups/newton-iii/hum/vm-101-disk-1
Mar 25 16:00:27 newton-iii zed[43648]: eid=2298 class=history_event pool_guid=0x506DF3B09AC22638
Mar 25 16:00:27 newton-iii znapzend[37394]: cleaning up snapshots on rpool/zbackups/newton-iii/hum/vm-101-disk-2
Mar 25 16:00:27 newton-iii znapzend[37394]: sending snapshots from hum to root@delozier-v.bootstrap.je:sero/zbackups/newton-iii/hum
Mar 25 16:00:27 newton-iii kernel: general protection fault: 0000 [#1] SMP NOPTI
Mar 25 16:00:27 newton-iii kernel: CPU: 29 PID: 43652 Comm: receive_writer Tainted: P O 5.3.18-2-pve #1
Mar 25 16:00:27 newton-iii kernel: Hardware name: Supermicro Super Server/X10DRG-Q, BIOS 2.0a 08/29/2016
Mar 25 16:00:27 newton-iii kernel: RIP: 0010:__kmalloc_node+0x19f/0x320
Mar 25 16:00:27 newton-iii kernel: Code: 47 0b 04 0f 84 f4 fe ff ff 4c 89 ff e8 fa cc 01 00 49 89 c1 e9 e4 fe ff ff 41 8b 59 20 49 8b 39 48 8d 4a 01 4c 89 d0 4c 01 d3 <48> 33 1b 49 33 99 70 01 00 00 65 48 0f c7 0f 0f 94 c0 84 c0 0f 84
Mar 25 16:00:27 newton-iii kernel: RSP: 0018:ffffb7bcdde77bf8 EFLAGS: 00010202
Mar 25 16:00:27 newton-iii kernel: RAX: 606a15974a93ccd9 RBX: 606a15974a93ccd9 RCX: 00000000000df02e
Mar 25 16:00:27 newton-iii kernel: RDX: 00000000000df02d RSI: 0000000000042c00 RDI: 000000000002f040
Mar 25 16:00:27 newton-iii kernel: RBP: ffffb7bcdde77c38 R08: ffff8b97ffc6f040 R09: ffff8b97ff407b80
Mar 25 16:00:27 newton-iii kernel: R10: 606a15974a93ccd9 R11: ffff8b7dcbdfeae0 R12: 0000000000042c00
Mar 25 16:00:27 newton-iii kernel: R13: 0000000000000001 R14: 00000000ffffffff R15: ffff8b97ff407b80
Mar 25 16:00:27 newton-iii kernel: FS: 0000000000000000(0000) GS:ffff8b97ffc40000(0000) knlGS:0000000000000000
Mar 25 16:00:27 newton-iii kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar 25 16:00:27 newton-iii kernel: CR2: 00007f5ac65f99a0 CR3: 000000357e80a004 CR4: 00000000003626e0
Mar 25 16:00:27 newton-iii kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Mar 25 16:00:27 newton-iii kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Mar 25 16:00:27 newton-iii kernel: Call Trace:
Mar 25 16:00:27 newton-iii kernel: ? spl_kmem_alloc+0xec/0x140 [spl]
Mar 25 16:00:27 newton-iii kernel: spl_kmem_alloc+0xec/0x140 [spl]
Mar 25 16:00:27 newton-iii kernel: dbuf_dirty+0x107/0x830 [zfs]
Mar 25 16:00:27 newton-iii kernel: dmu_buf_will_dirty_impl+0x11a/0x130 [zfs]
Mar 25 16:00:27 newton-iii kernel: dmu_buf_will_dirty+0x16/0x20 [zfs]
Mar 25 16:00:27 newton-iii kernel: receive_object+0x56e/0xbc0 [zfs]
Mar 25 16:00:27 newton-iii kernel: ? dnode_rele+0x3b/0x40 [zfs]
Mar 25 16:00:27 newton-iii kernel: ? _cond_resched+0x19/0x30
Mar 25 16:00:27 newton-iii kernel: ? mutex_lock+0x12/0x30
Mar 25 16:00:27 newton-iii kernel: receive_writer_thread+0x201/0xb80 [zfs]
Mar 25 16:00:27 newton-iii kernel: ? set_curr_task_fair+0x2b/0x60
Mar 25 16:00:27 newton-iii kernel: ? spl_kmem_free+0x33/0x40 [spl]
Mar 25 16:00:27 newton-iii kernel: ? receive_read_prefetch+0x140/0x140 [zfs]
Mar 25 16:00:27 newton-iii kernel: thread_generic_wrapper+0x74/0x90 [spl]
Mar 25 16:00:27 newton-iii kernel: ? receive_read_prefetch+0x140/0x140 [zfs]
Mar 25 16:00:27 newton-iii kernel: ? thread_generic_wrapper+0x74/0x90 [spl]
Mar 25 16:00:27 newton-iii kernel: kthread+0x120/0x140
Mar 25 16:00:27 newton-iii kernel: ? __thread_exit+0x20/0x20 [spl]
Mar 25 16:00:27 newton-iii kernel: ? __kthread_parkme+0x70/0x70
Mar 25 16:00:27 newton-iii kernel: ret_from_fork+0x35/0x40
Mar 25 16:00:27 newton-iii kernel: Modules linked in: veth ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter bpfilter softdog nfnetlink_log nfnetlink ast snd_hda_codec_realtek drm_vram_helper intel_rapl_msr snd_hda_codec_generic ledtrig_audio intel_rapl_common ttm sb_edac snd_hda_intel x86_pkg_temp_thermal snd_intel_nhlt intel_powerclamp drm_kms_helper snd_hda_codec uvcvideo coretemp kvm_intel snd_usb_audio videobuf2_vmalloc snd_hda_core videobuf2_memops snd_
Mar 25 16:00:27 newton-iii kernel: mac_hid ipmi_msghandler acpi_power_meter acpi_pad vhost_net vhost tap ib_iser rdma_cm iw_cm ib_cm iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi vfio_pci vfio_virqfd irqbypass vfio_iommu_type1 vfio sunrpc ip_tables x_tables autofs4 zfs(PO) zunicode(PO) zlua(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) btrfs xor zstd_compress raid6_pq libcrc32c mlx4_ib ib_uverbs ib_core mlx4_en hid_logitech_hidpp hid_apple hid_logitech_dj usbmouse usbkbd hid_generic usbhid hi
Mar 25 16:00:27 newton-iii kernel: ---[ end trace f24480b659ccd02e ]---
Mar 25 16:00:27 newton-iii kernel: RIP: 0010:__kmalloc_node+0x19f/0x320
Mar 25 16:00:27 newton-iii kernel: Code: 47 0b 04 0f 84 f4 fe ff ff 4c 89 ff e8 fa cc 01 00 49 89 c1 e9 e4 fe ff ff 41 8b 59 20 49 8b 39 48 8d 4a 01 4c 89 d0 4c 01 d3 <48> 33 1b 49 33 99 70 01 00 00 65 48 0f c7 0f 0f 94 c0 84 c0 0f 84
Mar 25 16:00:27 newton-iii kernel: RSP: 0018:ffffb7bcdde77bf8 EFLAGS: 00010202
Mar 25 16:00:27 newton-iii kernel: RAX: 606a15974a93ccd9 RBX: 606a15974a93ccd9 RCX: 00000000000df02e
Mar 25 16:00:27 newton-iii kernel: RDX: 00000000000df02d RSI: 0000000000042c00 RDI: 000000000002f040
Mar 25 16:00:27 newton-iii kernel: RBP: ffffb7bcdde77c38 R08: ffff8b97ffc6f040 R09: ffff8b97ff407b80
Mar 25 16:00:27 newton-iii kernel: R10: 606a15974a93ccd9 R11: ffff8b7dcbdfeae0 R12: 0000000000042c00
Mar 25 16:00:27 newton-iii kernel: R13: 0000000000000001 R14: 00000000ffffffff R15: ffff8b97ff407b80
Mar 25 16:00:27 newton-iii kernel: FS: 0000000000000000(0000) GS:ffff8b97ffc40000(0000) knlGS:0000000000000000
Mar 25 16:00:27 newton-iii kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar 25 16:00:27 newton-iii kernel: CR2: 00007f5ac65f99a0 CR3: 000000357e80a004 CR4: 00000000003626e0
Mar 25 16:00:27 newton-iii kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Mar 25 16:00:27 newton-iii kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Mar 25 16:00:27 newton-iii zed[43705]: eid=2299 class=history_event pool_guid=0x506DF3B09AC22638
Mar 25 16:00:28 newton-iii zed[43740]: eid=2300 class=history_event pool_guid=0x506DF3B09AC22638
 

t.lamprecht

Proxmox Staff Member
Staff member
Jul 28, 2015
2,393
365
103
South Tyrol/Italy
shop.maurer-it.com
Hi,

The problem appears to be ZFS related as it is usually triggered immediately after a znapzend backup is initiated however multiple backups happen hourly and failures occur only intermittently. If I disable the backups, lockups still happen, but marginally less frequently
Can you see if that happens with doing snapshots from our end, or using our replication? Just to be sure that "zsnazend" doesn't do something weird triggering this.

While we can get some rough idea about the hardware from the GPF panic log, it would be still nice to know CPU, Disks and ZFS setup details (root pool or extra pool, which RAID, if any, ...).

Generally a memory check can be advised too.

You could then also try to boot up an older or newer kernel, checking if it was a regression in 5.3.
The 5.0 based ones would be available for older, the 5.4 would be available for newer.
 
Dec 28, 2016
6
0
6
52
Not sure what you mean by "doing snapshots from our end"? Znapzend is just a perl script that issues calls to regular ZFS commands based upon schedules stored as ZFS metadata, so these are just regular ZFS snapshots.

The setup is a workstation used to virtualise a desktop environment for development purposes. Physically, it is a
Supermicro Super Server/X10DRG-Q with two Xeon E5-2687W processors and 256GB RAM.

Root pool is spinning rust:

rpool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
sdb2 ONLINE 0 0 0
sda2 ONLINE 0 0 0


The primary VM uses a dedicated pool of four NVMe drives:

wiz ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
nvme-0 ONLINE 0 0 0
nvme-1 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
nvme-2 ONLINE 0 0 0
nvme-3 ONLINE 0 0 0


There is also a pool of SSDs used for a second environment that hasn't been running recently:

hum ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
I1 ONLINE 0 0 0
I2 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
I3 ONLINE 0 0 0
I4 ONLINE 0 0 0


The Znapzend config snapshots each pool every two hours. Usually, it will run for most of a 24 hr period before crashing during one of the snapshot events, although I have seen it crash within an hour of a reboot.

I have tried switching to the 5.4 kernel and am seeing the same symptoms.

I ran a 24hr memory test over the weekend with no issues.

During normal operations, there are only two VMs running, a small headless Linux install and a large VM with passthrough of a Radeon VII and a USB3 interface card. Hugepages are configured for the large VM.

ZFS schedulers are set to 'none':

cat /sys/block/sda/queue/scheduler
[none] mq-deadline


During the course of this testing, the system crashed within a couple of hours of booting, with Znapzend disabled, at an arbitrary time unrelated to any scheduled activity so I think that eliminates Znapzend as the source of the problem.

Here are the logs from that failure:


Mar 30 13:50:00 newton-iii systemd[1]: Starting Proxmox VE replication runner...

Mar 30 13:50:00 newton-iii systemd[1]: pvesr.service: Succeeded.

Mar 30 13:50:00 newton-iii systemd[1]: Started Proxmox VE replication runner.

Mar 30 13:51:00 newton-iii systemd[1]: Starting Proxmox VE replication runner...

Mar 30 13:51:00 newton-iii systemd[1]: pvesr.service: Succeeded.

Mar 30 13:51:00 newton-iii systemd[1]: Started Proxmox VE replication runner.

Mar 30 13:52:00 newton-iii systemd[1]: Starting Proxmox VE replication runner...

Mar 30 13:52:00 newton-iii systemd[1]: pvesr.service: Succeeded.

Mar 30 13:52:00 newton-iii systemd[1]: Started Proxmox VE replication runner.

Mar 30 13:53:00 newton-iii systemd[1]: Starting Proxmox VE replication runner...

Mar 30 13:53:00 newton-iii systemd[1]: pvesr.service: Succeeded.

Mar 30 13:53:00 newton-iii systemd[1]: Started Proxmox VE replication runner.

Mar 30 13:54:00 newton-iii systemd[1]: Starting Proxmox VE replication runner...

Mar 30 13:54:00 newton-iii systemd[1]: pvesr.service: Succeeded.

Mar 30 13:54:00 newton-iii systemd[1]: Started Proxmox VE replication runner.

Mar 30 13:55:00 newton-iii systemd[1]: Starting Proxmox VE replication runner...

Mar 30 13:55:00 newton-iii systemd[1]: pvesr.service: Succeeded.

Mar 30 13:55:00 newton-iii systemd[1]: Started Proxmox VE replication runner.

Mar 30 13:56:00 newton-iii systemd[1]: Starting Proxmox VE replication runner...

Mar 30 13:56:00 newton-iii systemd[1]: pvesr.service: Succeeded.

Mar 30 13:56:00 newton-iii systemd[1]: Started Proxmox VE replication runner.

Mar 30 13:56:02 newton-iii kernel: general protection fault: 0000 [#1] SMP NOPTI

Mar 30 13:56:02 newton-iii kernel: CPU: 33 PID: 1927 Comm: zvol Tainted: P OE 5.4.24-1-pve #1

Mar 30 13:56:02 newton-iii kernel: Hardware name: Supermicro Super Server/X10DRG-Q, BIOS 2.0a 08/29/2016

Mar 30 13:56:02 newton-iii kernel: RIP: 0010:__kmalloc_node+0x19f/0x320

Mar 30 13:56:02 newton-iii kernel: Code: 47 0b 04 0f 84 f4 fe ff ff 4c 89 ff e8 6a f3 01 00 49 89 c1 e9 e4 fe ff ff 41 8b 59 20 49 8b 39 48 8d 4a 01 4c 89 d0 4c 01 d3 <48> 33 1b 49 33 99 70 01 00 00 65 48 0f c7 0f 0f 94 c0 84 c0 0f 84

Mar 30 13:56:02 newton-iii kernel: RSP: 0018:ffffb5b38e3e3be8 EFLAGS: 00010206

Mar 30 13:56:02 newton-iii kernel: RAX: 1287a8a7db0f8178 RBX: 1287a8a7db0f8178 RCX: 000000000004748b

Mar 30 13:56:02 newton-iii kernel: RDX: 000000000004748a RSI: 0000000000042d00 RDI: 0000000000030040

Mar 30 13:56:02 newton-iii kernel: RBP: ffffb5b38e3e3c28 R08: ffff9e9b3fd70040 R09: ffff9e9b3f407b80

Mar 30 13:56:02 newton-iii kernel: R10: 1287a8a7db0f8178 R11: 000000000000001a R12: 0000000000042d00

Mar 30 13:56:02 newton-iii kernel: R13: 0000000000000008 R14: 00000000ffffffff R15: ffff9e9b3f407b80

Mar 30 13:56:02 newton-iii kernel: FS: 0000000000000000(0000) GS:ffff9e9b3fd40000(0000) knlGS:0000000000000000

Mar 30 13:56:02 newton-iii kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033

Mar 30 13:56:02 newton-iii kernel: CR2: 000000012c75a000 CR3: 00000034f1e0a005 CR4: 00000000003626e0

Mar 30 13:56:02 newton-iii kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000

Mar 30 13:56:02 newton-iii kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

Mar 30 13:56:02 newton-iii kernel: Call Trace:

Mar 30 13:56:02 newton-iii kernel: ? spl_kmem_zalloc+0xe9/0x140 [spl]

Mar 30 13:56:02 newton-iii kernel: spl_kmem_zalloc+0xe9/0x140 [spl]

Mar 30 13:56:02 newton-iii kernel: dmu_buf_hold_array_by_dnode+0x84/0x480 [zfs]

Mar 30 13:56:02 newton-iii kernel: ? __switch_to_asm+0x40/0x70

Mar 30 13:56:02 newton-iii kernel: ? __switch_to_asm+0x34/0x70

Mar 30 13:56:02 newton-iii kernel: ? __switch_to_asm+0x40/0x70

Mar 30 13:56:02 newton-iii kernel: dmu_read_uio_dnode+0x49/0xf0 [zfs]

Mar 30 13:56:02 newton-iii kernel: ? generic_start_io_acct+0x101/0x120

Mar 30 13:56:02 newton-iii kernel: zvol_read+0x101/0x2d0 [zfs]

Mar 30 13:56:02 newton-iii kernel: taskq_thread+0x2ec/0x4d0 [spl]

Mar 30 13:56:02 newton-iii kernel: ? wake_up_q+0x80/0x80

Mar 30 13:56:02 newton-iii kernel: kthread+0x120/0x140

Mar 30 13:56:02 newton-iii kernel: ? task_done+0xb0/0xb0 [spl]

Mar 30 13:56:02 newton-iii kernel: ? kthread_park+0x90/0x90

Mar 30 13:56:02 newton-iii kernel: ret_from_fork+0x35/0x40

Mar 30 13:56:02 newton-iii kernel: Modules linked in: veth(E) ebtable_filter(E) ebtables(E) ip_set(E) ip6table_raw(E) iptable_raw(E) ip6table_filter(E) ip6_tables(E) iptable_filter(E) bpfilter(E) softdog(E) nfnetlink_log(E) nfnetlink(E) intel_rapl_msr(E) intel_rapl_common(E) sb_edac(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) kvm_intel(E) snd_hda_codec_realtek(E) snd_hda_codec_generic(E) ledtrig_audio(E) snd_hda_codec_hdmi(E) kvm(E) snd_hda_intel(E) snd_intel_nhlt(E) snd_usb_audio(E) snd_hda

Mar 30 13:56:02 newton-iii kernel: ecdh_generic(E) cryptd(E) glue_helper(E) snd(E) usblp(E) ecc(E) cdc_acm(E) mei_me(E) soundcore(E) hid_magicmouse(E) mei(E) joydev(E) intel_cstate(E) ipmi_ssif(E) input_leds(E) intel_rapl_perf(E) ioatdma(E) pcspkr(E) intel_wmi_thunderbolt(E) mxm_wmi(E) mac_hid(E) ipmi_si(E) ipmi_devintf(E) ipmi_msghandler(E) acpi_power_meter(E) acpi_pad(E) vhost_net(E) vhost(E) tap(E) ib_iser(E) rdma_cm(E) iw_cm(E) ib_cm(E) iscsi_tcp(E) libiscsi_tcp(E) libiscsi(E) scsi_transport_iscsi(E)

Mar 30 13:56:02 newton-iii kernel: ---[ end trace cef9611b81aca935 ]---

Mar 30 13:56:02 newton-iii kernel: RIP: 0010:__kmalloc_node+0x19f/0x320

Mar 30 13:56:02 newton-iii kernel: Code: 47 0b 04 0f 84 f4 fe ff ff 4c 89 ff e8 6a f3 01 00 49 89 c1 e9 e4 fe ff ff 41 8b 59 20 49 8b 39 48 8d 4a 01 4c 89 d0 4c 01 d3 <48> 33 1b 49 33 99 70 01 00 00 65 48 0f c7 0f 0f 94 c0 84 c0 0f 84

Mar 30 13:56:02 newton-iii kernel: RSP: 0018:ffffb5b38e3e3be8 EFLAGS: 00010206

Mar 30 13:56:02 newton-iii kernel: RAX: 1287a8a7db0f8178 RBX: 1287a8a7db0f8178 RCX: 000000000004748b

Mar 30 13:56:02 newton-iii kernel: RDX: 000000000004748a RSI: 0000000000042d00 RDI: 0000000000030040

Mar 30 13:56:02 newton-iii kernel: RBP: ffffb5b38e3e3c28 R08: ffff9e9b3fd70040 R09: ffff9e9b3f407b80

Mar 30 13:56:02 newton-iii kernel: R10: 1287a8a7db0f8178 R11: 000000000000001a R12: 0000000000042d00

Mar 30 13:56:02 newton-iii kernel: R13: 0000000000000008 R14: 00000000ffffffff R15: ffff9e9b3f407b80

Mar 30 13:56:02 newton-iii kernel: FS: 0000000000000000(0000) GS:ffff9e9b3fd40000(0000) knlGS:0000000000000000

Mar 30 13:56:02 newton-iii kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033

Mar 30 13:56:02 newton-iii kernel: CR2: 000000012c75a000 CR3: 00000034f1e0a005 CR4: 00000000003626e0

Mar 30 13:56:02 newton-iii kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000

Mar 30 13:56:02 newton-iii kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE and Proxmox Mail Gateway. We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!