PVE crash ZFS I/O infinity wait

Stereo973 · Mar 24, 2020

Hello everyone!
I hope you feel good!
First, sorry for my bad english!

PVE Cluster with 3 nodes, storage RAID1 ZFS with NVMe, Intel Xeon E-2288G CPU, 128Go RAM DDR4 ECC, 1 node master with 16 LXC Containers, 1 node slave and the last just used for quorum, redunded network link.
One of node (actually the master) have encountered a storage problem this weekend;
Saturday morning i see than the node show I/O Delay at 99% on pve panel!!
The server can't Read/Write to ZFS Pool, so all of my services on all my containers just running with the cached code on the RAM.
I hard restart the server and all services restart normaly, without any errors.
I check all logs, all of them don't say any errors but they finish at 00:50:51, time of the beginning of the crash until i hard reboot the server at 11:43:11.
My monitoring system (Observium) just show 100%CPU usage all the night and I/O Storage Activity NULL.
After that i have migrate all LXC containers to slave node and i do all hardware tests on the master, no problem found.

It's feel like ZFS storage crash or something like that.

I need your help, maybe someone have encountered the problem?
Thanks you very much and have a good day!

oguz · Mar 24, 2020

hi,

PhonaVibe said:
I check all logs, all of them don't say any errors but they finish at 00:50:51, time of the beginning of the crash until i hard reboot the server at 11:43:11.

can you post that here, just in case you missed something? (remove sensitive info like IP, etc.). you can use [code][/code] tags

PhonaVibe said:
It's feel like ZFS storage crash or something like that.

health of zfs pools? maybe you can find something there. zfs commands will be useful in diagnosing any issue.

also what is output of pveversion -v and pveperf

Stereo973 · Mar 24, 2020

Hi!
Thanks for your reply!

oguz said:
can you post that here, just in case you missed something? (remove sensitive info like IP, etc.). you can use [code][/code] tags

Last logs of syslog:

Code:

Mar 21 00:49:37 pve1 systemd[1]: Started Proxmox VE replication runner.
Mar 21 00:49:41 pve1 rsync_pvebck.sh[13880]: var/lib/asterisk/
Mar 21 00:49:41 pve1 rsync_pvebck.sh[13880]: var/lib/asterisk/astdb.sqlite3
Mar 21 00:49:42 pve1 rsync_pvebck.sh[13880]: sent 11,415,143 bytes  received 57,275 bytes  1,764,987.38 bytes/sec
Mar 21 00:49:42 pve1 rsync_pvebck.sh[13880]: total size is 16,624,157,248  speedup is 1,449.05
Mar 21 00:49:42 pve1 rsync_pvebck.sh[1880]: sending incremental file list
Mar 21 00:49:48 pve1 kernel: [3580984.794922] RBP: ffffb1686a193810 R08: 00000000ffffffff R09: ffff9f88f70b6022
Mar 21 00:49:48 pve1 kernel: [3580984.805694]  zap_cursor_retrieve+0x159/0x2f0 [zfs]
Mar 21 00:49:48 pve1 kernel: [3580984.810829]  do_syscall_64+0x5a/0x130
Mar 21 00:49:48 pve1 kernel: [3580984.866040] Code: c3 08 48 83 c2 08 48 89 43 f8 48 01 df 49 39 fb 72 2e 48 39 fb 72 13 4d 39 ca 0f 87 4b ff ff ff 48 29 f7 89 f8 c1 e8 1f eb 56 <48> 8b 02 48 83 c3 08 48 83 c2 08 48 89 43 f8 48 39 df 77 ec eb d7
Mar 21 00:49:48 pve1 rsync_pvebck.sh[13880]: /root/rsync_pvebck.sh: line 9: 19863 Killed                  rsync -avz --delete /rpool/data/subvol-200-disk-0/ root@pvebck.codeactive.net:/rpool/data/subvol-200-disk-0
Mar 21 00:49:48 pve1 rsync_pvebck.sh[13880]: sending incremental file list
Mar 21 00:49:48 pve1 rsync_pvebck.sh[13880]: sent 632,576 bytes  received 2,677 bytes  1,270,506.00 bytes/sec
Mar 21 00:49:48 pve1 rsync_pvebck.sh[13880]: total size is 1,894,047,159  speedup is 2,981.56
Mar 21 00:49:49 pve1 rsync_pvebck.sh[13880]: sending incremental file list
Mar 21 00:49:49 pve1 rsync_pvebck.sh[13880]: var/log/kern.log
Mar 21 00:49:49 pve1 rsync_pvebck.sh[13880]: var/log/syslog
Mar 21 00:49:49 pve1 rsync_pvebck.sh[13880]: sent 552,785 bytes  received 2,281 bytes  370,044.00 bytes/sec
Mar 21 00:49:49 pve1 rsync_pvebck.sh[13880]: total size is 1,223,994,090  speedup is 2,205.13
Mar 21 00:50:00 pve1 systemd[1]: Starting Proxmox VE replication runner...
Mar 21 00:50:01 pve1 snmpd[2065]: error on subcontainer 'ia_addr' insert (-1)
Mar 21 00:50:01 pve1 zed: eid=5444484 class=history_event pool_guid=0xCA26C2435C3C3A98 
Mar 21 00:50:02 pve1 zed: eid=5444485 class=history_event pool_guid=0xCA26C2435C3C3A98 
Mar 21 00:50:02 pve1 zed: eid=5444486 class=history_event pool_guid=0xCA26C2435C3C3A98

oguz said:
health of zfs pools? maybe you can find something there. zfs commands will be useful in diagnosing any issue.

zpool status:

Code:

root@pve1:/var/log# zpool status -v rpool
  pool: rpool
state: ONLINE
  scan: scrub repaired 0B in 0 days 00:03:18 with 0 errors on Sun Mar  8 00:27:19 2020
config:

    NAME                                                 STATE     READ WRITE CKSUM
    rpool                                                ONLINE       0     0     0
      mirror-0                                           ONLINE       0     0     0
        nvme-eui.343337304d9063300025384100000001-part3  ONLINE       0     0     0
        nvme-eui.343337304d9063270025384100000001-part3  ONLINE       0     0     0

errors: No known data errors

oguz said:
also what is output of pveversion -v and pveperf

pveversion -v

Code:

proxmox-ve: 6.1-2 (running kernel: 5.3.18-2-pve)
pve-manager: 6.1-8 (running version: 6.1-8/806edfe1)
pve-kernel-helper: 6.1-7
pve-kernel-5.3: 6.1-5
pve-kernel-5.0: 6.0-11
pve-kernel-5.3.18-2-pve: 5.3.18-2
pve-kernel-5.3.18-1-pve: 5.3.18-1
pve-kernel-5.3.13-3-pve: 5.3.13-3
pve-kernel-5.3.13-1-pve: 5.3.13-1
pve-kernel-5.0.21-5-pve: 5.0.21-10
pve-kernel-5.0.15-1-pve: 5.0.15-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.3-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.15-pve1
libpve-access-control: 6.0-6
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.0-17
libpve-guest-common-perl: 3.0-5
libpve-http-server-perl: 3.0-5
libpve-storage-perl: 6.1-5
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 3.2.1-1
lxcfs: 3.0.3-pve60
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.1-3
pve-cluster: 6.1-4
pve-container: 3.0-22
pve-docs: 6.1-6
pve-edk2-firmware: 2.20200229-1
pve-firewall: 4.0-10
pve-firmware: 3.0-6
pve-ha-manager: 3.0-9
pve-i18n: 2.0-4
pve-qemu-kvm: 4.1.1-4
pve-xtermjs: 4.3.0-1
qemu-server: 6.1-7
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.3-pve1

pveperf

Code:

CPU BOGOMIPS:      118395.20
REGEX/SECOND:      5222375
HD SIZE:           684.76 GB (rpool/ROOT/pve-1)
FSYNCS/SECOND:     14830.56
DNS EXT:           29.15 ms
DNS INT:           28.16 ms (hidden)

LnxBil · Mar 26, 2020

Please add also zpool list.

Stereo973 · Mar 26, 2020

Hi!

zpool list:

Code:

NAME    SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
rpool   888G   177G   711G        -         -    20%    19%  1.00x    ONLINE  -

Stereo973 · Mar 30, 2020

Hello!

The problem happen again on the slave node, so it's not hardware problem.

Replication Log is some of the last thing the server can write on the logs:

Code:

2020-03-29 23:02:29 306-0: start replication job
2020-03-29 23:02:29 306-0: guest => CT 306, running => 1
2020-03-29 23:02:29 306-0: volumes => local-zfs:subvol-306-disk-0
2020-03-29 23:02:30 306-0: freeze guest filesystem
2020-03-29 23:02:31 306-0: create snapshot '__replicate_306-0_1585515749__' on local-zfs:subvol-306-disk-0

On syslog, we can see the same kernel error of my first post:

Code:

Mar 29 23:02:23 pve2 zed: eid=1219691 class=history_event pool_guid=0xD63AAA4F8D4D36D1
Mar 29 23:02:23 pve2 zed: eid=1219692 class=history_event pool_guid=0xD63AAA4F8D4D36D1
Mar 29 23:02:23 pve2 zed: eid=1219693 class=history_event pool_guid=0xD63AAA4F8D4D36D1
Mar 29 23:02:23 pve2 zed: eid=1219694 class=history_event pool_guid=0xD63AAA4F8D4D36D1
Mar 29 23:02:24 pve2 zed: eid=1219695 class=history_event pool_guid=0xD63AAA4F8D4D36D1
Mar 29 23:02:24 pve2 zed: eid=1219696 class=history_event pool_guid=0xD63AAA4F8D4D36D1
Mar 29 23:02:24 pve2 pve-ha-crm[23515]: unable to read file '/etc/pve/nodes//lrm_status'
Mar 29 23:02:25 pve2 zed: eid=1219697 class=history_event pool_guid=0xD63AAA4F8D4D36D1
Mar 29 23:02:25 pve2 zed: eid=1219698 class=history_event pool_guid=0xD63AAA4F8D4D36D1
Mar 29 23:02:25 pve2 zed: eid=1219699 class=history_event pool_guid=0xD63AAA4F8D4D36D1
Mar 29 23:02:26 pve2 zed: eid=1219700 class=history_event pool_guid=0xD63AAA4F8D4D36D1
Mar 29 23:02:26 pve2 zed: eid=1219701 class=history_event pool_guid=0xD63AAA4F8D4D36D1
Mar 29 23:02:26 pve2 zed: eid=1219702 class=history_event pool_guid=0xD63AAA4F8D4D36D1
Mar 29 23:02:27 pve2 kernel: [726208.768401] R13: ffff8f8510567ff7 R14: 000000000000002f R15: 0000000000000001
Mar 29 23:02:27 pve2 kernel: [726208.776261]  dbuf_read+0x269/0xb80 [zfs]
Mar 29 23:02:27 pve2 kernel: [726208.782085]  ? zfs_getattr_fast+0x124/0x220 [zfs]
Mar 29 23:02:27 pve2 kernel: [726208.786346] RDX: 0000000000008000 RSI: 000055ec399a51a0 RDI: 0000000000000003
Mar 29 23:02:27 pve2 rsync_pvebck.sh[5186]: /root/rsync_pvebck.sh: line 9: 24386 Segmentation fault      rsync -avz --delete /rpool/data/subvol-200-disk-0/ root@pvebck.codeactive.net:/rpool/data/subvol-200-disk-0
Mar 29 23:02:27 pve2 rsync_pvebck.sh[5186]: sending incremental file list
Mar 29 23:02:28 pve2 rsync_pvebck.sh[5186]: tmp/
Mar 29 23:02:28 pve2 rsync_pvebck.sh[5186]: var/lib/apt/daily_lock
Mar 29 23:02:28 pve2 zed: eid=1219703 class=history_event pool_guid=0xD63AAA4F8D4D36D1
Mar 29 23:02:28 pve2 rsync_pvebck.sh[5186]: var/lib/systemd/timers/stamp-apt-daily.timer
Mar 29 23:02:28 pve2 rsync_pvebck.sh[5186]: var/log/auth.log
Mar 29 23:02:28 pve2 rsync_pvebck.sh[5186]: var/log/syslog
Mar 29 23:02:28 pve2 rsync_pvebck.sh[5186]: var/log/journal/db91a8d9813048b3b0d9627361bf92a2/system.journal
Mar 29 23:02:28 pve2 zed: eid=1219704 class=history_event pool_guid=0xD63AAA4F8D4D36D1
Mar 29 23:02:28 pve2 zed: eid=1219705 class=history_event pool_guid=0xD63AAA4F8D4D36D1
Mar 29 23:02:28 pve2 zed: eid=1219706 class=history_event pool_guid=0xD63AAA4F8D4D36D1
Mar 29 23:02:28 pve2 zed: eid=1219707 class=history_event pool_guid=0xD63AAA4F8D4D36D1

Search

Search

PVE crash ZFS I/O infinity wait

Stereo973

Member

oguz

Proxmox Retired Staff

Stereo973

Member

LnxBil

Distinguished Member

Stereo973

Member

Stereo973

Member

We value your privacy