I've issue about local disk on zfs.

totae · Jul 5, 2023

Hello,

First error,I've backup VM(Template) from Template to local(zfs "Raid 1") after that job error and node crash reboot without logs in syslog. (Node 1)

I try backup again still reboot. (Node 1)

Last error, I've rsync template to local found the issue same node1.(Node 2)

Syslog.Node Boot at 17:34:19

Jul 4 17:06:26 node02 systemd[1]: Starting Daily apt download activities...
Jul 4 17:06:27 node02 systemd[1]: apt-daily.service: Succeeded.
Jul 4 17:06:27 node02 systemd[1]: Finished Daily apt download activities.
Jul 4 17:09:39 node02 pveproxy[3106]: worker 2127412 finished
Jul 4 17:09:39 node02 pveproxy[3106]: starting 1 worker(s)
Jul 4 17:09:39 node02 pveproxy[3106]: worker 2207900 started
Jul 4 17:09:40 node02 pveproxy[2207897]: worker exit
Jul 4 17:10:54 node02 pveproxy[2139303]: worker exit
Jul 4 17:10:54 node02 pveproxy[3106]: worker 2139303 finished
Jul 4 17:10:54 node02 pveproxy[3106]: starting 1 worker(s)
Jul 4 17:10:54 node02 pveproxy[3106]: worker 2213809 started
Jul 4 17:11:43 node02 pmxcfs[59262]: [status] notice: received log
Jul 4 17:12:19 node02 pvedaemon[2139978]: worker exit
Jul 4 17:12:19 node02 pvedaemon[3092]: worker 2139978 finished
Jul 4 17:12:19 node02 pvedaemon[3092]: starting 1 worker(s)
Jul 4 17:12:19 node02 pvedaemon[3092]: worker 2219909 started
Jul 4 17:13:11 node02 pmxcfs[59262]: [status] notice: received log
Jul 4 17:13:30 node02 smartd[2439]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 74 to 76
Jul 4 17:13:30 node02 smartd[2439]: Device: /dev/sdd [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 76 to 77
Jul 4 17:13:30 node02 smartd[2439]: Device: /dev/sde [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 76 to 77
Jul 4 17:13:30 node02 smartd[2439]: Device: /dev/sdf [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 76 to 77
Jul 4 17:13:30 node02 smartd[2439]: Device: /dev/sdg [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 76 to 77
Jul 4 17:13:30 node02 smartd[2439]: Device: /dev/sdh [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 76 to 77
Jul 4 17:14:39 node02 pveproxy[3106]: worker 2155416 finished
Jul 4 17:14:39 node02 pveproxy[3106]: starting 1 worker(s)
Jul 4 17:14:39 node02 pveproxy[3106]: worker 2231248 started
Jul 4 17:14:41 node02 pveproxy[2231247]: got inotify poll request in wrong process - disabling inotify
Jul 4 17:14:42 node02 pveproxy[2231247]: worker exit
Jul 4 17:14:50 node02 pvedaemon[2152566]: worker exit
Jul 4 17:14:51 node02 pvedaemon[3092]: worker 2152566 finished
Jul 4 17:14:51 node02 pvedaemon[3092]: starting 1 worker(s)
Jul 4 17:14:51 node02 pvedaemon[3092]: worker 2232002 started
Jul 4 17:15:01 node02 CRON[2233058]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Jul 4 17:17:01 node02 CRON[2242260]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Jul 4 17:17:08 node02 pmxcfs[59262]: [status] notice: received log
Jul 4 17:21:47 node02 pvedaemon[2183120]: worker exit
Jul 4 17:21:47 node02 pvedaemon[3092]: worker 2183120 finished
Jul 4 17:21:47 node02 pvedaemon[3092]: starting 1 worker(s)
Jul 4 17:21:47 node02 pvedaemon[3092]: worker 2264271 started
Jul 4 17:24:48 node02 systemd[1]: Started Session 4846 of user root.
Jul 4 17:25:01 node02 CRON[2279666]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Jul 4 17:26:43 node02 pmxcfs[59262]: [status] notice: received log
Jul 4 17:26:43 node02 systemd[1]: Started Session 4848 of user root.
Jul 4 17:34:19 node02 systemd-modules-load[2953]: Inserted module 'iscsi_tcp'
Jul 4 17:34:19 node02 kernel: [ 0.000000] Linux version 5.15.102-1-pve (build@proxmox) (gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP PVE 5.15.102-1 (2023-03-14T13:48Z) ()
Jul 4 17:34:19 node02 systemd-modules-load[2953]: Inserted module 'ib_iser'
Jul 4 17:34:19 node02 kernel: [ 0.000000] Command line: initrd=\EFI\proxmox\5.15.102-1-pve\initrd.img-5.15.102-1-pve root=ZFS=rpool/ROOT/pve-1 boot=zfs
Jul 4 17:34:19 node02 kernel: [ 0.000000] KERNEL supported cpus:
Jul 4 17:34:19 node02 systemd-modules-load[2953]: Inserted module 'vhost_net'
Jul 4 17:34:19 node02 kernel: [ 0.000000] Intel GenuineIntel
Jul 4 17:34:19 node02 kernel: [ 0.000000] AMD AuthenticAMD
Jul 4 17:34:19 node02 kernel: [ 0.000000] Hygon HygonGenuine
Jul 4 17:34:19 node02 lvm[2948]: 1 logical volume(s) in volume group "ceph-720cbe5b-dee0-401b-ad2a-5fb75ecdbf71" monitored
Jul 4 17:34:19 node02 kernel: [ 0.000000] Centaur CentaurHauls
Jul 4 17:34:19 node02 lvm[2948]: 1 logical volume(s) in volume group "ceph-0f1bd2f9-9339-4245-839e-0d579164a25d" monitored
Jul 4 17:34:19 node02 kernel: [ 0.000000] zhaoxin Shanghai
Jul 4 17:34:19 node02 kernel: [ 0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
Jul 4 17:34:19 node02 kernel: [ 0.000000] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
Jul 4 17:34:19 node02 kernel: [ 0.000000] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'

Detail hardware.
Cluster 4 nodes all node same hardware.
Dell C6420
Zfs raid 1 "WDC WDS250G2B0B" 2 disk "P/N:00f9xf"

pveversion

proxmox-ve: 7.4-1 (running kernel: 5.15.102-1-pve)
pve-manager: 7.4-3 (running version: 7.4-3/9002ab8a)
pve-kernel-5.15: 7.3-3
pve-kernel-5.15.102-1-pve: 5.15.102-1
ceph: 17.2.5-pve1
ceph-fuse: 17.2.5-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libproxmox-rs-perl: 0.2.1
libpve-access-control: 7.4-1
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.3-3
libpve-guest-common-perl: 4.2-4
libpve-http-server-perl: 4.2-1
libpve-rs-perl: 0.7.5
libpve-storage-perl: 7.4-2
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
proxmox-backup-client: 2.3.3-1
proxmox-backup-file-restore: 2.3.3-1
proxmox-kernel-helper: 7.4-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.6.3
pve-cluster: 7.3-3
pve-container: 4.4-3
pve-docs: 7.4-2
pve-edk2-firmware: 3.20221111-1
pve-firewall: 4.3-1
pve-firmware: 3.6-4
pve-ha-manager: 3.6.0
pve-i18n: 2.11-1
pve-qemu-kvm: 7.2.0-8
pve-xtermjs: 4.16.0-1
qemu-server: 7.4-2
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1
zfsutils-linux: 2.1.9-pve1

zpool status
pool: rpool
state: ONLINE
scan: scrub repaired 0B in 00:00:05 with 0 errors on Sun Jun 11 00:24:06 2023
config:

NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ata-WDC_WDS250G2B0B-00YS70_2140LM472101-part3 ONLINE 0 0 0
ata-WDC_WDS250G2B0B-00YS70_220206A0005A-part3 ONLINE 0 0 0

errors: No known data errors

Everyday have job backup VMs to backup storage not found the issue.

*Not have any error on idrac.

I'm afraid that if there's an upgrade patch, it will reboot and cause problems.

According to my opinion, root cause may be on zfs or disk or boss controller. but i don't see any error logs.

Anyone can suggest or how to check about the issue.

Best Regards,

Search

Search

I've issue about local disk on zfs.

totae

Member

We value your privacy