This is one of the strangest issues in my 50+ years of computing. Your help, ideas, diagnostic or solution suggestions most welcome!
CONTEXT
OBVIOUS SYMPTOM
OTHER DIAGNOSTICS AND TESTS DONE
EVIDENCE POINTING TO PROXMOX IN SOME WAY
Careful examination of kern.log and syslog shows a 100% correlation to pvescheduler VM backup activity. AND I see some potentially strange side effects related to that backup task:
syslog
This exact sequence is always there. The timing varies a bit due to other activity, but this sequence is constant
My questions
Pete
Data Details below
pvescheduler file (/etc/pve/jobs.cfg)
6 April, syslog
CONTEXT
- I have a pair of almost-the-same Proxmox VM hosts. (HP Z2 G5 SFF, well loaded.)
- Both have NVMe 1TB ZFS mirrored primary storage
- Host 2 is MUCH less busy than Host 1 at present
- All but one of the Host 2 VM's have been shut down since late March or earlier.
OBVIOUS SYMPTOM
- Every Saturday at 00:20 +/- a few minutes, Host 2 loses NVMe mirror drive #2, gone w/o apparent error (other than being gone.) ZFS notes the pool is degraded.
- On reboot, the pool resilvers perfectly w/o error and all is well for another week.
OTHER DIAGNOSTICS AND TESTS DONE
- Swapping NVMe sticks? Same slot is gone. So this happens w/ any NVMe stick. (Perhaps of interest: my sticks are slightly different size, but the mirror is fine.)
- Ran many HW diagnostics w/o error.
- There are no cron jobs at that time.
- There is no special network traffic at that time.
- Tried playing w/ PCIe rescans etc and found nothing.
EVIDENCE POINTING TO PROXMOX IN SOME WAY
Careful examination of kern.log and syslog shows a 100% correlation to pvescheduler VM backup activity. AND I see some potentially strange side effects related to that backup task:
syslog
- VM 712 finishes backup
- VM 790 begins backup
- 64ms later, nvme disappears
- Kernel activity on tap port related to VM 712 wraps up
- NVMe disappears
- Kernel activity on tap port related to VM 790
This exact sequence is always there. The timing varies a bit due to other activity, but this sequence is constant
My questions
- Has anything like this been seen by others (online searches have not helped)
- Any ideas how to further diagnose?
- (Note: I hesitate to do massive reloads. I'd rather help find the bug than simply obscure it / make it disappear.)
Pete
Data Details below
pvescheduler file (/etc/pve/jobs.cfg)
vzdump: 3f9a084ec4febd6aa668de2f3f6d838fba2de390:1
schedule sat 00:00
all 1
compress zstd
enabled 1
mailnotification always
mode snapshot
quiet 1
storage nas-share
vzdump: 648efa5c74376df725b01f7cdac5863435aa48bd:1
schedule sat 00:00
all 1
compress zstd
enabled 0
mailnotification always
mode snapshot
quiet 1
storage nas-share
6 April, syslog
6 April, kern.log2024-04-06T00:19:29.849849-06:00 pve2a qmeventd[1011121]: Finished cleanup for 712
2024-04-06T00:19:32.783427-06:00 pve2a pvescheduler[1001041]: INFO: Finished Backup of VM 712 (00:03:09)
2024-04-06T00:19:32.819516-06:00 pve2a pvescheduler[1001041]: INFO: Starting Backup of VM 790 (qemu)
2024-04-06T00:19:32.883951-06:00 pve2a kernel: [197237.341987] zio pool=rpool vdev=/dev/disk/by-id/nvme-CT1000T500SSD5_234344CF32EB-part3 error=5 type=1 offset=270336 size=8192 flags=721601
2024-04-06T00:19:32.897159-06:00 pve2a zed: eid=31 class=statechange pool='rpool' vdev=nvme-CT1000T500SSD5_234344CF32EB-part3 vdev_state=REMOVED
2024-04-06T00:19:32.897420-06:00 pve2a zed: eid=32 class=removed pool='rpool' vdev=nvme-CT1000T500SSD5_234344CF32EB-part3 vdev_state=REMOVED
2024-04-06T00:19:32.915254-06:00 pve2a zed: eid=33 class=config_sync pool='rpool'
4 May, syslog2024-04-06T00:16:24.579990-06:00 pve2a kernel: [197049.035037] tap712i0: entered allmulticast mode
2024-04-06T00:16:24.579991-06:00 pve2a kernel: [197049.035386] vmbr0: port 3(tap712i0) entered blocking state
2024-04-06T00:16:24.579991-06:00 pve2a kernel: [197049.035675] vmbr0: port 3(tap712i0) entered forwarding state
2024-04-06T00:19:29.251997-06:00 pve2a kernel: [197233.709327] tap712i0: left allmulticast mode
2024-04-06T00:19:29.252009-06:00 pve2a kernel: [197233.709638] vmbr0: port 3(tap712i0) entered disabled state
2024-04-06T00:19:32.883951-06:00 pve2a kernel: [197237.341987] zio pool=rpool vdev=/dev/disk/by-id/nvme-CT1000T500SSD5_234344CF32EB-part3 error=5 type=1 offset=270336 size=8192 flags=721601
2024-04-06T00:19:33.815994-06:00 pve2a kernel: [197238.273119] tap790i0: entered promiscuous mode
2024-04-06T00:19:33.828143-06:00 pve2a kernel: [197238.286229] vmbr0: port 3(tap790i0) entered blocking state
2024-04-06T00:19:33.828151-06:00 pve2a kernel: [197238.286530] vmbr0: port 3(tap790i0) entered disabled state
2024-04-06T00:19:33.831977-06:00 pve2a kernel: [197238.286852] tap790i0: entered allmulticast mode
2024-04-06T00:19:33.832007-06:00 pve2a kernel: [197238.287214] vmbr0: port 3(tap790i0) entered blocking state
2024-04-06T00:19:33.832009-06:00 pve2a kernel: [197238.287510] vmbr0: port 3(tap790i0) entered forwarding state
2024-04-06T00:22:28.144029-06:00 pve2a kernel: [197412.601109] tap790i0: left allmulticast mode
2024-04-06T00:22:28.144040-06:00 pve2a kernel: [197412.601421] vmbr0: port 3(tap790i0) entered disabled state
2024-04-06T00:22:31.131973-06:00 pve2a kernel: [197415.590766] tap791i0: entered promiscuous mode
2024-04-06T00:22:31.147942-06:00 pve2a kernel: [197415.604230] vmbr0: port 3(tap791i0) entered blocking state
4 May, kern.log2024-05-04T00:20:57.864497-06:00 pve2a pvescheduler[2427079]: INFO: Finished Backup of VM 712 (00:03:13)
2024-05-04T00:20:57.875064-06:00 pve2a pvescheduler[2427079]: INFO: Starting Backup of VM 790 (qemu)
2024-05-04T00:20:57.945317-06:00 pve2a kernel: [448856.898381] zio pool=rpool vdev=/dev/disk/by-id/nvme-SAMSUNG_MZVL21T0HCLR-00BH1_S641NF0T407505-part3 error=5 type=1 offset=270336 size=8192 flags=721601
2024-05-04T00:20:57.945328-06:00 pve2a kernel: [448856.898392] zio pool=rpool vdev=/dev/disk/by-id/nvme-SAMSUNG_MZVL21T0HCLR-00BH1_S641NF0T407505-part3 error=5 type=1 offset=1023671017472 size=8192 flags=721601
2024-05-04T00:20:57.945329-06:00 pve2a kernel: [448856.898397] zio pool=rpool vdev=/dev/disk/by-id/nvme-SAMSUNG_MZVL21T0HCLR-00BH1_S641NF0T407505-part3 error=5 type=1 offset=1023671279616 size=8192 flags=721601
2024-05-04T00:20:57.953044-06:00 pve2a zed: eid=18 class=statechange pool='rpool' vdev=nvme-SAMSUNG_MZVL21T0HCLR-00BH1_S641NF0T407505-part3 vdev_state=REMOVED
2024-05-04T00:20:57.953148-06:00 pve2a zed: eid=19 class=removed pool='rpool' vdev=nvme-SAMSUNG_MZVL21T0HCLR-00BH1_S641NF0T407505-part3 vdev_state=REMOVED
2024-05-04T00:20:57.961195-06:00 pve2a zed: eid=20 class=config_sync pool='rpool'
2024-05-04T00:17:45.112362-06:00 pve2a kernel: [448664.064154] vmbr0: port 2(tap712i0) entered forwarding state
2024-05-04T00:20:56.163498-06:00 pve2a kernel: [448855.116201] tap712i0: left allmulticast mode
2024-05-04T00:20:56.163507-06:00 pve2a kernel: [448855.116219] vmbr0: port 2(tap712i0) entered disabled state
2024-05-04T00:20:57.945317-06:00 pve2a kernel: [448856.898381] zio pool=rpool vdev=/dev/disk/by-id/nvme-SAMSUNG_MZVL21T0HCLR-00BH1_S641NF0T407505-part3 error=5 type=1 offset=270336 size=8192 flags=721601
2024-05-04T00:20:57.945328-06:00 pve2a kernel: [448856.898392] zio pool=rpool vdev=/dev/disk/by-id/nvme-SAMSUNG_MZVL21T0HCLR-00BH1_S641NF0T407505-part3 error=5 type=1 offset=1023671017472 size=8192 flags=721601
2024-05-04T00:20:57.945329-06:00 pve2a kernel: [448856.898397] zio pool=rpool vdev=/dev/disk/by-id/nvme-SAMSUNG_MZVL21T0HCLR-00BH1_S641NF0T407505-part3 error=5 type=1 offset=1023671279616 size=8192 flags=721601
2024-05-04T00:20:58.854316-06:00 pve2a kernel: [448857.807202] tap790i0: entered promiscuous mode
2024-05-04T00:20:58.867310-06:00 pve2a kernel: [448857.820449] vmbr0: port 2(tap790i0) entered blocking state
2024-05-04T00:20:58.867318-06:00 pve2a kernel: [448857.820454] vmbr0: port 2(tap790i0) entered disabled state
2024-05-04T00:20:58.867319-06:00 pve2a kernel: [448857.820463] tap790i0: entered allmulticast mode
2024-05-04T00:20:58.867320-06:00 pve2a kernel: [448857.820511] vmbr0: port 2(tap790i0) entered blocking state
2024-05-04T00:20:58.867321-06:00 pve2a kernel: [448857.820514] vmbr0: port 2(tap790i0) entered forwarding state
2024-05-04T00:26:36.809517-06:00 pve2a kernel: [449195.763577] tap790i0: left allmulticast mode