Hi Spirit,
Thanks for taking some time to answer.
In order to give you an idea of my home lab here is a quick description:
I have 3 physical PVE nodes, 2 NUCs, one HP µserver G8, and a Synology NAS.
Each physical node has a boot SSD (Proxmox) and an attached USB3 disk used for CEPH (3/1). The G8 has 4 extra HDDs passed thru an OMV3 VM used for some replication of the Syno volumes and so on.
All my VM disks (including the OMW's boot one) are located on CEPH. I've set a dedicated switch/subnet for CEPH operations + backups.
I know it's only USB3 and 1Gbps but it's enough for my usage. I don't need high performance but more "high" availability ...
Now regarding the logs, I know (as a sysadmin) that I'm more trying to adapt to a problem instead of solving the problem but ...
For "some reason", the USB (/dev/sdb OSD) disconnected on one NUC:
Jul 1 22:06:04 pve2 kernel: [833838.186658] usb 4-5: Disable of device-initiated U1 failed.
Jul 1 22:06:09 pve2 kernel: [833843.186322] usb 4-5: Disable of device-initiated U2 failed.
Jul 1 22:06:14 pve2 kernel: [833848.313882] xhci_hcd 0000:00:14.0: Timeout while waiting for setup device command
Jul 1 22:06:19 pve2 kernel: [833853.529561] xhci_hcd 0000:00:14.0: Timeout while waiting for setup device command
Jul 1 22:06:25 pve2 kernel: [833858.937142] xhci_hcd 0000:00:14.0: Timeout while waiting for setup device command
Jul 1 22:06:30 pve2 kernel: [833864.152772] xhci_hcd 0000:00:14.0: Timeout while waiting for setup device command
Jul 1 22:06:35 pve2 kernel: [833869.560434] xhci_hcd 0000:00:14.0: Timeout while waiting for setup device command
Jul 1 22:06:40 pve2 kernel: [833874.776008] xhci_hcd 0000:00:14.0: Timeout while waiting for setup device command
Jul 1 22:06:41 pve2 kernel: [833875.577354] usb 4-5: reset SuperSpeed USB device number 2 using xhci_hcd
Jul 1 22:06:41 pve2 kernel: [833875.615973] usb 4-5: USB disconnect, device number 2
Jul 1 22:06:41 pve2 kernel: [833875.623947] sd 0:0:0:0: [sdb] tag#0 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Jul 1 22:06:41 pve2 kernel: [833875.623951] sd 0:0:0:0: [sdb] tag#0 CDB: Write(10) 2a 00 0f 45 05 a0 00 00 30 00
Jul 1 22:06:41 pve2 kernel: [833875.625849] XFS (sdb1): xfs_do_force_shutdown(0x2) called from line 1197 of file fs/xfs/xfs_log.c. Return address = 0xffffffffc0ad6888
Jul 1 22:06:41 pve2 kernel: [833875.625866] XFS (sdb1): xfs_log_force: error -5 returned.
Jul 1 22:06:41 pve2 kernel: [833875.627991] sd 0:0:0:0: [sdb] tag#0 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Jul 1 22:06:41 pve2 kernel: [833875.627995] sd 0:0:0:0: [sdb] tag#0 CDB: Write(10) 2a 00 00 20 bf 50 00 00 18 00
Jul 1 22:06:41 pve2 kernel: [833875.629138] XFS (sdb1): xfs_log_force: error -5 returned.
Jul 1 22:06:42 pve2 kernel: [833876.465576] sd 0:0:0:0: [sdb] Synchronizing SCSI cache
Jul 1 22:06:42 pve2 kernel: [833876.465602] sd 0:0:0:0: [sdb] Synchronize Cache(10) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Then right after it reconnected but as /dev/sdc:
Jul 1 22:06:42 pve2 kernel: [833876.652036] usb 4-5: new SuperSpeed USB device number 4 using xhci_hcd
Jul 1 22:06:42 pve2 kernel: [833876.670310] usb 4-5: New USB device found, idVendor=0480, idProduct=a202
Jul 1 22:06:42 pve2 kernel: [833876.670313] usb 4-5: New USB device strings: Mfr=1, Product=2, SerialNumber=3
Jul 1 22:06:42 pve2 kernel: [833876.670314] usb 4-5: Product: External USB 3.0
Jul 1 22:06:42 pve2 kernel: [833876.670316] usb 4-5: Manufacturer: TOSHIBA
Jul 1 22:06:42 pve2 kernel: [833876.670317] usb 4-5: SerialNumber: 20161030000181C
Jul 1 22:06:42 pve2 kernel: [833876.671090] usb-storage 4-5:1.0: USB Mass Storage device detected
Jul 1 22:06:42 pve2 kernel: [833876.671145] scsi host7: usb-storage 4-5:1.0
Jul 1 22:06:43 pve2 kernel: [833877.670830] scsi 7:0:0:0: Direct-Access TOSHIBA External USB 3.0 5438 PQ: 0 ANSI: 6
Jul 1 22:06:43 pve2 kernel: [833877.671067] sd 7:0:0:0: Attached scsi generic sg2 type 0
Jul 1 22:06:43 pve2 kernel: [833877.672451] sd 7:0:0:0: [sdc] 976773164 512-byte logical blocks: (500 GB/466 GiB)
Jul 1 22:06:43 pve2 kernel: [833877.672780] sd 7:0:0:0: [sdc] Write Protect is off
Jul 1 22:06:43 pve2 kernel: [833877.673096] sd 7:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Jul 1 22:06:45 pve2 kernel: [833879.576518] sdc: sdc1 sdc2
Jul 1 22:06:45 pve2 kernel: [833879.578834] sd 7:0:0:0: [sdc] Attached SCSI disk
Problem is that CEPH tried to mount the filesystem (using UUID you're right !) while it was probably still mount/failed:
Jul 1 22:06:46 pve2 kernel: [833880.145999] XFS (sdc1): Filesystem has duplicate UUID 7f15baec-8e84-4442-9e80-5650a8419548 - can't mount
Jul 1 22:06:46 pve2 kernel: [833880.430268] XFS (sdc1): Filesystem has duplicate UUID 7f15baec-8e84-4442-9e80-5650a8419548 - can't mount
Jul 1 22:06:58 pve2 kernel: [833891.926829] XFS (sdb1): xfs_log_force: error -5 returned.
Jul 1 22:07:28 pve2 kernel: [833922.004716] XFS (sdb1): xfs_log_force: error -5 returned.
Now maybe the only solution, expect to understand why it disconnects/reconnects, should be to use a static udev rule to be sure it keeps the device name after-all ...
Regards