Gluster - ZFS problem

andrea68

Renowned Member
Jun 30, 2010
158
2
83
Hi,

I have 3 proxmox nodes v 6.4.

Every node has 6 SSD drives dedicated to store VM.
Every node is configured with ZFS raidz1-0
On top on this ZFS pool data I built a Gluster Brick.
So I set a gluster volume dispersed with 3 brick (redundancy 1).And it worked flowless for the last 3 years.
Now the problem: I lost one brick (node 3).
Long story short: ZFS fail something, but I cant' bring up the pool no more: "zpool import PVE03" ask me to destroy and reformat from zero cause I/O errors.

Code:
root@pve03 ~ # zpool import PVE03
cannot import 'PVE03': I/O error
    Destroy and re-create the pool from
    a backup source.

So Gluster now sees only 2 brick of 3.
But seems I've lost various VM and this is drive me crazy...

Code:
root@pve01 ~ # gluster volume status
Status of volume: DATASTORE
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick stor01:/PVE01/stor01                  49152     0          Y       1967
Brick stor02:/PVE02/stor02                  49152     0          Y       1991
Brick stor03:/PVE03/stor03                  N/A       N/A        N       N/A
Self-heal Daemon on localhost               N/A       N/A        Y       1976
Self-heal Daemon on stor03                  N/A       N/A        Y       2034
Self-heal Daemon on stor02                  N/A       N/A        Y       2000

Task Status of Volume DATASTORE
------------------------------------------------------------------------------
There are no active volume tasks

Code:
root@pve01 ~ # gluster volume heal DATASTORE info
Brick stor01:/PVE01/stor01
/images/114
/images/108/vm-108-disk-0.qcow2
/images
/images/104/vm-104-disk-0.qcow2
<gfid:979d2546-124f-4d1b-bd3d-b8ccfbcc2800>
<gfid:426e0911-f5c9-4bc9-982b-37c244887d4c>
/images/114/vm-114-disk-0.qcow2
/images/111/vm-111-disk-0.qcow2
/images/109/vm-109-disk-0.qcow2
<gfid:fdc23428-8e45-40c9-856d-1c3011c0153f>
/images/112/vm-112-disk-0.qcow2
/images/102/vm-102-disk-0.qcow2
/images/113/vm-113-disk-0.qcow2
<gfid:b779547b-5e5f-44f7-82a7-302d8864a3b5>
Status: Connected
Number of entries: 14

Brick stor02:/PVE02/stor02
/images/110/vm-110-disk-0.qcow2
/images/114
<gfid:56d65fcb-451d-4288-b7d2-4c9a85fa6f87>
/images
<gfid:3216ab3b-76bb-4da0-8b1b-2e1848ee7283>
<gfid:d642498f-2e2c-4caf-a037-3418f1fc908b>
<gfid:43af1e25-4559-4ac0-af31-e7b19a195e17>
<gfid:2a8f4b90-62f1-476c-b29e-39316361042f>
/images/105/vm-105-disk-0.qcow2
/images/103/vm-103-disk-0.qcow2
<gfid:eff0daaf-dcaf-4faa-8f8f-558cd2a0022b>
<gfid:d907ae20-ce9b-4121-85fe-e983ab8a7d51>
<gfid:1d1b492b-dc1e-4ef4-a7eb-6e474c96427d>
/images/106/vm-106-disk-0.qcow2
Status: Connected
Number of entries: 14

Brick stor03:/PVE03/stor03
Status: Transport endpoint is not connected
Number of entries: -



Do you have some brillant idea to start debug this problem?

Tnx in advance
 
Last edited:
Hi,
I would check the ZFS status `zpool status` this might give a hint/information on the pool's state. And also if you may check the Syslog/journalctl and dmesg looking for any hints that might help to know the issue cause.
 
Code:
root@pve03 ~ # zpool status
no pools available

----

root@pve03 ~ # zpool import
   pool: PVE03
     id: 9958204538773202748
  state: DEGRADED
status: One or more devices contains corrupted data.
 action: The pool can be imported despite missing or damaged devices.  The
    fault tolerance of the pool may be compromised if imported.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-4J
 config:

    PVE03                                              DEGRADED
      raidz1-0                                         DEGRADED
        ata-SAMSUNG_MZ7LM960HCHP-00003_S1YHNX0H410138  UNAVAIL
        ata-SAMSUNG_MZ7LM960HCHP-00003_S1YHNXAG813934  ONLINE
        ata-SAMSUNG_MZ7LM960HCHP-00003_S1YHNYAG600091  ONLINE
        ata-SAMSUNG_MZ7GE960HMHP-00003_S1M7NWAG305222  ONLINE
        ata-SAMSUNG_MZ7LM960HCHP-00003_S1YHNXAH308733  ONLINE
        ata-SAMSUNG_MZ7LM960HCHP-00003_S1YHNX0H408512  ONLINE

Every attempt to import pool fail with I/O error

Log messages on journal:


Code:
Feb 05 11:56:46 pve03 systemd[1]: Removed slice system-zfs\x2dimport.slice.
Feb 05 11:56:46 pve03 systemd[1]: zfs-share.service: Succeeded.
Feb 05 11:56:46 pve03 systemd[1]: zfs-zed.service: Succeeded.
Feb 05 12:01:40 pve03 systemd-modules-load[479]: Inserted module 'zfs'
Feb 05 12:01:41 pve03 systemd[1]: zfs-import@PVE03.service: Main process exited, code=exited, status=1/FAILURE
Feb 05 12:01:41 pve03 systemd[1]: zfs-import@PVE03.service: Failed with result 'exit-code'.
Feb 05 12:02:24 pve03 systemd[1]: zfs-import-cache.service: Main process exited, code=exited, status=1/FAILURE
Feb 05 12:02:24 pve03 systemd[1]: zfs-import-cache.service: Failed with result 'exit-code'.
Feb 05 12:39:08 pve03 systemd-modules-load[470]: Inserted module 'zfs'
Feb 05 12:39:09 pve03 systemd[1]: zfs-import@PVE03.service: Main process exited, code=exited, status=1/FAILURE
Feb 05 12:39:09 pve03 systemd[1]: zfs-import@PVE03.service: Failed with result 'exit-code'.
Feb 05 12:39:53 pve03 systemd[1]: zfs-import-cache.service: Main process exited, code=exited, status=1/FAILURE
Feb 05 12:39:53 pve03 systemd[1]: zfs-import-cache.service: Failed with result 'exit-code'.
Feb 06 00:24:01 pve03 CRON[12887]: (root) CMD (if [ $(date +%w) -eq 0 ] && [ -x /usr/lib/zfs-linux/trim ]; then /usr/lib/zfs-linux/trim; fi)
Feb 06 07:46:03 pve03 systemd[1]: zfs-share.service: Succeeded.
Feb 06 07:46:03 pve03 systemd[1]: Removed slice system-zfs\x2dimport.slice.
Feb 06 07:46:03 pve03 systemd[1]: zfs-zed.service: Succeeded.
Feb 06 07:48:31 pve03 systemd-modules-load[481]: Inserted module 'zfs'
Feb 06 07:48:32 pve03 systemd[1]: zfs-import@PVE03.service: Main process exited, code=exited, status=1/FAILURE
Feb 06 07:48:32 pve03 systemd[1]: zfs-import@PVE03.service: Failed with result 'exit-code'.
Feb 06 07:48:50 pve03 systemd[1]: zfs-import-cache.service: Main process exited, code=exited, status=1/FAILURE
Feb 06 07:48:50 pve03 systemd[1]: zfs-import-cache.service: Failed with result 'exit-code'.
Feb 06 08:32:47 pve03 kernel:  spa_all_configs+0x3b/0x120 [zfs]
Feb 06 08:32:47 pve03 kernel:  zfs_ioc_pool_configs+0x1b/0x70 [zfs]
Feb 06 08:32:47 pve03 kernel:  zfsdev_ioctl_common+0x5b2/0x820 [zfs]
Feb 06 08:32:47 pve03 kernel:  zfsdev_ioctl+0x54/0xe0 [zfs]
Feb 06 08:34:48 pve03 kernel:  spa_all_configs+0x3b/0x120 [zfs]
Feb 06 08:34:48 pve03 kernel:  zfs_ioc_pool_configs+0x1b/0x70 [zfs]
Feb 06 08:34:48 pve03 kernel:  zfsdev_ioctl_common+0x5b2/0x820 [zfs]
Feb 06 08:34:48 pve03 kernel:  zfsdev_ioctl+0x54/0xe0 [zfs]
Feb 06 08:36:49 pve03 kernel:  spa_all_configs+0x3b/0x120 [zfs]
Feb 06 08:36:49 pve03 kernel:  zfs_ioc_pool_configs+0x1b/0x70 [zfs]
Feb 06 08:36:49 pve03 kernel:  zfsdev_ioctl_common+0x5b2/0x820 [zfs]
Feb 06 08:36:49 pve03 kernel:  zfsdev_ioctl+0x54/0xe0 [zfs]
Feb 06 08:38:50 pve03 kernel:  spa_all_configs+0x3b/0x120 [zfs]
Feb 06 08:38:50 pve03 kernel:  zfs_ioc_pool_configs+0x1b/0x70 [zfs]
Feb 06 08:38:50 pve03 kernel:  zfsdev_ioctl_common+0x5b2/0x820 [zfs]
Feb 06 08:38:50 pve03 kernel:  zfsdev_ioctl+0x54/0xe0 [zfs]
Feb 06 08:40:51 pve03 kernel:  spa_all_configs+0x3b/0x120 [zfs]
Feb 06 08:40:51 pve03 kernel:  zfs_ioc_pool_configs+0x1b/0x70 [zfs]
Feb 06 08:40:51 pve03 kernel:  zfsdev_ioctl_common+0x5b2/0x820 [zfs]
Feb 06 08:40:51 pve03 kernel:  zfsdev_ioctl+0x54/0xe0 [zfs]
Feb 06 08:42:51 pve03 kernel:  spa_all_configs+0x3b/0x120 [zfs]
Feb 06 08:42:51 pve03 kernel:  zfs_ioc_pool_configs+0x1b/0x70 [zfs]
Feb 06 08:42:51 pve03 kernel:  zfsdev_ioctl_common+0x5b2/0x820 [zfs]
Feb 06 08:42:51 pve03 kernel:  zfsdev_ioctl+0x54/0xe0 [zfs]
Feb 06 08:44:52 pve03 kernel:  spa_all_configs+0x3b/0x120 [zfs]
Feb 06 08:44:52 pve03 kernel:  zfs_ioc_pool_configs+0x1b/0x70 [zfs]
Feb 06 08:44:52 pve03 kernel:  zfsdev_ioctl_common+0x5b2/0x820 [zfs]
Feb 06 08:44:52 pve03 kernel:  zfsdev_ioctl+0x54/0xe0 [zfs]
Feb 06 08:46:53 pve03 kernel:  spa_all_configs+0x3b/0x120 [zfs]
Feb 06 08:46:53 pve03 kernel:  zfs_ioc_pool_configs+0x1b/0x70 [zfs]
Feb 06 08:46:53 pve03 kernel:  zfsdev_ioctl_common+0x5b2/0x820 [zfs]
Feb 06 08:46:53 pve03 kernel:  zfsdev_ioctl+0x54/0xe0 [zfs]
Feb 06 08:48:54 pve03 kernel:  spa_all_configs+0x3b/0x120 [zfs]
Feb 06 08:48:54 pve03 kernel:  zfs_ioc_pool_configs+0x1b/0x70 [zfs]
Feb 06 08:48:54 pve03 kernel:  zfsdev_ioctl_common+0x5b2/0x820 [zfs]
Feb 06 08:48:54 pve03 kernel:  zfsdev_ioctl+0x54/0xe0 [zfs]
Feb 06 08:50:55 pve03 kernel:  spa_all_configs+0x3b/0x120 [zfs]
Feb 06 08:50:55 pve03 kernel:  zfs_ioc_pool_configs+0x1b/0x70 [zfs]
Feb 06 08:50:55 pve03 kernel:  zfsdev_ioctl_common+0x5b2/0x820 [zfs]
Feb 06 08:50:55 pve03 kernel:  zfsdev_ioctl+0x54/0xe0 [zfs]

In dmesg this seems interesting:


Code:
[    2.339973] ata7: SATA link down (SStatus 0 SControl 300)
[    2.340224] ata6: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[    2.340434] ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[    2.340658] ata5: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[    2.340901] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[    2.341139] ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[    2.341364] ata8: SATA link down (SStatus 0 SControl 300)
[    2.341574] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
 
Last edited:
Hello,

Thank you for the output!

Have you checked the health of the disks using the smartctl?

Does the import PVE03 pool using -f flag return the same issue? zpool import -f PVE03
 
Hello,

Thank you for the output!

Have you checked the health of the disks using the smartctl?

All disks pass the smartctl status ...


Schermata 2023-02-06 alle 13.45.46.jpg

/des/sda I intentionally formatted to see if zfs respond as I expect. Otherwise the volume was not degraded (but still unable to mount)

Does the import PVE03 pool using -f flag return the same issue? zpool import -f PVE03


Fail as before.
Also I try:

zpool import -XF -m -f -o PVE03 -> same I/O error
 
Last edited: