My VM doesn't boot anymore. The VM disk is on a ZFS. SMART passed. Any ideas?

Also show output for lsblk (from host)
Code:
NAME                         MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
sda                            8:0    0 931.5G  0 disk
├─sda1                         8:1    0 931.5G  0 part
└─sda9                         8:9    0     8M  0 part
sdb                            8:16   0 931.5G  0 disk
├─sdb1                         8:17   0 931.5G  0 part
└─sdb9                         8:25   0     8M  0 part
sdc                            8:32   0 931.5G  0 disk
├─sdc1                         8:33   0 931.5G  0 part
└─sdc9                         8:41   0     8M  0 part
sdd                            8:48   0 931.5G  0 disk
├─sdd1                         8:49   0 931.5G  0 part
└─sdd9                         8:57   0     8M  0 part
sde                            8:64   0 931.5G  0 disk
├─sde1                         8:65   0 931.5G  0 part
└─sde9                         8:73   0     8M  0 part
sdf                            8:80   0 931.5G  0 disk
├─sdf1                         8:81   0 931.5G  0 part
└─sdf9                         8:89   0     8M  0 part
zd0                          230:0    0     1M  0 disk
zd16                         230:16   0   3.5T  0 disk
├─zd16p1                     230:17   0     1G  0 part
└─zd16p2                     230:18   0   3.5T  0 part
nvme0n1                      259:0    0 476.9G  0 disk
├─nvme0n1p1                  259:1    0  1007K  0 part
├─nvme0n1p2                  259:2    0     1G  0 part /boot/efi
└─nvme0n1p3                  259:3    0 475.9G  0 part
  ├─pve-swap                 252:0    0     8G  0 lvm  [SWAP]
  ├─pve-root                 252:1    0    96G  0 lvm  /
  ├─pve-data_tmeta           252:2    0   3.6G  0 lvm 
  │ └─pve-data-tpool         252:4    0 348.8G  0 lvm 
  │   ├─pve-data             252:5    0 348.8G  1 lvm 
  │   ├─pve-vm--100--disk--0 252:6    0     4M  0 lvm 
  │   ├─pve-vm--100--disk--1 252:7    0   100G  0 lvm 
  │   ├─pve-vm--100--disk--2 252:8    0     4M  0 lvm 
  │   ├─pve-vm--102--disk--0 252:9    0     4M  0 lvm 
  │   ├─pve-vm--102--disk--1 252:10   0   100G  0 lvm 
  │   └─pve-vm--102--disk--2 252:11   0     4M  0 lvm 
  └─pve-data_tdata           252:3    0 348.8G  0 lvm 
    └─pve-data-tpool         252:4    0 348.8G  0 lvm 
      ├─pve-data             252:5    0 348.8G  1 lvm 
      ├─pve-vm--100--disk--0 252:6    0     4M  0 lvm 
      ├─pve-vm--100--disk--1 252:7    0   100G  0 lvm 
      ├─pve-vm--100--disk--2 252:8    0     4M  0 lvm 
      ├─pve-vm--102--disk--0 252:9    0     4M  0 lvm 
      ├─pve-vm--102--disk--1 252:10   0   100G  0 lvm 
      └─pve-vm--102--disk--2 252:11   0     4M  0 lvm

Here is the output for lsblk
 
Are you able to mount the zvol directly (when the VM is off)? Something like ...

Code:
mkdir /mnt/testmp
mount /dev/zvol/Nextcloud/vm-101-disk-1 /mnt/testmp
ls /mnt/testmp
umount /mnt/testmp

Do you mind posting also:

Code:
zfs get all Nextcloud

Here is the output for zfs get all Nextcloud

Code:
NAME       PROPERTY              VALUE                  SOURCE
Nextcloud  type                  filesystem             -
Nextcloud  creation              Tue Mar 12  1:56 2024  -
Nextcloud  used                  3.52T                  -
Nextcloud  available             0B                     -
Nextcloud  referenced            192K                   -
Nextcloud  compressratio         1.15x                  -
Nextcloud  mounted               yes                    -
Nextcloud  quota                 none                   default
Nextcloud  reservation           none                   default
Nextcloud  recordsize            128K                   default
Nextcloud  mountpoint            /Nextcloud             default
Nextcloud  sharenfs              off                    default
Nextcloud  checksum              on                     default
Nextcloud  compression           on                     local
Nextcloud  atime                 on                     default
Nextcloud  devices               on                     default
Nextcloud  exec                  on                     default
Nextcloud  setuid                on                     default
Nextcloud  readonly              off                    default
Nextcloud  zoned                 off                    default
Nextcloud  snapdir               hidden                 default
Nextcloud  aclmode               discard                default
Nextcloud  aclinherit            restricted             default
Nextcloud  createtxg             1                      -
Nextcloud  canmount              on                     default
Nextcloud  xattr                 on                     default
Nextcloud  copies                1                      default
Nextcloud  version               5                      -
Nextcloud  utf8only              off                    -
Nextcloud  normalization         none                   -
Nextcloud  casesensitivity       sensitive              -
Nextcloud  vscan                 off                    default
Nextcloud  nbmand                off                    default
Nextcloud  sharesmb              off                    default
Nextcloud  refquota              none                   default
Nextcloud  refreservation        none                   default
Nextcloud  guid                  7669563316942828168    -
Nextcloud  primarycache          all                    default
Nextcloud  secondarycache        all                    default
Nextcloud  usedbysnapshots       0B                     -
Nextcloud  usedbydataset         192K                   -
Nextcloud  usedbychildren        3.52T                  -
Nextcloud  usedbyrefreservation  0B                     -
Nextcloud  logbias               latency                default
Nextcloud  objsetid              54                     -
Nextcloud  dedup                 off                    default
Nextcloud  mlslabel              none                   default
Nextcloud  sync                  standard               default
Nextcloud  dnodesize             legacy                 default
Nextcloud  refcompressratio      1.00x                  -
Nextcloud  written               192K                   -
Nextcloud  logicalused           137G                   -
Nextcloud  logicalreferenced     42K                    -
Nextcloud  volmode               default                default
Nextcloud  filesystem_limit      none                   default
Nextcloud  snapshot_limit        none                   default
Nextcloud  filesystem_count      none                   default
Nextcloud  snapshot_count        none                   default
Nextcloud  snapdev               hidden                 default
Nextcloud  acltype               off                    default
Nextcloud  context               none                   default
Nextcloud  fscontext             none                   default
Nextcloud  defcontext            none                   default
Nextcloud  rootcontext           none                   default
Nextcloud  relatime              on                     default
Nextcloud  redundant_metadata    all                    default
Nextcloud  overlay               on                     default
Nextcloud  encryption            off                    default
Nextcloud  keylocation           none                   default
Nextcloud  keyformat             none                   default
Nextcloud  pbkdf2iters           0                      default
Nextcloud  special_small_blocks  0                      default
Nextcloud  prefetch              all                    default
 
Are you able to mount the zvol directly (when the VM is off)? Something like ...

Code:
mkdir /mnt/testmp
mount /dev/zvol/Nextcloud/vm-101-disk-1 /mnt/testmp
ls /mnt/testmp
umount /mnt/testmp

I did this and I was able to mount it and I saw the data so I copied the data somewhere else just in case.
I destroyed the pool and restored an old backup that I had. It was restored successfully and the VM booted up again and it's working fine but when I go to System log, I can still see a LOT of the same errors about the zfs pool so it looks like it will most likely happen again soon, like a time bomb. How can I fix this?

Here are the same errors from today:

Code:
Jul 18 04:32:23 Proxmox kernel: buffer_io_error: 77 callbacks suppressed

Jul 18 04:32:23 Proxmox kernel: Buffer I/O error on dev zd0, logical block 65, lost async page write

Jul 18 04:32:23 Proxmox kernel: Buffer I/O error on dev zd0, logical block 66, lost async page write

Jul 18 04:32:23 Proxmox kernel: Buffer I/O error on dev zd0, logical block 67, lost async page write

Jul 18 04:32:23 Proxmox kernel: Buffer I/O error on dev zd0, logical block 68, lost async page write

Jul 18 04:32:23 Proxmox kernel: Buffer I/O error on dev zd0, logical block 69, lost async page write

Jul 18 04:32:23 Proxmox kernel: Buffer I/O error on dev zd0, logical block 70, lost async page write

Jul 18 04:32:23 Proxmox kernel: Buffer I/O error on dev zd0, logical block 71, lost async page write

Jul 18 04:32:23 Proxmox kernel: Buffer I/O error on dev zd0, logical block 72, lost async page write

Jul 18 04:32:23 Proxmox kernel: Buffer I/O error on dev zd0, logical block 73, lost async page write

Jul 18 04:32:23 Proxmox kernel: Buffer I/O error on dev zd0, logical block 74, lost async page write

Jul 18 04:32:33 Proxmox kernel: buffer_io_error: 57 callbacks suppressed

Jul 18 04:32:33 Proxmox kernel: Buffer I/O error on dev zd0, logical block 10, lost async page write

Jul 18 04:32:33 Proxmox kernel: Buffer I/O error on dev zd0, logical block 9, lost async page write

Jul 18 04:32:33 Proxmox kernel: Buffer I/O error on dev zd0, logical block 2, lost async page write

Jul 18 04:32:33 Proxmox kernel: Buffer I/O error on dev zd0, logical block 12, lost async page write

Jul 18 04:32:33 Proxmox kernel: Buffer I/O error on dev zd0, logical block 11, lost async page write

Jul 18 04:32:33 Proxmox kernel: Buffer I/O error on dev zd0, logical block 7, lost async page write

Jul 18 04:32:33 Proxmox kernel: Buffer I/O error on dev zd0, logical block 6, lost async page write

Jul 18 04:32:33 Proxmox kernel: Buffer I/O error on dev zd0, logical block 8, lost async page write

Jul 18 04:32:33 Proxmox kernel: Buffer I/O error on dev zd0, logical block 14, lost async page write

Jul 18 04:32:33 Proxmox kernel: Buffer I/O error on dev zd0, logical block 15, lost async page write

Jul 18 04:32:38 Proxmox kernel: buffer_io_error: 77 callbacks suppressed

Jul 18 04:32:38 Proxmox kernel: Buffer I/O error on dev zd0, logical block 28, lost async page write

Jul 18 04:32:38 Proxmox kernel: Buffer I/O error on dev zd0, logical block 32, lost async page write

Jul 18 04:32:38 Proxmox kernel: Buffer I/O error on dev zd0, logical block 29, lost async page write

Jul 18 04:32:38 Proxmox kernel: Buffer I/O error on dev zd0, logical block 33, lost async page write

Jul 18 04:32:38 Proxmox kernel: Buffer I/O error on dev zd0, logical block 39, lost async page write

Jul 18 04:32:38 Proxmox kernel: Buffer I/O error on dev zd0, logical block 36, lost async page write

Jul 18 04:32:38 Proxmox kernel: Buffer I/O error on dev zd0, logical block 34, lost async page write

Jul 18 04:32:38 Proxmox kernel: Buffer I/O error on dev zd0, logical block 35, lost async page write

Jul 18 04:32:38 Proxmox kernel: Buffer I/O error on dev zd0, logical block 42, lost async page write

Jul 18 04:32:38 Proxmox kernel: Buffer I/O error on dev zd0, logical block 51, lost async page write

Jul 18 04:32:48 Proxmox kernel: buffer_io_error: 19 callbacks suppressed

Jul 18 04:32:48 Proxmox kernel: Buffer I/O error on dev zd0, logical block 58, lost async page write

Jul 18 04:32:48 Proxmox kernel: Buffer I/O error on dev zd0, logical block 59, lost async page write

Jul 18 04:32:48 Proxmox kernel: Buffer I/O error on dev zd0, logical block 60, lost async page write

Jul 18 04:32:48 Proxmox kernel: Buffer I/O error on dev zd0, logical block 61, lost async page write

Jul 18 04:32:48 Proxmox kernel: Buffer I/O error on dev zd0, logical block 62, lost async page write

Jul 18 04:32:48 Proxmox kernel: Buffer I/O error on dev zd0, logical block 63, lost async page write

Jul 18 04:32:48 Proxmox kernel: Buffer I/O error on dev zd0, logical block 65, lost async page write

Jul 18 04:32:48 Proxmox kernel: Buffer I/O error on dev zd0, logical block 66, lost async page write

Jul 18 04:32:48 Proxmox kernel: Buffer I/O error on dev zd0, logical block 67, lost async page write

Jul 18 04:32:48 Proxmox kernel: Buffer I/O error on dev zd0, logical block 68, lost async page write

Jul 18 04:33:09 Proxmox kernel: buffer_io_error: 63 callbacks suppressed

Jul 18 04:33:09 Proxmox kernel: Buffer I/O error on dev zd0, logical block 1, lost async page write

Jul 18 04:33:09 Proxmox kernel: Buffer I/O error on dev zd0, logical block 3, lost async page write

Jul 18 04:33:09 Proxmox kernel: Buffer I/O error on dev zd0, logical block 2, lost async page write

Jul 18 04:33:09 Proxmox kernel: Buffer I/O error on dev zd0, logical block 5, lost async page write

Jul 18 04:33:09 Proxmox kernel: Buffer I/O error on dev zd0, logical block 122, lost async page write

Jul 18 04:33:09 Proxmox kernel: Buffer I/O error on dev zd0, logical block 4, lost async page write

Jul 18 04:33:09 Proxmox kernel: Buffer I/O error on dev zd0, logical block 123, lost async page write

Jul 18 04:33:09 Proxmox kernel: Buffer I/O error on dev zd0, logical block 124, lost async page write

Jul 18 04:33:09 Proxmox kernel: Buffer I/O error on dev zd0, logical block 127, lost async page write

Jul 18 04:33:09 Proxmox kernel: Buffer I/O error on dev zd0, logical block 126, lost async page write
 
Last edited:
I think you should run a memtest on your RAM. Some of it maybe faulty. Is it ECC or not?
 
I did this and I was able to mount it and I saw the data so I copied the data somewhere else just in case.
I destroyed the pool

I really wished you would have tried to create a file on that mounted fs while watching dmesg. Now we do not really know if the condition is related to what's in the log.

Here are the same errors from today:

Code:
Jul 18 04:32:23 Proxmox kernel: buffer_io_error: 77 callbacks suppressed
Jul 18 04:32:23 Proxmox kernel: Buffer I/O error on dev zd0, logical block 65, lost async page write

What you could do at least now is e.g. post logs (everything occurring, not just the error you got alarmed about) from the time period, such as:
journalctl -–since “2024-07-15 03:00:00” -–until “2024-07-16 03:00:00” > output_to_attach.log

Be sure to include at least one full host boot-up till reboot sequence. You may want to redact parts of the log, but do not remove anything. There might be other giveaways in the log.

Non ECC.
I will test the RAM and I'll report back.

Can you share more on the hardware setup as a whole? Not just drives but MB, CPU, RAM layout, etc.

The issue is that you have a good eye for one entry in the log, but as you now see it might be even unrelated.

One more thing since you appear to be happy testing (it's just 133GB there as USEDDS) - would you mind trying running something else than RAIDZ2 (if you have to re-create it later again)?
 
Can you share more on the hardware setup as a whole? Not just drives but MB, CPU, RAM layout, etc.
CPU:Intel(R) Core(TM) i5-9500T CPU @ 2.20GHz
RAM: Team T-Force 32GB (16GBx2) - 3200 CL 16-20-20-40
MB: AsRock BM365M Pro4 LGA1151
 
What you could do at least now is e.g. post logs (everything occurring, not just the error you got alarmed about) from the time period, such as:
journalctl -–since “2024-07-15 03:00:00” -–until “2024-07-16 03:00:00” > output_to_attach.log

I'm going to try this when I get home but for now, here's a log of a full boot that I copied from Proxmox > Syslog.
I'm doing a full backup of the whole thing before I start testing.
 
Last edited:
Thats fast - how did you test it?
Not really that fast. I started it right when you posted almost 3 hours ago.
I used the MemTest from the Proxmox ISO.
After the test, when I saw I was getting the same errors, I even swapped the RAM with the pair I have on my gaming PC and even after that I got the same errors so it's confirmed that is not not a RAM issue.
 
Ignore the above.

:) You can just delete posts if it's just 5min after you posted (no one replied yet) if a mistake. I think you are in a big rush. That's never good when you want to find a cause of something. If you absolutely need that up and running I suggest to just make a simple 2-drive mirror for your 130G nextcloud for now and go on experimenting with the rest of the drives.

It would be really interesting to find out if this is occurring e.g. only with raidz2.
 
:) You can just delete posts if it's just 5min after you posted (no one replied yet) if a mistake. I think you are in a big rush. That's never good when you want to find a cause of something. If you absolutely need that up and running I suggest to just make a simple 2-drive mirror for your 130G nextcloud for now and go on experimenting with the rest of the drives.

It would be really interesting to find out if this is occurring e.g. only with raidz2.
Just deleted it.
If I try to restore it to something smaller than 3.82TB, it doesn't let me. That's why I used RaidZ2, because it gave me more capacity with some redundancy. I don't have a bigger HDD at the moment. What else can I use that gives me the same amount of storage or more?

I'm not in a rush. I was just trying to give an update that I thought it was going to be relevant for you to help me. I apologize if it felt rushed.
I deleted the zfs again and now I'm restoring the backup I made before testing. I'll try to mount it after that.
 
How did you create the RAIDZ2 originally to begin with?

I clicked Datacenter > Proxmox (that's the name of my server, so creative I know) > Disks > ZFS > Create: ZFS
I clicked all the drivers I wanted to include and then selected raidz2

Then I restored the backup by clicking the Datacenter > Proxmox > Backup (Proxmox) > Backups > backup_name > Restore
 
Just deleted it.
If I try to restore it to something smaller than 3.82TB, it doesn't let me. That's why I used RaidZ2, because it gave me more capacity with some redundancy. I don't have a bigger HDD at the moment. What else can I use that gives me the same amount of storage or more?

I'm not in a rush. I was just trying to give an update that I thought it was going to be relevant for you to help me. I apologize if it felt rushed.
I deleted the zfs again and now I'm restoring the backup I made before testing. I'll try to mount it after that.

I might have misunderstood as well. You are saying you are "deleting" the zfs again and again above, are you destroying the pool anyhow? That's what I thought. Copying out the 130G aside ... re-creating new pool ... and putting it back? Please do not do anything yet, just let me know what you are doing when you say "delete zfs" ...
 
I clicked Datacenter > Proxmox (that's the name of my server, so creative I know) > Disks > ZFS > Create: ZFS
I clicked all the drivers I wanted to include and then selected raidz2

Then I restored the backup by clicking the Datacenter > Proxmox > Backup (Proxmox) > Backups > backup_name > Restore

Do you have access to the full logs (not just current boot), i.e. SSH and get the --since -until command. I really would like to see the command PVE uses to create raidz2 pool. I just have a weird hunch there's something off with the full refreservation, but I can't quite be sure - and I myself hate trying this and that as a matter of troubleshooting.

But that said, do you mind changing the ref reservation value?

Something like:
Code:
zfs set refreserv=3T Nextcloud/vm-101-disk-1
 
Code:
NAME                         MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
sda                            8:0    0 931.5G  0 disk
├─sda1                         8:1    0 931.5G  0 part
└─sda9                         8:9    0     8M  0 part
sdb                            8:16   0 931.5G  0 disk
├─sdb1                         8:17   0 931.5G  0 part
└─sdb9                         8:25   0     8M  0 part
sdc                            8:32   0 931.5G  0 disk
├─sdc1                         8:33   0 931.5G  0 part
└─sdc9                         8:41   0     8M  0 part
sdd                            8:48   0 931.5G  0 disk
├─sdd1                         8:49   0 931.5G  0 part
└─sdd9                         8:57   0     8M  0 part
sde                            8:64   0 931.5G  0 disk
├─sde1                         8:65   0 931.5G  0 part
└─sde9                         8:73   0     8M  0 part
sdf                            8:80   0 931.5G  0 disk
├─sdf1                         8:81   0 931.5G  0 part
└─sdf9                         8:89   0     8M  0 part
zd0                          230:0    0     1M  0 disk
zd16                         230:16   0   3.5T  0 disk
├─zd16p1                     230:17   0     1G  0 part
└─zd16p2                     230:18   0   3.5T  0 part
nvme0n1                      259:0    0 476.9G  0 disk
├─nvme0n1p1                  259:1    0  1007K  0 part
├─nvme0n1p2                  259:2    0     1G  0 part /boot/efi
└─nvme0n1p3                  259:3    0 475.9G  0 part
  ├─pve-swap                 252:0    0     8G  0 lvm  [SWAP]
  ├─pve-root                 252:1    0    96G  0 lvm  /
  ├─pve-data_tmeta           252:2    0   3.6G  0 lvm
  │ └─pve-data-tpool         252:4    0 348.8G  0 lvm
  │   ├─pve-data             252:5    0 348.8G  1 lvm
  │   ├─pve-vm--100--disk--0 252:6    0     4M  0 lvm
  │   ├─pve-vm--100--disk--1 252:7    0   100G  0 lvm
  │   ├─pve-vm--100--disk--2 252:8    0     4M  0 lvm
  │   ├─pve-vm--102--disk--0 252:9    0     4M  0 lvm
  │   ├─pve-vm--102--disk--1 252:10   0   100G  0 lvm
  │   └─pve-vm--102--disk--2 252:11   0     4M  0 lvm
  └─pve-data_tdata           252:3    0 348.8G  0 lvm
    └─pve-data-tpool         252:4    0 348.8G  0 lvm
      ├─pve-data             252:5    0 348.8G  1 lvm
      ├─pve-vm--100--disk--0 252:6    0     4M  0 lvm
      ├─pve-vm--100--disk--1 252:7    0   100G  0 lvm
      ├─pve-vm--100--disk--2 252:8    0     4M  0 lvm
      ├─pve-vm--102--disk--0 252:9    0     4M  0 lvm
      ├─pve-vm--102--disk--1 252:10   0   100G  0 lvm
      └─pve-vm--102--disk--2 252:11   0     4M  0 lvm

Here is the output for lsblk

I just noticed something here ... the VM 101 is your nextcloud but ... it has not drive on our NVME pool ... so now I get it, you use that 6-drive pool and make it BOTH the storage and boot drive of the nextcloud VM, right?
 
I might have misunderstood as well. You are saying you are "deleting" the zfs again and again above, are you destroying the pool anyhow? That's what I thought. Copying out the 130G aside ... re-creating new pool ... and putting it back? Please do not do anything yet, just let me know what you are doing when you say "delete zfs" ...

When I say I was deleting zfs I meant:

Left click the Nextcloud VM on the left side panel > More > Remove > Confirm then going to Datacenter > Proxmox > Disks > ZFS > Nextcloud > More > Destroy > Confirm then going to back to Disks > click dev/sda > Wipe Disk > confirm and then repeat these last steps with all disks on the zfs

To recreate it, I just follow the steps from #36
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!