Hi.
Last night, following a power cut, my home server booted up, but failed to start one of its VMs, as some of the zvols didn't mount. A further reboot cured the problem, so I'm just posting here in case the information is useful.
The server is an HP Microserver (N40L, Gen 7) with 16GB of ECC. Storage is ZFS. The 4 drive bays contain a mirrored pair of Intel 320 SSDs ("rpool") and a mirrored pair of Hitachi 2TB SATA drives ("bigtank"). These attach to the onboard disk controller. Additionally, a Dell Perc H200 card is fitted, with a lone Intel 320 ("singletank"). All of the affected datasets in the issue I'm reporting are on the root pool. Software is PVE 4.4-13, with the "pve-no-subscription" repo enabled, and all updates applied. SMART shows no problems with any of the disks, and all of the zpools have been recently scrubbed, and the rpool has been scrubbed again since the issue. No scrub has ever reported any problem. The server has been in its current hardware configuration for about 6 months, and runs 24/7.
Only one VM is set to start on boot. 3 LXC containers also start on boot, but basically do no work unless a client connects to them.
The events:
After power came back on, I noticed that the VM wasn't running, so logged in to the web ui. The start task for VM 109 had failed with the following message:
So I open a console and go look at /dev/zvol/rpool/data, and what I get (I'm only showing items for 109 here) is:
What I should get is this:
... which includes the entry for vm-109-disk-1 that the start task is referring to.
Next I go looking in the syslog for entries about zd16, or just anything else that looks like a problem, and I find this:
All of these are datasets which should be linked from /dev/zvol/rpool/data.
I decided to reboot the server, and this time, no timeouts, everything started up fine.
The only other odd thing that I noticed while the issue was occuring, was that the web ui reported an "IO delay" figure of a steady 50%. I couldn't figure out what was causing this; zpool iostat showed that almost no disk i/o was occuring (as expected, since no VMs were running and the 3 LXC containers on the server were idle), all zpools were responsive, and zpool status showed no problems. There are no other storage devices attached at all.
I'd be interested to know what else I could have done to examine the cause of the IO delay, in case I should see the issue again.
Thanks,
Nick
Last night, following a power cut, my home server booted up, but failed to start one of its VMs, as some of the zvols didn't mount. A further reboot cured the problem, so I'm just posting here in case the information is useful.
The server is an HP Microserver (N40L, Gen 7) with 16GB of ECC. Storage is ZFS. The 4 drive bays contain a mirrored pair of Intel 320 SSDs ("rpool") and a mirrored pair of Hitachi 2TB SATA drives ("bigtank"). These attach to the onboard disk controller. Additionally, a Dell Perc H200 card is fitted, with a lone Intel 320 ("singletank"). All of the affected datasets in the issue I'm reporting are on the root pool. Software is PVE 4.4-13, with the "pve-no-subscription" repo enabled, and all updates applied. SMART shows no problems with any of the disks, and all of the zpools have been recently scrubbed, and the rpool has been scrubbed again since the issue. No scrub has ever reported any problem. The server has been in its current hardware configuration for about 6 months, and runs 24/7.
Only one VM is set to start on boot. 3 LXC containers also start on boot, but basically do no work unless a client connects to them.
The events:
After power came back on, I noticed that the VM wasn't running, so logged in to the web ui. The start task for VM 109 had failed with the following message:
kvm: -drive file=/dev/zvol/rpool/data/vm-109-disk-1,if=none,id=drive-virtio0,format=raw,cache=none,aio=native,detect-zeroes=on: Could not open '/dev/zvol/rpool/data/vm-109-disk-1': No such file or directory
So I open a console and go look at /dev/zvol/rpool/data, and what I get (I'm only showing items for 109 here) is:
lrwxrwxrwx 1 root root 15 May 20 03:43 vm-109-disk-1-part1 -> ../../../zd16p1
lrwxrwxrwx 1 root root 15 May 20 03:43 vm-109-disk-1-part5 -> ../../../zd16p5
lrwxrwxrwx 1 root root 15 May 20 03:43 vm-109-disk-1-part5 -> ../../../zd16p5
What I should get is this:
lrwxrwxrwx 1 root root 13 May 20 03:43 vm-109-disk-1 -> ../../../zd16
lrwxrwxrwx 1 root root 15 May 20 03:43 vm-109-disk-1-part1 -> ../../../zd16p1
lrwxrwxrwx 1 root root 15 May 20 03:43 vm-109-disk-1-part5 -> ../../../zd16p5
lrwxrwxrwx 1 root root 15 May 20 03:43 vm-109-disk-1-part1 -> ../../../zd16p1
lrwxrwxrwx 1 root root 15 May 20 03:43 vm-109-disk-1-part5 -> ../../../zd16p5
... which includes the entry for vm-109-disk-1 that the start task is referring to.
Next I go looking in the syslog for entries about zd16, or just anything else that looks like a problem, and I find this:
May 20 02:36:06 pve1 systemd-udevd[2681]: timeout '/lib/udev/zvol_id /dev/zd0p5'
May 20 02:36:06 pve1 systemd-udevd[2691]: timeout '/lib/udev/zvol_id /dev/zd112p5'
May 20 02:36:06 pve1 systemd-udevd[2677]: timeout '/lib/udev/zvol_id /dev/zd0p1'
May 20 02:36:06 pve1 systemd-udevd[2688]: timeout '/lib/udev/zvol_id /dev/zd32'
May 20 02:36:06 pve1 systemd-udevd[2683]: timeout '/lib/udev/zvol_id /dev/zd16'
May 20 02:36:06 pve1 systemd-udevd[2689]: timeout '/lib/udev/zvol_id /dev/zd112p1'
May 20 02:36:06 pve1 systemd-udevd[2683]: timeout '/lib/udev/vdev_id -d zd16'
May 20 02:36:06 pve1 systemd-udevd[2686]: timeout '/lib/udev/vdev_id -d zd80'
May 20 02:36:06 pve1 systemd-udevd[2688]: timeout '/lib/udev/vdev_id -d zd32'
May 20 02:36:06 pve1 systemd-udevd[2691]: timeout '/lib/udev/zvol_id /dev/zd112p5'
May 20 02:36:06 pve1 systemd-udevd[2677]: timeout '/lib/udev/zvol_id /dev/zd0p1'
May 20 02:36:06 pve1 systemd-udevd[2688]: timeout '/lib/udev/zvol_id /dev/zd32'
May 20 02:36:06 pve1 systemd-udevd[2683]: timeout '/lib/udev/zvol_id /dev/zd16'
May 20 02:36:06 pve1 systemd-udevd[2689]: timeout '/lib/udev/zvol_id /dev/zd112p1'
May 20 02:36:06 pve1 systemd-udevd[2683]: timeout '/lib/udev/vdev_id -d zd16'
May 20 02:36:06 pve1 systemd-udevd[2686]: timeout '/lib/udev/vdev_id -d zd80'
May 20 02:36:06 pve1 systemd-udevd[2688]: timeout '/lib/udev/vdev_id -d zd32'
All of these are datasets which should be linked from /dev/zvol/rpool/data.
I decided to reboot the server, and this time, no timeouts, everything started up fine.
The only other odd thing that I noticed while the issue was occuring, was that the web ui reported an "IO delay" figure of a steady 50%. I couldn't figure out what was causing this; zpool iostat showed that almost no disk i/o was occuring (as expected, since no VMs were running and the 3 LXC containers on the server were idle), all zpools were responsive, and zpool status showed no problems. There are no other storage devices attached at all.
I'd be interested to know what else I could have done to examine the cause of the IO delay, in case I should see the issue again.
Thanks,
Nick