VM-hdd only with cache=write through possible

udo

Distinguished Member
Apr 22, 2009
5,975
196
163
Ahrensburg; Germany
Hi,
now I have the second system where the default cache=none produce not bootable VMs.

The big question is how to find the issue? Gladly the second system is my home- and test-server, so I can do much more things to find the issue.

Strange thing: the Server has an hw-raid-controller (areca 1210) with 3 volumes (pve, sata and sata2 - all raid-5 with 3 disks) and the behavior only happens on one raid-volume (sdb/sata1/vg:sata).

I have a bigger LV on the vg sata mountet as local-storage (mainly for an nfs-server) as ext4 - because of the size and fsck-time. With the hdd-file as raw on this local storage and the hdd-file as lv inside the sata-vg happens the same: no boot with cache=none! Also with an mounted ext3-lv from this vg the same issue occurs.
But as raw-file on the local storage (raid-volume sda) or on the 3rd volume (lvm-storage sata2 - raid volume sdc) the VM boot also with cache=none!

Version:
Code:
pve-manager: 2.2-19 (pve-manager/2.2/b8238244)
running kernel: 2.6.32-15-pve
proxmox-ve-2.6.32: 2.2-78
pve-kernel-2.6.32-11-pve: 2.6.32-66
pve-kernel-2.6.32-14-pve: 2.6.32-74
pve-kernel-2.6.32-15-pve: 2.6.32-78
lvm2: 2.02.95-1pve2
clvm: 2.02.95-1pve2
corosync-pve: 1.4.3-1
openais-pve: 1.1.4-2
libqb: 0.10.1-2
redhat-cluster-pve: 3.1.92-3
resource-agents-pve: 3.9.2-3
fence-agents-pve: 3.1.8-1
pve-cluster: 1.0-28
qemu-server: 2.0-59
pve-firmware: 1.0-19
libpve-common-perl: 1.0-33
libpve-access-control: 1.0-25
libpve-storage-perl: 2.0-32
vncterm: 1.0-3
vzctl: 3.0.30-2pve5
vzprocps: 2.0.11-2
vzquota: 3.0.12-3
pve-qemu-kvm: 1.2-6
ksm-control-daemon: 1.1-1
Volumes (drives):
Code:
cli64 vsf info
  # Name             Raid Name       Level   Capacity Ch/Id/Lun  State         
===============================================================================
  1 sata1            sata_3tb_disks  Raid5   4000.0GB 00/00/01   Normal
  2 pve              sata_3tb_disks  Raid5    100.0GB 00/00/00   Normal
  3 sata2            sata_3tb_disks  Raid5   1900.0GB 00/00/02   Normal
===============================================================================
vginfo of volume, where cache must be write through:
Code:
 --- Volume group ---
  VG Name               sata
  System ID             
  Format                lvm2
  Metadata Areas        1
  Metadata Sequence No  5
  VG Access             read/write
  VG Status             resizable
  MAX LV                0                                                                                                                                                                                                
  Cur LV                3                                                                                                                                                                                                
  Open LV               1                                                                                                                                                                                                
  Max PV                0                                                                                                                                                                                                
  Cur PV                1                                                                                                                                                                                                
  Act PV                1                                                                                                                                                                                                
  VG Size               3,64 TiB                                                                                                                                                                                         
  PE Size               4,00 MiB                                                                                                                                                                                         
  Total PE              953673                                                                                                                                                                                           
  Alloc PE / Size       813824 / 3,10 TiB                                                                                                                                                                                
  Free  PE / Size       139849 / 546,29 GiB
  VG UUID               aq1YV7-GdSC-7cz9-Y8JQ-FqMj-vNls-9MZ0lK
storage.cfg:
Code:
lvm: sata
        vgname sata
        content images

lvm: sata2
        vgname sata2
        content images

dir: local
        path /var/lib/vz
        content images,iso,vztmpl,rootdir
        maxfiles 0

dir: local-sata
        path /mnt/local-sata
        content images,rootdir
        maxfiles 1

dir: test
        path /mnt/test-sata
        content images
        maxfiles 1
mount:
Code:
none /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0
none /proc proc rw,nosuid,nodev,noexec,relatime 0 0
none /dev devtmpfs rw,relatime,size=4060340k,nr_inodes=1015085,mode=755 0 0
none /dev/pts devpts rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000 0 0
/dev/mapper/pve-root / ext3 rw,relatime,errors=remount-ro,barrier=0,data=ordered 0 0
tmpfs /lib/init/rw tmpfs rw,nosuid,relatime,mode=755 0 0
tmpfs /dev/shm tmpfs rw,nosuid,nodev,relatime 0 0
/dev/mapper/pve-data /var/lib/vz ext3 rw,relatime,errors=continue,barrier=0,data=ordered 0 0
/dev/sda1 /boot ext3 rw,relatime,errors=continue,barrier=0,data=ordered 0 0
/dev/mapper/sata-local /mnt/local-sata ext4 rw,relatime,barrier=1,data=ordered 0 0
fusectl /sys/fs/fuse/connections fusectl rw,relatime 0 0
/dev/fuse /etc/pve fuse rw,nosuid,nodev,relatime,user_id=0,group_id=0,default_permissions,allow_other 0 0
beancounter /proc/vz/beancounter cgroup rw,relatime,blkio,name=beancounter 0 0
container /proc/vz/container cgroup rw,relatime,freezer,devices,name=container 0 0
fairsched /proc/vz/fairsched cgroup rw,relatime,cpuacct,cpu,cpuset,name=fairsched 0 0
/mnt/local-sata/private/100 /var/lib/vz/root/100 simfs rw,relatime 0 0
proc /var/lib/vz/root/100/proc proc rw,relatime 0 0
sysfs /var/lib/vz/root/100/sys sysfs rw,relatime 0 0
nfsd /var/lib/vz/root/100/proc/fs/nfsd nfsd rw,relatime 0 0
sunrpc /var/lib/vz/root/100/var/lib/nfs/rpc_pipefs rpc_pipefs rw,relatime 0 0
tmpfs /var/lib/vz/root/100/lib/init/rw tmpfs rw,nosuid,relatime,size=393216k,nr_inodes=98304,mode=755 0 0
tmpfs /var/lib/vz/root/100/dev/shm tmpfs rw,nosuid,nodev,relatime,size=393216k,nr_inodes=98304 0 0
devpts /var/lib/vz/root/100/dev/pts devpts rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000 0 0
/dev/mapper/sata-test /mnt/test-sata ext3 rw,relatime,errors=continue,barrier=0,data=ordered 0 0
The logs (messages+syslog) shows nothing about this.

Any hints how to find the issue??

Udo
 
Hi,
one thing what I just noticed: On both server, where this happens is the disks bigger than 2TB!!
Can this be in some cases a problem? But I have on some other pve-host also big volumes without trouble (but as FC/iSCSI/DRBD - not directly local storage...).

Udo
 
Hi,
one thing what I just noticed: On both server, where this happens is the disks bigger than 2TB!!
Can this be in some cases a problem? But I have on some other pve-host also big volumes without trouble (but as FC/iSCSI/DRBD - not directly local storage...).

Udo
This seems to be the problem.
The volume have 4k-blocks:
Code:
parted) print all                                                        
Model: Areca pve (scsi)
Disk /dev/sda: 100GB
Sector size (logical/physical): 512B/512B
Partition Table: msdos

Number  Start   End    Size    Type     File system  Flags
 1      1049kB  537MB  536MB   primary  ext3         boot
 2      537MB   100GB  99,5GB  primary               lvm


Model: Areca sata1 (scsi)
Disk /dev/sdb: 4000GB
Sector size (logical/physical): [B]4096B/4096B[/B]
Partition Table: gpt

Number  Start   End     Size    File system  Name  Flags
 1      1049kB  4000GB  4000GB               lvm   lvm


Model: Areca sata2 (scsi)
Disk /dev/sdc: 1900GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt

Number  Start   End     Size   File system  Name  Flags
 1      1049kB  900GB   900GB               lvm
 2      900GB   1400GB  500GB               lvm
 3      1400GB  1900GB  500GB               lvm
Other big volumes, where no problem occur has 512b-Blocks:
Code:
Model: IFT DS S16E-G2240 (scsi)
Disk /dev/sdo: 16.0TB
Sector size (logical/physical): 512B/512B
Partition Table: gpt

I've tried spirits trick with 4k for virtio http://forum.proxmox.com/threads/11316-Advanced-disk-format?p=62108#post62108 but the issue is the same.

Udo
 
bug report on redhat (from 2010)
https://bugzilla.redhat.com/show_bug.cgi?id=616509

Code:
"[COLOR=#000000]Description of problem:[/COLOR]
O_DIRECT (cache=none) requires sector size alignment and size for IO requests,which is currently hard coded within QEMU to be 512, ie. the most commonsector size.New hard drives will have 4k sectors, and is one of the things we explicitlylist as a feature in RHEL6.We need to examine the scope of this problem and determine whether it is indeed a real problem. If it is, users will face unexplained errors whentrying to run KVM with images on 4k drives."
 
Last edited:
Hi,
that's not correct. The bios see the disk but print the error:
Code:
Bootig from Hard Disk...
Boot failed: could not read the boot disk
After that the Bios try to do an network boot.

Udo

So, this is a disk with an already installed os from previous install ?

What happen if you try to create a new disk on this storage ? (try to add in on a running vm, and check if you can do fdisk, and see what is displayed)
 
So, this is a disk with an already installed os from previous install ?

What happen if you try to create a new disk on this storage ? (try to add in on a running vm, and check if you can do fdisk, and see what is displayed)
Hi Spirit,
the hdd isn't realy accessible.
I have start an new VM with a grml-live distro and can partition the hdd:
Code:
# fdisk -l
Disk /dev/vda: 4294 MB, 4294967296 bytes
1 heads, 32 sectors/track, 262144 cylinders
Units = cylinders of 32 * 512 = 16384 bytes
Disk identifier: 0x52183693

   Device Boot      Start         End      Blocks   Id  System
/dev/vda1              65      262144     4193280   83  Linux
Also an filesystem is creatable and fsck say that all ok:
Code:
# mkfs.ext3 /dev/vda1
mke2fs 1.41.11 (14-Mar-2010)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=0 blocks, Stripe width=0 blocks
262144 inodes, 1048320 blocks
52416 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=1073741824
32 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks: 
        32768, 98304, 163840, 229376, 294912, 819200, 884736

Writing inode tables: done                            
Creating journal (16384 blocks): done
Writing superblocks and filesystem accounting information: done

This filesystem will be automatically checked every 25 mounts or
180 days, whichever comes first.  Use tune2fs -c or -i to override.

# fsck -f /dev/vda1
fsck from util-linux-ng 2.16.2
e2fsck 1.41.11 (14-Mar-2010)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/vda1: 11/262144 files (0.0% non-contiguous), 34911/1048320 blocks
But mounting fails:
Code:
# mount /dev/vda1 /mnt/vda1
mount: wrong fs type, bad option, bad superblock on /dev/vda1,
       missing codepage or helper program, or other error
       In some cases useful info is found in syslog - try
       dmesg | tail  or so

# mount -t ext3 /dev/vda1 /mnt/vda1
mount: wrong fs type, bad option, bad superblock on /dev/vda1,
       missing codepage or helper program, or other error
       In some cases useful info is found in syslog - try
       dmesg | tail  or so
dmesg say read-error:
Code:
[  929.713053] end_request: I/O error, dev vda, sector 2050
[  929.713069] EXT3-fs (vda1): error: unable to read superblock
but with dd it's ok:
Code:
# dd if=/dev/vda1 of=test bs=1024k count=128
128+0 records in
128+0 records out
134217728 bytes (134 MB) copied, 0,544312 s, 247 MB/s

# file test
test: Linux rev 1.0 ext3 filesystem data, UUID=642af4c7-7e5c-4fdd-8c57-c3181522c32e (large files)
Strange, or not?

Udo
 
@udo: I am trying to find a hard disk which exposes native 4K sectors.

So what disk Vendor/Model do you use to get:

Sector size (logical/physical): 4096B/4096B
 
@udo: I am trying to find a hard disk which exposes native 4K sectors.

So what disk Vendor/Model do you use to get:

Sector size (logical/physical): 4096B/4096B
Hi Dietmar,
in this case it's an raid-volume. I can select with areca raidcontroller how should an volume > 2TB is used for the system. (64bit lsb(?) and 4k-sectors).

Udo
 
Hi Dietmar,
in this case it's an raid-volume. I can select with areca raidcontroller how should an volume > 2TB is used for the system. (64bit lsb(?) and 4k-sectors).

And why do you select 4K if you can use 512?
 
And why do you select 4K if you can use 512?
Hi,
now I see it was an fail - but I think the new disks (3T and bigger) has also 4k-sectors, so i can use the same format with the raid-controller...

With an raid-controller i can switch to 512b-blocks (not easy, only with new building of the raidset, which takes a long time) but with new single disks I must use 4k.

Udo
 
but I think the new disks (3T and bigger) has also 4k-sectors

I have tried to find such disks, but seem all disks emulates 512bytes sectors (they use 4K sectors internally).

Please tells me if you can find a disk which support "4Kn".
 
Last edited:
Hi Udo

about:
Code:
# fdisk -lDisk /dev/vda: 4294 MB, 4294967296 bytes1 heads, 32 sectors/track, 262144 cylindersUnits = cylinders of 32 * 512 = 16384 bytesDisk identifier: 0x52183693   

Device Boot      Start         End      Blocks   Id  System
/dev/vda1              65      262144     4193280   83  Linux

if you use virtual 4K drive (-device virtio-blk-pci,logical_block_size=4096,physical_block_size=4096),

the partition is misaligned, it should start at sector 64 and not 65.
I don't know if it can be the problem with with cache=none
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!