KVM Loosing disk drives

PretoX · Jan 14, 2017

Hi guys,

proxmox-ve: 4.4-76 (running kernel: 4.4.21-1-pve)
pve-manager: 4.4-2 (running version: 4.4-2/80259e05)
pve-kernel-4.4.6-1-pve: 4.4.6-48
pve-kernel-4.4.13-1-pve: 4.4.13-56
pve-kernel-4.4.35-1-pve: 4.4.35-76
pve-kernel-4.2.6-1-pve: 4.2.6-36
pve-kernel-4.4.13-2-pve: 4.4.13-58
pve-kernel-4.4.21-1-pve: 4.4.21-71
pve-kernel-4.2.8-1-pve: 4.2.8-41
pve-kernel-4.4.19-1-pve: 4.4.19-66
pve-kernel-4.4.10-1-pve: 4.4.10-54
lvm2: 2.02.116-pve3
corosync-pve: 2.4.0-1
libqb0: 1.0-1
pve-cluster: 4.0-48
qemu-server: 4.0-102
pve-firmware: 1.1-10
libpve-common-perl: 4.0-84
libpve-access-control: 4.0-19
libpve-storage-perl: 4.0-70
pve-libspice-server1: 0.12.8-1
vncterm: 1.2-1
pve-docs: 4.4-1
pve-qemu-kvm: 2.7.0-9
pve-container: 1.0-89
pve-firewall: 2.0-33
pve-ha-manager: 1.0-38
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u2
lxc-pve: 2.0.6-2
lxcfs: 2.0.5-pve1
criu: 1.6.0-1
novnc-pve: 0.5-8
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.8-pve13~bpo80

Sorry I have no text logs yet, only screenshots.
my problem is, sometimes once in 2-3 days, VM's crash with messages shown on pictures attached:

First: we got messages about e1000 errors, so I've changed nic's to RTL ones and errors disappeared.
Second: From what I can see at least swap drive was lost. I do also indicate cpu loaded with WA queue before VM panic, it also points to drives issue, changing scheduler from cfq didn't help.

And yesterday I got freebsd VM crashed:

We are running on zfs and using pve-zsync to backup. My thoughts was pve-zsync is locking HW raid as we make backups over bonded interfaces and use max backup speed so I lowered it to 20 MB/sec, and still got a crash.

I could assume Cloudlinux 6.8/Centos 6 and kernel 2.6 issue as I saw messages about bad work of kernel 2.6 with kvm but they are pretty old, like several years and marked as solved.

PLZ help.

I just thought, can storage type changing from zfs to qcow2 files help?

e100 · Jan 14, 2017

Freebsd seems to be complaining that disk IO is taking too long.

On Linux I see the OOM killer trying too free up ram so that vm was likely doing lots of disk IO to swap. I have disabled swap on nearly every vm because when they start using swap it causes IO issues and performance goes to crap. Linux vms I've added zram and if I see OOM events I add ram or change settings to prevent the excess usage. If your Proxmox host is using swap, that can cause all sorts of strange issues especially for vms who's ram got swapped to disk. My software testers complained every Monday because the idle testing servers got swapped to disk over the weekend and response times were horrible. Removed swap and all sorts of random performance issues were eliminated, but I do have the luxury of most servers having 64gb ram or more.

Looks like maybe your storage is not performing at the level required by your VMs.
Maybe the sync process is partly to blame, maybe ZFS needs tuned or maybe your hardware is simply not fast enough.

What does pveperf output when you run it against your zfs storage?

PretoX · Jan 14, 2017

e100 said:
What does pveperf output when you run it against your zfs storage?

Thank you for your reply. This is a brand new dell server. And no, my VM's do not user swap thus it is enabled
RAID-6 + zfs with a caching ssd

# pveperf /VM-DATA
CPU BOGOMIPS: 153621.60
REGEX/SECOND: 1841587
HD SIZE: 2794.25 GB (VM-DATA)
FSYNCS/SECOND: 2079.90

which is only slightly less than ssd drives I have

# pveperf
CPU BOGOMIPS: 153621.60
REGEX/SECOND: 2037255
HD SIZE: 155.36 GB (rpool/ROOT/pve-1)
FSYNCS/SECOND: 2244.64

Just got another VM crashe, this time I have full log. No backups running during a crash, part of logs:
<4>[249920.223451] sd 2:0:1:0: [sdb] ABORT operation started
<4>[249920.223508] sd 2:0:1:0: ABORT operation failed.
<4>[249920.223617] sd 2:0:1:0: [sdb] DEVICE RESET operation started
<6>[249920.223978] scsi target2:0:1: control msgout: c.
<5>[249920.224039] scsi target2:0:1: has been reset
<4>[249920.224101] sd 2:0:1:0: DEVICE RESET operation complete.
<4>[249982.003452] sd 2:0:0:0: [sda] ABORT operation started
<4>[249982.003462] sd 2:0:0:0: ABORT operation failed.
<4>[249982.003699] sd 2:0:0:0: [sda] DEVICE RESET operation started
<4>[249982.003728] sd 2:0:0:0: DEVICE RESET operation complete.
<6>[249982.004219] scsi target2:0:0: control msgout: c.
<5>[249982.004307] scsi target2:0:0: has been reset
<4>[249982.004821] sd 2:0:0:0: [sda] BUS RESET operation started
<4>[249982.007173] sd 2:0:0:0: BUS RESET operation complete.
<4>[249982.007196] sym0: SCSI BUS reset detected.
<5>[249982.017121] sym0: SCSI BUS has been reset.
<4>[249982.018274] sym0: unknown interrupt(s) ignored, ISTAT=0x1 DSTAT=0x80 SIST=0x0
<4>[249992.012966] ------------[ cut here ]------------

I'll upload full dump to cloudlinux support, maybe they can help. If crashdump will help, I can upload it here too

e100 · Jan 14, 2017

I don't use zfs so I have no clue if that might be part of the issue or not.

The pveperf looks OK.

How are your VM disks configured?
Maybe you can just post the vm.conf files.

PretoX · Jan 16, 2017

e100 said:
Maybe you can just post the vm.conf files.

agent: 1
balloon: 0
boot: cdn
bootdisk: scsi0
cores: 4
ide2: none,media=cdrom
memory: 32768
name: plesk2
net0: rtl8139=62:64:64:35:xx:xx,bridge=vmbr0
net1: rtl8139=3A:35:33:30:xx:xx,bridge=vmbr0
numa: 0
onboot: 1
ostype: l26
parent: Test
protection: 1
scsi0: VM-DATA:vm-102-disk-1,size=100G
scsi1: VM-DATA:vm-102-disk-2,size=200G
smbios1: uuid=31f1b79b-d7fe-4fc2-8baf-239acad853b9
sockets: 2
tablet: 0

I tried switching balooning off, but got crash again this night.

I also opened support ticket to Cloudlinux support team but I'm pretty sure it's proxmox+kvm issue...

rampage · Jan 16, 2017

checkout

Related issues:
https://forum.rockstor.com/t/btrfs-error-and-critical-target-errors-with-kvm-disk-passthrough/2573/4
https://forum.proxmox.com/threads/kvm-loosing-disk-drives.31963/
https://forum.proxmox.com/threads/proxmox-4-4-virtio_scsi-regression.31471/
http://forum.openmediavault.org/ind...-btrfs-errors-and-ro-mount-after-last-apt-upd

PretoX · Jan 16, 2017

rampage said:
checkout

Related issues:
https://forum.rockstor.com/t/btrfs-error-and-critical-target-errors-with-kvm-disk-passthrough/2573/4
https://forum.proxmox.com/threads/kvm-loosing-disk-drives.31963/
https://forum.proxmox.com/threads/proxmox-4-4-virtio_scsi-regression.31471/
http://forum.openmediavault.org/ind...-btrfs-errors-and-ro-mount-after-last-apt-upd

yeah, thx for that, I don't use brtfs ) But saw the messages about scsi regression stuff, will try to move vm to at least sata emulation. But freebsd vm living on sata crashes with drives failing too

rampage · Jan 16, 2017

It's not related to btrfs, simply accessing the harddrive, even via dd, will have errors.
I just have read/writes on this btrfs file system, so showing up btrfs is failing on doing this.

e100 · Jan 16, 2017

I recently converted a vm to virtio scsi and it crashed during vzdump backup. The kvm process vanished.

PretoX · Jan 17, 2017

So #CloudLinux guys say it's probably this issue https://patchwork.kernel.org/patch/7086701/ and they say this patch will be included in next kernel version.
Patch on my system should be applied faster, so I'll report back about the result asap

PretoX · Jan 31, 2017

So CL guys proposed new kernel:

We have released a kernel that include needed patch, please consider updating and rebooting to start using it:

yum install kernel-2.6.32-673.26.1.lve1.4.21.el6 kmod-lve-1.4-21.el6 --enablerepo=cloudlinux-updates-testing

But it didn't help. I blame this for now https://github.com/zfsonlinux/zfs/issues/4345
Testing limits for arc

fabian · Feb 1, 2017

e100 said:
I recently converted a vm to virtio scsi and it crashed during vzdump backup. The kvm process vanished.

was this by chance with the 4.4.35-1 kernel?

@PretoX: so if I understand correctly, this is an issue with VMs using outdated kernels?

PretoX · Feb 1, 2017

fabian said:
was this by chance with the 4.4.35-1 kernel?

@PretoX: so if I understand correctly, this is an issue with VMs using outdated kernels?

No, my last post was with a new test kernel built according to my CloudLinux support case. They made kernel updates based on our crashdumps

VM crashed again today on 4.4.35-2-pve
This was not happening on 4.3

# pveversion -v
proxmox-ve: 4.4-78 (running kernel: 4.4.35-2-pve)
pve-manager: 4.4-5 (running version: 4.4-5/c43015a5)
pve-kernel-4.4.6-1-pve: 4.4.6-48
pve-kernel-4.4.13-1-pve: 4.4.13-56
pve-kernel-4.4.35-1-pve: 4.4.35-77
pve-kernel-4.2.6-1-pve: 4.2.6-36
pve-kernel-4.4.13-2-pve: 4.4.13-58
pve-kernel-4.4.35-2-pve: 4.4.35-78
pve-kernel-4.4.21-1-pve: 4.4.21-71
pve-kernel-4.2.8-1-pve: 4.2.8-41
pve-kernel-4.4.19-1-pve: 4.4.19-66
pve-kernel-4.4.10-1-pve: 4.4.10-54
lvm2: 2.02.116-pve3
corosync-pve: 2.4.0-1
libqb0: 1.0-1
pve-cluster: 4.0-48
qemu-server: 4.0-102
pve-firmware: 1.1-10
libpve-common-perl: 4.0-85
libpve-access-control: 4.0-19
libpve-storage-perl: 4.0-71
pve-libspice-server1: 0.12.8-1
vncterm: 1.2-1
pve-docs: 4.4-1
pve-qemu-kvm: 2.7.0-10
pve-container: 1.0-90
pve-firewall: 2.0-33
pve-ha-manager: 1.0-38
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u3
lxc-pve: 2.0.6-5
lxcfs: 2.0.5-pve2
criu: 1.6.0-1
novnc-pve: 0.5-8
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.8-pve13~bpo80

fabian · Feb 1, 2017

PretoX said:
No, my last post was with a new test kernel built according to my CloudLinux support case. They made kernel updates based on our crashdumps

the patch you linked has been included in vanilla linux since 4.3.. so you are talking about the guest kernel here?

VM crashed again today on 4.4.35-2-pve
This was not happening on 4.3

I assume you mean PVE 4.3 here? (this "closeness" of kernel and PVE version numbers can sometimes be confusing, sorry).

If so, I would suggest testing with an older qemu version (e.g., 2.6.x) to narrow down possible culprits.

Do you see anything in the host logs? Are you monitoring the memory and I/O situation on the host?

PretoX · Feb 1, 2017

fabian said:
the patch you linked has been included in vanilla linux since 4.3.. so you are talking about the guest kernel here?

I assume you mean PVE 4.3 here? (this "closeness" of kernel and PVE version numbers can sometimes be confusing, sorry).

If so, I would suggest testing with an older qemu version (e.g., 2.6.x) to narrow down possible culprits.

Do you see anything in the host logs? Are you monitoring the memory and I/O situation on the host?

Yes, sorry, patch is for guest CloudLinux kernel.

Yes, I do monitor ram (about 90Gb out of 190 used) and io delay is on 16% level during vm going down.

as far as I remember the VM was stable on: Nov 23, 2016 not sure what kernel was latest at that time.

Here's the vm bootup log, during those messages VM is completely stuck.

Feb 1 10:13:22 PM1-BNE2 pvedaemon[35464]: start VM 102: UPID

M1-BNE2:00008A88:04249E8F:58912822:qmstart:102

retox@pam:
Feb 1 10:13:22 PM1-BNE2 systemd[1]: Starting 102.scope.
Feb 1 10:13:22 PM1-BNE2 systemd[1]: Started 102.scope.
Feb 1 10:13:22 PM1-BNE2 kernel: [695059.736954] device tap102i0 entered promiscuous mode
Feb 1 10:13:22 PM1-BNE2 kernel: [695059.742538] vmbr0: port 28(tap102i0) entered forwarding state
Feb 1 10:13:22 PM1-BNE2 kernel: [695059.742561] vmbr0: port 28(tap102i0) entered forwarding state
Feb 1 10:13:23 PM1-BNE2 kernel: [695060.176056] device tap102i1 entered promiscuous mode
Feb 1 10:13:23 PM1-BNE2 kernel: [695060.181259] vmbr0: port 29(tap102i1) entered forwarding state
Feb 1 10:13:23 PM1-BNE2 kernel: [695060.181281] vmbr0: port 29(tap102i1) entered forwarding state
Feb 1 10:13:25 PM1-BNE2 kernel: [695062.317207] kvm: zapping shadow pages for mmio generation wraparound
Feb 1 10:13:25 PM1-BNE2 kernel: [695062.319280] kvm: zapping shadow pages for mmio generation wraparound
Feb 1 10:13:55 PM1-BNE2 kernel: [695092.288111] kvm [35472]: vcpu0 unhandled rdmsr: 0xce
Feb 1 10:13:55 PM1-BNE2 kernel: [695092.427009] kvm [35472]: vcpu1 unhandled rdmsr: 0xce
Feb 1 10:13:55 PM1-BNE2 kernel: [695092.458670] kvm [35472]: vcpu2 unhandled rdmsr: 0xce
Feb 1 10:13:55 PM1-BNE2 kernel: [695092.490319] kvm [35472]: vcpu3 unhandled rdmsr: 0xce
Feb 1 10:13:55 PM1-BNE2 kernel: [695092.522057] kvm [35472]: vcpu4 unhandled rdmsr: 0xce
Feb 1 10:13:55 PM1-BNE2 kernel: [695092.553692] kvm [35472]: vcpu5 unhandled rdmsr: 0xce
Feb 1 10:13:55 PM1-BNE2 kernel: [695092.585286] kvm [35472]: vcpu6 unhandled rdmsr: 0xce
Feb 1 10:13:55 PM1-BNE2 kernel: [695092.617008] kvm [35472]: vcpu7 unhandled rdmsr: 0xce

Alessandro 123 · Feb 2, 2017

So this issue is affecting only VM with cloudlinux kernel ?

Search

Search

KVM Loosing disk drives

PretoX

Well-Known Member

e100

Renowned Member

PretoX

Well-Known Member

e100

Renowned Member

PretoX

Well-Known Member

rampage

New Member

PretoX

Well-Known Member

rampage

New Member

e100

Renowned Member

PretoX

Well-Known Member

PretoX

Well-Known Member

fabian

Proxmox Staff Member

PretoX

Well-Known Member

fabian

Proxmox Staff Member

PretoX

Well-Known Member

Alessandro 123

Well-Known Member