I have a proxmox node that's currently running only one VM. It's running a mysql server inside and I have configured linux software raid10 in proxmox over 4 samsung 840 Pro SSD's. Speed is awesome, also in the VM. So mysql is running better than expected !
It's been running like that for a couple of weeks without any issues until today.
I noticed that one myisam table in the mysql VM got corrupted. I had to "repair table tbl" to fix it and everything is up and running again since. But then I started investigating and I noticed this in the VM's kernel log:
So I logged in to proxmox and checked it's kernel logs and I see this:
So the proxmox host was busy with lzop. If I'm not mistaken this is the compression process that runs while taking backups. And yes, the kernel log entries are at the time the backups are supposed to run.. So it seems related to the backups.
I have no idea why it suddenly became a problem? but I know for sure that disk-speed shouldn't be an issue as these ssd's aren't used that heavily. Especially not during the backup time...
The call trace talks about nfs, and it's true that our backups are getting stored on NAS that's available over NFS to proxmox. So i'm thinking that maybe the switch is failing and blocked for some reason?
That could explain the error on the proxmox host itself. But it doesn't explain why I get these 'blocked for 120 second' messages inside the VM.
Inside the VM it looks like the call trace is talking about the filesystem. So it makes me think the VM had troubles accessing it's disk?
If the VM had troubles accessing it's disk and the proxmox host too, then it could explain why I got that message in both the host and the VM.
But the host never talked about a disk issue or anything like that. It even did a complete software raid 10 data-check right before the backup started:
So i'm puzzled on what happend here.
Can anyone help me find out what went wrong?
It's been running like that for a couple of weeks without any issues until today.
I noticed that one myisam table in the mysql VM got corrupted. I had to "repair table tbl" to fix it and everything is up and running again since. But then I started investigating and I noticed this in the VM's kernel log:
Code:
(my message got too long for this forum to accept it. So i'm going to post it right below in a reply)
So I logged in to proxmox and checked it's kernel logs and I see this:
Code:
[FONT=Menlo](my message got too long for thie forum to accept it. So I'm going to post it right below in a second reply)
[/FONT]
So the proxmox host was busy with lzop. If I'm not mistaken this is the compression process that runs while taking backups. And yes, the kernel log entries are at the time the backups are supposed to run.. So it seems related to the backups.
I have no idea why it suddenly became a problem? but I know for sure that disk-speed shouldn't be an issue as these ssd's aren't used that heavily. Especially not during the backup time...
The call trace talks about nfs, and it's true that our backups are getting stored on NAS that's available over NFS to proxmox. So i'm thinking that maybe the switch is failing and blocked for some reason?
That could explain the error on the proxmox host itself. But it doesn't explain why I get these 'blocked for 120 second' messages inside the VM.
Inside the VM it looks like the call trace is talking about the filesystem. So it makes me think the VM had troubles accessing it's disk?
If the VM had troubles accessing it's disk and the proxmox host too, then it could explain why I got that message in both the host and the VM.
But the host never talked about a disk issue or anything like that. It even did a complete software raid 10 data-check right before the backup started:
Code:
md: data-check of RAID array md0
Code:
[FONT=Menlo]md: minimum _guaranteed_ speed: 1000 KB/sec/disk.[/FONT]
[FONT=Menlo]md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.[/FONT]
[FONT=Menlo]md: using 128k window, over a total of 499855360k.[/FONT]
[FONT=Menlo]md: md0: data-check done.[/FONT]
So i'm puzzled on what happend here.
Can anyone help me find out what went wrong?
Last edited: