Halts alfter upgrade to Proxmox 4.2

ktecho

Active Member
Jun 6, 2016
49
0
26
44
Hi there,

I've been running a Proxmox 3.4 for two years without a single problem. Two weeks ago I upgraded it to Proxmox 4.2. To my surprise, everything went smooth and all my 4 containers (kvm) started up and worked as always. So congrats to Proxmox guys for the smooth transition.

Just one day after the migration, I receive a mail from SoYouStart (OVH) telling me that my machine has stopped replying to pings:

PING nsxxxxxx.ip-xx-yy-zz.eu (94.23.oo.pp) from 213.186.zz.xx : 56(84) bytes of data.
From 213.186.zz.xx: Destination Host Unreachable
From 213.186.zz.xx: Destination Host Unreachable
From 213.186.zz.xx: Destination Host Unreachable

I couldn't access using ssh to the proxmox machine nor to any of the VMs. The only thing I could do was rebooting the proxmox machine using my control panel, and after two minutes, everything was working fine again. This has happened 3 times in this two weeks.

I've done some log crawling, I've cheked the RAID (hard) status, but I don't know where the problem could be. I've found another user that says he had the same problem and he had to go back to Proxmox 3.4.

Right now I'm using latest kernel pve-kernel-4.4.8-1-pve (4.4.8-52), but I think I've been using -51 since the upgrade. I'm reading that -52 fixes some network problems, so it shouldn't be too important.

Is there any place I can look for to try to guess what's happening?
 
I get this in the "dmesg", but as I've read in some places around the Internet, it shouldn't be too important:

[ 15.029498] ACPI Warning: SystemIO range 0x000000000000F040-0x000000000000F05F conflicts with OpRegion 0x000000000000F040-0x000000000000F04F (\_SB_.PCI0.SBUS.SMBI) (20150930/utaddress-254)
[ 15.029502] ACPI: If an ACPI driver is available for this device, you should use it instead of the native driver
[ 15.034041] EDAC ie31200: No ECC support
[ 15.060346] ACPI Warning: SystemIO range 0x0000000000000428-0x000000000000042F conflicts with OpRegion 0x0000000000000400-0x000000000000047F (\PMIO) (20150930/utaddress-254)
[ 15.060350] ACPI: If an ACPI driver is available for this device, you should use it instead of the native driver
[ 15.060353] ACPI Warning: SystemIO range 0x0000000000000540-0x000000000000054F conflicts with OpRegion 0x0000000000000500-0x0000000000000563 (\GPIO) (20150930/utaddress-254)
[ 15.060355] ACPI Warning: SystemIO range 0x0000000000000540-0x000000000000054F conflicts with OpRegion 0x0000000000000500-0x000000000000057F (\_SB_.PCI0.LPCB.GPBX) (20150930/utaddress-254)
[ 15.060357] ACPI: If an ACPI driver is available for this device, you should use it instead of the native driver
[ 15.060358] ACPI Warning: SystemIO range 0x0000000000000530-0x000000000000053F conflicts with OpRegion 0x0000000000000500-0x0000000000000563 (\GPIO) (20150930/utaddress-254)
[ 15.060360] ACPI Warning: SystemIO range 0x0000000000000530-0x000000000000053F conflicts with OpRegion 0x0000000000000500-0x000000000000057F (\_SB_.PCI0.LPCB.GPBX) (20150930/utaddress-254)
[ 15.060362] ACPI: If an ACPI driver is available for this device, you should use it instead of the native driver
[ 15.060362] ACPI Warning: SystemIO range 0x0000000000000500-0x000000000000052F conflicts with OpRegion 0x0000000000000500-0x0000000000000563 (\GPIO) (20150930/utaddress-254)
[ 15.060364] ACPI Warning: SystemIO range 0x0000000000000500-0x000000000000052F conflicts with OpRegion 0x0000000000000500-0x000000000000057F (\_SB_.PCI0.LPCB.GPBX) (20150930/utaddress-254)
[ 15.060366] ACPI: If an ACPI driver is available for this device, you should use it instead of the native driver
 
This morning it happened again. Any clue on where to look at?

Thanks.
 
I had similar problems with 4.2 on ovh dedicated hard raid servers and did a downgrade to 3.4 and now everything work perfectly.
 
Then there's a possibility that the problem is in the kernel, right?

Any Proxmox developer can assist us to investigate something?

Thanks.
 
The server I used was idle, running few test lxc containers and one kvm machine. The server is Enterprise MG-128 - 128G 2xE5-2630v3 Server with RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 2208 [Thunderbolt] (rev 05)
 
Thanks for the information. My server is a different one from SoYouStart: 32G E3-1245v2 HardRAID 2x2 TB

The only thing that could be related is the LSI RAID, albeit i'm using a different one:

Symbios Logic SAS2004 PCI-Express Fusion-MPT SAS-2 [Spitfire] (rev 03)
 
Wolfgang Bumiller in the bug tracking system points me to enabling the the journal. I've just done that, but there I can see this. I've read in some places (links at the end) that it could be the problem. Could you please tell me if it could be the problem?

Jun 07 12:06:29 ns204651 kernel: MTRR default type: uncachable
Jun 07 12:06:29 ns204651 kernel: MTRR fixed ranges enabled:
Jun 07 12:06:29 ns204651 kernel: 00000-9FFFF write-back
Jun 07 12:06:29 ns204651 kernel: A0000-BFFFF uncachable
Jun 07 12:06:29 ns204651 kernel: C0000-D7FFF write-protect
Jun 07 12:06:29 ns204651 kernel: D8000-E7FFF uncachable
Jun 07 12:06:29 ns204651 kernel: E8000-FFFFF write-protect
Jun 07 12:06:29 ns204651 kernel: MTRR variable ranges enabled:
Jun 07 12:06:29 ns204651 kernel: 0 base 000000000 mask 800000000 write-back
Jun 07 12:06:29 ns204651 kernel: 1 base 800000000 mask FE0000000 write-back
Jun 07 12:06:29 ns204651 kernel: 2 base 0E0000000 mask FE0000000 uncachable
Jun 07 12:06:29 ns204651 kernel: 3 base 0DE000000 mask FFE000000 uncachable
Jun 07 12:06:29 ns204651 kernel: 4 base 0DD000000 mask FFF000000 uncachable
Jun 07 12:06:29 ns204651 kernel: 5 base 81FE00000 mask FFFE00000 uncachable
Jun 07 12:06:29 ns204651 kernel: 6 disabled
Jun 07 12:06:29 ns204651 kernel: 7 disabled
Jun 07 12:06:29 ns204651 kernel: 8 disabled
Jun 07 12:06:29 ns204651 kernel: 9 disabled
Jun 07 12:06:29 ns204651 kernel: x86/PAT: Configuration [0-7]: WB WC UC- UC WB WC UC- WT
Jun 07 12:06:29 ns204651 kernel: original variable MTRRs
Jun 07 12:06:29 ns204651 kernel: reg 0, base: 0GB, range: 32GB, type WB
Jun 07 12:06:29 ns204651 kernel: reg 1, base: 32GB, range: 512MB, type WB
Jun 07 12:06:29 ns204651 kernel: reg 2, base: 3584MB, range: 512MB, type UC
Jun 07 12:06:29 ns204651 kernel: reg 3, base: 3552MB, range: 32MB, type UC
Jun 07 12:06:29 ns204651 kernel: reg 4, base: 3536MB, range: 16MB, type UC
Jun 07 12:06:29 ns204651 kernel: reg 5, base: 33278MB, range: 2MB, type UC
Jun 07 12:06:29 ns204651 kernel: total RAM covered: 32718M
Jun 07 12:06:29 ns204651 kernel: gran_size: 64K chunk_size: 64K num_reg: 10 lose cover RAM: 254M
Jun 07 12:06:29 ns204651 kernel: gran_size: 64K chunk_size: 128K num_reg: 10 lose cover RAM: 254M
Jun 07 12:06:29 ns204651 kernel: gran_size: 64K chunk_size: 256K num_reg: 10 lose cover RAM: 254M
Jun 07 12:06:29 ns204651 kernel: gran_size: 64K chunk_size: 512K num_reg: 10 lose cover RAM: 254M
Jun 07 12:06:29 ns204651 kernel: gran_size: 64K chunk_size: 1M num_reg: 10 lose cover RAM: 254M
Jun 07 12:06:29 ns204651 kernel: gran_size: 64K chunk_size: 2M num_reg: 10 lose cover RAM: 254M
Jun 07 12:06:29 ns204651 kernel: *BAD*gran_size: 64K chunk_size: 4M num_reg: 10 lose cover RAM: -2M
Jun 07 12:06:29 ns204651 kernel: *BAD*gran_size: 64K chunk_size: 8M num_reg: 10 lose cover RAM: -2M
Jun 07 12:06:29 ns204651 kernel: *BAD*gran_size: 64K chunk_size: 16M num_reg: 10 lose cover RAM: -2M
Jun 07 12:06:29 ns204651 kernel: gran_size: 64K chunk_size: 32M num_reg: 10 lose cover RAM: 510M
Jun 07 12:06:29 ns204651 kernel: gran_size: 64K chunk_size: 64M num_reg: 10 lose cover RAM: 0G
Jun 07 12:06:29 ns204651 kernel: gran_size: 64K chunk_size: 128M num_reg: 10 lose cover RAM: 0G
Jun 07 12:06:29 ns204651 kernel: gran_size: 64K chunk_size: 256M num_reg: 10 lose cover RAM: 0G
Jun 07 12:06:29 ns204651 kernel: gran_size: 64K chunk_size: 512M num_reg: 10 lose cover RAM: 0G
Jun 07 12:06:29 ns204651 kernel: gran_size: 64K chunk_size: 1G num_reg: 10 lose cover RAM: 0G
Jun 07 12:06:29 ns204651 kernel: *BAD*gran_size: 64K chunk_size: 2G num_reg: 10 lose cover RAM: -1G
Jun 07 12:06:29 ns204651 kernel: gran_size: 128K chunk_size: 128K num_reg: 10 lose cover RAM: 254M
Jun 07 12:06:29 ns204651 kernel: gran_size: 128K chunk_size: 256K num_reg: 10 lose cover RAM: 254M
Jun 07 12:06:29 ns204651 kernel: gran_size: 128K chunk_size: 512K num_reg: 10 lose cover RAM: 254M
Jun 07 12:06:29 ns204651 kernel: gran_size: 128K chunk_size: 1M num_reg: 10 lose cover RAM: 254M
Jun 07 12:06:29 ns204651 kernel: gran_size: 128K chunk_size: 2M num_reg: 10 lose cover RAM: 254M
Jun 07 12:06:29 ns204651 kernel: *BAD*gran_size: 128K chunk_size: 4M num_reg: 10 lose cover RAM: -2M
Jun 07 12:06:29 ns204651 kernel: *BAD*gran_size: 128K chunk_size: 8M num_reg: 10 lose cover RAM: -2M
Jun 07 12:06:29 ns204651 kernel: *BAD*gran_size: 128K chunk_size: 16M num_reg: 10 lose cover RAM: -2M
Jun 07 12:06:29 ns204651 kernel: gran_size: 128K chunk_size: 32M num_reg: 10 lose cover RAM: 510M
Jun 07 12:06:29 ns204651 kernel: gran_size: 128K chunk_size: 64M num_reg: 10 lose cover RAM: 0G
Jun 07 12:06:29 ns204651 kernel: gran_size: 128K chunk_size: 128M num_reg: 10 lose cover RAM: 0G
Jun 07 12:06:29 ns204651 kernel: gran_size: 128K chunk_size: 256M num_reg: 10 lose cover RAM: 0G
Jun 07 12:06:29 ns204651 kernel: gran_size: 128K chunk_size: 512M num_reg: 10 lose cover RAM: 0G
Jun 07 12:06:29 ns204651 kernel: gran_size: 128K chunk_size: 1G num_reg: 10 lose cover RAM: 0G
Jun 07 12:06:29 ns204651 kernel: *BAD*gran_size: 128K chunk_size: 2G num_reg: 10 lose cover RAM: -1G
Jun 07 12:06:29 ns204651 kernel: gran_size: 256K chunk_size: 256K num_reg: 10 lose cover RAM: 254M
Jun 07 12:06:29 ns204651 kernel: gran_size: 256K chunk_size: 512K num_reg: 10 lose cover RAM: 254M
Jun 07 12:06:29 ns204651 kernel: gran_size: 256K chunk_size: 1M num_reg: 10 lose cover RAM: 254M
Jun 07 12:06:29 ns204651 kernel: gran_size: 256K chunk_size: 2M num_reg: 10 lose cover RAM: 254M
Jun 07 12:06:29 ns204651 kernel: *BAD*gran_size: 256K chunk_size: 4M num_reg: 10 lose cover RAM: -2M
Jun 07 12:06:29 ns204651 kernel: *BAD*gran_size: 256K chunk_size: 8M num_reg: 10 lose cover RAM: -2M
Jun 07 12:06:29 ns204651 kernel: *BAD*gran_size: 256K chunk_size: 16M num_reg: 10 lose cover RAM: -2M
Jun 07 12:06:29 ns204651 kernel: gran_size: 256K chunk_size: 32M num_reg: 10 lose cover RAM: 510M
Jun 07 12:06:29 ns204651 kernel: gran_size: 256K chunk_size: 64M num_reg: 10 lose cover RAM: 0G
Jun 07 12:06:29 ns204651 kernel: gran_size: 256K chunk_size: 128M num_reg: 10 lose cover RAM: 0G
Jun 07 12:06:29 ns204651 kernel: gran_size: 256K chunk_size: 256M num_reg: 10 lose cover RAM: 0G
Jun 07 12:06:29 ns204651 kernel: gran_size: 256K chunk_size: 512M num_reg: 10 lose cover RAM: 0G
Jun 07 12:06:29 ns204651 kernel: gran_size: 256K chunk_size: 1G num_reg: 10 lose cover RAM: 0G
Jun 07 12:06:29 ns204651 kernel: *BAD*gran_size: 256K chunk_size: 2G num_reg: 10 lose cover RAM: -1G
Jun 07 12:06:29 ns204651 kernel: gran_size: 512K chunk_size: 512K num_reg: 10 lose cover RAM: 254M
Jun 07 12:06:29 ns204651 kernel: gran_size: 512K chunk_size: 1M num_reg: 10 lose cover RAM: 254M
Jun 07 12:06:29 ns204651 kernel: gran_size: 512K chunk_size: 2M num_reg: 10 lose cover RAM: 254M
Jun 07 12:06:29 ns204651 kernel: *BAD*gran_size: 512K chunk_size: 4M num_reg: 10 lose cover RAM: -2M
Jun 07 12:06:29 ns204651 kernel: *BAD*gran_size: 512K chunk_size: 8M num_reg: 10 lose cover RAM: -2M
Jun 07 12:06:29 ns204651 kernel: *BAD*gran_size: 512K chunk_size: 16M num_reg: 10 lose cover RAM: -2M
Jun 07 12:06:29 ns204651 kernel: gran_size: 512K chunk_size: 32M num_reg: 10 lose cover RAM: 510M
Jun 07 12:06:29 ns204651 kernel: gran_size: 512K chunk_size: 64M num_reg: 10 lose cover RAM: 0G
 
Post size limit. I continue right here:

Jun 07 12:06:29 ns204651 kernel: gran_size: 512K chunk_size: 128M num_reg: 10 lose cover RAM: 0G
Jun 07 12:06:29 ns204651 kernel: gran_size: 512K chunk_size: 256M num_reg: 10 lose cover RAM: 0G
Jun 07 12:06:29 ns204651 kernel: gran_size: 512K chunk_size: 512M num_reg: 10 lose cover RAM: 0G
Jun 07 12:06:29 ns204651 kernel: gran_size: 512K chunk_size: 1G num_reg: 10 lose cover RAM: 0G
Jun 07 12:06:29 ns204651 kernel: *BAD*gran_size: 512K chunk_size: 2G num_reg: 10 lose cover RAM: -1G
Jun 07 12:06:29 ns204651 kernel: gran_size: 1M chunk_size: 1M num_reg: 10 lose cover RAM: 254M
Jun 07 12:06:29 ns204651 kernel: gran_size: 1M chunk_size: 2M num_reg: 10 lose cover RAM: 254M
Jun 07 12:06:29 ns204651 kernel: *BAD*gran_size: 1M chunk_size: 4M num_reg: 10 lose cover RAM: -2M
Jun 07 12:06:29 ns204651 kernel: *BAD*gran_size: 1M chunk_size: 8M num_reg: 10 lose cover RAM: -2M
Jun 07 12:06:29 ns204651 kernel: *BAD*gran_size: 1M chunk_size: 16M num_reg: 10 lose cover RAM: -2M
Jun 07 12:06:29 ns204651 kernel: gran_size: 1M chunk_size: 32M num_reg: 10 lose cover RAM: 510M
Jun 07 12:06:29 ns204651 kernel: gran_size: 1M chunk_size: 64M num_reg: 10 lose cover RAM: 0G
Jun 07 12:06:29 ns204651 kernel: gran_size: 1M chunk_size: 128M num_reg: 10 lose cover RAM: 0G
Jun 07 12:06:29 ns204651 kernel: gran_size: 1M chunk_size: 256M num_reg: 10 lose cover RAM: 0G
Jun 07 12:06:29 ns204651 kernel: gran_size: 1M chunk_size: 512M num_reg: 10 lose cover RAM: 0G
Jun 07 12:06:29 ns204651 kernel: gran_size: 1M chunk_size: 1G num_reg: 10 lose cover RAM: 0G
Jun 07 12:06:29 ns204651 kernel: *BAD*gran_size: 1M chunk_size: 2G num_reg: 10 lose cover RAM: -1G
Jun 07 12:06:29 ns204651 kernel: gran_size: 2M chunk_size: 2M num_reg: 10 lose cover RAM: 254M
Jun 07 12:06:29 ns204651 kernel: *BAD*gran_size: 2M chunk_size: 4M num_reg: 10 lose cover RAM: -2M
Jun 07 12:06:29 ns204651 kernel: *BAD*gran_size: 2M chunk_size: 8M num_reg: 10 lose cover RAM: -2M
Jun 07 12:06:29 ns204651 kernel: *BAD*gran_size: 2M chunk_size: 16M num_reg: 10 lose cover RAM: -2M
Jun 07 12:06:29 ns204651 kernel: gran_size: 2M chunk_size: 32M num_reg: 10 lose cover RAM: 510M
Jun 07 12:06:29 ns204651 kernel: gran_size: 2M chunk_size: 64M num_reg: 10 lose cover RAM: 0G
Jun 07 12:06:29 ns204651 kernel: gran_size: 2M chunk_size: 128M num_reg: 10 lose cover RAM: 0G
Jun 07 12:06:29 ns204651 kernel: gran_size: 2M chunk_size: 256M num_reg: 10 lose cover RAM: 0G
Jun 07 12:06:29 ns204651 kernel: gran_size: 2M chunk_size: 512M num_reg: 10 lose cover RAM: 0G
Jun 07 12:06:29 ns204651 kernel: gran_size: 2M chunk_size: 1G num_reg: 10 lose cover RAM: 0G
Jun 07 12:06:29 ns204651 kernel: *BAD*gran_size: 2M chunk_size: 2G num_reg: 10 lose cover RAM: -1G
Jun 07 12:06:29 ns204651 kernel: gran_size: 4M chunk_size: 4M num_reg: 10 lose cover RAM: 254M
Jun 07 12:06:29 ns204651 kernel: *BAD*gran_size: 4M chunk_size: 8M num_reg: 10 lose cover RAM: -2M
Jun 07 12:06:29 ns204651 kernel: *BAD*gran_size: 4M chunk_size: 16M num_reg: 10 lose cover RAM: -2M
Jun 07 12:06:29 ns204651 kernel: gran_size: 4M chunk_size: 32M num_reg: 10 lose cover RAM: 510M
Jun 07 12:06:29 ns204651 kernel: gran_size: 4M chunk_size: 64M num_reg: 10 lose cover RAM: 2M
Jun 07 12:06:29 ns204651 kernel: gran_size: 4M chunk_size: 128M num_reg: 10 lose cover RAM: 2M
Jun 07 12:06:29 ns204651 kernel: gran_size: 4M chunk_size: 256M num_reg: 10 lose cover RAM: 2M
 
And more here:

Jun 07 12:06:29 ns204651 kernel: gran_size: 4M chunk_size: 512M num_reg: 10 lose cover RAM: 2M
Jun 07 12:06:29 ns204651 kernel: gran_size: 4M chunk_size: 1G num_reg: 10 lose cover RAM: 2M
Jun 07 12:06:29 ns204651 kernel: *BAD*gran_size: 4M chunk_size: 2G num_reg: 10 lose cover RAM: -1022M
Jun 07 12:06:29 ns204651 kernel: gran_size: 8M chunk_size: 8M num_reg: 10 lose cover RAM: 254M
Jun 07 12:06:29 ns204651 kernel: *BAD*gran_size: 8M chunk_size: 16M num_reg: 10 lose cover RAM: -2M
Jun 07 12:06:29 ns204651 kernel: gran_size: 8M chunk_size: 32M num_reg: 10 lose cover RAM: 510M
Jun 07 12:06:29 ns204651 kernel: gran_size: 8M chunk_size: 64M num_reg: 10 lose cover RAM: 6M
Jun 07 12:06:29 ns204651 kernel: gran_size: 8M chunk_size: 128M num_reg: 10 lose cover RAM: 6M
Jun 07 12:06:29 ns204651 kernel: gran_size: 8M chunk_size: 256M num_reg: 10 lose cover RAM: 6M
Jun 07 12:06:29 ns204651 kernel: gran_size: 8M chunk_size: 512M num_reg: 10 lose cover RAM: 6M
Jun 07 12:06:29 ns204651 kernel: gran_size: 8M chunk_size: 1G num_reg: 10 lose cover RAM: 6M
Jun 07 12:06:29 ns204651 kernel: *BAD*gran_size: 8M chunk_size: 2G num_reg: 10 lose cover RAM: -1018M
Jun 07 12:06:29 ns204651 kernel: gran_size: 16M chunk_size: 16M num_reg: 10 lose cover RAM: 254M
Jun 07 12:06:29 ns204651 kernel: gran_size: 16M chunk_size: 32M num_reg: 10 lose cover RAM: 510M
Jun 07 12:06:29 ns204651 kernel: gran_size: 16M chunk_size: 64M num_reg: 10 lose cover RAM: 14M
Jun 07 12:06:29 ns204651 kernel: gran_size: 16M chunk_size: 128M num_reg: 10 lose cover RAM: 14M
Jun 07 12:06:29 ns204651 kernel: gran_size: 16M chunk_size: 256M num_reg: 10 lose cover RAM: 14M
Jun 07 12:06:29 ns204651 kernel: gran_size: 16M chunk_size: 512M num_reg: 10 lose cover RAM: 14M
Jun 07 12:06:29 ns204651 kernel: gran_size: 16M chunk_size: 1G num_reg: 10 lose cover RAM: 14M
Jun 07 12:06:29 ns204651 kernel: *BAD*gran_size: 16M chunk_size: 2G num_reg: 10 lose cover RAM: -1010M
Jun 07 12:06:29 ns204651 kernel: gran_size: 32M chunk_size: 32M num_reg: 10 lose cover RAM: 142M
Jun 07 12:06:29 ns204651 kernel: gran_size: 32M chunk_size: 64M num_reg: 10 lose cover RAM: 46M
Jun 07 12:06:29 ns204651 kernel: gran_size: 32M chunk_size: 128M num_reg: 9 lose cover RAM: 46M
Jun 07 12:06:29 ns204651 kernel: gran_size: 32M chunk_size: 256M num_reg: 9 lose cover RAM: 46M
Jun 07 12:06:29 ns204651 kernel: gran_size: 32M chunk_size: 512M num_reg: 9 lose cover RAM: 46M
Jun 07 12:06:29 ns204651 kernel: gran_size: 32M chunk_size: 1G num_reg: 9 lose cover RAM: 46M
Jun 07 12:06:29 ns204651 kernel: gran_size: 32M chunk_size: 2G num_reg: 10 lose cover RAM: 46M
Jun 07 12:06:29 ns204651 kernel: gran_size: 64M chunk_size: 64M num_reg: 10 lose cover RAM: 142M
Jun 07 12:06:29 ns204651 kernel: gran_size: 64M chunk_size: 128M num_reg: 9 lose cover RAM: 78M
Jun 07 12:06:29 ns204651 kernel: gran_size: 64M chunk_size: 256M num_reg: 9 lose cover RAM: 78M
Jun 07 12:06:29 ns204651 kernel: gran_size: 64M chunk_size: 512M num_reg: 9 lose cover RAM: 78M
Jun 07 12:06:29 ns204651 kernel: gran_size: 64M chunk_size: 1G num_reg: 9 lose cover RAM: 78M
Jun 07 12:06:29 ns204651 kernel: gran_size: 64M chunk_size: 2G num_reg: 10 lose cover RAM: 78M
Jun 07 12:06:29 ns204651 kernel: gran_size: 128M chunk_size: 128M num_reg: 9 lose cover RAM: 206M
Jun 07 12:06:29 ns204651 kernel: gran_size: 128M chunk_size: 256M num_reg: 9 lose cover RAM: 206M
Jun 07 12:06:29 ns204651 kernel: gran_size: 128M chunk_size: 512M num_reg: 9 lose cover RAM: 206M
Jun 07 12:06:29 ns204651 kernel: gran_size: 128M chunk_size: 1G num_reg: 9 lose cover RAM: 206M
Jun 07 12:06:29 ns204651 kernel: gran_size: 128M chunk_size: 2G num_reg: 10 lose cover RAM: 206M
Jun 07 12:06:29 ns204651 kernel: gran_size: 256M chunk_size: 256M num_reg: 7 lose cover RAM: 462M
Jun 07 12:06:29 ns204651 kernel: gran_size: 256M chunk_size: 512M num_reg: 9 lose cover RAM: 462M
Jun 07 12:06:29 ns204651 kernel: gran_size: 256M chunk_size: 1G num_reg: 9 lose cover RAM: 462M
Jun 07 12:06:29 ns204651 kernel: gran_size: 256M chunk_size: 2G num_reg: 10 lose cover RAM: 462M
Jun 07 12:06:29 ns204651 kernel: gran_size: 512M chunk_size: 512M num_reg: 5 lose cover RAM: 974M
Jun 07 12:06:29 ns204651 kernel: gran_size: 512M chunk_size: 1G num_reg: 5 lose cover RAM: 974M
Jun 07 12:06:29 ns204651 kernel: gran_size: 512M chunk_size: 2G num_reg: 5 lose cover RAM: 974M
Jun 07 12:06:29 ns204651 kernel: gran_size: 1G chunk_size: 1G num_reg: 5 lose cover RAM: 974M
Jun 07 12:06:29 ns204651 kernel: gran_size: 1G chunk_size: 2G num_reg: 5 lose cover RAM: 974M
Jun 07 12:06:29 ns204651 kernel: gran_size: 2G chunk_size: 2G num_reg: 4 lose cover RAM: 1998M
Jun 07 12:06:29 ns204651 kernel: mtrr_cleanup: can not find optimal value
Jun 07 12:06:29 ns204651 kernel: please specify mtrr_gran_size/mtrr_chunk_size
Jun 07 12:06:29 ns204651 kernel: e820: update [mem 0xdd000000-0xffffffff] usable ==> reserved
Jun 07 12:06:29 ns204651 kernel: e820: last_pfn = 0xdc000 max_arch_pfn = 0x400000000
Jun 07 12:06:29 ns204651 kernel: found SMP MP-table at [mem 0x000fd7b0-0x000fd7bf] mapped at [ffff8800000fd7b0]
Jun 07 12:06:29 ns204651 kernel: Scanning 1 areas for low memory corruption
Jun 07 12:06:29 ns204651 kernel: Base memory trampoline at [ffff880000097000] 97000 size 24576
Jun 07 12:06:29 ns204651 kernel: BRK [0x0221c000, 0x0221cfff] PGTABLE
Jun 07 12:06:29 ns204651 kernel: BRK [0x0221d000, 0x0221dfff] PGTABLE
Jun 07 12:06:29 ns204651 kernel: BRK [0x0221e000, 0x0221efff] PGTABLE
Jun 07 12:06:29 ns204651 kernel: BRK [0x0221f000, 0x0221ffff] PGTABLE
Jun 07 12:06:29 ns204651 kernel: BRK [0x02220000, 0x02220fff] PGTABLE
Jun 07 12:06:29 ns204651 kernel: BRK [0x02221000, 0x02221fff] PGTABLE
Jun 07 12:06:29 ns204651 kernel: RAMDISK: [mem 0x35038000-0x36813fff]

* http://my-fuzzy-logic.de/blog/index.php?/archives/41-Solving-linux-MTRR-problems.html

* https://velenux.wordpress.com/2014/01/02/badgran_size-in-dmesg/
 
Might be related. More interesting though will be the log right around the time of such a crash/hangup, since these messages should AFAIK happen around boot time?

At least you should be able to get rid of those messages by doing what it says and specify mtrr_gran_size and mtrr_chunk_size as boot parameters. Just gotta pick one of the suggested configurations not marked as '*BAD*' - all of the ones which don't seem to lose some RAM coverage use 10 registers, which probably means all of them while the default kernel parameter for spare registers is 1 - ie it can't find a configuration with 0 loss which only uses 9 registers - although I'm not 100% sure on that topic.
So maybe try gran/chunk of 32M/1G, would leave you the spare register and lose "only" 46M of RAM?
 
Thanks @wbumiller

The halt happened again a few hours ago. The only thing I can see in the logs are this lines hundreds of times:

Jun 10 08:42:54 ns20xxxx sshd[28582]: Failed password for root from 116.31.116.43 port 45119 ssh2
Jun 10 08:42:54 ns20xxxx sshd[28582]: Received disconnect from 116.31.116.43: 11: [preauth]
Jun 10 08:42:54 ns20xxxx sshd[28582]: PAM 2 more authentication failures; logname= uid=0 euid=0 tty=ssh ruser= rhost=116.31.116.43
Jun 10 08:42:56 ns20xxxx sshd[28592]: Failed password for root from 116.31.116.43 port 61967 ssh2
Jun 10 08:42:59 ns20xxxx sshd[28592]: Failed password for root from 116.31.116.43 port 61967 ssh2

So it's an ssh2 brute force attack upto 250 mbps. I've installed "fail2ban" in the Proxmox machine as in the Linux VMs (that were receiving the attack too), but still it shouldn't halt because of this, right?

So as far as I understand, the problem could be:

1- Kernel has some kind of problem that halts the machine after thousands of connections

2- The hardware (memory, cpu, chipset?) has a problem and with the attack the temperature rises and it breaks.

Maybe this weekend I can do some hardware tests and see if it breaks, but if you have any more clue on what tests I can do, please tell me.

Thanks a lot,

Luis Miguel
 
Hey there.

After installing fail2ban, my server it's been 12 days alive without problems (before it could only lasts for 2 or 3 days). So for me the problem is solved, but I think this kernel has a problem that makes it halt when receiving an attack.
 
It's been almost one month and no more problems with the server after installing fail2ban. Just in case something have the same problem.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!