1/5 MS-01 Nodes Unstable & Not Much in Journal

manofoz · Jul 30, 2024

Hello,

I have five MS-01 nodes in my cluster which is running fairly well. However, one of the nodes becomes unreachable from time to time. It will still be running but I cannot ping it and there is no output if I plug in an HDMI cable.

Here is the journalctl right before it went down (22:31). You can see I hard reset it at 22:34. I am wondering if there is a way to enable more verbose logging or if I am missing something here. Or maybe other ways to diagnose this.

Jul 29 22:18:13 pve04 kernel: kvm_msr_ignored_check: 41 callbacks suppressed
Jul 29 22:18:13 pve04 kernel: kvm: kvm [26577]: ignored rdmsr: 0x60d data 0x0
Jul 29 22:18:13 pve04 kernel: kvm: kvm [26577]: ignored rdmsr: 0x3f8 data 0x0
Jul 29 22:18:13 pve04 kernel: kvm: kvm [26577]: ignored rdmsr: 0x3f9 data 0x0
Jul 29 22:18:13 pve04 kernel: kvm: kvm [26577]: ignored rdmsr: 0x3fa data 0x0
Jul 29 22:18:13 pve04 kernel: kvm: kvm [26577]: ignored rdmsr: 0x630 data 0x0
Jul 29 22:18:13 pve04 kernel: kvm: kvm [26577]: ignored rdmsr: 0x631 data 0x0
Jul 29 22:18:13 pve04 kernel: kvm: kvm [26577]: ignored rdmsr: 0x632 data 0x0
Jul 29 22:18:13 pve04 kernel: kvm: kvm [26577]: ignored rdmsr: 0x61d data 0x0
Jul 29 22:18:13 pve04 kernel: kvm: kvm [26577]: ignored rdmsr: 0x621 data 0x0
Jul 29 22:18:13 pve04 kernel: kvm: kvm [26577]: ignored rdmsr: 0x690 data 0x0
Jul 29 22:18:18 pve04 pmxcfs[1127]: [dcdb] notice: data verification successful
Jul 29 22:19:36 pve04 pmxcfs[1127]: [status] notice: received log
Jul 29 22:19:36 pve04 pmxcfs[1127]: [status] notice: received log
Jul 29 22:20:47 pve04 corosync[1247]: [TOTEM ] Retransmit List: cc2d1
Jul 29 22:22:18 pve04 pmxcfs[1127]: [status] notice: received log
Jul 29 22:22:19 pve04 pmxcfs[1127]: [status] notice: received log
Jul 29 22:22:19 pve04 pmxcfs[1127]: [status] notice: received log
Jul 29 22:22:19 pve04 pmxcfs[1127]: [status] notice: received log
Jul 29 22:22:19 pve04 pmxcfs[1127]: [status] notice: received log
Jul 29 22:22:19 pve04 pmxcfs[1127]: [status] notice: received log
Jul 29 22:25:21 pve04 postfix/qmgr[1233]: 5B44A1C1248: from=<root@pve04>, size=35737, nrcpt=1 (queue active)
Jul 29 22:25:51 pve04 postfix/smtp[3957496]: connect to gmail-smtp-in.l.google.com[142.251.179.26]:25: Connection timed out
Jul 29 22:25:51 pve04 postfix/smtp[3957496]: connect to gmail-smtp-in.l.google.com[2607:f8b0:4004:c1f::1b]:25: Network is unreachable
Jul 29 22:25:51 pve04 postfix/smtp[3957496]: connect to alt1.gmail-smtp-in.l.google.com[2a00:1450:400b:c00::1b]:25: Network is unreachable
Jul 29 22:26:21 pve04 postfix/smtp[3957496]: connect to alt1.gmail-smtp-in.l.google.com[209.85.202.27]:25: Connection timed out
Jul 29 22:26:24 pve04 pmxcfs[1127]: [status] notice: received log
Jul 29 22:26:51 pve04 postfix/smtp[3957496]: connect to alt2.gmail-smtp-in.l.google.com[64.233.184.26]:25: Connection timed out
Jul 29 22:26:51 pve04 postfix/smtp[3957496]: 5B44A1C1248: to=<redacted>, relay=none, delay=246308, delays=246217/0.01/90/0, dsn=4.4.1, status=deferred (co>
Jul 29 22:28:12 pve04 kernel: kvm_msr_ignored_check: 40 callbacks suppressed
Jul 29 22:28:12 pve04 kernel: kvm: kvm [26577]: ignored rdmsr: 0x60d data 0x0
Jul 29 22:28:12 pve04 kernel: kvm: kvm [26577]: ignored rdmsr: 0x3f8 data 0x0
Jul 29 22:28:12 pve04 kernel: kvm: kvm [26577]: ignored rdmsr: 0x3f9 data 0x0
Jul 29 22:28:12 pve04 kernel: kvm: kvm [26577]: ignored rdmsr: 0x3fa data 0x0
Jul 29 22:28:12 pve04 kernel: kvm: kvm [26577]: ignored rdmsr: 0x630 data 0x0
Jul 29 22:28:12 pve04 kernel: kvm: kvm [26577]: ignored rdmsr: 0x631 data 0x0
Jul 29 22:28:12 pve04 kernel: kvm: kvm [26577]: ignored rdmsr: 0x632 data 0x0
Jul 29 22:28:12 pve04 kernel: kvm: kvm [26577]: ignored rdmsr: 0x61d data 0x0
Jul 29 22:28:12 pve04 kernel: kvm: kvm [26577]: ignored rdmsr: 0x621 data 0x0
Jul 29 22:28:12 pve04 kernel: kvm: kvm [26577]: ignored rdmsr: 0x690 data 0x0
Jul 29 22:28:36 pve04 pmxcfs[1127]: [status] notice: received log
Jul 29 22:28:37 pve04 pmxcfs[1127]: [status] notice: received log
Jul 29 22:28:39 pve04 pmxcfs[1127]: [status] notice: received log
Jul 29 22:28:39 pve04 pmxcfs[1127]: [status] notice: received log
Jul 29 22:28:39 pve04 pmxcfs[1127]: [status] notice: received log
Jul 29 22:28:39 pve04 pmxcfs[1127]: [status] notice: received log
Jul 29 22:28:49 pve04 pmxcfs[1127]: [status] notice: received log
Jul 29 22:30:21 pve04 postfix/qmgr[1233]: 4B5FF1C1209: from=<root@pve04>, size=38466, nrcpt=1 (queue active)
Jul 29 22:30:26 pve04 postfix/smtp[3960556]: connect to gmail-smtp-in.l.google.com[2607:f8b0:4004:c09::1b]:25: Network is unreachable
Jul 29 22:30:56 pve04 postfix/smtp[3960556]: connect to gmail-smtp-in.l.google.com[142.251.16.27]:25: Connection timed out
Jul 29 22:30:56 pve04 postfix/smtp[3960556]: connect to alt1.gmail-smtp-in.l.google.com[2a00:1450:400b:c00::1b]:25: Network is unreachable
Jul 29 22:31:26 pve04 postfix/smtp[3960556]: connect to alt1.gmail-smtp-in.l.google.com[209.85.202.27]:25: Connection timed out
Jul 29 22:31:26 pve04 postfix/smtp[3960556]: connect to alt2.gmail-smtp-in.l.google.com[2a00:1450:400c:c0b::1b]:25: Network is unreachable
Jul 29 22:31:26 pve04 postfix/smtp[3960556]: 4B5FF1C1209: to=<redacted>, relay=none, delay=418486, delays=418421/0.01/65/0, dsn=4.4.1, status=deferred (co>
-- Boot 6de4beafeff64ea3852dd559aa1bd735 --
Jul 29 22:43:32 pve04 kernel: Linux version 6.8.4-3-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAM>
Jul 29 22:43:32 pve04 kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-6.8.4-3-pve root=/dev/mapper/pve-root ro quiet intel_iommu=on rootdelay=10
Jul 29 22:43:32 pve04 kernel: KERNEL supported cpus:
Jul 29 22:43:32 pve04 kernel: Intel GenuineIntel
Jul 29 22:43:32 pve04 kernel: AMD AuthenticAMD
Jul 29 22:43:32 pve04 kernel: Hygon HygonGenuine
Jul 29 22:43:32 pve04 kernel: Centaur CentaurHauls
Jul 29 22:43:32 pve04 kernel: zhaoxin Shanghai

The logging that I see here is repeated for days on end as I scroll up. I have not yet configured the email functionality; all nodes complain about that.

However, the "ignored rdmsr" logs are unique to this node but I did a lot of GPU/iGPU/eGPU pass through with this node and may have VMs trying to reach things that are not accessible.

I was hoping to be able to tirage this one to stability and learn a bit but if I'm at a dead end I'm also open to reinstalling proxmox. This MS-01 is running:

Barebones 13900H MS-01
96GB Crucial RAM
Crucial T500 1TB boot disk
2x Samsung PM983a NVMes for Ceph OSDs
RX 6800 GPU

Thanks!

gfngfn256 · Jul 30, 2024

According to their own web site's specs, the MS-01 can:

Memory: DDR5-5200MHz Dual Channel (SODIMM slots x2, up to a total maximum of 64GB)

I realize according to official Intel specs your 13900H CPU can support up to 96GB (mem type dependent - but you haven't given us that), but this will also depend on the environment the CPU is placed in; namely MB, PSU, BIOS configs & cooling. MS-01 have told you the maximum they support on your machine is 64GB. You've fully loaded both the PCI lanes & all NVMe slots. I'm guessing that from a thermal & power perspective you are already on the upper-limit of that machine. IDK what the 2.5/10 GB NIC's are doing in your setup - but I imagine some/all of them are busy chewing power. Finally that RX 6800 GPU has its own healthy appetite. I'm pretty sure that 19v power adapter is pretty stressed out.

Have you checked thermals? Memory? Etc.

How does this particular machine compare HW wise to the other 4?

manofoz · Jul 30, 2024

gfngfn256 said:
According to their own web site's specs, the MS-01 can:

I realize according to official Intel specs your 13900H CPU can support up to 96GB (mem type dependent - but you haven't given us that), but this will also depend on the environment the CPU is placed in; namely MB, PSU, BIOS configs & cooling. MS-01 have told you the maximum they support on your machine is 64GB. You've fully loaded both the PCI lanes & all NVMe slots. I'm guessing that from a thermal & power perspective you are already on the upper-limit of that machine. IDK what the 2.5/10 GB NIC's are doing in your setup - but I imagine some/all of them are busy chewing power. Finally that RX 6800 GPU has its own healthy appetite. I'm pretty sure that 19v power adapter is pretty stressed out.

Have you checked thermals? Memory? Etc.

How does this particular machine compare HW wise to the other 4?

Hey! Thanks for the response.

All five machines are identical in hardware. Not every machine has a GPU but this machine was periodically dropping before I plugged that in. Two others have GPUs but none have the same GPU.

One thing unique to this machine was I setup iGPU passthrough using vGPUs. That was when things got messy, and I didn't end up using the iGPU because the way I set it up needed an older version of the kernel. I was more just experimenting to see how that would look. This was why I was thinking to just wipe it clean and start with a fresh proxmox install but I wanted to see if I could figure this out first.

Also the machine was not under load when this happens. Three are used regularly and take the brunt of the load, these have been fine. Two I use to experiment including this node and aside from some gaming VM I spun up I have not done much with it. That VM was down when it froze. Here is the RAM:

https://www.amazon.com/gp/product/B0C79K5VGZ/ref=ppx_yo_dt_b_search_asin_title?ie=UTF8&psc=1

gfngfn256 · Jul 30, 2024

manofoz said:
This was why I was thinking to just wipe it clean and start with a fresh proxmox install

Thats what I would do. However you could just reverse the actions you took to passthrough the iGPU; kernel command line, kernel version etc. That should work too.

manofoz · Jul 30, 2024

gfngfn256 said:
Thats what I would do. However you could just reverse the actions you took to passthrough the iGPU; kernel command line, kernel version etc. That should work too.

Thanks! I will try that. At first I was worried about mucking up Ceph with a reinstall but I had ran into disk corruption after a hair dryer in my second floor bathroom blew the basement fuses that MS-01 was on (never knew they were connected). I was able to get it back up quick by just removing the OSDs adding new ones from the same drives. Also got nut going so I don't hit another "manually run fdisk" message. Since this one is alive still I can uninstall stuff properly before reformatting.

gfngfn256 · Jul 30, 2024

With a power outage - all sorts of crazy stuff can happen (that maybe difficult to immediately realize). If it isn't a big bother (relatively new install/full backups available) - I'd start fresh again especially in light of having the unstable node - unless you are absolutely certain the instability is only passthrough related.

1/5 MS-01 Nodes Unstable & Not Much in Journal

manofoz

New Member

gfngfn256

Distinguished Member

manofoz

New Member

gfngfn256

Distinguished Member

manofoz

New Member

gfngfn256

Distinguished Member

We value your privacy