Hello,
I have five MS-01 nodes in my cluster which is running fairly well. However, one of the nodes becomes unreachable from time to time. It will still be running but I cannot ping it and there is no output if I plug in an HDMI cable.
Here is the journalctl right before it went down (22:31). You can see I hard reset it at 22:34. I am wondering if there is a way to enable more verbose logging or if I am missing something here. Or maybe other ways to diagnose this.
The logging that I see here is repeated for days on end as I scroll up. I have not yet configured the email functionality; all nodes complain about that.
However, the "ignored rdmsr" logs are unique to this node but I did a lot of GPU/iGPU/eGPU pass through with this node and may have VMs trying to reach things that are not accessible.
I was hoping to be able to tirage this one to stability and learn a bit but if I'm at a dead end I'm also open to reinstalling proxmox. This MS-01 is running:
I have five MS-01 nodes in my cluster which is running fairly well. However, one of the nodes becomes unreachable from time to time. It will still be running but I cannot ping it and there is no output if I plug in an HDMI cable.
Here is the journalctl right before it went down (22:31). You can see I hard reset it at 22:34. I am wondering if there is a way to enable more verbose logging or if I am missing something here. Or maybe other ways to diagnose this.
Jul 29 22:18:13 pve04 kernel: kvm_msr_ignored_check: 41 callbacks suppressed
Jul 29 22:18:13 pve04 kernel: kvm: kvm [26577]: ignored rdmsr: 0x60d data 0x0
Jul 29 22:18:13 pve04 kernel: kvm: kvm [26577]: ignored rdmsr: 0x3f8 data 0x0
Jul 29 22:18:13 pve04 kernel: kvm: kvm [26577]: ignored rdmsr: 0x3f9 data 0x0
Jul 29 22:18:13 pve04 kernel: kvm: kvm [26577]: ignored rdmsr: 0x3fa data 0x0
Jul 29 22:18:13 pve04 kernel: kvm: kvm [26577]: ignored rdmsr: 0x630 data 0x0
Jul 29 22:18:13 pve04 kernel: kvm: kvm [26577]: ignored rdmsr: 0x631 data 0x0
Jul 29 22:18:13 pve04 kernel: kvm: kvm [26577]: ignored rdmsr: 0x632 data 0x0
Jul 29 22:18:13 pve04 kernel: kvm: kvm [26577]: ignored rdmsr: 0x61d data 0x0
Jul 29 22:18:13 pve04 kernel: kvm: kvm [26577]: ignored rdmsr: 0x621 data 0x0
Jul 29 22:18:13 pve04 kernel: kvm: kvm [26577]: ignored rdmsr: 0x690 data 0x0
Jul 29 22:18:18 pve04 pmxcfs[1127]: [dcdb] notice: data verification successful
Jul 29 22:19:36 pve04 pmxcfs[1127]: [status] notice: received log
Jul 29 22:19:36 pve04 pmxcfs[1127]: [status] notice: received log
Jul 29 22:20:47 pve04 corosync[1247]: [TOTEM ] Retransmit List: cc2d1
Jul 29 22:22:18 pve04 pmxcfs[1127]: [status] notice: received log
Jul 29 22:22:19 pve04 pmxcfs[1127]: [status] notice: received log
Jul 29 22:22:19 pve04 pmxcfs[1127]: [status] notice: received log
Jul 29 22:22:19 pve04 pmxcfs[1127]: [status] notice: received log
Jul 29 22:22:19 pve04 pmxcfs[1127]: [status] notice: received log
Jul 29 22:22:19 pve04 pmxcfs[1127]: [status] notice: received log
Jul 29 22:25:21 pve04 postfix/qmgr[1233]: 5B44A1C1248: from=<root@pve04>, size=35737, nrcpt=1 (queue active)
Jul 29 22:25:51 pve04 postfix/smtp[3957496]: connect to gmail-smtp-in.l.google.com[142.251.179.26]:25: Connection timed out
Jul 29 22:25:51 pve04 postfix/smtp[3957496]: connect to gmail-smtp-in.l.google.com[2607:f8b0:4004:c1f::1b]:25: Network is unreachable
Jul 29 22:25:51 pve04 postfix/smtp[3957496]: connect to alt1.gmail-smtp-in.l.google.com[2a00:1450:400b:c00::1b]:25: Network is unreachable
Jul 29 22:26:21 pve04 postfix/smtp[3957496]: connect to alt1.gmail-smtp-in.l.google.com[209.85.202.27]:25: Connection timed out
Jul 29 22:26:24 pve04 pmxcfs[1127]: [status] notice: received log
Jul 29 22:26:51 pve04 postfix/smtp[3957496]: connect to alt2.gmail-smtp-in.l.google.com[64.233.184.26]:25: Connection timed out
Jul 29 22:26:51 pve04 postfix/smtp[3957496]: 5B44A1C1248: to=<redacted>, relay=none, delay=246308, delays=246217/0.01/90/0, dsn=4.4.1, status=deferred (co>
Jul 29 22:28:12 pve04 kernel: kvm_msr_ignored_check: 40 callbacks suppressed
Jul 29 22:28:12 pve04 kernel: kvm: kvm [26577]: ignored rdmsr: 0x60d data 0x0
Jul 29 22:28:12 pve04 kernel: kvm: kvm [26577]: ignored rdmsr: 0x3f8 data 0x0
Jul 29 22:28:12 pve04 kernel: kvm: kvm [26577]: ignored rdmsr: 0x3f9 data 0x0
Jul 29 22:28:12 pve04 kernel: kvm: kvm [26577]: ignored rdmsr: 0x3fa data 0x0
Jul 29 22:28:12 pve04 kernel: kvm: kvm [26577]: ignored rdmsr: 0x630 data 0x0
Jul 29 22:28:12 pve04 kernel: kvm: kvm [26577]: ignored rdmsr: 0x631 data 0x0
Jul 29 22:28:12 pve04 kernel: kvm: kvm [26577]: ignored rdmsr: 0x632 data 0x0
Jul 29 22:28:12 pve04 kernel: kvm: kvm [26577]: ignored rdmsr: 0x61d data 0x0
Jul 29 22:28:12 pve04 kernel: kvm: kvm [26577]: ignored rdmsr: 0x621 data 0x0
Jul 29 22:28:12 pve04 kernel: kvm: kvm [26577]: ignored rdmsr: 0x690 data 0x0
Jul 29 22:28:36 pve04 pmxcfs[1127]: [status] notice: received log
Jul 29 22:28:37 pve04 pmxcfs[1127]: [status] notice: received log
Jul 29 22:28:39 pve04 pmxcfs[1127]: [status] notice: received log
Jul 29 22:28:39 pve04 pmxcfs[1127]: [status] notice: received log
Jul 29 22:28:39 pve04 pmxcfs[1127]: [status] notice: received log
Jul 29 22:28:39 pve04 pmxcfs[1127]: [status] notice: received log
Jul 29 22:28:49 pve04 pmxcfs[1127]: [status] notice: received log
Jul 29 22:30:21 pve04 postfix/qmgr[1233]: 4B5FF1C1209: from=<root@pve04>, size=38466, nrcpt=1 (queue active)
Jul 29 22:30:26 pve04 postfix/smtp[3960556]: connect to gmail-smtp-in.l.google.com[2607:f8b0:4004:c09::1b]:25: Network is unreachable
Jul 29 22:30:56 pve04 postfix/smtp[3960556]: connect to gmail-smtp-in.l.google.com[142.251.16.27]:25: Connection timed out
Jul 29 22:30:56 pve04 postfix/smtp[3960556]: connect to alt1.gmail-smtp-in.l.google.com[2a00:1450:400b:c00::1b]:25: Network is unreachable
Jul 29 22:31:26 pve04 postfix/smtp[3960556]: connect to alt1.gmail-smtp-in.l.google.com[209.85.202.27]:25: Connection timed out
Jul 29 22:31:26 pve04 postfix/smtp[3960556]: connect to alt2.gmail-smtp-in.l.google.com[2a00:1450:400c:c0b::1b]:25: Network is unreachable
Jul 29 22:31:26 pve04 postfix/smtp[3960556]: 4B5FF1C1209: to=<redacted>, relay=none, delay=418486, delays=418421/0.01/65/0, dsn=4.4.1, status=deferred (co>
-- Boot 6de4beafeff64ea3852dd559aa1bd735 --
Jul 29 22:43:32 pve04 kernel: Linux version 6.8.4-3-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAM>
Jul 29 22:43:32 pve04 kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-6.8.4-3-pve root=/dev/mapper/pve-root ro quiet intel_iommu=on rootdelay=10
Jul 29 22:43:32 pve04 kernel: KERNEL supported cpus:
Jul 29 22:43:32 pve04 kernel: Intel GenuineIntel
Jul 29 22:43:32 pve04 kernel: AMD AuthenticAMD
Jul 29 22:43:32 pve04 kernel: Hygon HygonGenuine
Jul 29 22:43:32 pve04 kernel: Centaur CentaurHauls
Jul 29 22:43:32 pve04 kernel: zhaoxin Shanghai
The logging that I see here is repeated for days on end as I scroll up. I have not yet configured the email functionality; all nodes complain about that.
However, the "ignored rdmsr" logs are unique to this node but I did a lot of GPU/iGPU/eGPU pass through with this node and may have VMs trying to reach things that are not accessible.
I was hoping to be able to tirage this one to stability and learn a bit but if I'm at a dead end I'm also open to reinstalling proxmox. This MS-01 is running:
- Barebones 13900H MS-01
- 96GB Crucial RAM
- Crucial T500 1TB boot disk
- 2x Samsung PM983a NVMes for Ceph OSDs
- RX 6800 GPU