This issue has been happening randomly for a while now, but I only just got around to setting up an alternate network interface to investigate.
After some time with high load/IO, the 10GbE interface on a j274 stops receiving network packets. TX still works, which eventually devolves to just ARP requests (which are received by other hosts). The failure is not instant, but rather seems to take time ranging up to several minutes, during which somehow network connectivity is degraded between certain hosts.
In this instance, the failure happened at 20:02 initially and most traffic flows were interrupted, but the (remote) monitoring still worked. It somehow recovered at 20:17, before failing completely at 20:31.
This is currently only happening on a j274. I don't think I've ever seen it on other Mac Mini variants (M1 Pro, M2). I'm not sure if that's just a coincidence, though (it could be related to CPU performance relative to network IO, for example).
I have a vague feeling this might be IRQ related. If there are multiple RX queues assigned to different network flows, some IRQs or queues failing first would explain the strange failure behavior where it fails partially before it fails fully.
/proc/interrupts dump during the failure:
124: 131 0 316956 321397 0 6301482 5757779 7076594 PCI-MSIX-0000:03:00.0 0 Edge end0
125: 62 0 22534244 0 180882 109157 19396 163910 PCI-MSIX-0000:03:00.0 1 Edge end0
126: 178 0 0 150623 9052354 3688564 1227219 4592258 PCI-MSIX-0000:03:00.0 2 Edge end0
127: 16596356 0 0 0 0 0 0 0 PCI-MSIX-0000:03:00.0 3 Edge end0
128: 81 0 0 236171 26504 3987434 4310273 5243217 PCI-MSIX-0000:03:00.0 4 Edge end0
129: 85 0 0 75 8166 7977179 19628093 7158307 PCI-MSIX-0000:03:00.0 5 Edge end0
130: 141 0 0 740925 8342324 4133795 2094545 1913432 PCI-MSIX-0000:03:00.0 6 Edge end0
131: 80 27867953 0 0 0 0 0 0 PCI-MSIX-0000:03:00.0 7 Edge end0
132: 0 0 0 0 0 0 0 0 PCI-MSIX-0000:03:00.0 8 Edge end0
In this case, IRQ 128 (MSI 4) is ticking up with TX packets, but nothing else is.
ip link shows all RX packets as dropped (dropped and mcast ticking up, as well as TX normally):
2: end0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether a4:77:f3:05:50:1b brd ff:ff:ff:ff:ff:ff
RX: bytes packets errors dropped missed mcast
725823928713 552326168 0 20032 0 449035
TX: bytes packets errors dropped carrier collsns
1818579085663 1247873976 0 0 0 0
altname enp3s0
altname enxa477f305501b
So the PHY is receiving packets, but they aren't making it into the kernel.
Nothing in dmesg when the problem happens. ip link set end0 down && ip link set end0 up fixes the situation.
This issue has been happening randomly for a while now, but I only just got around to setting up an alternate network interface to investigate.
After some time with high load/IO, the 10GbE interface on a j274 stops receiving network packets. TX still works, which eventually devolves to just ARP requests (which are received by other hosts). The failure is not instant, but rather seems to take time ranging up to several minutes, during which somehow network connectivity is degraded between certain hosts.
In this instance, the failure happened at 20:02 initially and most traffic flows were interrupted, but the (remote) monitoring still worked. It somehow recovered at 20:17, before failing completely at 20:31.
This is currently only happening on a j274. I don't think I've ever seen it on other Mac Mini variants (M1 Pro, M2). I'm not sure if that's just a coincidence, though (it could be related to CPU performance relative to network IO, for example).
I have a vague feeling this might be IRQ related. If there are multiple RX queues assigned to different network flows, some IRQs or queues failing first would explain the strange failure behavior where it fails partially before it fails fully.
/proc/interrupts dump during the failure:
In this case, IRQ 128 (MSI 4) is ticking up with TX packets, but nothing else is.
ip linkshows all RX packets as dropped (dropped and mcast ticking up, as well as TX normally):So the PHY is receiving packets, but they aren't making it into the kernel.
Nothing in dmesg when the problem happens.
ip link set end0 down && ip link set end0 upfixes the situation.