HN.zip

When networking doesn't work

69 points by kencausey - 11 comments
deathanatos [3 hidden]5 mins ago
But what was the checksum? Like the actual, specific value?

The Factorio devs found[1] that some devices do fail to compute checksums, in that they compute the checksum just fine, but they're doing something stupid with some values and so checksums of 0x0000 or 0xFFFF (the two values from the FFF) cause packet loss.

In any protocol that, when the packet repeats, repeats it with even the slightest permutation (different request ID, timestamp, sequence number, etc.), that will be enough to jiggle the checksum to a new value (probably), and then the protocol will keep going with only a minor blip that probably goes unnoticed.

But if the packet is deterministic, only then you hit the problem.

> calculating the UDP checksum is not exactly rocket science.

I've seen things that trivial get messed up. "Just read the standard" is a high bar, sometimes. (Though the above is probably "I dual purposed a u16 without realizing it didn't have any available niches for that…")

[1]: https://www.factorio.com/blog/post/fff-176

zamadatix [3 hidden]5 mins ago
Breaking down the oddity on 0x0000 and 0xFFFF further, it stems from this special behavior per the RFCs https://www.rfc-editor.org/rfc/rfc1122#page-29:~:text=is%20v...:

> Unlike the TCP checksum, the UDP checksum is optional; the value zero is transmitted in the checksum field of a UDP header to indicate the absence of a checksum. If the transmitter really calculates a UDP checksum of zero, it must transmit the checksum as all 1's (65535). No special action is required at the receiver, since zero and 65535 are equivalent in 1's complement arithmetic.

Using 0x0000 and 0xFFFF as special values via 1's complement creates the error, only for these 2 specific values, when 2's complement logic is used to calculate.

ongy [3 hidden]5 mins ago
In my university times I wrote a library (to help with some homework we gave students) that calculated the CRC32 for ethernet.

Which worked well unless compiled with `strict-aliasing` gcc optimizations enabled...

Just writing UDP RFC compliant code doesn't protect you from running into annoying behavior with your programming language of choice...

yuye [3 hidden]5 mins ago
I love Wube's FFFs. I wish more devs would do it; not just a devlog, but really going into the nitty-gritty of how some systems work.
userbinator [3 hidden]5 mins ago
Without disassembling and tracing the Intel Windows drivers (something I don’t feel like doing)

As someone who generally doesn't use AI in software development nor RE, this is one thing that I'd recommend trying one on to see what it can do: the problem is clearly defined and a solution is easily validated, and it's a problem you're not intersted in digging deeper yourself. The other comment here about 0000 and FFFF checksums seems like a good place to start.

A little more digging found this discussion from TODAY regarding what looks like a very similar bug in one of Intel's Linux NIC drivers: https://lkml.org/lkml/2026/5/4/1886

codemog [3 hidden]5 mins ago
Exactly. Why would people willingly do this kind of tedious grunt work by hand instead of having a machine do it? I guess some people enjoy it, but it was always one of my least favorite parts.
stroebs [3 hidden]5 mins ago
I came across this very same issue with fika, a community-made mod for Escape from Tarkov. One player would consistently fail to join games and it took ages to figure out the different components that were failing. The code intentionally sent the join message 4 times in quick succession, which triggered the DoS protection on the internet firewall. Ok, disabled that. The next issue was the packets were being interfered with by the ALG on the internet firewall, so disabled that too. Then the last final hurdle was the Rx offloading on the Intel NIC which was the exact same issue with the checksum being set to all 0’s or all F’s.

What made it confusing at the time is the join packet would sometimes be accepted and passed through to the game, so it prompted further digging into why.

bombcar [3 hidden]5 mins ago
It'd be interesting to see what the wrong checksum it calculates is ...
ErroneousBosh [3 hidden]5 mins ago
Someone else mentioned further up that it's all zeroes or all ones. A checksum of all zeroes means "this packet has no checksum and that's okay". Because of the way it's calculated 0xffff works out the same as 0x0000, so if the checksum happens to sum to 0x0000 it's replaced with 0xffff.

Both values are totally valid checksums but some people don't believe that :-)

ranger_danger [3 hidden]5 mins ago
Usually all 0s or all Fs. I had the same problem with an old Dell PowerEdge with Broadcom nics... packet failures left and right without disabling the offloading options.
nubinetwork [3 hidden]5 mins ago
Interesting... I've heard enabling tx/rx offloading is actually beneficial, turns out that's not always the case...