HN.zip

The Size of Packets

77 points by todsacerdoti - 34 comments
hliyan [3 hidden]5 mins ago
This reminds me of one of the most interesting bugs I've faced: I was responsible for developing the component that provided away market data to the core trading system of a major US exchange (which allows the trading system to determine whether an order should be matched in-house or routed to another exchange with a better price).

Throughputs were in the multiple tens of thousands of transactions per second and latencies were in single digit milliseconds (in later years these would drop to double digit microseconds, but that's a different story). Components were written in C++, running on Linux. The machine that ran my component and the trading engine were neighbors in a LAN.

We put my component through a full battery of performance tests, and for a while, we seem to be meeting the numbers. Then one day, with absolutely zero code changes from my end or the trading engine's end, the the latency numbers collapsed. We checked the hardware configs and the rate at which the latest test was run. Both identical.

It took, I think, several days to solve the mystery: in the latest test run, we had added one extra away market to a list of 7 or 8 markets for which my component provided market data to the trading system. We had added markets before without an issue. It's a negligible change to the market data message size, because it only adds a few bytes: market ID, best bid price & quantity, best offer price & quantity. In no way should such a small change result in a disproportionate collapse in the latency numbers. It took a while for us to realize that before the addition of these few bytes, our market data message (a binary packed format), neatly fit into a single ethernet frame. Those extra few bytes pushed it over the 1600 (or 1500?) mark and caused all market data message frames (which were the bulk of messages on the system, next to orders), to fragment. The frame fragmentation and reassembly overhead was enough to clog up the pipes at the rates we were pumping data.

In the short run, I think we managed to do some tweaks and get the message back under 1600 bytes (by omitting markets that did not have a current bid/offer, rather than sending NULLs). I can't recall what we did in the long run.

roeles [3 hidden]5 mins ago
> The system dispensed with a passive common bus and replaced it with an active switching hub to which hosts were attached.

I get the impression that the standard still allows hubs to exist, but that you just don't see them in practice.

I would be interested if anyone has ever used a 100mbit hub.

Veserv [3 hidden]5 mins ago
MTU discovery would be so much easier if the default behavior was truncate and forward when encountering a oversized packet. The endpoints can then just compare the bytes received against the size encoded inside of the packet to trivially detect truncation and thus get the inbound MTU size.

This allows you to do MTU discovery as a endpoint protocol with all the authentication benefits that provides and allows you to send a single large probe packet to precisely identify the MTU size. It would also allow you to immediately and transparently identify MTU reductions due to route changes or any other such cause instead of packets just randomly blackholing or getting responses from unknown, unauthenticated endpoints.

zamadatix [3 hidden]5 mins ago
Truncation for a dedicated probe packet type: you lose the information it's a probe when you go through a tunnel of some sort (VPN, L2TP, IPsec, MPLS, VPLS, VXLAN, PBB, q-in-q, whatever). You're also dealing with different layers e.g. a client could send an L3 packet probe and now you're expecting a layer 2 PBB/q-in-q node to recognize IP packet types and treat them specially (layering violation).

Truncation for all packet types: data in transit can occasionally get split for other reasons. Right now that's just made into loss, if we had built every protocol layer on the idea it should forward anyways then any instances of this type of loss also become MTU renegotiations, at best. At worst we're having to forward generally corrupted packets which can cause all sorts of other problems. It'd be another layering violation to require that e.g. an L2 switch must adjust the UDP checksum when it's intentionally truncating a packet, but that'd be the only way to avoid that. Tunnels (particularly secure) are also tricky here (you need to run multiple separate layers of this continuously to avoid truncation information not propagating to the right endpoints). It also doesn't allow for truly unidirectional protocols e.g. a UDP video stream as there is no allowance for out of session signaling to be possible.

The above is for "if we have started networking day 1 with this plan in mind". There are of course additional problems given we didn't. I'm also not sure I follow how allowing any intermediate node to truncate a packet is any more authenticated.

The (still ugly) beauty of using PMTUD-style approach over truncation or probe+notification is it doesn't try to make assumptions about how anything in the middle could ever work for the rest of time, and that makes it both simple (despite sounding like a lot of work) and reliable. You and your peer just exchange packets until you find the biggest size that fits (or that you care to check for) and you're off! MTU changes due to a path change? No problem, it's just part of your "I had a connection and the other side seems to have stopped responding. How do I attempt to continue" logic (be that retry a new session or attempt to be smart about it). It also plays nice with the ICMP too large messages - if they are there you can choose to listen, if they are not it still "just works".

Or, like the article says, safe minimums can be more practical.

ikiris [3 hidden]5 mins ago
And how do you tell the difference between cut off packets, and a mtu drop? What about crcs / frame checks? Do you regenerate the frames? Do you do this at routed interfaces? What if there's just layer 2 only involved?
LegionMammal978 [3 hidden]5 mins ago
> And how do you tell the difference between cut off packets, and a mtu drop?

You don't, apart from enforcing a bare-minimum MTU for sanity's sake. If your jumbo-size packets are getting randomly cut off by a middlebox, then they probably aren't stable at that size anyway.

Veserv [3 hidden]5 mins ago
Packets do not get “cut-off” normally. That is kind of the point. Some protocols allow transparent fragmentation, but the fragments need to encode enough information for reconstruction, so you can still detect “less data received than encoded on send”.

You do not need bit error detection because you literally truncated the packet. The data is already lost. But in the process you learned it was due to MTU limits which is very useful. Protocols are already required to be robust to garbage that fails bit error detection anyways, so it is not “required” to always have valid integrity tags. You could transparently re-encode bit error detection on the truncated packet if you so desire to ensure data integrity of the “MTU resulted in truncation” packet that you are now forwarding, but again, not necessary.

Any end-to-end protocol that encodes the intended data size in-band can use this technique across truncating transport layers. And any protocol which does so already requires implementations to not blindly trust the in-band value otherwise you get trivial buffer overflows. So, all non-grossly insecure client implementations should already be able to safely handle MTU truncation if they received it (they would just not be able to use that for MTU discovery until they are updated). The only thing you need is routers to truncate instead of drop and then you can slowly update client implementations to take advantage of the new feature since this middlebox change should not break any existing implementations unless they are inexcusably insecure.

ikiris [3 hidden]5 mins ago
I don’t think you understand what normally looks like if you start forwarding damaged frames like this because you can’t tell the difference. That was the point.
Veserv [3 hidden]5 mins ago
I literally have no idea what you are talking about. You can send garbage packets that conform to no known protocol on the internet. You can get more bit errors or perfect bit errors that make your bit error detection pass while still forwarding corrupt payloads. Transport protocols and channels must be and are robust to this.

“Damaged” frames and frame integrity only matter if you need the contents of the entire packet to remain intact. Which you explicitly do not when truncating.

The only new problem that arises is that maybe the in-band length information or headers get corrupted resulting in misinterpreting the truncation that actually occurred. And again, you already need to be robust to garbage. And you can just change my proposal to recompute the integrity tag on the truncated data if you think that really matters.

cryptonector [3 hidden]5 mins ago
> Path MTU discovery has not been enthusiastically embraced

Ugh. I don't understand this. Especially passive PMTUD should just be rolled out everywhere. On Linux it still defaults to disabled! https://sourcegraph.com/search?q=context%3Aglobal+repo%3A%5E...

whiatp [3 hidden]5 mins ago
PMTU just doesn't feel reliable to me because of poorly behaved boxes in the middle. The worst offender I've had to deal with was AWS Transit Gateway, which just doesn't bother sending ICMP too big messages. The second worst offender is, IMO (data center and ISP) routers that generate ICMP replies in their CPU, meaning large packets hit a rate limited exception punt path out of the switch ASIC over to the cheapest CPU they could find to put in the box. If too many people are hitting that path at the same time, (maybe) no reply for you.

More rare cases, but really frustrating to debug was when we had an L2 switch in the path with lower MTU than the routers it was joining together. Without an IP level stack, there is no generation of ICMP messages and that thing just ate larger packets. The even stranger case was when there was a Linux box doing forwarding that had segment offload left on. It was taking in several 1500 byte TCP packets from one side, smashing them into ~9000 byte monsters, and then tried to send those over a VPNish network interface that absolutely couldn't handle that. Even if the network in the middle bothered to generate the ICMP too big message, the source would have been thoroughly confused because it never sent anything over 1500.

toast0 [3 hidden]5 mins ago
> The even stranger case was when there was a Linux box doing forwarding that had segment offload left on. It was taking in several 1500 byte TCP packets from one side, smashing them into ~9000 byte monsters, and then tried to send those over a VPNish network interface that absolutely couldn't handle that. Even if the network in the middle bothered to generate the ICMP too big message, the source would have been thoroughly confused because it never sent anything over 1500.

This is an old Linux tcp offloading bug; large receive offload smooshes the inbound packet, then it's too big to forward.

I had to track down the other side of this. FreeBSD used to resend the whole send queue if it got a too big message, even if the size did not change. Sending all at once made it pretty likely for the broken forwarder to get packets close enough to do LRO, which resulted in large enough packet sending to show up as network problems.

I don't remember where the forwarder seemed to be, somewhere far away, IIRC.

cryptonector [3 hidden]5 mins ago
> PMTU just doesn't feel reliable to me because of poorly behaved boxes in the middle. The worst offender I've had to deal with was AWS Transit Gateway, which just doesn't bother sending ICMP too big messages.

Passive PMTUD does NOT depend on ICMP messages.

Hikikomori [3 hidden]5 mins ago
They recently started supporting pmtud on tgw. But it wasn't a big deal really as it adjusted mss instead.
immibis [3 hidden]5 mins ago
L2 not generating errors is expected behaviour - all ports on the L2 network are supposed to have the same MTU set
mkj [3 hidden]5 mins ago
Would that help with UDP, or only TCP?
ajb [3 hidden]5 mins ago
That particular one, only TCP. There is a different one for UDP applications: https://www.rfc-editor.org/rfc/rfc8899

Because UDP is only a very thin layer, each layer on top (eg, QUIC) has to implement PLPMTUD; although, recently IETF standardised a way to extend UDP to have options and PLPTMUD is also specified for that: https://datatracker.ietf.org/doc/draft-ietf-tsvwg-udp-option...

cryptonector [3 hidden]5 mins ago
You can implement passive PMTUD with UDP if you like. It's more work for you, but it's perfectly doable.
posnet [3 hidden]5 mins ago
"Jumbogram", an IPv6 packet with the Jumbo Payload option set, allowing for an frame size of up to 2³²-1 bytes.

At 10Gbps it would take 3.4 seconds just to serialize the frame.

hugmynutus [3 hidden]5 mins ago
Luckily 400Gb/s nics are already on the market [1]

[1] https://docs.broadcom.com/doc/957608-PB1

2OEH8eoCRo0 [3 hidden]5 mins ago
Do you count the frame preamble?
nayuki [3 hidden]5 mins ago
> The speed of light in glass or fiber-optic cable is significantly slower, at approximately 194,865 kilometers per second. The speed of voltage propagation in copper is 224,844 kilometres per second.

If I understand correctly, the speed of light in an electrical cable doesn't depend on the metal that carries current, but instead depends on the dielectric materials (plastic, air, etc.) between the two conductors?

tonyarkles [3 hidden]5 mins ago
If I’m interpreting what you’re asking correctly, yes. The velocity factor of a cable doesn’t spend on the metal it’s made of but rather the insulator material and the geometry of the cable.

For fibre the velocity factor depends on the refraction index of the fibre.

lucb1e [3 hidden]5 mins ago
Huh? Maybe I'm completely misreading the question, but when they say fiber-optic cable, they do mean optic. It's not an "electrical cable"; there is no metal needed in optic communication cables (perhaps for stiffness or whatnot, but not for the communication)
Hikikomori [3 hidden]5 mins ago
>The speed of voltage propagation in copper is 224,844 kilometres per second.

This part?

beeburrt [3 hidden]5 mins ago
That font size is tiny. If this is your site, maybe consider a larger font size
nayuki [3 hidden]5 mins ago
The site specifies a base font size of 12px. The better practice is to not specify a base font size at all, just taking it from the user's web browser instead. Then, the web designer should specify every other font size and box dimension as a scaled version of the base font size, using units like em/rem/%, not px.

Related reading: https://joshcollinsworth.com/blog/never-use-px-for-font-size

lucb1e [3 hidden]5 mins ago
It's the same size as HN: 12px. HN looks larger to me for some reason, but I can't figure out why: when I overlay a quote someone posted here over the website with half transparency in GIMP, the text is clearly the same height. Some letters are wider, some narrower, but the final length of the 8 words I sampled is 360px on HN vs. 358px on that website (so differences basically cancel out)

This is on Firefox/Debian, in case that means something for installed fonts. I see that site's CSS specifies Verdana and Arial, names that sound windowsey to me but I have no idea if my system has (analogous versions to) those

tomthecreator [3 hidden]5 mins ago
There's a PDF version linked at the top of the article, it's actually much better typeset.
usefulcat [3 hidden]5 mins ago
Given the subject of TFA, this seems appropriate in a meta sort of way.
jeffbee [3 hidden]5 mins ago
The efficiency argument applies to private flows mostly. In terms of overall network traffic, the huge majority takes place between peers that share a local or private network. Internetworking as such has a relatively small share of total flows. So large frame sizes are beneficial in the context where they are also not problematic, and path MTU discovery is not beneficial in the context where it has many drawbacks. It seems as though the current state is pretty much optimal.
nullc [3 hidden]5 mins ago
Is there any convenient way to tell linux distributions that the local subnet can handle 9k jumbos (or whatever) but that anything routed out must be 1500?

I currently have this solved by just sticking hosts on two vlans, one that has the default route and another that only has the jumbo capable hosts. ... but this seems kinda stupid.

fbouynot [3 hidden]5 mins ago
Yes you can set your interface MTU at 9000 and assign a 1500 MTU to the routes themselves.
throw0101b [3 hidden]5 mins ago
> […] and assign a 1500 MTU to the routes themselves.

See "mtu" option in ip-route(8):

* https://man.archlinux.org/man/ip-route.8.en#mtu

The BSDs also have an "-mtu" option in route(8):

* https://man.freebsd.org/cgi/man.cgi?route(8)

* https://man.openbsd.org/route