Hacker News

The Isolation Trap: Erlang

130 points by enz - 56 comments

JackC [3 hidden]5 mins ago

The article argues that shared memory and message passing are the same thing because they share the same classes of potential failure modes.

Isn't it more like, message passing is a way of constraining shared memory to the point where it's possible for humans to reason about most of the time?

Sort of like rust and c. Yes, you can write code with 'unsafe' in rust that makes any mistake c can make. But the rules outside unsafe blocks, combined with the rules at module boundaries, greatly reduce the m * n polynomial complexity of a given size of codebase, letting us reason better about larger codebases.

alberth [3 hidden]5 mins ago

Tangentially related: I haven’t seen DragonflyBSD talked about on HN in a long while but wasn’t it a split from FreeBSD to be built entirely around message passing as the core construct.

And with the tiny team working on it, it has remarkable performance.

https://www.dragonflybsd.org/performance/

toast0 [3 hidden]5 mins ago

> Isn't it more like, message passing is a way of constraining shared memory to the point where it's possible for humans to reason about most of the time?

That's a good way to look at it. A processes's mailbox is shared mutable state, but restrictions and conventions make a lot of things simpler when a given process owns its statr and responds to requests than when the requesters can access the state in shared memory. But when the requests aren't well thought out, you can build all the same kinds of issues.

Let's say you have a process that holds an account balance. If requests are deposit X or withdrawl Y, no problem (other than two generals). If instead requestors get balance, adjust and then send a set balance, you have a classic race condition.

ETS can be mentally modeled as a process that owns the table (even though the implementation is not), and the same thing applies... if the mutations you want to do aren't available as atomic requests or you don't use those facilities, the mutation isn't atomic and you get all the consequences that come with that.

Circular message passing can be an easy mistake to make in some applications, too.

dnautics [3 hidden]5 mins ago

> ETS can be mentally modeled as a process that owns the table (even though the implementation is not)

the API models it that way, so i'd say its a bit more than just a mental model.

roncesvalles [3 hidden]5 mins ago

Exactly, reading TFA and its prequel, can't shake the feeling that the author doesn't really understand concurrency.

The main purpose of synchronization is creating happens-before (memory/cache coherence) relationships between lines of code that aren't in the same program order. Go channels are just syntactic sugar for creating these happens-before relationships. Problems such as deadlocks and races (at least in the way that TFA calls them out) are irreducible complexity if you're executing two sequences of logical instructions in parallel. If you're passing data in whatever way, there is no isolation between those two sequences. All you can enforce is degrees of discipline.

It's typical AI slop. I'd recommend for the author (or anyone else) to watch Jenkov's course[1] first if they have an honest interest in the topic.

[1] https://www.youtube.com/playlist?list=PLL8woMHwr36EDxjUoCzbo...

rdtsc [3 hidden]5 mins ago

> But an escape hatch is still an escape hatch. These mechanisms bypass the process isolation model entirely. They are shared state outside the process model, accessible concurrently by any process, with no mailbox serialization, no message copying, no ownership semantics. And when you introduce shared state into a system built on the premise of having none, you reintroduce the bugs that premise was supposed to eliminate.

No, they do bypass it. I don't know what "Technical Program Managers at Google" do but they don't seem to be using a lot of Erlang it seems ;-). ETS tables can be modeled as a process which stores data and then replies to message queries. Every update and read is equivalent to sending a message. The terms are still copied (see note * below). You're not going to read half a tuple and then it will mutate underneath as another process updates it. Traversing an ETS table is logically equivalent to asking a process for individual key-values using regular message passing.

What is different is what these are optimized for. ETS tables are great for querying and looking up data. They even have a mini query language for it (https://www.erlang.org/doc/apps/stdlib/qlc.html). Persistent terms are great for configuration values. None of them break the isolated heap and immutable data paradigm, they just optimize for certain access patterns.

Even dictionary fields they mention, when a process reads another process' dictionary it's still a signal being sent to a process and a reply needing to be received.

* Immutable binary blocks >64B can be referenced, but they are referenced when sending data using explicit messages between processes anyway.

dnautics [3 hidden]5 mins ago

minor nitpicks:

ETS is not a process that responds to messages, you have to wrap it in a process and do the messages part yourself.

Process dictionary: i am pretty sure that's a process_info bif that directly queries the vm internal database and not a secret message that can be trapped or even uses the normal message passing system.

rdtsc [3 hidden]5 mins ago

> ETS is not a process that responds to messages, you have to wrap it in a process and do the messages part yourself.

I didn't say it's implemented as a process but works as if it where logically. Most terms (except literals and the binary references) are still copied just like when you send message. You could replace it behind the scenes with a process and it would act the same. Performance-wise it won't be the same, and that's why they are implemented differently but it doesn't allow sharing a process heap and you don't have to do locks and mutexes to protect access to this "shared" data.

> i am pretty sure that's a process_info bif that directly queries the vm internal database and not a secret message that can be trapped or even uses the normal message passing system.

I specifically meant querying the dictionary of another process. Since it's in the context of "erlang is violating the shared nothing" comment. In that case if we look at https://www.erlang.org/doc/system/ref_man_processes.html#rec... we see that process_info_request is a signal. A process is sent a signal, and then it gets its dictionary entries and replies (note the difference between messages and signals there).

Twey [3 hidden]5 mins ago

Message passing is a type of mutable shared state — but one that's restricted in some important way to eliminate a certain class of errors (in Erlang's case, to a thread-safe queue with pairwise ordering guarantees so that all processing on a particular actor's state is effectively atomic). You can also pick other structures that give different guarantees, e.g. LVars or CRDTs make operations commutative so that the ordering problems go away (but by removing your ability to write non-commutative operations). The big win for the actor model is (just) that it linearizes all operations on a particular substate of the program while allowing other actors' states to be operated on concurrently.

Nobody argues that any of these approaches is a silver bullet for all concurrency problems. Indeed most of the problems of concurrency have direct equivalents in the world of single-threaded programming that are typically hard and only partially solved: deadlocks and livelocks are just infinite loops that occur across a thread boundary, protocol violations are just type errors that occur across a thread boundary, et cetera. But being able to rule out some of these problems in the happy case, even if you have to deal with them occasionally when writing more fiddly code, is still a big win.

If you have an actor Mem that is shared between two other actors A and B then Mem functions exactly as shared memory does between colocated threads in a multithreaded system: after all, RAM on a computer is implemented by sending messages down a bus! The difference is just that in the hardware case the messages you can pass to/from the actor (i.e. the atomicity boundaries) are fixed by the hardware, e.g. to reads/writes on particular fixed-sized ranges of memory, while with a shared actor Mem is free to present its own set of software-defined operations, with awareness of the program's semantics. Memory fences are a limited way to bring that programmability to hardware memory, but the programmer still has the onerous and error-prone task of mapping domain operations to fences.

_mrinalwadhwa_ [3 hidden]5 mins ago

> a thread-safe queue with pairwise ordering guarantees so that all processing on a particular actor's state is effectively atomic

> The big win for the actor model is (just) that it linearizes all operations on a particular substate of the program while allowing other actors' states to be operated on concurrently.

Came here to say exactly those two things. Your comment is as clear as it could be.

IsTom [3 hidden]5 mins ago

> Forget to set a timeout on a gen_server:call?

Default timeout is 5 seconds. You need to set explicit infinity timeout to not have one.

__turbobrew__ [3 hidden]5 mins ago

I work on infrastructure at bigco and we landed on a 5 second default timeout for our RPC framework which is interesting.

Sometimes I think there should be a list of sane and tested production configs: default rpc timeout, default backoff exponent, default initial backoff, default max backoff, health check frequency, health check timeout, process restart delay, process restart backoff, etc…

johnisgood [3 hidden]5 mins ago

> This isn’t obviously wrong

I thought it was obviously wrong. Server A calls Server B, and Server B calls server A. Because when I read the code my first thought was that it is circular. Is it really not obvious? Am I losing my mind?

The mention of `persistent_term` is cool.

loloquwowndueo [3 hidden]5 mins ago

It wasn’t obvious to the AI that wrote the article. There’s still hope for humans :)

jacquesm [3 hidden]5 mins ago

Servers are not ever supposed to be calling each other, also not in longer chains. They're supposed to layered and you only call "down".

bluGill [3 hidden]5 mins ago

It is too common / useful. Not everything is a tree.

allreduce [3 hidden]5 mins ago

Most things are a dag tho. :)

bluGill [3 hidden]5 mins ago

Most is not all. And those exceptions are annoying.

lukeasrodgers [3 hidden]5 mins ago

I don’t have much experience with pony but it seems like it addresses the core concerns in this article by design https://www.ponylang.io/discover/why-pony/. I wish it were more popular.

jen20 [3 hidden]5 mins ago

I don’t know enough about pony to know for sure, but nothing on that page sums to suggest that deadlocks of the form the article discusses are resolved?

gf000 [3 hidden]5 mins ago

I don't think there is a generic computational model that would prevent deadlocks, so no, pony also doesn't solve it.

brabel [3 hidden]5 mins ago

In Pony, handling messages is done via behaviors, which look like a normal method - the only difference is that it cannot return anything and that it runs asynchronously. Hence, the example in the post cannot occur since you cannot really wait for something in Pony. You'd have to explicitly call some other behavior of the other actor to be able to "respond" its first message, which breaks the circularity.

aeonfox [3 hidden]5 mins ago

A real interesting read as someone who spends a bit of time with Elixir. Wasn't aware of the atomic and counter Erlang features that break isolation.

Though they do say that race conditions are purely mitigated by discipline at design time, but then mention race conditions found via static analysis:

> Maria Christakis and Konstantinos Sagonas built a static race detector for Erlang and integrated it into Dialyzer, Erlang’s standard static analysis tool. They ran it against OTP’s own libraries, which are heavily tested and widely deployed.

> They found previously unknown race conditions. Not in obscure corners of the codebase. Not in exotic edge cases. In the kind of code that every Erlang application depends on, code that had been running in production for years.

I imagine that the 4th issue of protocol violation could possibly be mitigated by a typesafe abstracted language like Gleam (or Elixir when types are fully implemented)

WJW [3 hidden]5 mins ago

If these race conditions are in code that has been in production for years and yet the race conditions are "previously unknown", that does suggest to me that it is in practice quite hard to trigger these race conditions. Bugs that happen regularly in prod (and maybe I'm biased, but especially bugs that happen to erlang systems in prod) tend to get fixed.

aeonfox [3 hidden]5 mins ago

True. And that the subtle bugs were then picked up by static analysis makes the safety proposition of Erlang even better.

> Bugs that happen regularly in prod

It depends on how regular and reproducible they are. Timing bugs are notoriously difficult to pin down. Pair that with let-it-crash philosophy, and it's maybe not worth tracking down. OTOH, Erlang has been used for critical systems for a very long time – plenty long enough for such bugs to be tracked down if they posed real problems in practice.

thesz [3 hidden]5 mins ago

Erlang has "die and be restarted" philosophy towards process failures, so these "bugs that happen to erlang systems in prod" may not be fixed at all, if they are rare enough.

toast0 [3 hidden]5 mins ago

As of now, the post you're replying to says "bugs that regularly happen ... in prod"

Now, if it crashes every 10 years, that is regular, but I think the meaning is that it happens often. Back when I operated a large dist cluster, yes, some rare crashes happened that never got noticed or the triage was 'wait and see if it happens again' and it didn't happen. But let it crash and restart from a known good state is a philosophy about structuring error checking more than an operational philosophy: always check for success and if you don't know how to handle an error fail loudly and return to a good state to continue.

Operationally, you are expected to monitor for crashes and figure out how to prevent them in the future. And, IMHO, be prepared to hot load fixes in response... although a lot of organizations don't hot load.

dnautics [3 hidden]5 mins ago

not all races are bugs. here's an example that probably happens in many systems that people just don't notice: sometimes you don't care and, say, having database setup race against setup of another service that needs the database means that in 99% of cases you get a faster bootup and in 1% of cases the database setup is slow and the dependent server gets restarted by your application supervisor and connects on the second try.

kamma4434 [3 hidden]5 mins ago

The 4th issue is a feature- it’s what allows zero downtime hot updates.

anonymous_user9 [3 hidden]5 mins ago

This seems interesting, but the sheer density of LLM-isms make it hard to get through.

rando1234 [3 hidden]5 mins ago

I actually disagree, thought it read reasonably well and didn't feel LLMy at all.

loloquwowndueo [3 hidden]5 mins ago

It stinks of LLM - sections with headers beginning with “The”, a lot of “it’s not just X, it’s Y” etc etc.

The content is good and interesting though. Just hard to wade through with all the thorny LLM bushes getting in the way.

Looks like the author had a draft with the core content and ideas and asked an LLM to embellish it. Maybe because author wasn’t confident in their writing skills? Whatever the reason, I’d honestly prefer something human-written.

brabel [3 hidden]5 mins ago

I must be immune to that since I thought the post was very nice and didn't realize any of those things. Do they really make the post less good though?

boxed [3 hidden]5 mins ago

I think at this point comments like this are equivalent to saying "I didn't like this article, because it's written in too good English".

andrelaszlo [3 hidden]5 mins ago

I would edit sentences like this:

"Erlang is the strongest form of the isolation argument, and it deserves to be taken seriously, which is why what happens next matters."

It doesn't add much, and it has this condescending and pretentious LLM tone. For me as a reader, it distracts from an otherwise interesting article.

layer8 [3 hidden]5 mins ago

That what the only place that made me stumble, because “what happens next” doesn’t really make sense in that context.

brabel [3 hidden]5 mins ago

But mistakes like that are what makes it human! I really don't know anymore that we can have certainty about things being AI or human.

layer8 [3 hidden]5 mins ago

Mistakes yes, but “this obviously makes no sense” less so.

loloquwowndueo [3 hidden]5 mins ago

Sorry, good English is good grammatically and structurally while being unique and feeling creative. and AI-written English is not good. It’s correct but totally repetitive, formulaic and circular. It’s like expecting a pizza and finding it’s made of cardboard.

MarkusQ [3 hidden]5 mins ago

Or maybe more like expecting Italian food and getting pizza?

Linux-Fan [3 hidden]5 mins ago

I liked the content of the article enough to read it to the end, but I did have a hard time due to inflation with LLM-isms. Then again I am not a native so how would I know if this is good English? I can only tell that to me, it is hard to read despite interesting content.

trashburger [3 hidden]5 mins ago

It shows a lack of care for the reader. Use your own words.

tonnydourado [3 hidden]5 mins ago

Thank god I found this page: https://causality.blog/series/, now I can relax knowing that at least there's a plan for a conclusion. Looking forward to the next posts

NeutralForest [3 hidden]5 mins ago

Yeah I was looking for the next one!

pshirshov [3 hidden]5 mins ago

I believe it's more correct to reference circular calls as "livelocks", not "deadlocks" - something is happening but the whole computation cannot progress.

For the rest - pure untyped actors come with a lot of downsides and provoke engineers to make systems unnecessarily distributed (with all the consistency and timeout issues). There aren't that many problems which can be mapped well directly to actors. I personally find async runtimes with typed front-ends (e.g. Cats/ZIO in Scala, async in Rust, etc) much more robust and much less error-prone.

toast0 [3 hidden]5 mins ago

If process A is waiting for a reply from process B and process B is waiting for a reply from process A; that is deadlock. There is no way those processes can continue (unless there's a timeout or one process gets killed). Other processes may progress as long as they don't need a reply from process A or B ... which sometimes is fine. (Edit: nevermind, I forgot the 5 second timeout if you use gen_server:call/2; you will end up in livelock if it happens continuously, but a mostly ok system if it works out)

Livelock is something like you've got 1000 nodes that all want to do X, which requires an exclusive lock and the method to get an exclusive lock is:

Broadcast request to cluster

If you got the lock on all nodes, proceed

If you get the lock on all nodes, release and try again after a timeout

This procedure works in practice, when there is low contention. If the cluster is large and many processes contend for the lock, progress is rare. It's not impossible to progress, so the system is not deadlocked; but it takes an inordinate amount of time, mostly waiting for locks: the system is livelocked. In this case, whenever progress happens, future progress is easier.

This is a rough description of an actual incident with nodes joining pg2, I think around 2018... the new pg module avoids that lock (and IMHO, the lock was not needed anyway; it was there to provide consistent order in member lists across nodes, but member lists would no longer be consistent when dist distonects happened and resolved, so why add locks to be consistent sometimes). As an Erlang user with I think the largest clusters anywhere, we ran into a good number of these kinds of things in OTP. Ericsson built dist for telecom switches with two nodes in a single enclosure in a rack. It works over tcp and they didn't put explicit limits, so you can run a dist cluster with thousands of nodes in locations across the globe and it mostly works, but there will be some things to debug from time to time. Erlang is fairly easy to debug... All the new nodes have a process waiting to join pg2, what's the pg2 process doing, why does that lock not have the consensus building algorithm, can we add it? In the meantime, let's kill some nodes so others can progreas and then we'll run a sequenced start of the rest.

gzread [3 hidden]5 mins ago

It's a deadlock because two threads are each waiting for the other.

cyberpunk [3 hidden]5 mins ago

Eh maybe. I work on a big, mature, production erlang system which has millions of processes per cluster and while the author is right in theory, these are quite extreme edge cases and i’ve never tripped over them.

Sure, if you design a shit system that depends on ETS for shares state there are dangers, so maybe don’t do that?

I’d still rather be writing this system in erlang than in another language, where the footguns are bigger.

hrmtst93837 [3 hidden]5 mins ago

Treating ETS as the only footgun misses a few ugly ones, because a bad mailbox backlog or a gen_server chain can turn local slowness into cluster-wide pain before anything actually crashes.

Erlang does make some failure modes less nasty. It also hides latency debt well enough that people thinks the model saved them right up until one overloaded process turns the whole system into a distributed whodunit.

cyberpunk [3 hidden]5 mins ago

Oh, I didn’t mean it’s the only one.

There are a bunch for sure! Turns out writing concurrent reliable distributed systems is really hard. I’ve not found anything else that makes them easier to deal with than BEAM though.

I’d switch if something better came along and happened to also be as battle hardened. I’ll be waiting a while i think.

dnautics [3 hidden]5 mins ago

in ten years of BEAM ive written a deadlock once. and zero times in prod.

id say its better to default to call instead of pushing people to use cast because it won't lock.

cyberpunk [3 hidden]5 mins ago

Generally agree, all the problems i’ve had with erlang have been related to full mailboxes or having one process type handling too many kinds of different messages etc.

These are manageable, but i really really stress and soak test my releases (max possible load / redline for 48+ hours) before they go out and since doing that things have been fairly fine, you can usually spot such issues in your metrics by doing that

never_inline [3 hidden]5 mins ago

> This isn’t just academic elegance, it kept phone switches running with five nines of availability.

Hmm....

> Erlang is the strongest form of the isolation argument, and it deserves to be taken seriously, which is why what happens next matters.

OK I think I know who wrote this.

> The problem isn’t that developers write circular calls by accident. It’s that deadlock-freedom doesn’t compose.

Is there a need to regugriate it in this format? "two protocols that are individually deadlock-free can still combine to deadlock in an actor system." This is the actually meaningful part.

> Forget to set a timeout on a gen_server:call?

People have pointed out its factually wrong in the thread. Eh

> This is the discipline tax. It works when the team is experienced, the codebase is well-maintained, and the conventions are followed consistently. It erodes when any of those conditions weaken, and given enough time and enough turnover they do.

I know this is an LLM tell, but can't point out. It makes me uneasy to read this. Maybe the rule of three? Maybe the reguggeiation of a elementary SE concept in between a technical description? Maybe because it's tryhard to sound smart? All three I guess.

I could go on, but sigh, man don't use these clankers to write prose. They're like negative level gzip compression.

worthless-trash [3 hidden]5 mins ago

Could be wrong, but that wont deadlock because 5 seconds later, you're going to have call/2 fail.

instig007 [3 hidden]5 mins ago

GHC Haskell has the best concurrency story among high-level programming languages. SMP parallelism, structured concurrency with M:N multicore mapping, STM transactions for data structures including members of collections (https://hackage.haskell.org/package/stm-containers), and OTP-like primitives (https://haskell-distributed.github.io/). All fit nicely into native binaries on x86_64 and arm64.

felixgallo [3 hidden]5 mins ago

This is agitslop.