How scalable is distributed Erlang?

We should specify that we talk about horizontal scalability of physical machines — that’s the only problem. CPUs on one machine will be handled by one VM, no matter what the number of those is.

node = machine.

To begin, I can say that 30-60 nodes you get out of the box (vanilla OTP installation) with any custom application written on the top of that (in Erlang). Proof: ejabberd.

~100-150 is possible with optimized custom application. I means, it has to be good code, written with knowledge about GC, characteristic of data types, message passing etc.

over +150 is all right but when we talk about numbers like 300, 500 it will require optimizations & customizations of TCP layer. Also, our app has to be aware of cost of e.g. sync calls across the cluster.

The other thing is DB layer. Mnesia (built-in) due its features will not be effective over 20 nodes (my experience – I may be wrong). Solution: just use something else: dynamo DBs, separate cluster of MySQLs, HBase etc.

The most common technique to leverage cost of creating high quality application and scalability are federations of ~20-50 nodes clusters. So internally its an efficient mesh of ~50 erlang nodes and its connected via any suitable protocol with N another 50 nodes clusters. To sum up, such a system is federation of N erlang clusters.

Distributed erlang is designed to run in one data center. If you need more, geographically distant nodes, then use federations.

There are lots of config options e.g. which do not connect all nodes to each other. It may be helpful, however in ~50 cluster erlang overhead is not significant. Also you can create a graph of erlang nodes using ‘hidden’ connection, which doesn’t join this full mesh, but also it cannot benefit from connection to all nodes.

The biggest problem I see, in this kind of systems, is designing it as master-less system. If you do not need that, everything should be ok.

Leave a Comment