In distributed computing, what are world size and rank?

These concepts are related to parallel computing. It would be helpful to learn a little about parallel computing, e.g., MPI.

You can think of world as a group containing all the processes for your distributed training. Usually, each GPU corresponds to one process. Processes in the world can communicate with each other, which is why you can train your model distributedly and still get the correct gradient update. So world size is the number of processes for your training, which is usually the number of GPUs you are using for distributed training.

Rank is the unique ID given to a process, so that other processes know how to identify a particular process. Local rank is the a unique local ID for processes running in a single node, this is where my view differs with @zihaozhihao.

Let’s take a concrete example. Suppose we run our training in 2 servers (some articles also call them nodes) and each server/node has 4 GPUs. The world size is 4*2=8. The ranks for the processes will be [0, 1, 2, 3, 4, 5, 6, 7]. In each node, the local rank will be [0, 1, 2, 3].

I have also written a post about MPI collectives and basic concepts. The link is here.

Leave a Comment