- on Sun 05 January 2014
Category: MPI HPC Parallel File Systems
Introduction
In Part I of this article, I introduced the problem and the initial naive solution to it. In this part, I will discuss the random algorithm that was used to solve the problem, and the results we achieved using it. In Part I, I introduced the naive (master/worker) algorithm, and the first version of a distributed algorithm. The distributed algorithm allows worker processes to ask other worker processes (chosen at random) for work when needed. This actually introduced a defect: far too many messages were exchanged between workers.
Random vs. Equal Queue Splitting
When a worker $P_i$ process receives a request for worker from another process $P_j$, it will provide $P_j$ with some portion of its own pending tasks (if any). Process $P_i$ may split its own queue into two equal halves and send one half to the requesting process $P_j$ as our first try did. However, $P_i$ could split its own queue unevenly and assign some unequal portion of the work to $P_j$. But, how much should it share? Remember that work queue items in this case are either files or directories. If a work queue entry is a file, then it's easy to quantify how much work that represents (keep in mind that at this point, the items are just filenames with no information about whether they are files or directories). But if a work queue entry is a directory, then how does one quantify how much work it represents without first fully exploring the directory? For each work queue item Process $P_i$ gives to Process $P_i$, Process $P_i$ really has no idea of how much work it's giving away - which makes our goal of equitable load distribution seem impossible.
When faced with such a decision, it is impossible to know the ideal solution without having previously solved the exact instance of the problem. Having no other information to go on, and knowing the problem with splitting the queues equally, I chose to split them randomly. The idea was that when a process received a work request from another, it would just give away a random portion of the work (even if multiple processes requested work).
What if random is wrong?
Theoretically, splitting the queue at random could actually cause very unbalanced load distribution. How bad could it be? To be specific about cost, we'll use a simple model for network cost: $$C(n) = \alpha + n * \beta$$In this model, $\alpha$ is the network latency, and $\beta$ is the average transmission cost for one network unit, and $n$ is the message length.
Now that we have a metric, let's examine the worst case scenario.Asymptotically, the queue can be split at most $n$ times for $n$ elements. This is because for every split, the splitting process will always keep at least one element to process. No matter how the work is distributed amongst the processes, the pending work of a process decreases by at least one unit after a queue split.
If we assume that there are $p$ processes and each process consumes one unit of the $n$ initial work units between two consecutive splits of its queue, then the work to be done decreases by at least one for each queue split.This in turn implies a reduction by $p$ units over $p$ processes after their corresponding queue splits. The communication cost is dependent on the queue splitting, since every queue split implies two communications (a request and a reply).
Each exchange consists of a work request of size one unit, and a response of size in $ \lbrace n - 1, \dots, 1 \rbrace $, with a total cost: $$ C_a \lbrace C(1) + C(n-1) \rbrace + \lbrace C(1) + C(n-2) \rbrace + \dots + \lbrace C(1) + C(n-1) \rbrace $$
Exanding and combining all of the $C(1)$ terms:
$$ C_a = n * (\alpha + \beta) + \lbrace \alpha + (n-1)* \beta \rbrace + \dots + \lbrace \alpha + 1 * \beta \rbrace $$
Simplified:
$$ C_a = n * (\alpha + \beta) + n * \alpha + \frac{\beta}{2} * (n-1) * (n-2) $$
$$ C_a = O(n^2) $$
Now we know that in the absolute worst case, the cost of this distributed algorithm is $O(n^2)$. In practice however, this case was never encountered. Below is a graph comparing the distributed algorithm to the centralized one.
It's easy to see from the graph that our distributed algorithm outperforms the centralized algorithm by a significant margin, even with fewer processes. You might be wondering, can we do better? To find out, I profiled the code to see what exactly was going on.
The graph above shows the results of profiling the distributed algorithm by function. The first function communicate()
, represents all of the communication overhead that our algorithm uses (OpenMPI in our implementation). The callback()
function is called for every file in the tree. If you wanted to copy the file tree, then the call back would be a function that copies one file. In this test, it was just a logging function. Next, the function readdir()
is the function used to read the contents of a directory onto a worker's queue. Finally, lstat()
is the function used to determine if a filename is a directory or file.
Conclusion
We can safely assume that the distributed algorithm can't really get much faster, because the graph aboveshows that the vast majority of the execution time is spent in system calls to lstat
and readdir
as it should be. Additionally, the same amount of work could be accomplished with far fewer resources, as shown in the first graph.
The implementation of this algorithm is still in use today on supercomputers at the Los AlamosNational Laboratory. The algorithm was implemented as a library called libcircle
, named for the circular ordering of processes for termination detection (explained in Part I). It's written in C, and the source can be found on GitHub. The original implementation has been wrapped so that it can be used from Go, and you can also find that library on GitHub.