- on Sat 21 December 2013
Category: MPI HPC Parallel File Systems
Introduction
If you've ever put 'too many' files into a directory, then you know what happens when you type ls -l
(long listing). Each file's attributesmust be read using the stat
system call so that the file's attributes (metadata) can be displayed in the output. Calling stat
for hundreds of thousands of files can take a very long time (depending on the filesystem). Large filesystems can contain millions or billions of files inside of many directories. When I worked at the Los Alamos National Laboratory, we had several large parallel file systems attached to multiple supercomputers. These parallel file systems were used as fast, temporary storage for jobs running on the supercomputers. Users were expected to migrate their data to a more reliable (slower) file system when they were done with it, in order to free space on the parallel file system for other users. We observed that more often than not, users didn't even know how much space they were using on the parallel file system. To make matters worse, the standard tools for finding that information relied on calling stat
serially. If you give a scientist 100,000 processors, they will write 100,000 files over and over again. This is the first part of the story of how a distributed, randomized file tree walk algorithm was developed.
Necessity is the mother of invention
At the time I had a working knowledge of the file system internals from many nights spent keeping the file system running. I was also a graduate student and had worked on several parallel code projects for fun and profit. I knew that our file system metadata could be queried in parallel, and even a small speedup would be better than a serial process. Armed with the skills and the knowledge, I set out to solve the problem at hand: efficiently visit every file in a tree in parallel. If the general problem could be solved, then it could be applied to any operation on a file tree (copying, deleting, stating).
The naive solution
The first, and simplest solution to implement was a master/worker architecture. The master process distributes work units to each of the worker processes, one or more units at a time. Each worker performs the work assigned and sends the results back to the master process (the result might actually contain more work). In this case, the work units are file metadata queries and the results are the attributes. The master would begin by opening the root directory to be traversed, reading its children onto the queue, and then dispatching those items to worker processes. Worker processes would stat
each item it received and, if the item was a directory would send its children to the master process.
Although simple, this solution is far from ideal. First of all, the process of querying metadata can be done by nodes independently (in other words - distributed), but using the master/worker architecture centralizes it. The second problem can't be explained without a little context. The parallel program for this algorithm would be running on a supercomputer, whose network interconnect was designed to maximum throughput between every pair of communication endpoints. With this in mind, note that all of the communication takes place between the master process and the worker processes, while none takes place between the processes themselves.
To demonstrate this point, the implementation for the master/worker file tree walk was instrumented so that communication between the processes were logged (both message size and message count).
Centralized Communication Heat Map
The graph above shows a heat map of the communication data for a tree walk on a 471TB file system using 30 processes (called ranks). Each process is identified by its rank, and each intersection of two ranks in the graph is colored based on the amount of data that was exchanged between them. For example, rank 0 didn't send any data to rank 0 so that block is colored white (meaning 0 Bytes). You can see that rank 0 (the master process) exchanged data with each worker, but favored the first few (the implementation used a LIFO queue). This was indicative of another problem with the algorithm: unbalanced load distribution. Also note that because workers didn't exchange any data amongst themselves, that all of the corresponding blocks are white. Although it was significantly faster than a serial tree walk, a lot of performance was left on the table.
How much performance did we leave on the table using a centralized algorithm? The final algorithm, presented in part II, was instrumented as well, and the results are shown below. Only 8 processes were used, because the new algorithm is more efficient, and we were able to saturate the file system metadata servers with a smaller number of workers. You may correctly be thinking that networking != performance, and you are correct. But you can rest assured that I'll show the speedup acheived by this load redistribution in part II.
Distributed Communication Heat Map
These visualizations were made using D3.js. You can see the source here.
A better solution
I was determined to make it faster, and started from scratch. This time however, I abandoned the master/worker architecture in favor of dynamic load distribution. I devised a simple scheme in which all processes were implemented as simple state machines. There would be no master process. Instead, each worker would maintain its own work queue,and operate independently using a simple loop. If the process had work items queued, it would process them, otherwise it would ask another worker for work at random. After processing a work queue item, if another process had requested work from it then it would send that process half of its own queue. This scheme is known as work stealing.
The first thing I noticed was that a large number of messages were exchanged as the processes were requesting work from each other, performing the same tree walk as in the first figure. The graph above shows the number of messages sent between all nodes over time. At first, only one process has work - the that opened the first directory. Then there is a flurry of communication as processes ask each other for work (often without success). Then at the end of the walk, communication spikes again as processes begin to starve. But that raises another question, without a master process how would the processes even know when the walk was over?
Distributed Termination Detection
Luckily, the problem of distributed termination detection had already been solved by Djikstra. The algorithm is very clever, and uses simple rules to accomplish its goal. I'll summarize them here, but if you want to know more then you should read the paper.
- All processes are logically ordered. This one is easy because the processes are already uniquely identifiable by their MPI rank.
- Processes have a color, which can be black or white.
- Every processes is initially colored white.
- There is a token which can be passed between processes (only one process can possess the token at a time).
- The token can be colored black or white.
- Initially, the first process (rank 0) has the token, and the token is white.
- When the first process (rank 0) is idle, it sends the white token to the next process (rank 1).
- Any time a process sends work to another process with a lower rank, it colors itself black.
- If a process is black and receives the token, it makes the token black, makes itself white, and then forwards the token to the next process.
- If a process is white and receives the token, it forwards the token unchanged.
- Tokens are only forwarded by a process when it has no work in its queue.
- Termination is detected when the first process (rank 0) receives a white token.
The implementation of this algorithm can be found on GitHub here. I would eventuallylike to rewrite it in a higher level language, but the C implementation has been sufficient for my uses so far.