Fast directory scanning
In this post, I present two approaches to directory scanning, one serial and one parallel. Both approaches return a collection of all files and directories contained by a given directory, including the contents (recursively) of each sub-directory thereof. For example, scanning the following directory tree:
…/dir/
…/dir/picture0.jpg
…/dir/picture1.jpg
…/dir/vacation-pictures/
…dir/vacation-pictures/grand-canyon1.jpg
returns a collection containing four entries, viz. …/dir/picture0.jpg, …/dir/picture1.jpg, …/dir/vacation-pictures/ and …/dir/vacation-pictures/grand-canyon1.jpg.
Serial Scanning
In serial scanning, a recursive method lists all files in a given directory, adds these to a list, and, if the file being added is a directory, invokes itself: The SerialDirScanner class, shown below, performs this task:
public class SerialDirScanner {
private final ConcurrentMap<File, Boolean> files = new ConcurrentHashMap<File, Boolean>();
public SerialDirScanner() {
super();
}
public Set<File> scan(File dir) throws InterruptedException,
ExecutionException {
scan0(dir);
return files.keySet();
}
private void scan0(File dir) {
for (final File file : dir.listFiles()) {
files.putIfAbsent(file, Boolean.TRUE);
if (file.isDirectory()) {
scan0(file);
}
}
}
public static Set<File> listAllContentsUnder(File dir)
throws InterruptedException, ExecutionException {
SerialDirScanner scanner = new SerialDirScanner();
return scanner.scan(dir);
}
}
The “scan” method invokes the recursive “scan0” method, which does the real work of scanning directories. That method fills up a concurrent map with files and directories as it works. Finally, the “scan” method returns all keys of the concurrent map.
Parallel scanning
Looking at the “scan0” method of the serial scanner shown above, many opportunities for parallelization become apparent. For example, the listing of different sibling directories (that is, directories that both have the same parent) can proceed in parallel, as can the listing of directories that are unrelated to each other (that is, one directory is not an ancestor of the other). The “DirScan” class shown below performs parallel scanning:
public class DirScanner {
private final ConcurrentMap<File, Boolean> files = new ConcurrentHashMap<File, Boolean>();
private final ExecutorService executor;
private final Semaphore sem = new Semaphore(1);
private final AtomicInteger count = new AtomicInteger();
public DirScanner(int threads) {
super();
executor = Executors.newFixedThreadPool(threads);
}
public Set<File> scan(File dir) throws InterruptedException,
ExecutionException {
sem.acquire();
scan0(dir);
sem.acquire();
return files.keySet();
}
private void scan0(File dir) {
for (final File file : dir.listFiles()) {
files.putIfAbsent(file, Boolean.TRUE);
if (file.isDirectory()) {
count.incrementAndGet();
executor.submit(new Runnable() {
public void run() {
DirScanner.this.scan0(file);
}
});
}
}
if (count.decrementAndGet() < 0) {
sem.release();
}
}
public void close(){
executor.shutdown();
}
public static Set<File> listAllContentsUnder(File dir, int threads)
throws InterruptedException, ExecutionException {
DirScanner scanner = new DirScanner(threads);
return scanner.scan(dir);
}
}
In common with the “SerialDirScanner” class, the “DirScanner” class has a public method named “scan” that takes a directory as an argument and returns a Set of File objects. The “scan” method acquires a semaphore of count 1, calls the “scan0” method, and waits to re-acquire the same semaphore. The “scan0” method performs the actual work of scanning directory contents. The “DirScanner” class contains an ExecutorService object named “executor” that is used to schedule scanning tasks.
Note that the “DirScanner” class accepts an integer constructor parameter named “threads”. This parameter is used to initialize the ExecutorService with a fixed thread pool of this size.
The “scan0” method submits a new task to the executor whenever it finds a sub-directory while performing a directory listing. The new “task”, in this case, happens to be a recursive invocation of itself. The “scan0” method increments a counter (an AtomicInteger instance object named “counter”) before the task is actually submitted. Each time the “scan0” method completes listing a directory, it decrements the counter, releasing the semaphore when the counter’s value falls below zero. This allows the “scan” method, which has been waiting to acquire this semaphore, to “wake up” and continue, effectively returning a Set of File objects from the key set of the concurrent map “files”.
Performance comparison and thread counts
To test the performance improvement, if any, of the parallelized directory scanner vis-a-vis the serial scanner, I wrote the following “main” method:
public static void main(String[] args) {
int status = 0;
try {
Set<File> files0 = null;
Set<File> files1 = null;
long t0 = System.nanoTime();
files0 = SerialDirScanner
.listAllContentsUnder(new File(args[0]));
t0 = System.nanoTime() - t0;
System.out.println(t0 + " ticks.");
t0 = System.nanoTime();
files1 = DirScanner.listAllContentsUnder(new File(args[0]), 5);
t0 = System.nanoTime() - t0;
System.out.println(t0 + " ticks.");
System.out.println(files0.size() + "," + files1.size());
System.out.println(files0.size() == files1.size());
} catch (Throwable exc) {
status = 1;
exc.printStackTrace();
} finally {
System.exit(status);
}
}
The argument I pass in to the main method is a directory that is well populated and quite deep. Running this on my development laptop (with the specs Core 2 Duo 2.26 MHz, 4GB, running Windows 7 Evaluation build No. 7100), I get:
1. Initially, before the directory structure is cached by the operating system:
3666120093 ticks.
2375654123 ticks.
39592,39592
true
2. Subsequently, after the directory structure has been cached:
3551936784 ticks.
2340802880 ticks.
39592,39592
true
So, there is a performance gain of roughly 33% using the parallel scanner using five threads (note that the “listAllContentsUnder” method accepts a thread count as the second argument). Using three threads, I get:
3594733052 ticks.
2335042793 ticks.
39592,39592
true
which is not much different from the readings obtained from using five threads. Using nine threads, I get:
3563738170 ticks.
2368012499 ticks.
39592,39592
true
which shows that the performance has actually been reduced by increasing the number of threads. The explanation for this curious observation is that the concurrent map, into which all threads write data, faces increased contention as the number of threads goes up.
Another interesting observation is that the parallel scanner is actually slower at scanning shallow directories. Consider the following output while scanning an example:
2602454 ticks.
10136304 ticks.
17,17
true
Conclusion
Using the parallel directory scanner speeds up the scanning of deep (and well populated) directories by roughly 33%. Shallow directories are better scanned with the serial directory scanner.

Comments