Exploring MapReduce for Large Collections of Phylogenetic Trees

 
 

Abstract

 
 

Phylogenetic trees depict the evolutionary relationships between a set of taxa. Most commonly used in the field of evolutionary biology to chart ancestor-descendent relationships among taxa, phylogenetic trees also have applications in pharmaceuticals, virology, antivenin treatment, and conservation: they allow researchers to illustrate the relatedness of compounds and species in order to make informed decisions about how to best treat a disease or preserve a population. As the number of taxa in a collection increases, the number of possible tree structures quickly grows, yielding a search space of (2n-5)!! possible trees for n taxa. This makes it increasingly difficult to generate and analyze the set of best hypothetical trees, necessitating the development of fast algorithms to simplify the search procedure and analyze the differences between the trees produced by phylogenetic search. In some cases, a single computer core cannot cope with the magnitude of the data it must process to produce the Robinson-Foulds matrix, a common distance metric that compares potential trees based on internal edges and bipartitions.

MrsRF ("MapReduce speeds up Robinson Foulds") is currently the fastest available algorithm for calculating the Robinson Foulds matrix. It uses Phoenix, a utility based on the MapReduce paradigm popularized by Google, to process the data using several cores. This Phoenix implementation lessens user workload by dynamically scheduling tasks to available cores and allowing the user to specify concurrency and locality at a high level while automating granularity adjustment and assignment of parallel tasks. In recent years, the availability of increasingly massive data sets, the release of Phoenix++, and the creation of TreeZip (a more efficient way of compressing trees for storage) have made it clear that MrsRF must be updated. By refactoring the code to work with Phoenix++ and TreeZip, we predict that we will be able to attain speedup far exceeding that of the original MrsRF implementation.