KD Trees in the Sky with MPI

This website presents the code and results of Matt Holman's final project in Harvard's CS205 course, fall 2013.

Motivation

A number of all-sky astronomical surveys are either underway or are being planned (Pan-STARRS-1, Skymapper, LSST).
Such surveys generate enormous amounts of data (PBs of images and TBs of source catalogs.
Processing, aggregating, and querying such data can be daunting, but the scientific yield can be very high.
These data sets are frequently much too large to be held in memory on a single CPU.
It is critical that efficient computational tools be developed.
This project was inspired by Mario Juric's Large Survey Database.

This project combines three algorithmic approaches (the HEALPix tessellation, KD trees, and MPI) into one effective tool for searching astronomical data sets.

HEALPix

With the Hierachical Equal Area isoLatitude Pixelization (Gorski et al. 2005), the sky is divided into an enumerated set of regions, with a simple correspondence between locations on the sky and the HEALPix index.

HEALPix allows for a simple partitioning of data sets on the sphere into disjoint regions.

The healpy package provides python bindings to C++ routines that carry out various transformations and tests of sky coordinates and HEALPix indices.

KD Trees

KD Trees are useful for organizing data that span a range of spatial scales. They are well suited to data sets with a small number of dimensions, such as points on the sky.
KD Trees allow for nearest neighbor searches, as well as fixed-radius searches, in O(log N) time, where N is the number of data points in the tree.
In this project I used the Approximate Nearest Neighbors (ANN) KD Tree library, written in C++, by Mount & Arya.
I used the Hetland python bindings to the ANN library. The input and output to these wrappers are numpy arrays.
UPDATE: The scipy.spatial module contains a class cKDTree that is a python binding to a C implementation of the ANN algorithm. This routine is newer, considerably faster, more memory efficient, and more robust than the Hetland bindings. Thus, I adopted it and rewrote portions of the code accordingly.

MPI

MPI (Message Passing Interface) allows for the effective use of multiple CPUs to solve large computational problems.
Python bindings to MPI are available through mpi4py and other packages.
I used mpi4py, as it was the main tool in CS205, and it is available on the Odyssey 2.0 cluster.

Design

The overall design is data parallel. Using mpi4py, each of the CPUs

reads in a HEALPix region of data, corresponding to a region of the sky,
builds a KD tree from those data,
and responds to requests for data within a certain angular radius of a query point.

Data Format

In order to carry out a nearest-neighbors search, the KD tree requires a metric for the distance between any two points. There are several choices with the ANN package. The simplest and most intuitive metric is the Euclidean distance. However, we are concerned with angular separation between points on the sky. So we need to transform points on the sky into a space that has a Euclidean metric. There are at least two good options.

The first of these is the "tangent plane" projection. Here each point in a given HEALPix region is projected onto the plane perpendicular to the center of the region. The 2D Euclidean distance is clear. Also, great circles on the sky transform into straight lines in the tangent plane projection. This approach has the disadvantage that the distances depend on the local projection. Antipodal points can appear close, in projection, to the query point. (This is problematic and led me to seek another approach.)

The second approach is to simply represent sky positions with 3D unit vectors. The 3D Euclidean distance is clear. It does not depend upon a local projection. Although the points must be represented with three, rather than two, values, the speed of the KD tree is similar. I adopt this second approach.

Queries

Given the data format, I implemented the queries in two different ways.

Collective Calls

In the first implementation, I used the mpi4py bcast and gather collective calls to first distribute a set of query points to all of the ranks (CPUs) and to then assemble the results of each CPU's query of its KD tree.

What is notable here is that the code is actually ignoring the spatial organization of the data into disjoint sets. Instead, the code is simply asking every rank to inspect its KD tree. This is still reasonably efficient; if a query point is outside the KD tree's bounding box, it returns immediately.

Thus, it would be possible to entirely ignore the HEALPix tessellation with this approach. We could simply divide the data evenly among the processors, regardless of the spatial distribution.

This approach becomes dominated by the communication costs when a large number of CPUs is used. A possible optimization would be to use the Bcast and Gather collective calls, that transmit numpy arrays rather than python objects, to speed up the communication. (This is future work.)

Send & Receive

The second implmentation, was my original intention but is still under development. In this one only those ranks that could possibly contain any data in the query region are notified.

The healpy package provides a convenient routine, query_disc, that returns a list of the indices of all the HEALPix regions that could fall within an angular radius of a point on the sky. Thus, given a query point and a search radius, we can determine with ranks to contact and then send the query to only those ranks and receive the results from each.

This approach requires more bookkeeping, but the speed should not be dominated by communication costs.

Installation

You will need an account on the Harvard Reseach Computing Odyssey 2.0 cluster to run the code and to access the synthetic data. Here are instructions for obtaining an odyssey2 account.

On odyssey, you will need to add /n/home01/mholman/lib/python to your PYTHONPATH, with

export PYTHONPATH=${PYTHONPATH}:/n/home01/mholman/lib/python:

Download the tar ball. Here is the orignal tar ball.

untar to generate the holman_project directory.

This includes various run.script version.

Running the code on Odyssey 2.0

From an Odyssey 2.0 headnode, obtain an interactive session with 192 (or 12 or 48 nodes) with this command

srun -n 192 -p general -t 60 --x11=first --pty --mem-per-cpu=1750 bash

cd holman_project

To run on 12 nodes: source run.script.12

To run on 48 nodes: source run.script.48

To run on 192 nodes: source run.script.192

However, the code is easy to run from the command line with something like:

mpirun -np 12 python project_nodes.py nside N

nside = 1, 2, or 4, for the data sets available. N <= 10^7, for the data sets available.

UPDATE: I modified the code so that an arbitrary number of processors can be used, rather than just 12, 48, or 192. Rank 0 now assigns the work as evenly as possible among the processes.

Synthetic Data

The synthetic data were generated with

source generate.script.12

source generate.script.48

source generate.script.192

These scripts run code that use multiple processors to simultaneously generate large binary files contain angular positions on the sky. Each file covers a particular HEALPix region.

Correctness Tests

I tested the correctness of the codes in a number of ways:

I examined the returned points to make sure that they match the query poitns.
Using a fixed set of query points (the six unit vectors pointing along the positive and negative x, y, and z axes), a made sure that the results returned were independent of the number of nodes used. Furthermore, I check that the results returned matched the appropriate HEALPix region.
I checked that the number of points returned for a given search radius matched the expected number for a set of points distributed uniformly on the sky.
I need to serially go through the data points to check that the results match.

Timing Results

N points	N proc	nside	File read (sec)	tree construction (sec)	query generation (msec)	query broadcast (msec)	tree & gather (msec)
1e7/region * 12 regions	12	1	43.3	9.3	0.3	0.2	0.6
1e7/region * 12 regions	6	1	72.5	20.3	1.3	0.1	0.8
1e7/region * 12 regions	3	1	136.4	39.4	0.2	0.1	1.4
1e7/region * 12 regions	1	1	384.6.4	118.4	0.3	0.1	3.8
1e7/region * 48 regions	48	2	90.8	9.2	0.2	0.1	1.0
1e7/region * 48 regions	24	2	85.4	15.6	0.2	0.1	1.0
1e7/region * 48 regions	12	2	124.2	38.6	0.2	0.1	1.7
1e7/region * 48 regions	6	2	259.7	80.5	0.3	0.2	5.7
1e7/region * 48 regions	3	2	341.5	161.1	0.7	0.2	5.6
1e7/region * 192 regions	192	4	80.2	19.9	0.3	0.5	5.0
1e7/region * 192 regions	96	4	111.3	21.5	0.3	0.3	3.2

The table displays the time required to complete the different computational stages in a series of runs with the code. "N points" refers to the total number of points, which is 1e7 points per region times the number of regions. "N proc" is the number of processors used. When "N proc" is smaller than the number of regions, each processor manages the same number of regions. "nside" refers to the number of sides in each HEALPix region. The number of regions = 12*nside*nside.

Note that these times are all for 6 query points and a search radius of 0.0005 radians. The times are the average of 10 trials.

The next columns show the time to read the files (sec), the time to construct the KD trees (sec), the time to generate the queries (msec), the time to broadcast those queries (msec), and the time to query the KD trees and gather the results (msec).

The file read time is simply proportional to the number of points an individual process must read.

Each sky region has its own KD tree. Thus, the time to generate the trees is proportional to the number of trees an individual process must generate.

The queries are generated by rank 0. This always takes roughly the same amount of time, with some scatter.

The time to broadcast the queries is small, but seems to grow a bit for large numbers of processes.

The time to query each tree is O(log N), where N is the number of points in the tree. Thus, each process requires O(M log N) time to query all of its M trees. The time to gather the results, although not separated out here, is related to the total amount of data being transmitted.

Conclusions

This approach to spatially querying large data sets appears to scale well. It is possible to keep billions of points in memory, distributed over a large number of processes, and quickly query the full data set. This is well beyond what can be done with a single CPU.

There is very significant time to read in the files and generate the KD trees, but this is a one-time cost. The time to read in the files could be somewhat reduced by using a file format that supports parallel I/O. (HDF5 might be a good way to go.) The time to generate the KD trees was reduced by using a more efficient routine.

It appears that the time to evaluate a single query becomes dominated by the time to gather the data, when large numbers of processes are used. This argues for distributing the data over as few processes as per-CPU memory limitations allow. Also, it argues for exploring a Send-Recv approach that might reduce the communication time.

Finally, a little more effort is still needed to encapsulate some of the code into routines hide nearly all of the internal behavior from the use and expose just a few basic methods.

Acknowledgements

Special thanks to the following people for helpful discussions:

Vicente Rodriguez-Gomez

Cris Cecka

Paul Edmon

Edward Lin

Pavlos Protopapas

Contact

Matthew Holman

Harvard-Smithsonian Center for Astrophysics

Phone: 617-496-7775

Fax: 617-495-7093

mholman@cfa.harvard.edu