This website presents the code and results of Matt Holman's final project in Harvard's CS205 course, fall 2013.
This project combines three algorithmic approaches (the HEALPix tessellation, KD trees, and MPI) into one effective tool for searching astronomical data sets.
With the Hierachical Equal Area isoLatitude Pixelization (Gorski et al. 2005), the sky is divided into an enumerated set of regions, with a simple correspondence between locations on the sky and the HEALPix index.
HEALPix allows for a simple partitioning of data sets on the sphere into disjoint regions.
The healpy package provides python bindings to C++ routines that carry out various transformations and tests of sky coordinates and HEALPix indices.
In order to carry out a nearest-neighbors search, the KD tree requires a metric for the distance between any two points. There are several choices with the ANN package. The simplest and most intuitive metric is the Euclidean distance. However, we are concerned with angular separation between points on the sky. So we need to transform points on the sky into a space that has a Euclidean metric. There are at least two good options.
The first of these is the "tangent plane" projection. Here each point in a given HEALPix region is projected onto the plane perpendicular to the center of the region. The 2D Euclidean distance is clear. Also, great circles on the sky transform into straight lines in the tangent plane projection. This approach has the disadvantage that the distances depend on the local projection. Antipodal points can appear close, in projection, to the query point. (This is problematic and led me to seek another approach.)
The second approach is to simply represent sky positions with 3D unit vectors. The 3D Euclidean distance is clear. It does not depend upon a local projection. Although the points must be represented with three, rather than two, values, the speed of the KD tree is similar. I adopt this second approach.
Given the data format, I implemented the queries in two different ways.
In the first implementation, I used the mpi4py bcast and gather collective calls to first distribute a set of query points to all of the ranks (CPUs) and to then assemble the results of each CPU's query of its KD tree.
What is notable here is that the code is actually ignoring the spatial organization of the data into disjoint sets. Instead, the code is simply asking every rank to inspect its KD tree. This is still reasonably efficient; if a query point is outside the KD tree's bounding box, it returns immediately.
Thus, it would be possible to entirely ignore the HEALPix tessellation with this approach. We could simply divide the data evenly among the processors, regardless of the spatial distribution.
This approach becomes dominated by the communication costs when a large number of CPUs is used. A possible optimization would be to use the Bcast and Gather collective calls, that transmit numpy arrays rather than python objects, to speed up the communication. (This is future work.)
The second implmentation, was my original intention but is still under development. In this one only those ranks that could possibly contain any data in the query region are notified.
The healpy package provides a convenient routine, query_disc, that returns a list of the indices of all the HEALPix regions that could fall within an angular radius of a point on the sky. Thus, given a query point and a search radius, we can determine with ranks to contact and then send the query to only those ranks and receive the results from each.
This approach requires more bookkeeping, but the speed should not be dominated by communication costs.
You will need an account on the Harvard Reseach Computing Odyssey 2.0 cluster to run the code and to access the synthetic data. Here are instructions for obtaining an odyssey2 account.
On odyssey, you will need to add /n/home01/mholman/lib/python to your PYTHONPATH, with
export PYTHONPATH=${PYTHONPATH}:/n/home01/mholman/lib/python:
Download the tar ball. Here is the orignal tar ball.
untar to generate the holman_project directory.
This includes various run.script version.
From an Odyssey 2.0 headnode, obtain an interactive session with 192 (or 12 or 48 nodes) with this command
srun -n 192 -p general -t 60 --x11=first --pty --mem-per-cpu=1750 bash
cd holman_project
To run on 12 nodes: source run.script.12
To run on 48 nodes: source run.script.48
To run on 192 nodes: source run.script.192
However, the code is easy to run from the command line with something like:
mpirun -np 12 python project_nodes.py nside N
nside = 1, 2, or 4, for the data sets available. N <= 10^7, for the data sets available.
UPDATE: I modified the code so that an arbitrary number of processors can be used, rather than just 12, 48, or 192. Rank 0 now assigns the work as evenly as possible among the processes.
The synthetic data were generated with
source generate.script.12
source generate.script.48
source generate.script.192
These scripts run code that use multiple processors to simultaneously generate large binary files contain angular positions on the sky. Each file covers a particular HEALPix region.
I tested the correctness of the codes in a number of ways:
N points | N proc | nside | File read (sec) | tree construction (sec) | query generation (msec) | query broadcast (msec) | tree & gather (msec) |
---|---|---|---|---|---|---|---|
1e7/region * 12 regions | 12 | 1 | 43.3 | 9.3 | 0.3 | 0.2 | 0.6 |
1e7/region * 12 regions | 6 | 1 | 72.5 | 20.3 | 1.3 | 0.1 | 0.8 |
1e7/region * 12 regions | 3 | 1 | 136.4 | 39.4 | 0.2 | 0.1 | 1.4 |
1e7/region * 12 regions | 1 | 1 | 384.6.4 | 118.4 | 0.3 | 0.1 | 3.8 |
1e7/region * 48 regions | 48 | 2 | 90.8 | 9.2 | 0.2 | 0.1 | 1.0 |
1e7/region * 48 regions | 24 | 2 | 85.4 | 15.6 | 0.2 | 0.1 | 1.0 |
1e7/region * 48 regions | 12 | 2 | 124.2 | 38.6 | 0.2 | 0.1 | 1.7 |
1e7/region * 48 regions | 6 | 2 | 259.7 | 80.5 | 0.3 | 0.2 | 5.7 |
1e7/region * 48 regions | 3 | 2 | 341.5 | 161.1 | 0.7 | 0.2 | 5.6 |
1e7/region * 192 regions | 192 | 4 | 80.2 | 19.9 | 0.3 | 0.5 | 5.0 |
1e7/region * 192 regions | 96 | 4 | 111.3 | 21.5 | 0.3 | 0.3 | 3.2 |
The table displays the time required to complete the different computational stages in a series of runs with the code. "N points" refers to the total number of points, which is 1e7 points per region times the number of regions. "N proc" is the number of processors used. When "N proc" is smaller than the number of regions, each processor manages the same number of regions. "nside" refers to the number of sides in each HEALPix region. The number of regions = 12*nside*nside.
Note that these times are all for 6 query points and a search radius of 0.0005 radians. The times are the average of 10 trials.
The next columns show the time to read the files (sec), the time to construct the KD trees (sec), the time to generate the queries (msec), the time to broadcast those queries (msec), and the time to query the KD trees and gather the results (msec).
The file read time is simply proportional to the number of points an individual process must read.
Each sky region has its own KD tree. Thus, the time to generate the trees is proportional to the number of trees an individual process must generate.
The queries are generated by rank 0. This always takes roughly the same amount of time, with some scatter.
The time to broadcast the queries is small, but seems to grow a bit for large numbers of processes.
The time to query each tree is O(log N), where N is the number of points in the tree. Thus, each process requires O(M log N) time to query all of its M trees. The time to gather the results, although not separated out here, is related to the total amount of data being transmitted.
This approach to spatially querying large data sets appears to scale well. It is possible to keep billions of points in memory, distributed over a large number of processes, and quickly query the full data set. This is well beyond what can be done with a single CPU.
There is very significant time to read in the files and generate the KD trees, but this is a one-time cost. The time to read in the files could be somewhat reduced by using a file format that supports parallel I/O. (HDF5 might be a good way to go.) The time to generate the KD trees was reduced by using a more efficient routine.
It appears that the time to evaluate a single query becomes dominated by the time to gather the data, when large numbers of processes are used. This argues for distributing the data over as few processes as per-CPU memory limitations allow. Also, it argues for exploring a Send-Recv approach that might reduce the communication time.
Finally, a little more effort is still needed to encapsulate some of the code into routines hide nearly all of the internal behavior from the use and expose just a few basic methods.
Special thanks to the following people for helpful discussions:
|
|