efficient and scalable cross-matching of (very) large...

44
Efficient and scalable cross-matching of (very) large catalogues Fran¸cois-XavierPineau 1 , Thomas Boch 1 and Sebastien Derri` ere 1 1 CDS, Observatoire Astronomique de Strasbourg ADASS Boston, 08 November 2010 Fran¸cois-XavierPineau (CDS) X-Match of large catalogues 08/11/2010 1 / 16

Upload: others

Post on 01-May-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Efficient and scalable cross-matching of (very) large cataloguesadass2010.cfa.harvard.edu/ADASS2010/incl/presentations/O...E cient and scalable cross-matching of (very) large catalogues

Efficient and scalable cross-matching of (very) largecatalogues

Francois-Xavier Pineau1, Thomas Boch1 and Sebastien Derriere1

1CDS, Observatoire Astronomique de Strasbourg

ADASS Boston, 08 November 2010

Francois-Xavier Pineau (CDS) X-Match of large catalogues 08/11/2010 1 / 16

Page 2: Efficient and scalable cross-matching of (very) large cataloguesadass2010.cfa.harvard.edu/ADASS2010/incl/presentations/O...E cient and scalable cross-matching of (very) large catalogues

Context

Context

CDS cross-match service (in development)

Based on UWS (job submission)

Catalogues :

Algorithms :

A B A B

Particularity : deal with (very) large catalogues

Francois-Xavier Pineau (CDS) X-Match of large catalogues 08/11/2010 2 / 16

Page 3: Efficient and scalable cross-matching of (very) large cataloguesadass2010.cfa.harvard.edu/ADASS2010/incl/presentations/O...E cient and scalable cross-matching of (very) large catalogues

Context

Context

CDS cross-match service (in development)

Based on UWS (job submission)

Catalogues :

Algorithms :

A B A B

Particularity : deal with (very) large catalogues

Francois-Xavier Pineau (CDS) X-Match of large catalogues 08/11/2010 2 / 16

Page 4: Efficient and scalable cross-matching of (very) large cataloguesadass2010.cfa.harvard.edu/ADASS2010/incl/presentations/O...E cient and scalable cross-matching of (very) large catalogues

Context

Context

CDS cross-match service (in development)

Based on UWS (job submission)

Catalogues :

Algorithms :

A B A B

Particularity : deal with (very) large catalogues

Francois-Xavier Pineau (CDS) X-Match of large catalogues 08/11/2010 2 / 16

Page 5: Efficient and scalable cross-matching of (very) large cataloguesadass2010.cfa.harvard.edu/ADASS2010/incl/presentations/O...E cient and scalable cross-matching of (very) large catalogues

Context

Context

CDS cross-match service (in development)

Based on UWS (job submission)

Catalogues :

Algorithms :

A B A B

Particularity : deal with (very) large catalogues

Francois-Xavier Pineau (CDS) X-Match of large catalogues 08/11/2010 2 / 16

Page 6: Efficient and scalable cross-matching of (very) large cataloguesadass2010.cfa.harvard.edu/ADASS2010/incl/presentations/O...E cient and scalable cross-matching of (very) large catalogues

Context

Dealing with (very) large catalogues

Example

2MASSI ∼ 470x106 sourcesI minimal data ∼ 15 GB

F identifier (integer 4 Bytes)F positions (double 8 B+8 B)F errors (float 4 B+4 B+4 B)

USNO-B1I ∼ 109 sourcesI minimal data ∼ 28 GB

F identifier (integer 4 B)F positions (double 8 B+8 B)F errors (float 4 B+4 B)

LSST projection at 5 years:I V>26, ∼ 3x109 unique sourcesI minimal data ∼ 96 GB

ProblemsData size

I do not fit into memory

Performance issuesI data loadingI looking for candidates

SolutionsScalability: Healpix partitioning

Efficiency:I special indexed binary fileI kd-tree (cone search queries)I multithreadingI parallel processing

Francois-Xavier Pineau (CDS) X-Match of large catalogues 08/11/2010 3 / 16

Page 7: Efficient and scalable cross-matching of (very) large cataloguesadass2010.cfa.harvard.edu/ADASS2010/incl/presentations/O...E cient and scalable cross-matching of (very) large catalogues

Context

Dealing with (very) large catalogues

Example

2MASSI ∼ 470x106 sourcesI minimal data ∼ 15 GB

F identifier (integer 4 Bytes)F positions (double 8 B+8 B)F errors (float 4 B+4 B+4 B)

USNO-B1I ∼ 109 sourcesI minimal data ∼ 28 GB

F identifier (integer 4 B)F positions (double 8 B+8 B)F errors (float 4 B+4 B)

LSST projection at 5 years:I V>26, ∼ 3x109 unique sourcesI minimal data ∼ 96 GB

ProblemsData size

I do not fit into memory

Performance issuesI data loadingI looking for candidates

SolutionsScalability: Healpix partitioning

Efficiency:I special indexed binary fileI kd-tree (cone search queries)I multithreadingI parallel processing

Francois-Xavier Pineau (CDS) X-Match of large catalogues 08/11/2010 3 / 16

Page 8: Efficient and scalable cross-matching of (very) large cataloguesadass2010.cfa.harvard.edu/ADASS2010/incl/presentations/O...E cient and scalable cross-matching of (very) large catalogues

Context

Dealing with (very) large catalogues

Example

2MASSI ∼ 470x106 sourcesI minimal data ∼ 15 GB

F identifier (integer 4 Bytes)F positions (double 8 B+8 B)F errors (float 4 B+4 B+4 B)

USNO-B1I ∼ 109 sourcesI minimal data ∼ 28 GB

F identifier (integer 4 B)F positions (double 8 B+8 B)F errors (float 4 B+4 B)

LSST projection at 5 years:I V>26, ∼ 3x109 unique sourcesI minimal data ∼ 96 GB

ProblemsData size

I do not fit into memory

Performance issuesI data loadingI looking for candidates

SolutionsScalability: Healpix partitioning

Efficiency:I special indexed binary fileI kd-tree (cone search queries)I multithreadingI parallel processing

Francois-Xavier Pineau (CDS) X-Match of large catalogues 08/11/2010 3 / 16

Page 9: Efficient and scalable cross-matching of (very) large cataloguesadass2010.cfa.harvard.edu/ADASS2010/incl/presentations/O...E cient and scalable cross-matching of (very) large catalogues

Scalability

Healpix

Hierarchical sky pixelisationI level 0 12 pixelsI level 1 12x4 pixelsI . . .I level n 12x22n

Pixels of equal area

Developed at NASA:healpix.jpl.nasa.gov

Available inI C, C++I FortranI IDLI JavaI . . . ?

Francois-Xavier Pineau (CDS) X-Match of large catalogues 08/11/2010 4 / 16

Page 10: Efficient and scalable cross-matching of (very) large cataloguesadass2010.cfa.harvard.edu/ADASS2010/incl/presentations/O...E cient and scalable cross-matching of (very) large catalogues

Scalability

Scalable cross-match

Independent pixels cross-matchI but border effects

Cat. B pixel sources put in a kd-tree

Optimal partitioning levelI available memoryI minimisation of:

nPixels∑i=0

NAi log(1 + NBi + NbBi

)

I I/O cost

Level 0

Level 1

A BFrancois-Xavier Pineau (CDS) X-Match of large catalogues 08/11/2010 5 / 16

Page 11: Efficient and scalable cross-matching of (very) large cataloguesadass2010.cfa.harvard.edu/ADASS2010/incl/presentations/O...E cient and scalable cross-matching of (very) large catalogues

Scalability

Scalable cross-match

Independent pixels cross-matchI but border effects

Cat. B pixel sources put in a kd-tree

Optimal partitioning levelI available memoryI minimisation of:

nPixels∑i=0

NAi log(1 + NBi + NbBi

)

I I/O cost

Level 0

Level 1

A BFrancois-Xavier Pineau (CDS) X-Match of large catalogues 08/11/2010 5 / 16

Page 12: Efficient and scalable cross-matching of (very) large cataloguesadass2010.cfa.harvard.edu/ADASS2010/incl/presentations/O...E cient and scalable cross-matching of (very) large catalogues

Scalability

Scalable cross-match

Single machine

All sky correlation (small catalogues)I allow “on the fly” correlation

Correlation pixel by pixel (large catalogues)

Computer grid

Parallel processing

Framework:I based on UWSa (few machines)I Hadoop (large grid)

“On the fly” correlation possible

aUniversal Worker Service (IVOA)

Francois-Xavier Pineau (CDS) X-Match of large catalogues 08/11/2010 6 / 16

Page 13: Efficient and scalable cross-matching of (very) large cataloguesadass2010.cfa.harvard.edu/ADASS2010/incl/presentations/O...E cient and scalable cross-matching of (very) large catalogues

Scalability

Scalable cross-match

Single machine

All sky correlation (small catalogues)I allow “on the fly” correlation

Correlation pixel by pixel (large catalogues)

Computer grid

Parallel processing

Framework:I based on UWSa (few machines)I Hadoop (large grid)

“On the fly” correlation possible

aUniversal Worker Service (IVOA)

Francois-Xavier Pineau (CDS) X-Match of large catalogues 08/11/2010 6 / 16

Page 14: Efficient and scalable cross-matching of (very) large cataloguesadass2010.cfa.harvard.edu/ADASS2010/incl/presentations/O...E cient and scalable cross-matching of (very) large catalogues

Scalability

Scalable cross-match

Single machine

All sky correlation (small catalogues)I allow “on the fly” correlation

Correlation pixel by pixel (large catalogues)

Computer grid

Parallel processing

Framework:I based on UWSa (few machines)I Hadoop (large grid)

“On the fly” correlation possible

aUniversal Worker Service (IVOA)

Francois-Xavier Pineau (CDS) X-Match of large catalogues 08/11/2010 6 / 16

Page 15: Efficient and scalable cross-matching of (very) large cataloguesadass2010.cfa.harvard.edu/ADASS2010/incl/presentations/O...E cient and scalable cross-matching of (very) large catalogues

Scalability

Scalable cross-match

Single machine

All sky correlation (small catalogues)I allow “on the fly” correlation

Correlation pixel by pixel (large catalogues)

Computer grid

Parallel processing

Framework:I based on UWSa (few machines)I Hadoop (large grid)

“On the fly” correlation possible

aUniversal Worker Service (IVOA)

Francois-Xavier Pineau (CDS) X-Match of large catalogues 08/11/2010 6 / 16

Page 16: Efficient and scalable cross-matching of (very) large cataloguesadass2010.cfa.harvard.edu/ADASS2010/incl/presentations/O...E cient and scalable cross-matching of (very) large catalogues

Scalability

Scalable cross-match

Single machine

All sky correlation (small catalogues)I allow “on the fly” correlation

Correlation pixel by pixel (large catalogues)

Computer grid

Parallel processing

Framework:I based on UWSa (few machines)I Hadoop (large grid)

“On the fly” correlation possible

aUniversal Worker Service (IVOA)

Master

Slave 1 Slave 2

Slave 3 Slave 4

Francois-Xavier Pineau (CDS) X-Match of large catalogues 08/11/2010 6 / 16

Page 17: Efficient and scalable cross-matching of (very) large cataloguesadass2010.cfa.harvard.edu/ADASS2010/incl/presentations/O...E cient and scalable cross-matching of (very) large catalogues

Efficiency

Loading data: indexed binary files

Index filesOne by healpix level

For each pixelI offsetI nSources

Binary data file

Organized by blocks:I positionsI position errorsI identifiersI ...

Sources ordered by healpix pixel index

level 0 index fileIdx offset nSrcs11 xxx xxx

Binary file

Block position

Block posErr

. . .

Francois-Xavier Pineau (CDS) X-Match of large catalogues 08/11/2010 7 / 16

Page 18: Efficient and scalable cross-matching of (very) large cataloguesadass2010.cfa.harvard.edu/ADASS2010/incl/presentations/O...E cient and scalable cross-matching of (very) large catalogues

Efficiency

Loading data: indexed binary files

Index filesOne by healpix level

For each pixelI offsetI nSources

Binary data file

Organized by blocks:I positionsI position errorsI identifiersI ...

Sources ordered by healpix pixel index

level 0 index fileIdx offset nSrcs0 xxx xxx

1 xxx xxx

.

.

....

.

.

.

11 xxx xxx

Binary file

Block position

Block posErr

. . .

Francois-Xavier Pineau (CDS) X-Match of large catalogues 08/11/2010 7 / 16

Page 19: Efficient and scalable cross-matching of (very) large cataloguesadass2010.cfa.harvard.edu/ADASS2010/incl/presentations/O...E cient and scalable cross-matching of (very) large catalogues

Efficiency

Loading data: indexed binary files

Index filesOne by healpix level

For each pixelI offsetI nSources

Binary data file

Organized by blocks:I positionsI position errorsI identifiersI ...

Sources ordered by healpix pixel index

level 2 index fileIdx offset nSrcs

.

.

....

.

.

.

84 xxx xxx

.

.

....

.

.

.

Binary file

Block position

Block posErr

. . .

Francois-Xavier Pineau (CDS) X-Match of large catalogues 08/11/2010 7 / 16

Page 20: Efficient and scalable cross-matching of (very) large cataloguesadass2010.cfa.harvard.edu/ADASS2010/incl/presentations/O...E cient and scalable cross-matching of (very) large catalogues

Efficiency

kd-tree

What is a kd-Tree?A space-partitioning data structure

Allows for fast k-nearest neighbour/cone search queriesI nearest neighbour query in O(log(n))

ProblemNaive implementation can be memory consuming

We want a memory efficient kd-tree (capacity > 1 billion sources)

Solution

To use a single array (sorted using a kd-tree scheme)

Francois-Xavier Pineau (CDS) X-Match of large catalogues 08/11/2010 8 / 16

Page 21: Efficient and scalable cross-matching of (very) large cataloguesadass2010.cfa.harvard.edu/ADASS2010/incl/presentations/O...E cient and scalable cross-matching of (very) large catalogues

Efficiency

kd-tree

What is a kd-Tree?A space-partitioning data structure

Allows for fast k-nearest neighbour/cone search queriesI nearest neighbour query in O(log(n))

ProblemNaive implementation can be memory consuming

We want a memory efficient kd-tree (capacity > 1 billion sources)

Solution

To use a single array (sorted using a kd-tree scheme)

Francois-Xavier Pineau (CDS) X-Match of large catalogues 08/11/2010 8 / 16

Page 22: Efficient and scalable cross-matching of (very) large cataloguesadass2010.cfa.harvard.edu/ADASS2010/incl/presentations/O...E cient and scalable cross-matching of (very) large catalogues

Efficiency

kd-tree

What is a kd-Tree?A space-partitioning data structure

Allows for fast k-nearest neighbour/cone search queriesI nearest neighbour query in O(log(n))

ProblemNaive implementation can be memory consuming

We want a memory efficient kd-tree (capacity > 1 billion sources)

Solution

To use a single array (sorted using a kd-tree scheme)

Francois-Xavier Pineau (CDS) X-Match of large catalogues 08/11/2010 8 / 16

Page 23: Efficient and scalable cross-matching of (very) large cataloguesadass2010.cfa.harvard.edu/ADASS2010/incl/presentations/O...E cient and scalable cross-matching of (very) large catalogues

Efficiency

A kd-tree can be a simple sorted array of sourcesAlgorithm: quicksort alternating the sorted coordinate

α

δS1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 S15

α

δα ≤ αS3

αS3

S3

αS3 ≤ α

α

δδ ≤ δS10 δS10

S10

δS10 ≤ δS3

δ ≤ δS8 δS8

S8

δS8 ≤ δ

α

δα ≤ αS11

S11

≤ αS10

α ≤ αS6

S6

≤ αS3

α ≤ αS4

S4

≤ αS8

α ≤ αS2

S2

≤ α

α

δS5 S11 S13 S10 S15 S6 S14 S3 S7 S4 S9 S8 S1 S2 S12

Creation speed up by using multi-threading

Francois-Xavier Pineau (CDS) X-Match of large catalogues 08/11/2010 9 / 16

Page 24: Efficient and scalable cross-matching of (very) large cataloguesadass2010.cfa.harvard.edu/ADASS2010/incl/presentations/O...E cient and scalable cross-matching of (very) large catalogues

Efficiency

A kd-tree can be a simple sorted array of sourcesAlgorithm: quicksort alternating the sorted coordinate

α

δS1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 S15

α

δα ≤ αS3

αS3

S3

αS3 ≤ α

α

δδ ≤ δS10 δS10

S10

δS10 ≤ δS3

δ ≤ δS8 δS8

S8

δS8 ≤ δ

α

δα ≤ αS11

S11

≤ αS10

α ≤ αS6

S6

≤ αS3

α ≤ αS4

S4

≤ αS8

α ≤ αS2

S2

≤ α

α

δS5 S11 S13 S10 S15 S6 S14 S3 S7 S4 S9 S8 S1 S2 S12

Creation speed up by using multi-threading

Francois-Xavier Pineau (CDS) X-Match of large catalogues 08/11/2010 9 / 16

Page 25: Efficient and scalable cross-matching of (very) large cataloguesadass2010.cfa.harvard.edu/ADASS2010/incl/presentations/O...E cient and scalable cross-matching of (very) large catalogues

Efficiency

A kd-tree can be a simple sorted array of sourcesAlgorithm: quicksort alternating the sorted coordinate

α

δS1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 S15

α

δα ≤ αS3

αS3

S3

αS3 ≤ α

α

δδ ≤ δS10 δS10

S10

δS10 ≤ δS3

δ ≤ δS8 δS8

S8

δS8 ≤ δ

α

δα ≤ αS11

S11

≤ αS10

α ≤ αS6

S6

≤ αS3

α ≤ αS4

S4

≤ αS8

α ≤ αS2

S2

≤ α

α

δS5 S11 S13 S10 S15 S6 S14 S3 S7 S4 S9 S8 S1 S2 S12

Creation speed up by using multi-threading

Francois-Xavier Pineau (CDS) X-Match of large catalogues 08/11/2010 9 / 16

Page 26: Efficient and scalable cross-matching of (very) large cataloguesadass2010.cfa.harvard.edu/ADASS2010/incl/presentations/O...E cient and scalable cross-matching of (very) large catalogues

Efficiency

A kd-tree can be a simple sorted array of sourcesAlgorithm: quicksort alternating the sorted coordinate

α

δS1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 S15

α

δα ≤ αS3

αS3

S3

αS3 ≤ α

α

δδ ≤ δS10 δS10

S10

δS10 ≤ δS3

δ ≤ δS8 δS8

S8

δS8 ≤ δ

α

δα ≤ αS11

S11

≤ αS10

α ≤ αS6

S6

≤ αS3

α ≤ αS4

S4

≤ αS8

α ≤ αS2

S2

≤ α

α

δS5 S11 S13 S10 S15 S6 S14 S3 S7 S4 S9 S8 S1 S2 S12

Creation speed up by using multi-threading

Francois-Xavier Pineau (CDS) X-Match of large catalogues 08/11/2010 9 / 16

Page 27: Efficient and scalable cross-matching of (very) large cataloguesadass2010.cfa.harvard.edu/ADASS2010/incl/presentations/O...E cient and scalable cross-matching of (very) large catalogues

Efficiency

A kd-tree can be a simple sorted array of sourcesAlgorithm: quicksort alternating the sorted coordinate

α

δS1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 S15

α

δα ≤ αS3

αS3

S3

αS3 ≤ α

α

δδ ≤ δS10 δS10

S10

δS10 ≤ δS3

δ ≤ δS8 δS8

S8

δS8 ≤ δ

α

δα ≤ αS11

S11

≤ αS10

α ≤ αS6

S6

≤ αS3

α ≤ αS4

S4

≤ αS8

α ≤ αS2

S2

≤ α

α

δS5 S11 S13 S10 S15 S6 S14 S3 S7 S4 S9 S8 S1 S2 S12

Creation speed up by using multi-threading

Francois-Xavier Pineau (CDS) X-Match of large catalogues 08/11/2010 9 / 16

Page 28: Efficient and scalable cross-matching of (very) large cataloguesadass2010.cfa.harvard.edu/ADASS2010/incl/presentations/O...E cient and scalable cross-matching of (very) large catalogues

Efficiency

A kd-tree can be a simple sorted array of sourcesAlgorithm: quicksort alternating the sorted coordinate

α

δS1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 S15

α

δα ≤ αS3

αS3

S3

αS3 ≤ α

α

δδ ≤ δS10 δS10

S10

δS10 ≤ δS3

δ ≤ δS8 δS8

S8

δS8 ≤ δ

α

δα ≤ αS11

S11

≤ αS10

α ≤ αS6

S6

≤ αS3

α ≤ αS4

S4

≤ αS8

α ≤ αS2

S2

≤ α

α

δS5 S11 S13 S10 S15 S6 S14 S3 S7 S4 S9 S8 S1 S2 S12

Thread 1

Creation speed up by using multi-threading

Francois-Xavier Pineau (CDS) X-Match of large catalogues 08/11/2010 9 / 16

Page 29: Efficient and scalable cross-matching of (very) large cataloguesadass2010.cfa.harvard.edu/ADASS2010/incl/presentations/O...E cient and scalable cross-matching of (very) large catalogues

Efficiency

A kd-tree can be a simple sorted array of sourcesAlgorithm: quicksort alternating the sorted coordinate

α

δS1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 S15

α

δα ≤ αS3

αS3

S3

αS3 ≤ α

α

δδ ≤ δS10 δS10

S10

δS10 ≤ δS3

δ ≤ δS8 δS8

S8

δS8 ≤ δ

α

δα ≤ αS11

S11

≤ αS10

α ≤ αS6

S6

≤ αS3

α ≤ αS4

S4

≤ αS8

α ≤ αS2

S2

≤ α

α

δS5 S11 S13 S10 S15 S6 S14 S3 S7 S4 S9 S8 S1 S2 S12

Thread 1

Thread 2

Creation speed up by using multi-threading

Francois-Xavier Pineau (CDS) X-Match of large catalogues 08/11/2010 9 / 16

Page 30: Efficient and scalable cross-matching of (very) large cataloguesadass2010.cfa.harvard.edu/ADASS2010/incl/presentations/O...E cient and scalable cross-matching of (very) large catalogues

Efficiency

A kd-tree can be a simple sorted array of sourcesAlgorithm: quicksort alternating the sorted coordinate

α

δS1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 S15

α

δα ≤ αS3

αS3

S3

αS3 ≤ α

α

δδ ≤ δS10 δS10

S10

δS10 ≤ δS3

δ ≤ δS8 δS8

S8

δS8 ≤ δ

α

δα ≤ αS11

S11

≤ αS10

α ≤ αS6

S6

≤ αS3

α ≤ αS4

S4

≤ αS8

α ≤ αS2

S2

≤ α

α

δS5 S11 S13 S10 S15 S6 S14 S3 S7 S4 S9 S8 S1 S2 S12

Thread 1

Thread 2

Thread 3 Thread 4

Creation speed up by using multi-threading

Francois-Xavier Pineau (CDS) X-Match of large catalogues 08/11/2010 9 / 16

Page 31: Efficient and scalable cross-matching of (very) large cataloguesadass2010.cfa.harvard.edu/ADASS2010/incl/presentations/O...E cient and scalable cross-matching of (very) large catalogues

Efficiency

Modified kd-tree and multithreading

Modified kd-treeClassical kd-tree adapted for euclidian spaces

Solution 1: (rejected)I cartesian coordinates (x , y , z)

F time consuming (conversion)F memory consuming (+50%)

Solution 2: (approved)I spherical coordinates (α, δ)I classical creation algorithmI modified query algorithm

F angular distances (Haversine formula)F modified circle/rectangle intersection to enter a sub-tree

Multithreading

Single kNN or cone search query not multithread

Pool of threads executing multiple queries simultaneously

Francois-Xavier Pineau (CDS) X-Match of large catalogues 08/11/2010 10 / 16

Page 32: Efficient and scalable cross-matching of (very) large cataloguesadass2010.cfa.harvard.edu/ADASS2010/incl/presentations/O...E cient and scalable cross-matching of (very) large catalogues

Efficiency

Modified kd-tree and multithreading

Modified kd-treeClassical kd-tree adapted for euclidian spaces

Solution 1: (rejected)I cartesian coordinates (x , y , z)

F time consuming (conversion)F memory consuming (+50%)

Solution 2: (approved)I spherical coordinates (α, δ)I classical creation algorithmI modified query algorithm

F angular distances (Haversine formula)F modified circle/rectangle intersection to enter a sub-tree

Multithreading

Single kNN or cone search query not multithread

Pool of threads executing multiple queries simultaneously

Francois-Xavier Pineau (CDS) X-Match of large catalogues 08/11/2010 10 / 16

Page 33: Efficient and scalable cross-matching of (very) large cataloguesadass2010.cfa.harvard.edu/ADASS2010/incl/presentations/O...E cient and scalable cross-matching of (very) large catalogues

Efficiency

Modified kd-tree and multithreading

Modified kd-treeClassical kd-tree adapted for euclidian spaces

Solution 1: (rejected)I cartesian coordinates (x , y , z)

F time consuming (conversion)F memory consuming (+50%)

Solution 2: (approved)I spherical coordinates (α, δ)I classical creation algorithmI modified query algorithm

F angular distances (Haversine formula)F modified circle/rectangle intersection to enter a sub-tree

Multithreading

Single kNN or cone search query not multithread

Pool of threads executing multiple queries simultaneously

Francois-Xavier Pineau (CDS) X-Match of large catalogues 08/11/2010 10 / 16

Page 34: Efficient and scalable cross-matching of (very) large cataloguesadass2010.cfa.harvard.edu/ADASS2010/incl/presentations/O...E cient and scalable cross-matching of (very) large catalogues

Tests

Test Machine

Dell machine 2 600d(∼$3 600):I 24 GB of 1333 MHz memoryI 2x Quad Core 2.27 GHz (Xeon)I 16 threads (Hyper-Threading)I High speed HDD (10 000 rpm)

Francois-Xavier Pineau (CDS) X-Match of large catalogues 08/11/2010 11 / 16

Page 35: Efficient and scalable cross-matching of (very) large cataloguesadass2010.cfa.harvard.edu/ADASS2010/incl/presentations/O...E cient and scalable cross-matching of (very) large catalogues

Tests

Test results

Full catalogue cross-correlation

SDSS DR7 (∼357 000 000 sources) 2MASS (∼470 000 000 sources)

Simple cross-match: ∼9 minI radius of 5′′

I Healpix level 3 (∼7.3◦)I Level 9 borders (∼7’)I ∼49 209 000 associations

With elliptical errors: ∼10 minI distance of 3.44σI distance max of 5′′

I Healpix level 3I ∼37 507 000 associations

Francois-Xavier Pineau (CDS) X-Match of large catalogues 08/11/2010 12 / 16

Page 36: Efficient and scalable cross-matching of (very) large cataloguesadass2010.cfa.harvard.edu/ADASS2010/incl/presentations/O...E cient and scalable cross-matching of (very) large catalogues

Tests

Test results

Full catalogue cross-correlation

SDSS DR7 (∼357 000 000 sources) 2MASS (∼470 000 000 sources)

Simple cross-match: ∼9 minI radius of 5′′

I Healpix level 3 (∼7.3◦)I Level 9 borders (∼7’)I ∼49 209 000 associations

With elliptical errors: ∼10 minI distance of 3.44σI distance max of 5′′

I Healpix level 3I ∼37 507 000 associations

Francois-Xavier Pineau (CDS) X-Match of large catalogues 08/11/2010 12 / 16

Page 37: Efficient and scalable cross-matching of (very) large cataloguesadass2010.cfa.harvard.edu/ADASS2010/incl/presentations/O...E cient and scalable cross-matching of (very) large catalogues

Tests

Test results

Full catalogue cross-correlation

SDSS DR7 (∼357 000 000 sources) 2MASS (∼470 000 000 sources)

Simple cross-match: ∼9 minI radius of 5′′

I Healpix level 3 (∼7.3◦)I Level 9 borders (∼7’)I ∼49 209 000 associations

With elliptical errors: ∼10 minI distance of 3.44σI distance max of 5′′

I Healpix level 3I ∼37 507 000 associations

Francois-Xavier Pineau (CDS) X-Match of large catalogues 08/11/2010 12 / 16

Page 38: Efficient and scalable cross-matching of (very) large cataloguesadass2010.cfa.harvard.edu/ADASS2010/incl/presentations/O...E cient and scalable cross-matching of (very) large catalogues

Lessons

Lessons learned

HardwareFor our application:

RAM frequency does matter (lots of memory access)

Hyper-Threading does matter (on 8 cores, 16 threads ∼2x faster than 8threads)

Software: don’t have a priori

Efficient full Java code

Efficient modifed kd-trees (in our case)

Service

Existing and future (very) large catalogues can be processed

Bottleneck is data transfer (without surprise)I service colocated with data

Francois-Xavier Pineau (CDS) X-Match of large catalogues 08/11/2010 13 / 16

Page 39: Efficient and scalable cross-matching of (very) large cataloguesadass2010.cfa.harvard.edu/ADASS2010/incl/presentations/O...E cient and scalable cross-matching of (very) large catalogues

Annexes

Test results

Full all-sky catalogues cross-correlation

2MASS (∼470 000 000 sources) USNO-B1 (∼1 046 000 000 sources)

Simple cross-match: ∼30 minI radius of 5′′

I Healpix level 3I Level 9 bordersI ∼583 300 000 associations

Francois-Xavier Pineau (CDS) X-Match of large catalogues 08/11/2010 14 / 16

Page 40: Efficient and scalable cross-matching of (very) large cataloguesadass2010.cfa.harvard.edu/ADASS2010/incl/presentations/O...E cient and scalable cross-matching of (very) large catalogues

Annexes

Basic likelihood ratio (LR)Ratio between:

Rayleigh distribution

LR =

r exp (−1

2r2)

2λr=

exp (− 12 r

2)

2 λ

Poissonian distribution

Depends on:

r = normalized distance in σ

λ ∝ local density of sources

Test resultsSDSS7 x 2MASS correlations + LRs

Local densities estimated by kNN averaging (k=100)

15min

Francois-Xavier Pineau (CDS) X-Match of large catalogues 08/11/2010 15 / 16

Page 41: Efficient and scalable cross-matching of (very) large cataloguesadass2010.cfa.harvard.edu/ADASS2010/incl/presentations/O...E cient and scalable cross-matching of (very) large catalogues

Annexes

Basic likelihood ratio (LR)Ratio between:

Rayleigh distribution

LR =

r exp (−1

2r2)

2λr=

exp (− 12 r

2)

2 λ

Poissonian distribution

Depends on:

r = normalized distance in σ

λ ∝ local density of sources

Test resultsSDSS7 x 2MASS correlations + LRs

Local densities estimated by kNN averaging (k=100)

15min

Francois-Xavier Pineau (CDS) X-Match of large catalogues 08/11/2010 15 / 16

Page 42: Efficient and scalable cross-matching of (very) large cataloguesadass2010.cfa.harvard.edu/ADASS2010/incl/presentations/O...E cient and scalable cross-matching of (very) large catalogues

Annexes

Basic likelihood ratio (LR)Ratio between:

Rayleigh distribution

LR =

r exp (−1

2r2)

2λr=

exp (− 12 r

2)

2 λ

Poissonian distribution

Depends on:

r = normalized distance in σ

λ ∝ local density of sources

Test resultsSDSS7 x 2MASS correlations + LRs

Local densities estimated by kNN averaging (k=100)

15min

Francois-Xavier Pineau (CDS) X-Match of large catalogues 08/11/2010 15 / 16

Page 43: Efficient and scalable cross-matching of (very) large cataloguesadass2010.cfa.harvard.edu/ADASS2010/incl/presentations/O...E cient and scalable cross-matching of (very) large catalogues

Annexes

Going further...

Magnitude-dependent LRs (fast solution)

ρ(m ±∆m) = ρ p(m ±∆m)

kNN averaging

log N-log S law

SDSS7, level 6 (∼ 1◦), 15 187 non empty histogramscomputed in 30s.

Probability of identifications (fast solution)

Number of spurious match estimatesI Positional errors sampling for both cataloguesI Nspur =

∑A

∑B

Sconv/Spixel

SDSS7 x 2MASS, level 6, 8min (not yetmultithreaded!)

Francois-Xavier Pineau (CDS) X-Match of large catalogues 08/11/2010 16 / 16

Page 44: Efficient and scalable cross-matching of (very) large cataloguesadass2010.cfa.harvard.edu/ADASS2010/incl/presentations/O...E cient and scalable cross-matching of (very) large catalogues

Annexes

Going further...

Magnitude-dependent LRs (fast solution)

ρ(m ±∆m) = ρ p(m ±∆m)

kNN averaging

log N-log S law

SDSS7, level 6 (∼ 1◦), 15 187 non empty histogramscomputed in 30s.

Probability of identifications (fast solution)

Number of spurious match estimatesI Positional errors sampling for both cataloguesI Nspur =

∑A

∑B

Sconv/Spixel

SDSS7 x 2MASS, level 6, 8min (not yetmultithreaded!)

Francois-Xavier Pineau (CDS) X-Match of large catalogues 08/11/2010 16 / 16