clive page university of leicester meeting at roe - 2005 january 25 (1)cross-matching catalogues...

Clive Page

University of Leicester

Meeting at ROE - 2005 January 25

(1) Cross-matching Catalogues

(2) Column-based storage for data exploring

Data Federation

• In order to extract full scientific value from our datasets we have to combine information from different wavebands.

• Data Federation is one of the principal aims of the VO projects – but facilities to implement it seem to be very slow in arriving.

• Federation can be done from images, but images are made in quite different ways in radio, optical, and X-ray bands, so combining them is difficult.

• Usually images are searched for sources and these make source catalogues.

Cross-matching Catalogues

• Catalogues are large tables, so normally stored in a relational database management system.

• The principal matching criterion is approximate spatial coincidence, so the main part of the cross-matching process is a spatial join.

• Most DBMS support spatial joins with some form of spatial index:- – DB2, Informix, MySQL, Postgres all have spatial

indexing built-in (mostly using R-trees).– Oracle, Sybase – have optional spatial data modules (R-

trees)– Microsoft SQL Server – no support for spatial data.

• If we want a generic solution for the VO it would be better to use a spatial join which only needs B-tree indexing.

Spatial Join Algorithms

• R-trees (and similar spatial indices):– Easy to use, quite efficient, but syntax not standardised.

• Sort/sweep algorithm of Dave Abel et al (CSIRO):– Cannot be done in DBMS – extract data in binary form– Only efficient for large comparably sized tables.

• Declination Zone method of Jim Grey et al:– Seems to be efficient only on SQL Server.

• Pixel-code method: cover sky with grid of pixels– E.g. use HEALPix, HTM, or simple igloo grid.– Works quite well on all DBMS used so far.– Complicated: we are joining regions (e.g. circles) not

points, and regions often cross pixel-boundaries.

Standard Catalogue - TableS

TableS

uid

ra

dec

radius

rmag

bmag

Etc.

Create Pixel-code table - PixelS

TableS PixelS

uid uid

ra pixel

dec

radius

rmag

bmag

Etc.

Ingest User’s own table - TableU

TableS PixelS TableU

uid uid uid

ra pixel ra

dec dec

radius radius

rmag pmra

bmag pmdec

Etc. Etc.

Generate its pixel-table - PixelU

TableS PixelS PixelU TableU

uid uid uid uid

ra pixel pixel ra

dec dec

radius radius

rmag pmra

bmag pmdec

Etc. Etc.

Create B-tree Indices


* uid uid uid * uid

ra * pixel pixel ra

dec dec

radius radius

rmag pmra

bmag pmdec

Etc. Etc.

Join the four tables: loop over rows of PixelU, use indices on other three.


* uid uid uid * uid

ra * pixel pixel ra

dec dec

radius radius

rmag pmra

bmag pmdec

Etc. Etc.

Pixel-code method - advantages

• Seems to be efficient enough when small to medium sized table joined to very large one (more benchmarks needed).

• Can be done with practically standard SQL, so should be usable on all DBMS.

Joining Catalogues: four cases to consider

1. Cone-search: find matches to single source/position.

2. User uploads own source-list to match with std catalogues

3. User selects subset from some standard catalogue, wants to cross-match with other large catalogues.

4. Cross-match of whole of one large catalogue with another.

Joining Catalogues: practical cases

1. Cone-search: find matches to single source/position.

• Simple case, already supported by several services

2. User uploads own source-list to match with standard cats

3. User selects subset from some standard catalogue, wants to cross-match with other large catalogues.

4. Cross-match of whole of one large catalogue with another.

• Substantial project – use sort/sweep join outside DBMS?

Joining user’s Catalogue with Standard Catalogue

• Practical problems – especially on tables uploaded by the user:

• Support range of physical units and layouts

– Radians, degrees, HMS/DMS, etc.

• Various equinoxes, e.g. J2000, B1950 etc.

• Error may be fixed for whole table, or specified by row(s).

• Error regions may be circles, ellipses, or other shapes

• Error values may be specified as 1-sigma, 90%, etc.

Software Development Proposed

• A little more development of pixel-code method, to benchmark and check efficiency with existing DBMS on a wider range of datasets.

• Web application to allow users to upload their own tables

– Convert them to a DBMS-compatible form

– Ingest to DBMS

– Create necessary pixel-code table, indices.

• Needs way of allocating disc-space for users on the server e.g using MySpace (AstroGrid) or MyDB (NVO).

Cross-matching system - Conclusions

• Fairly easy to develop required software to handle users tables and cross-match with standard catalogues.

• Would provide new facilities that astronomers currently lack – and which might actually get used.

• A basic system could be set up in a matter of weeks (assuming the required MySpace/MyDB facilities were available).

And now for something completely different…

Column-based data storage for data exploration/mining

Data Exploration and Data Mining• Operations very often performed on tabular datasets.

• Source catalogues, especially, are increasing in size

– all-sky surveys, large format CCDs, higher time resolutions.

– SDSS DR3 has about 1 Terabyte of tabular data.

– Several catalogues now ~109 rows in length.

Data Exploration and Data Mining• Operations very often performed on tabular datasets.

• Source catalogues, especially, are increasing in size

– all-sky surveys, large format CCDs, higher time resolutions.

– SDSS DR3 has about 1 Terabyte of tabular data.

– Several catalogues now ~109 rows in length.

• Width of tables also increasing:

Table Number of ColumnsUSNO-B catalog 50

2MASS point sources 61

1XMM source catalogue 379

SDSS DR3 PhotObjAll 446

Storage of Tables

• Tables are normally stored in relational DBMS

• There are two obvious ways to store tabular data

– Row-oriented, i.e. all fields in a row stored contiguously

– Column-oriented, i.e. all values in a column stored as a vector in contiguous blocks on disc.

• RDBMS are designed and optimised for transactions (insert/delete/update) so they almost always use row-oriented storage.

Can we keep up with Moore’s Law?

• Astronomical data volume, processing power, disc storage – doubling in about every 2 years.

• I/O bandwidth and seek times are improving much more slowly (~10% per year)

• Increasingly we want to manipulate datasets which are too large to be held in memory (even though memory sizes are also increasing) – so disc speeds are very important.

Conclusion: I/O is becoming a more serious bottleneck.

Query Types

• Index-accelerated queries:– Selection on value (or range of values) in an indexed

column.• Sequential scans necessary when:

– Creating an index– Populating new columns e.g. using UPDATE– Selecting on an expression

• SELECT * FROM table WHERE (bmag-vmag) > 0.5

– Computing statistics– Joining one table with another (must scan at least one

table, can use index on the other)– Selecting, where index selectivity is too small ( a

random seek takes ~10ms, can scan ~500 disc blocks in that time).

Query Timing

• Index-accelerated queries– Can return results in as little as 20 ms – the time for a

couple of disk-seeks.• Sequential scans of a table

– Time depends (obviously) on table length– For tables of a billion rows, any scan operation takes at

least 30 minutes.– Highly desirable to speed these up so can be interactive

not batch operations.• For queries which scan large tables but only access a few

columns out of many, row-based storage is very inefficient, column-oriented storage would be much faster.

Systems using column-oriented tables

• Sybase-IQ best known commercial product from makers of Sybase-ASE relational database. Fast, but

– No Linux version yet

– No spatial indexing

– Very expensive licence costs

• ESO/MIDAS table system

• STSDAS table system (part of IRAF)

• FITS files can (just) store data in column-based form but these files too peculiar for most FITS utilities to cope.

Using HDF5 for column-oriented storage

• Hierarchical Data Format 5– Comes from NCSA, well-supported library and tools– Compatible with Globus, GridFTP– Designed for efficient access to big files (no 2 GB limit)– Simple API (bindings to C, Fortran90, Java, Python)– Can attach unlimited metadata to tables and columns.

• Basic prototype was constructed and benchmarked– Used query parser and evaluator originally written for

Starlink’s CURSA about 1992.– Around 10% sample of 2MASS catalogue put into

HDF5 column-based format.• Queries found to be 13 to 50 times faster using HDF5 than

with fastest RDBMS (MySQL).

Development of an HDF5-based data explorer?

• HDF5 library – reliable, efficient, easy to use.

• Query parser and evaluator – CURSA, or Java library

• Graphics package – great variety of them available

• Statistics routines – many available

• Indexing – many B-tree libraries available

• Spatial indexing – free R-tree code exists, or use pixel-code methods.

• Integration with VO for authentication, MySPACE, etc.

• Provide as stand-alone package, and as a Web Service.

• User interface – allowing iterative filtering, undo, etc.

– Probably the most difficult part of the design.

Additional Advantages

• Makes distributed cross-match more feasible: only transfer positional columns from one site to another.

• Simple API - so users with their own data mining code (clustering, classification, etc) can read HDF5 datasets directly.

• Easy to add access to tables in other formats such as FITS, VOTable, CSV, or (using JDBC) those in other DBMS.

• Can provide user with integrated graphics, statistics, etc. and a host of other desirable data exploration features missing from commercial RDBMS.

Column-oriented Data Explorer - Conclusions

• Developing a column-oriented data storage system using HDF5 is not a trivial exercise, but does not appear to be too difficult.

• It is not intended to be a general-purpose DBMS, and should be seen as something complementing the DBMS in a data archive system or astronomical data warehouse.

• Storage of same data in row-oriented and column-oriented form may be seen as wasteful, but disc is now cheap.

clive page university of leicester meeting at roe - 2005 january 25 (1)cross-matching catalogues...

Documents

large catalogues

large tables

source catalogues

radius rmagpmra bmagpmdec

similar spatial indices

data federation

std catalogues

form of spatial index