hrvatska, june 3-rd, 2003 cost action n. 283 - progress report, june 2003 computational and...
TRANSCRIPT
Hrvatska, June 3-rd, 2003
COST Action n. 283 - progress report, June 2003
Computational and Information Infrastructures in the
Astronomical Data GRID
Giuseppe Longo
Chair of AstrophysicsDepartment of Physical SciencesUniversity of Napoli “Federico II”, Italy & INFN (Italian Institute for Nuclear Physics)[email protected]
Chair: Prof. F. Murtagh – Queen University College Belfast
Hu
bb
le D
eep
Fie
ld
Hrvatska, June 3-rd, 2003
Methodological background:
Id est: is history teaching us something (or isn’t it?)…
Role of Technological Breakthroughs
All discoveries
Before 1954
After 1954
Num
ber
of
di s
c overi
es
Hrvatska, June 3-rd, 2003
Where is (now) the next breakthrough in Astronomy?
Either new channels (better: new information carriers):
• Electromagnetic waves (optical since 1609, other since 60’s)
• Solid samples (70’s ->)
• Gravitational waves (2005 ->)
• Neutrino’s (early 80’s ->)Or leaps in any of:
• Sensitivity
• Spectral range
• Spectral resolution
• Angular resolution
• Time resolution
Hrvatska, June 3-rd, 2003
The iAstro people believe that:
Discoveries
Massive data sets
Distributed computing
Massive data
mining
Hardware breakthrough: wide field imaging with CCD Mosaics enables digital surveys
The Sky covers 40.000 sq. Deg.
With 0.6 arcsec sampling: 2 x 1012 pxl
8 TB for band (10/100 TB/survey)
Ca. 10 PB keeping temporal resolution (ca.h for 1 yr …need for 20 yr)
Hrvatska, June 3-rd, 2003
From Traditional to Survey Science
Highly successful and increasingly prominent, but inherently limited by the information content of individual surveys …What comes next, beyond survey science is distributed (V.O.) science
Data Analysis
Results
Telescope
Traditional:
SurveyTelescope
Archive
Follow-UpTelescope
Results
Target SelectionData Mining
Another Survey/Archive?
Survey-Based:
Courtesy of G. Djorgovski
Hrvatska, June 3-rd, 2003
SurveysObservatories
Missions
Surveyand
MissionArchives Follow-Up
Telescopesand
Missions
Results
Data Services---------------Data Miningand Analysis,
Target Selection
Digital libraries
Primary Data Providers
VOSecondary
DataProviders
A Schematic Illustration of the new astronomy
Courtesy of G. Djorgovski
Hrvatska, June 3-rd, 2003
Radio Far-Infrared Visible
Visible + X-ray Dust Map Density Map
Panchromatic view of the Universe:
Search for the unknown
Offers:Different physicsGlobal understandingComparison with theoryNew discoveries
New domains of the parameter space: cf. time
Faint, Fast Transients (Tyson et al.)
Hrvatska, June 3-rd, 2003
RA Dec
WavelengthTime
Flux
Propermotion
Non-EM …
Polarization
Morphology / Surf.Br.
High dimensionality (N>>100)What is the coverage?Where are the gaps?
Calls for…Feature selectionclusteringstatisticsKDDVisualization, etc…
Catalogue space (features; TB)
Pixel space (raw data; TB/PB) Huge data flow
data fusionneed for recalibrations
Calls for…Automatic catalogue extractionspurious features removalimage parametrization and classificationdata compressionmultiscale analysis, etc.
Hrvatska, June 3-rd, 2003
0,0000001
0,00001
0,001
0,1
10
1000
1500
1600
1700
1800
1900
2000
Hours of Computer
Time/Night
T2 (Moore)~1.5 years
Sounds Beautiful ! …. BUT:
Terascale (Petascale?) computing and/or better algorithms are required
Hrvatska, June 3-rd, 2003
In modern data sets: DD >> 10, DS >> 3Data Complexity Multidimensionality DiscoveriesBut the bad news is …
The computational cost of clustering analysis:
Some dimensionality reduction methods do exist (e.g., PCA, classprototypes, hierarchical methods, etc.), but more work is needed
K-means: K N I DExpectation Maximisation: K N I D2
Monte Carlo Cross-Validation: M Kmax2 N I D2
N = no. of data vectors, D = no. of data dimensionsK = no. of clusters chosen, Kmax = max no. of clusters triedI = no. of iterations, M = no. of Monte Carlo trials/partitions
Digital sky surveys call for huge increases in computing power
Hrvatska, June 3-rd, 2003
Hrvatska, June 3-rd, 2003
“Standard Activities” all meeting reports and proceedings on the web• First and Second MC meetings, Brussels,
11/23/2001 & 2/14-15/2002
• Third MC meeting, Edinburgh, 07/21/2002(at GGF-5, Global Grid Forum 5)
• Fourth MC meeting & workshop on Multispectral data analysis, and image metadata, Strasbourg, 11/28-29/2002
• Fifth MC meeting & workshop on High/low resolution signal processing, Granada, 02/22-23/2003
• Planned: Sixth MC meeting & workshop on Poisson noise models, Nice, Oct. 2003.
• Planned: Seventh MC meeting & workshop on Data mining & Image analysis in a distributed environment, Capri, Mar. 2004.
Hrvatska, June 3-rd, 2003Granada, february 2002
Guess who was taking the picture…
Hrvatska, June 3-rd, 2003
Major Orientation of iAstro in early 2003: FP6
• Expressions of Interest filed in - summer 2002.
• Participation in Commission Information Days.
• Involvement in several NoEs (sensor fusion, information retrieval, e-education and training, the European virtual observatory, and digital signal processing and data mining in medicine).
• Participation in evaluation panels.
Hrvatska, June 3-rd, 2003
• Submitted early April 2003.• Participants: iAstro partners in BG, CH, D, E, F, GR, H, I, IRL and
UK. Additional partner cluster in University of Paris Sud.
COST 283 proposal for the Marie Curie RTN network
“GridFocus: Data and Information Fusion and Mining in the Context of the DataGrid”
Multiband and multiple layer image and signal processing as a basic paradigm for the data Grid.
Data mining of visual and other streams, including high performance forensic image data mining.
Empirical and virtual data interfaces.
Hrvatska, June 3-rd, 2003
GridFocus concept based on data dynamics and information thermodynamics
Hrvatska, June 3-rd, 2003
SOMETHING ON SCIENCE….
Hrvatska, June 3-rd, 2003
openimport
headercompliant
non compliant
open import
Head/proc.preprocessing
Parameter and training options
Supervisedunsupervised
supervisedParameter options
unsupervised
Labeledunlabeled
labeled
Label preparation
Feature selectionvia unsupervised clustering
MLP RBF Etc.
Training setpreparation
Feature selectionvia unsupervised clustering
Etc.GTMSOMFuzzy set
INTERPRETATION
Code in C++Parallelized on Beowulf
Used (so far) for
Cosmologyparticle Physics (ARGO)Gravitational Waves (VIRGO)
Hrvatska, June 3-rd, 2003
A standard clustering example: unsupervised S/G classification
Input data: DPOSS catalogue (ca. 5x106 objects, 50 features each)
SOM (output is a U-Matrix) ~ GTM (output is a PDF)
1. Input data (Tables or strings)
2. Feature selection (backward elimination strategy)
3. Compression of input space and re-design of network
4. Classification
5. Labeling (e.g. 500 well classified objects)
6. …freeze & run on real data
Hrvatska, June 3-rd, 2003
Star/Galaxy classificationAutomatic selection of significant features Unsupervised SOM (DPOSS data)
Hrvatska, June 3-rd, 2003
Labeling
Localization of a set of 500 faint stars
Hrvatska, June 3-rd, 2003
Stars p.d.f galaxies p.d.f
cumulative p.d.f
G.T.M. unsupervised clustering; S/G
Hrvatska, June 3-rd, 2003
cumulative p.d.f
Stars p.d.f galaxies p.d.f
G.T.M. unsupervised clustering; S/G – CDF Field
5x105 obj.
Hrvatska, June 3-rd, 2003
SDSS-EDR DB
SOM unsup.Set construction
SOM supervisedFeature selection
MLP supervisedexperiments
SOM unsup.completeness
ReliabilityMap
Best MLP model
• Input data set: SDSS – EDR photometric data (galaxies)
• Training/validation/test set:SDSS-EDR spectroscopic subsample
Photometric redshifts: a mixed case
Hrvatska, June 3-rd, 2003
Step 3 - experiments to find the optimal architecture Varying n. of input, n. of hidden, n. of patterns in the training set, n. of training epochs, n. of Bayesian cycles and inner loops, etc.
Convergence computed on validation set
Error derived from test set
Robust error: 0.02176
Hrvatska, June 3-rd, 2003
• Advance the state of the art through our workshops and visits. Manyof these exchanges presented results in a Special Issue of Neural Networks (Ed. Tagliaferri and Longo, vol 16 3-4, 2003). Status: ongoing.
• Define our role vis-à-vis large
Framework Programme projects on the virtual observatory, grid, computer vision, etc. through an iAstro White Paper. Status: done in early 2003.
• Spin-off specific targeted actions where greater resources are needed. Status: GridFocus Marie Curie RTN network proposal written and submitted in early 2003; local initiatives
• Next step: spin-out and commercial exploitation of our work through a STREP or IP proposal?
iAstro strategy:
Hrvatska, June 3-rd, 2003
iAstro web pages: http://www.iastro.org
To join the iAstro Mailing List: send a message to: [email protected]
Where & how to know more about iAstro:
Thanks to: E.U. & to… Prof. Fedi