image information mining conference: the sentinels era
DESCRIPTION
Quantifying the Value of Federated Datasets in Earth Observation Information Mining and Analytics Volumes of EO data systematically collected, processed and stored is continuously increasing It becomes more and more difficult evaluating their “value” Datasets are made available by different institutions (agencies, commercial providers, etc.) Federation of EO datasets The Dataset Value is a vague concept: inherent information content, its possible exploitation, relation with user’s application needs, etc. How to evaluate the value of an EO dataset in a typical scenario of a network of federated datasets? Our work shows that the Representation Capacity of an EO product dataset D is proportional to the log of the cardinality of the EO dataset. The value of a federation of datasets should take into account the Representation Capacity, and therefore grows with the log of the size of the individual datasets. In order to evaluate and compare different datasets in a federation for further processing, a general methodology to preserve the relative information content has been defined. New models for research and service support are emerging in the Earth Observation context / Data availability from forthcoming missions will increase rapidly Facilities for EO dissemination and processing services, geographically distributed in a federated domain, largely scalable with reliable Quality of Services are urgently needed Federated domains shall federate both computing and storage resources. The federation is valued and sustained by the underpinning Earth Observation datasets and their information content To value datasets federations in wider contexts (e.g. Big Data, Web 2.0) R&D activities are needed to fully exploit the information they contain A programmatic framework to sustain such R&D activities must be set-up to cover the various aspects involved (Image Information Mining, Time Series analysis, EO data analytics, multi-dimensional databases, semantic web, visual analytics, etc.)TRANSCRIPT
Image Information Mining Conference 05/03/2014
Quantifying the Value of Federated Datasets in Earth
Observation Information Mining and Analytics
P.G. Marchetti (*), M. Iapaolo (**)
* European Space Agency (ESA/ESRIN)
EOP Research and Ground Segment Technology Section
** Randstad Italia c/o ESA/ESRIN
Image Information Mining Conference 05/03/2014
1. Introduction
2. EO Datasets Value
3. Representation Capacity and Information Content for EO Datasets
4. Initial Results
5. Towards “Big Data”
6. Future work and perspectives
Outline
Image Information Mining Conference 05/03/2014
1. Volumes of EO data systematically collected, processed and stored
is continuously increasing
2. It becomes more and more difficult evaluating their “value”
3. Datasets are made available by different institutions (agencies,
commercial providers, etc.) Federation of EO datasets
4. The Dataset Value is a vague concept: inherent information
content, its possible exploitation, relation with user’s application
needs, etc.
How to evaluate the value of an EO dataset in a typical
scenario of a network of federated datasets?
Introduction
Image Information Mining Conference 05/03/2014
Communication networks: the value of a network (its growth potential) grows as a quadratic function (n2) with the number of network nodes n (Metcalf’s Law)
Generic concept of value (importance) applicable to a wide range of natural phenomena (occurrences of words in a text, size of population
of big cities, etc.): the kth ranked item has a value (frequency, size) of
about 1/k of the first one (Zipf’s Law)
Total Value = sum of decreasing 1/k values over all the n items
EO Datasets Value
≈ log(n)
Applying to all n nodes: Total Value ≈ n log(n)
Image Information Mining Conference 05/03/2014
The Crossover Point with the Zipf’s law is obtained for larger n
with respect to the Metcalf’s law
EO Datasets Value
Plot of nlog(n) growing function, compared with the linear and quadratic one
The origin is set on n=1.
Image Information Mining Conference 05/03/2014
EO Datasets Value and Information Content
1. In the EO context, it is of paramount importance to assess the value
of datasets from the information content point of view (neither
from growth potential nor from a market value )
2. The actual exploitation of federated datasets is mainly based on
their information content, extracted through time series analysis
and image information mining techniques and analytics
3. The relative value (i.e. the information content) of an EO dataset
permits to:
estimate the number of EO products (or samples) to be used
select which datasets are relevant for an analysis
Need for a theoretical framework for the assessment of the
value (information content) of a federation of EO datasets
Image Information Mining Conference 05/03/2014
Representation Capacity
Given a family of n non-overlapping datasets in a federation, D={D1,D2,
…,Dn};
Select from D a sample S={S1, S2, …, Sn}, where each Sh is contained in
Dh (h=1,2,…,n);
Our aim is here to assess and quantify how much S is
representative of D, and how it can characterise the value of D
The Representation Capacity in D, K(D) is a measure for the degree
of arbitrariness in choosing the sample S from D
K(D) should be a non-decreasing function f(x) where x is the size of the
set from which the images must be extracted
Image Information Mining Conference 05/03/2014
Representation Capacity
K(D) = f(x) = f(d1d2…dn) = f(d1)+f(d2)+…+f(dn) = K(D1)+K(D2)+…+K(Dn)
f(x) = k log(x)
Assuming sh proportional to K(Dh) sh = k log dh
Method for building the sample S from D: how to extract a representative
number of samples sh from the datasets Dh, in order to preserve the relative
value (information content) of the corresponding datasets Dh.
Image Information Mining Conference 05/03/2014
Information Content
Denoting Ih = log(dh):
Ih represents the average information associated to the random choice
(with uniform probability 1/dh) of a single product from each Dh
M = k I
The number of EO product samples M is therefore proportional to the average Shannon information I associated to the h random choices
Image Information Mining Conference 05/03/2014
1. The Representation Capacity of an EO product dataset D is
proportional to the log of the cardinality of the EO dataset
2. The value of a federation of datasets should take into account the
Representation Capacity, and therefore grows with the log of the
size of the individual datasets
3. In order to evaluate and compare different datasets in a federation
for further processing, a general methodology to preserve the
relative information content has been defined
Representation Capacity
Image Information Mining Conference 05/03/2014
1. Additional constraints could be imposed by further processing,
image mining, time series analysis and statistics/analytics
objectives and requirements
2. The simplified approach presented in this paper could allow to
assess the value (information content) a federation of EO dataset
according to the Shannon’s theoretical framework
3. This approach should complement the one derived from the Zipf’s
law, based on the number n of datasets in the federation, to help
decision makers in evaluating the wealth of available information.
Comments
Image Information Mining Conference 05/03/2014
Information Content
Denoting Ih = log(dh):
Ih represents the average information associated to the random choice
(with uniform probability 1/dh) of a single product from each Dh
To comply with this requirement of representativeness, we define as
Informative Units the vectors of EO product samples v={v1,v2,…,vn}
with each vh belonging to the corresponding dataset Dh. It is important
to notice that the Informative Unit is neither a single EO product in Dh
nor an arbitrary n-tuple of EO products.
Image Information Mining Conference 05/03/2014
Initial results 1-3
1. General approach for the assessment of the value – in terms of information content – of a federation of EO datasets
2. Interpretation of results under the Shannon information theoretical framework:
o The information content of a dataset is proportional to its cardinality
o Considering a sample of data extracted from the whole dataset, the Representation Capacity of the dataset is proportional to the log of its cardinality
o As a consequence, the value (information content) of a federation of EO datasets grows with the log of the size of the individual datasets
ESA Presentation | DD/MM/YYYY | Slide 14
ESA UNCLASSIFIED – For Official Use
Initial results 2-3
CRYOSA
T
ENVIS
ATER
S
GOCE
IKONOS
KOMPS
AT
LANDSA
T
MODIS
PROBA
RAPIDEY
E
SMOS
0
500
1000
1500
2000
2500
3000
Series1
Number of papers published on IEEE
search performed on 14.02.2014
Oops, if we have a look at the papers…
Image Information Mining Conference 05/03/2014
Initial results 3-3
The identification of a general method for evaluating, comparing and selecting different datasets cannot ignore other information elements like:
• the papers published and their quality, content, relevance, citations and impact factors e.g. (see Hirsch [1]) h-index
• the papers published and related parameters: mission, sensor, area, …
• the web pages published (see PageRank [2])
• Social media
• …
Image Information Mining Conference 05/03/2014
1. New models for research and service support are emerging in the Earth Observation context / Data availability from forthcoming missions will increase rapidly
2. Facilities for EO dissemination and processing services, geographically distributed in a federated domain, largely scalable with reliable Quality of Services are urgently needed
3. Federated domains shall federate both computing and storage resources. The federation is valued and sustained by the underpinning Earth Observation datasets and their information content
4. To value datasets federations in wider contexts (e.g. Big Data, Web 2.0) R&D activities are needed to fully exploit the information they contain
5. A programmatic framework to sustain such R&D activities must be set-up to cover the various aspects involved (IIM, TS analysis, EO data analytics, multi-dimensional databases, semantic web, visual analytics, etc.)
Future Work, towards “Big Data”
Image Information Mining Conference 05/03/2014
1. The programmatic framework should span a time frame of 5-10 years
2. It should include a strong user validation step (possibly involving hundreds of users and laboratories)
3. Should be extended to include other domains (not only EO!!): Earth and Space Science, Engineering … see the announced “Big Data from Space” Conference !
4. Recent work (Mazzucato) demonstrates the benefits to fund large and strongly supported research programmes (venture capital and market will follow, exploiting former consistent investments by state funded institutions)
5. Research on value-enahnced search for EO data may help in adding value and is needed to exploit to the great variety of data which will be made available!
Future Work, towards “Big Data”
ESA Presentation | DD/MM/YYYY | Slide 18
ESA UNCLASSIFIED – For Official Use
References
[1] J.E. Hirsch, An index to quantify an individual's scientific research output, Proceedings National Academy of Science 46:16569, 2005
[2] L. Page, S. Brin, R. Motwani, T. Winograd, The PageRank Citation Ranking: Bringing Order to the Web. Technical Report. Stanford InfoLab., 1999
Image Information Mining Conference 05/03/2014
Thank you!!