Download - Introduction to Big data
Supported by EU projects
29/11/2013Athens, Greece
Open Data for Agriculture
Joint offering by
Intro to Big Data
Antonis KoukourikosNCSR “Demokritos”
Intro to Big Data
Slide 3 of 25
Presentation Outline
• What is Big Data?
• Semantic Web Technologies
• What Semantic Web brings into the picture
WHAT IS BIG DATA?Part 1
Slide 5 of 25
Big Data Is…
Data whose scale, diversity, and complexity require new architecture, techniques, algorithms, and analytics to manage it and extract value and hidden knowledge from it
Slide 6 of 25
Big Data Sources
• Biomedical Information
• Sensor Data
• Logs
• E-mails
• Satellite images
• Audio and Video Streams
• Social Networks
Slide 7 of 25
Big Data Challenges – “The Three Vs”
…or is it 4…?
Veracity
…or is it 6… ??
Visualization
Value
Volume
VelocityVariety
Slide 8 of 25
Big Data demand…
• Storage– Impractical or impossible to use centralized storage
• Distribution• Federation
– Indexing is a problem of itself• Computational power
– For discovering– For searching / retrieving– For joining
• Human effort and expertise– Querying can become complex– Are you sure you exploit all this information?
SEMANTIC WEB TECHNOLOGIESPart 2
Slide 10 of 25
The Syntactic and the Semantic Web
• The World Wide Web represents information using natural language, graphics, multimedia...– Humans can process and combine these
information easily– However, machines are ignorant!
• The Semantic Web is a Web with a meaning– A web of data that is understandable by the
machines
Slide 11 of 25
Semantic Web Technologies
• Common formats for integration and combination of data drawn from diverse sources, whereas the original Web mainly concentrated on the interchange of documents.
• For defining– RDFS http://www.w3.org/TR/rdf-schema/
– OWL http://www.w3.org/TR/owl2-overview/
• For describing– RDF http://www.w3.org/RDF/
• For querying– SPARQL http://www.w3.org/TR/2013/REC-sparql11-query-20130321 /
Slide 12 of 25
What SW can do
• Handle heterogeneity
• Handle evolution / variability
• Elicit inferred knowledge
• Volume is still the challenge
WHAT SEMANTIC WEB BRINGS IN THE BIG DATA PICTURE
Part 3
Slide 14 of 25
Moving Forward with “Old” Technologies
HARVESTER
OAI-PMH Service Provider #1
Schema #1
OAI-PMH Service Provider #n
Schema #n
INDEXER
AggregatedXML Repository
Web Portals
Open AGRIS (FAO)AgLR/GLN (ARIADNE)Organic.Edunet (UAH)
VOA3R (UAH)...
AGRIS AP Schema
IEEE LOM Schema
DC Schema
...
RDF Triple Store
Common Schema
SPARQL endpoint(Data Source #1)
SPARQL endpoint(Data Source #n)
INDEXER
Web Portals
SPARQL endpoint
NOW (2012) CASE OF AGRICULTURAL INFRASTRUCTURES 2015 (AgINFRA) CASE OF AGRICULTURAL INFRASTRUCTURES
How Many?
Is it feasible?
BigData Problem!
Slide 15 of 25
Query
Federated endpoint Wrapper
SemaGrow SPARQL endpoint
Resource Discovery
Query results
query fragment,Source
(#1)
Instance StatisticsData Summaries
SPARQL endpoint
POWDER Inference Layer
P-Store
InstanceStatistics
query fragment,target Source
transformed query
Query Decomposition
querypatterns
Query Results Merger
query fragment,Source
(#n)
queryresults
Client
Reactivityparameters
Query Decomposer
Data Source(s) Selector
Ctrl
Candidate Source(s) List· Instance Statistics· Load Info· Semantic Proximity
Query Transformation Service
SchemaMappings
SPARQL endpoint(Data Source #n)
SPARQLquery
Ctrl
Ctrl
Load Info
Instance Statistics
Data Summaries
Set of query
patternsQuery Pattern Discovery
Service
equivalentpatterns
querypattern
SemanticProximity
Resource Selector
query results schema
transformed schema
queryrequest #1
queryrequest #n
queryresults
SPARQL endpoint(Data Source #1)
SPARQLquery
Query Manager
What Semantic Web can bring into the picture
• One Data Access Point for One Data Access Point for the entire Data Cloud– Enabling Service-Data level agreements with Data providers
• Application-level Vocabularies / Thesauri / Ontologies– Enabling different application facets for different communities of users over the SAME data pool
• Going beyond existing Distributed Triple Store Implementations–Link Heterogeneous but Semantically Connected
Data–Index Extremely Large Information Volumes (Peta
Sizes)–Improve Information Retrieval response • Data (+Metadata)
physically stored in Data Provider– No need for harvesting
• Vocabularies / Thesauri / Ontologies of Data Provider choice– No need for aligning
according to common schemas
Slide 16 of 25
The SemaGrow Solution
• Use POWDER to mass-annotate large-subspaces– Exploit naming convention regularities to compress
the indexes used by the system• Partition triple patterns in the original query• Annotate each fragment with an ordered list of
data sources most likely to contain relevant data• Distribute and transform the query fragments• Collect and align the results
Slide 17 of 25
The POWDER W3C Recommendation
• Exploits natural groupings of URIs to annotate all resources in a subset of the URI space
• Regular expression based grouping
• Allows properties and their values to be associated with an arbitrary number of subjects within a fully-defined semantic framework
• POWDER Description Resources: http://www.w3.org/TR/powder-dr/• POWDER Formal Semantics: http://www.w3.org/TR/powder-formal/
Slide 18 of 25
The SemaGrow Stack
• Integrates the components in order to offer a single SPARQL endpoint that federates a number of heterogeneous data sources
• Targets the federation of independently provided data sources
Slide 19 of 25
SemaGrow Architecture
Query
Federated endpoint Wrapper
SemaGrow SPARQL endpoint
Resource Discovery
Query results
query fragment,Source
(#1)
Instance StatisticsData Summaries
SPARQL endpoint
POWDER Inference Layer
P-Store
InstanceStatistics
query fragment,target Source
transformed query
Query Decomposition
querypatterns
Query Results Merger
query fragment,Source
(#n)
queryresults
Client
Reactivityparameters
Query Decomposer
Data Source(s) Selector
Ctrl
Candidate Source(s) List· Instance Statistics· Load Info· Semantic Proximity
Query Transformation Service
SchemaMappings
SPARQL endpoint(Data Source #n)
SPARQLquery
Ctrl
Ctrl
Load Info
Instance Statistics
Data Summaries
Set of query
patternsQuery Pattern Discovery
Service
equivalentpatterns
querypattern
SemanticProximity
Resource Selector
query results schema
transformed schema
queryrequest #1
queryrequest #n
queryresults
SPARQL endpoint(Data Source #1)
SPARQLquery
Query Manager
Resource DiscoveryQuery
Decomposition
Federated Endpoint Wrapper
Data Summaries Endpoint
Slide 24 of 25
Use Cases (DLO)Heterogeneous Data Collections & Streams Big data:
– Sensor data: soil data, weather– GIS data: land usage, forest and natural resources management data– Historical data: crop yield, economic data– Forecasts: climate change models
Problem:– Combine heterogeneous sources to analyze past food production and
forecast future trends– Cannot clone and translate: large scale, live data streams– Cannot immediately and directly affect radical re-design of all sensing
and processing currently in place
3rd Plenary & ESG Meeting 21/10/2013
Slide 25 of 25
Use Cases (FAO)Reactive Data Analysis Big data:
– Document collections: past experiences, analysis and research results– Databases: climate conditions and crop yield observations, economic
data (land and food prices) Problem:
– Retrieving complete and accurate information to compile reports• Raw data and reports, scientific publications, etc.
– Wastes human resources that could analyze data and synthesize useful knowledge and advice for food production
• Too much time spent cross-relating responses from different sources
– Too many different organizations and processes rely on the different schemas to make re-design viable
– Cloning is inefficient: large and constantly updated stores3rd Plenary & ESG Meeting 21/10/2013
Slide 26 of 25
Use Cases (AK)Reactive Resource Discovery Big data:
– Multimedia content about agriculture and biodiversity
Problem:– Real-time retrieval of relevant content– Used to compile educational activities– Schema heterogeneity:
• Different providers (Oganic edunet, Europeana, VOA3R, etc.)
– Too many different organizations and processes rely on the different schema to make re-design viable
– Cloning is inefficient: large and constantly updated stores
3rd Plenary & ESG Meeting 21/10/2013
Slide 27 of 25
Project Info
• SemaGrow: Data intensive techniques to boost the real-time performance of global agricultural data infrastructures
• FP7-ICT-2011.4.4 (Intelligent Information Management)No.
Name Country
1 Universidad de Alcala
2 NCSR “Demokritos”
3 Universita Degli Studi di Roma Tor Vergata
4 Semantic Web Company
5 Institut Za Fiziku
6 Stichting Dienst Landbouwkundik Onderzoek
7 Food and Agriculture Organization of the UN
8 Agroknow Technologies