big data curation - pdfs.semanticscholar.org · big data curation webinar 19/12/2013 big big data...

22
Big Data Curation Webinar 19/12/2013 BIG Big Data Public Private Forum Big Data Curation Edward Curry (Insight @ NUI Galway) Project co-funded by the European Commission within the 7th Framework Program (Grant Agreement No. 257943)

Upload: others

Post on 25-Oct-2020

16 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Big Data Curation - pdfs.semanticscholar.org · Big Data Curation Webinar 19/12/2013 BIG Big Data Public Private Forum BIG DATA INSIGHTS Coping with data variety and verifiability

Big Data Curation Webinar 19/12/2013

BIG Big Data Public Private Forum

Big Data Curation

Edward Curry (Insight @ NUI Galway)

Project co-funded by the European Commission within the 7th Framework Program (Grant Agreement No. 257943)

Page 2: Big Data Curation - pdfs.semanticscholar.org · Big Data Curation Webinar 19/12/2013 BIG Big Data Public Private Forum BIG DATA INSIGHTS Coping with data variety and verifiability

Big Data Curation Webinar 19/12/2013

BIG Big Data Public Private Forum

BIG DATA INSIGHTS

▶  Coping with data variety and verifiability are central challenges and opportunities for Big Data

▶  The long tail of data variety is a major shift in the data landscape ▶  Need for scalable approaches to cope with data under different

format and semantic assumptions

The Data Landscape

The Solution Space ▶  Lowering the usability barrier for data tools is a major requirement

across all sectors. Users should be able to directly manipulate the data ▶  Blended human and algorithmic data processing approaches are

a trend for coping with data acquisition, transformation, curation, access, and analysis challenges for Big Data

▶  Solutions based on large communities (crowd-based approaches) are emerging as a trend to cope with Big Data challenges

Page 3: Big Data Curation - pdfs.semanticscholar.org · Big Data Curation Webinar 19/12/2013 BIG Big Data Public Private Forum BIG DATA INSIGHTS Coping with data variety and verifiability

Big Data Curation Webinar 19/12/2013

BIG Big Data Public Private Forum

THE DATA VALUE CHAIN

Data Acquisition

Data Analysis

Data Curation

Data Storage

Data Usage

•  Structured data

•  Unstructured data

•  Event processing

•  Sensor networks

•  Streams •  Multimodality

•  Data preprocessing

•  Semantic analysis

•  Sentiment analysis

•  Data correlation

•  Pattern recognition

•  Realtime analysis

•  Machine learning

•  Trust •  Provenance •  Data

augmentation •  Annotation •  Data validation •  Redundancy

elimination •  Keep up-to-date •  Consistency

•  In-Memory Technology

•  HANA •  Column DB •  NoSQL •  Cloud storage •  Compression

•  Decision support

•  Predictions •  Simulation •  Exploration •  Modelling •  Control •  Domain-

specific usage

Technical Working Groups

Value Chain

Page 4: Big Data Curation - pdfs.semanticscholar.org · Big Data Curation Webinar 19/12/2013 BIG Big Data Public Private Forum BIG DATA INSIGHTS Coping with data variety and verifiability

Big Data Curation Webinar 19/12/2013

BIG Big Data Public Private Forum

DATA CURATION

Value Chain

Data Acquisition

Data Analysis

Data Curation

Data Storage

Data Usage

Page 5: Big Data Curation - pdfs.semanticscholar.org · Big Data Curation Webinar 19/12/2013 BIG Big Data Public Private Forum BIG DATA INSIGHTS Coping with data variety and verifiability

Big Data Curation Webinar 19/12/2013

BIG Big Data Public Private Forum

THE PROBLEM: DATA QUALITY

ID PNAME PCOLOR PRICE

APNR iPod Nano Red 150

APNS iPod Nano Silver 160

<Product  name=“iPod  Nano”>        <Items>                  <Item  code=“IPN890”>                              <price>150</price>                              <genera>on>5</genera>on>                  </Item>          </Items>  </Product>  

Source A

Source B Schema Difference?

Data Developer

APNR  

iPod  Nano  

Red  

150  

APNR  

iPod  Nano  

Silver  

160  

iPod  Nano   IPN890  150  

5  

Value Conflicts? Entity Duplication?

Data Steward

Business Users

? Technical Domain

(Technical)

Domain

Page 6: Big Data Curation - pdfs.semanticscholar.org · Big Data Curation Webinar 19/12/2013 BIG Big Data Public Private Forum BIG DATA INSIGHTS Coping with data variety and verifiability

Big Data Curation Webinar 19/12/2013

BIG Big Data Public Private Forum

DATA CURATION OVERVIEW

▶  Digital Curation “Selection, preservation, maintenance, collection, and archiving of digital assets”

▶  Data Curation “Active management of data over its life-cycle”

Definition

▶  Individual Curators ▶  Curation Departments ▶  Community-based (Emerging trend)

Who?

▶  Manual Curation ▶  (Semi-)Automated ▶  Sheer Curation ▶  Collaborative Data Management (Crowdsourcing)

How?

▶  Accessible ▶  Authenticity ▶  Collaboration ▶  Discoverability ▶  Fitness for Use

Why? ▶  Integrity ▶  Reusability ▶  Security ▶  Sustainability ▶  Trustworthy

Page 7: Big Data Curation - pdfs.semanticscholar.org · Big Data Curation Webinar 19/12/2013 BIG Big Data Public Private Forum BIG DATA INSIGHTS Coping with data variety and verifiability

Big Data Curation Webinar 19/12/2013

BIG Big Data Public Private Forum

Clean Data

ALGORITHM + CROWD

Developers Data Governance

Internal Community

External Crowd

Data Sources

Data Quality Algorithms

Human Computation

Page 8: Big Data Curation - pdfs.semanticscholar.org · Big Data Curation Webinar 19/12/2013 BIG Big Data Public Private Forum BIG DATA INSIGHTS Coping with data variety and verifiability

Big Data Curation Webinar 19/12/2013

BIG Big Data Public Private Forum

MIXED HUMAN-COMPUTER INTELLIGENCE

▶  Coordinating a crowd (a large group of workers) to do micro-work (small tasks) that solves pro(that computers or a single user can’t)blems

▶  A collection of mechanisms and associated methodologies for scaling and directing crowd activities to achieve goals

Key Points

▶  Collective Intelligence ▶  Social Computing ▶  Human Computation ▶  Data Mining & Machine learning ▶  Natural Language Processing ▶  Speech recognition & Computer vision

Related Areas

Page 9: Big Data Curation - pdfs.semanticscholar.org · Big Data Curation Webinar 19/12/2013 BIG Big Data Public Private Forum BIG DATA INSIGHTS Coping with data variety and verifiability

Big Data Curation Webinar 19/12/2013

BIG Big Data Public Private Forum

HUMAN VS MACHINE AFFORDANCES

ü Visual perception ü Visuospatial thinking ü Audiolinguistic ability ü Sociocultural awareness ü Creativity ü Domain knowledge

ü Large-scale data manipulation ü Collecting and storing

large amounts of data ü Efficient data movement ü Bias-free analysis

Human Machine

Page 10: Big Data Curation - pdfs.semanticscholar.org · Big Data Curation Webinar 19/12/2013 BIG Big Data Public Private Forum BIG DATA INSIGHTS Coping with data variety and verifiability

Big Data Curation Webinar 19/12/2013

BIG Big Data Public Private Forum

WHEN COMPUTERS WERE HUMAN

▶ Used human computers to created almanac of moon positions ▶ Used for shipping/

navigation ▶ Quality assurance ▶ Do calculations twice ▶ Compare to third verifier

Maskelyne 1760

Page 11: Big Data Curation - pdfs.semanticscholar.org · Big Data Curation Webinar 19/12/2013 BIG Big Data Public Private Forum BIG DATA INSIGHTS Coping with data variety and verifiability

Big Data Curation Webinar 19/12/2013

BIG Big Data Public Private Forum

WHEN COMPUTERS WERE HUMAN

Page 12: Big Data Curation - pdfs.semanticscholar.org · Big Data Curation Webinar 19/12/2013 BIG Big Data Public Private Forum BIG DATA INSIGHTS Coping with data variety and verifiability

Big Data Curation Webinar 19/12/2013

BIG Big Data Public Private Forum

BIG DATA CURATION EXEMPLARS

Page 13: Big Data Curation - pdfs.semanticscholar.org · Big Data Curation Webinar 19/12/2013 BIG Big Data Public Private Forum BIG DATA INSIGHTS Coping with data variety and verifiability

Big Data Curation Webinar 19/12/2013

BIG Big Data Public Private Forum

TAG A TUNE

Page 14: Big Data Curation - pdfs.semanticscholar.org · Big Data Curation Webinar 19/12/2013 BIG Big Data Public Private Forum BIG DATA INSIGHTS Coping with data variety and verifiability

Big Data Curation Webinar 19/12/2013

BIG Big Data Public Private Forum

PEEKABOOM

Page 15: Big Data Curation - pdfs.semanticscholar.org · Big Data Curation Webinar 19/12/2013 BIG Big Data Public Private Forum BIG DATA INSIGHTS Coping with data variety and verifiability

Big Data Curation Webinar 19/12/2013

BIG Big Data Public Private Forum

FOLDIT

Page 16: Big Data Curation - pdfs.semanticscholar.org · Big Data Curation Webinar 19/12/2013 BIG Big Data Public Private Forum BIG DATA INSIGHTS Coping with data variety and verifiability

Big Data Curation Webinar 19/12/2013

BIG Big Data Public Private Forum

RECAPTCHA

▶ OCR ▶  ~ 1% error rate ▶  20%-30% for 18th and 19th

century books ▶  40 million ReCAPTCHAs

every day” (2008) ▶  Fixing 40,000 books a day

Recaptcha

Page 17: Big Data Curation - pdfs.semanticscholar.org · Big Data Curation Webinar 19/12/2013 BIG Big Data Public Private Forum BIG DATA INSIGHTS Coping with data variety and verifiability

Big Data Curation Webinar 19/12/2013

BIG Big Data Public Private Forum

BIG DATA CURATION IN ENTERPRISES

Product Categorization

Sentiment Analysis

▶ Categorize millions of products with accurate and complete attributes

▶ Combine the crowd with machine learning to create an affordable and flexible catalog quality system

▶ Understanding customer sentiment for worldwide launch of new product

▶  Implemented 24/7 sentiment analysis system using workers from around the world

Page 18: Big Data Curation - pdfs.semanticscholar.org · Big Data Curation Webinar 19/12/2013 BIG Big Data Public Private Forum BIG DATA INSIGHTS Coping with data variety and verifiability

Big Data Curation Webinar 19/12/2013

BIG Big Data Public Private Forum

BIG DATA CURATION USE CASES Telco, Media, & Entertainment

Manufacturing, Retail, Energy & Transport

Public Sector Life Sciences

Page 19: Big Data Curation - pdfs.semanticscholar.org · Big Data Curation Webinar 19/12/2013 BIG Big Data Public Private Forum BIG DATA INSIGHTS Coping with data variety and verifiability

Big Data Curation Webinar 19/12/2013

BIG Big Data Public Private Forum

COMMUNITY, CROWDS, & OPEN DATA

▶  Leaverage online community to curate large datasets ▶  Natural Language Processing, Computer Vision,

Classification, Verification, Enrichment, Judgments, etc

Community & Crowds

Emerging Economic Model for Open Data ▶  Pre-competitive collaboration efforts ▶  Share costs, risks, & technical challenges ▶  Benefit from collective wisdom and

network effect for curated dataset ▶  Pistoia Alliance (pharmaceutical data)

Page 20: Big Data Curation - pdfs.semanticscholar.org · Big Data Curation Webinar 19/12/2013 BIG Big Data Public Private Forum BIG DATA INSIGHTS Coping with data variety and verifiability

Big Data Curation Webinar 19/12/2013

BIG Big Data Public Private Forum

FUTURE REQUIREMENTS OF BIG DATA CURATION

▶  Increase in the need for automation ▶  Trust and provenance capture/management

Curation at Scale

▶  Interfaces which can cope with different levels of expertise and responsibility

▶  Discoverability of data items ▶  Fine-grained control over accessibility of various data items

Access Management

▶  Enable contribution from wide range of human resources such as programmers, domain experts, non-experts contributors, and crowds.

▶  Distribute curation tasks while considering abilities of persons and complexities of tasks

Variety of Expertise

Multimedia & Text ▶  Data curation infrastructure focused on multimedia and

unstructured resources

Page 21: Big Data Curation - pdfs.semanticscholar.org · Big Data Curation Webinar 19/12/2013 BIG Big Data Public Private Forum BIG DATA INSIGHTS Coping with data variety and verifiability

Big Data Curation Webinar 19/12/2013

BIG Big Data Public Private Forum

SUMMARY

▶  Coping with data variety and verifiability are central challenges and opportunities for Big Data

▶  The long tail of data variety is a major shift in the data landscape ▶  Need for scalable approaches to cope with data under different

format and semantic assumptions

The Data Landscape

The Solution Space ▶  Lowering the usability barrier for data tools is a major requirement

across all sectors. Users should be able to directly manipulate the data ▶  Blended human and algorithmic data processing approaches are

a trend for coping with data acquisition, transformation, curation, access, and analysis challenges for Big Data

▶  Solutions based on large communities (crowd-based approaches) are emerging as a trend to cope with Big Data challenges

▶  Principled semantic and standardized data representation models are central to cope with data heterogeneity

Page 22: Big Data Curation - pdfs.semanticscholar.org · Big Data Curation Webinar 19/12/2013 BIG Big Data Public Private Forum BIG DATA INSIGHTS Coping with data variety and verifiability

Big Data Curation Webinar 19/12/2013

BIG Big Data Public Private Forum

BIG DATA CURATION INTERVIEW SERIES http://big-project.eu/text-interviews

More to come in 2014…

Future Interviews