open minted content_provision

16
Content Provision Lucas Anastasiou The Open University, Knowledge Media Institute 18-06-2015 Rhodes, Greece, 2015

Upload: anastasiou-lucas

Post on 16-Apr-2017

224 views

Category:

Services


0 download

TRANSCRIPT

Page 1: Open minted content_provision

Content Provision

Lucas AnastasiouThe Open University, Knowledge Media Institute

18-06-2015Rhodes, Greece, 2015

Page 2: Open minted content_provision

A simple text mining exercise

Average length of dissertations (doctorates and master thesis) by major

3037 records from University of Minnesota

https://beckmw.wordpress.com/2014/07/15/average-dissertation-and-thesis-length-take-two/

Page 3: Open minted content_provision

• R script to scrape pdfs from Institutional Repository

• Extract Text• Parse Text• Plot the data

A simple text mining exercise

Page 4: Open minted content_provision

The challenge of TDM

• 90% of TDM project [1] is spent – Collecting data– Harmonising– Pre-processing data

• Magnitude of data• Heterogeneity of data

[1] Jisc Open Mirror Report Oct 2013

Page 5: Open minted content_provision

The case of scientific literature• Identifying levels of access – Transactional information access

– Analytical information access

– Raw data (programatical) access

– Google scholar estimated at 100 million papers

Page 6: Open minted content_provision

3 levels of access

How the “big” guys are doing?

Access type Google scholar MS Academic Research

Transactional Browser interface (portal) Browser interface (portal)

Analytical access Citation analysis, researcher profile

Visualisations, citation analysis, authors connections

Raw data access No API, scrapping possible (violation of ToC)

Limited API, explicitly forbidden to download full corpus

Page 7: Open minted content_provision

Other scholarly data sourcesName of service Transactional Analytical RawCiteseerX PubMed Arxiv Scopus Web of Science SpringerLink ☐ Elsevier

Page 8: Open minted content_provision

The case of Elsevier API

• Sufficient for some tasks but not for all, no access to the full corpus

• Restricting the usability of API, controlling the access

• Potential loss of information (what you see in portal is different than API)

• (may) require special dispensation from authors

“It takes a lot of time and a lot of energy and doesn’t scale at all”

Heather Piwowar

Page 9: Open minted content_provision

Is this enough for the TDM community?

• If you are a TDM-er you need to have true unrestricted programmatic level of access

• APIs– Offer programatic access to individual articles– Offered in XML/JSON– Lack of standard schema, providers use

proprietary formats– (may be) sufficient for some TDM tasks (e.g.

information extraction)

Page 10: Open minted content_provision

APIs not enough

• Other family of TDM tasks require access to the full corpus – E.g. recommender systems, text summarisation

• Need to have access to complete collection of articles

• Data dumps

Page 11: Open minted content_provision

Data dumps not enough

• Even if you can access whole corpus you would need special hardware resources

• Arxiv.org data dump compressed: 300Gb– How do you get it?– Where to store it? (*)– How to analyse it?– How to disseminate your findings? – In what format?– How can someone else verify your findings?

Research should be reproducible !

Page 12: Open minted content_provision

Non-technological barriers

• Legal uncertainty • Copyright, database rights, licensing• (some) publishers require special dispensation• Skills gap • Researchers lack of awareness of TDM

potential

Page 13: Open minted content_provision

Summary

• Collecting data is a tedious and time-consuming task, perhaps impossible

• Scientific literature lacks programmatic level of access

• APIs and Data dumps (though nice) not enough

• Other barriers

Page 14: Open minted content_provision

openMinTed vision

• Data and algorithms in one place• Interoperable framework• “Safe” environment (legal status)• Trusted environment

Page 15: Open minted content_provision

Thank you!

Q n A

Page 16: Open minted content_provision

TDM-ing today