open minted content_provision

Content Provision

Lucas AnastasiouThe Open University, Knowledge Media Institute

18-06-2015Rhodes, Greece, 2015

A simple text mining exercise

Average length of dissertations (doctorates and master thesis) by major

3037 records from University of Minnesota

https://beckmw.wordpress.com/2014/07/15/average-dissertation-and-thesis-length-take-two/

• R script to scrape pdfs from Institutional Repository

• Extract Text• Parse Text• Plot the data

A simple text mining exercise

The challenge of TDM

• 90% of TDM project [1] is spent – Collecting data– Harmonising– Pre-processing data

• Magnitude of data• Heterogeneity of data

[1] Jisc Open Mirror Report Oct 2013

The case of scientific literature• Identifying levels of access – Transactional information access

– Analytical information access

– Raw data (programatical) access

– Google scholar estimated at 100 million papers

3 levels of access

How the “big” guys are doing?

Access type Google scholar MS Academic Research

Transactional Browser interface (portal) Browser interface (portal)

Analytical access Citation analysis, researcher profile

Visualisations, citation analysis, authors connections

Raw data access No API, scrapping possible (violation of ToC)

Limited API, explicitly forbidden to download full corpus

Other scholarly data sourcesName of service Transactional Analytical RawCiteseerX PubMed Arxiv Scopus Web of Science SpringerLink ☐ Elsevier

The case of Elsevier API

• Sufficient for some tasks but not for all, no access to the full corpus

• Restricting the usability of API, controlling the access

• Potential loss of information (what you see in portal is different than API)

• (may) require special dispensation from authors

“It takes a lot of time and a lot of energy and doesn’t scale at all”

Heather Piwowar

Is this enough for the TDM community?

• If you are a TDM-er you need to have true unrestricted programmatic level of access

• APIs– Offer programatic access to individual articles– Offered in XML/JSON– Lack of standard schema, providers use

proprietary formats– (may be) sufficient for some TDM tasks (e.g.

information extraction)

APIs not enough

• Other family of TDM tasks require access to the full corpus – E.g. recommender systems, text summarisation

• Need to have access to complete collection of articles

• Data dumps

Data dumps not enough

• Even if you can access whole corpus you would need special hardware resources

• Arxiv.org data dump compressed: 300Gb– How do you get it?– Where to store it? (*)– How to analyse it?– How to disseminate your findings? – In what format?– How can someone else verify your findings?

Research should be reproducible !

Non-technological barriers

• Legal uncertainty • Copyright, database rights, licensing• (some) publishers require special dispensation• Skills gap • Researchers lack of awareness of TDM

potential

Summary

• Collecting data is a tedious and time-consuming task, perhaps impossible

• Scientific literature lacks programmatic level of access

• APIs and Data dumps (though nice) not enough

• Other barriers

openMinTed vision

• Data and algorithms in one place• Interoperable framework• “Safe” environment (legal status)• Trusted environment

Thank you!

Q n A

TDM-ing today

…

open minted content_provision

Services