open minted content_provision
TRANSCRIPT
Content Provision
Lucas AnastasiouThe Open University, Knowledge Media Institute
18-06-2015Rhodes, Greece, 2015
A simple text mining exercise
Average length of dissertations (doctorates and master thesis) by major
3037 records from University of Minnesota
https://beckmw.wordpress.com/2014/07/15/average-dissertation-and-thesis-length-take-two/
• R script to scrape pdfs from Institutional Repository
• Extract Text• Parse Text• Plot the data
A simple text mining exercise
The challenge of TDM
• 90% of TDM project [1] is spent – Collecting data– Harmonising– Pre-processing data
• Magnitude of data• Heterogeneity of data
[1] Jisc Open Mirror Report Oct 2013
The case of scientific literature• Identifying levels of access – Transactional information access
– Analytical information access
– Raw data (programatical) access
– Google scholar estimated at 100 million papers
3 levels of access
How the “big” guys are doing?
Access type Google scholar MS Academic Research
Transactional Browser interface (portal) Browser interface (portal)
Analytical access Citation analysis, researcher profile
Visualisations, citation analysis, authors connections
Raw data access No API, scrapping possible (violation of ToC)
Limited API, explicitly forbidden to download full corpus
Other scholarly data sourcesName of service Transactional Analytical RawCiteseerX PubMed Arxiv Scopus Web of Science SpringerLink ☐ Elsevier
The case of Elsevier API
• Sufficient for some tasks but not for all, no access to the full corpus
• Restricting the usability of API, controlling the access
• Potential loss of information (what you see in portal is different than API)
• (may) require special dispensation from authors
“It takes a lot of time and a lot of energy and doesn’t scale at all”
Heather Piwowar
Is this enough for the TDM community?
• If you are a TDM-er you need to have true unrestricted programmatic level of access
• APIs– Offer programatic access to individual articles– Offered in XML/JSON– Lack of standard schema, providers use
proprietary formats– (may be) sufficient for some TDM tasks (e.g.
information extraction)
APIs not enough
• Other family of TDM tasks require access to the full corpus – E.g. recommender systems, text summarisation
• Need to have access to complete collection of articles
• Data dumps
Data dumps not enough
• Even if you can access whole corpus you would need special hardware resources
• Arxiv.org data dump compressed: 300Gb– How do you get it?– Where to store it? (*)– How to analyse it?– How to disseminate your findings? – In what format?– How can someone else verify your findings?
Research should be reproducible !
Non-technological barriers
• Legal uncertainty • Copyright, database rights, licensing• (some) publishers require special dispensation• Skills gap • Researchers lack of awareness of TDM
potential
Summary
• Collecting data is a tedious and time-consuming task, perhaps impossible
• Scientific literature lacks programmatic level of access
• APIs and Data dumps (though nice) not enough
• Other barriers
openMinTed vision
• Data and algorithms in one place• Interoperable framework• “Safe” environment (legal status)• Trusted environment
Thank you!
Q n A
TDM-ing today
…