text mining in core (or2012)

Text mining in CORE

Petr KnothThe Open University

Outline• Introduction of the CORE system• Three phases: • Metadata and content harvesting• Semantic Enrichment• Providing services

• Supporting research in mining databases of scientific publications (DiggiCORE)

CORE objectives

• To provide a platform for the delivery of Open Access content

aggregated from multiple sources and to deliver a wide range of services on top of this aggregation.

• A nation-wide aggregation system that will improve the discovery of publications stored in British Open Access Repositories (OARs).

CORE functionality

Content harvesting, processing

CORE functionality

Semantic enrichment

CORE functionality

Providing services

CORE functionality

Content harvesting, processing

Growth of items in Open Access repositories

Growth of Open Access repositories

Green Open Access - statistics

Why we need aggregations?

“Each individual repository is of limited value for research: the real power of Open Access lies in the possibility of connecting and tying together repositories, which is why we need interoperability. In order to create a seamless layer of content through connected repositories from around the world, Open Access relies on interoperability, the ability for systems to communicate with each other and pass information back and forth in a usable format. Interoperability allows us to exploit today's computational power so that we can aggregate, data mine, create new tools and services, and generate new knowledge from repository content.’’

[COAR manifesto]

Aggregation in CORE

• OAI-PMH metadata harvesting• Locating full-text• Focused crawling (to locate full-texts)• Focused crawling (driven by citation analysis)

CORE functionality

Semantic enrichment

Aggregations need access to content, not just metadata!

• Certain metadata types can be created only at the level of the aggregation

• Certain metadata can be changing in time• Ensuring content:• accessibility• availability• validity• quality• …

Semantic similarity and duplicates detection• Cosine similarity calculated on tfidf vectors extracted from full-

[Knoth et al, COLING 2010; Knoth et al, IMMM 2011]

Semantic similarity and duplicates detection• Heuristics to reduce the number of combinations (problem with

the query length)• Cross-language linking tests [Knoth et al, NTCIR-9 CrossLink 2011;

Knoth et al IJC-NLP CLIA 2011]

Information extraction, citation parsing and target recognition

• ParsCIT tool (based on CRF) for extraction of reference sections• Levensthein distance used for target detection

Text categorisation• 17 top-level DOAJ classes (

http://www.doaj.org/doaj?func=browse&uiLanguage=en)• 1080 examples• SVM multiclass• 10 fold cross-validation• 91.4% accuracy

CORE functionality

Providing services

Who should be supported by aggregations?

The following users groups (divided according to the level of abstraction of information they need):

• Raw data access. • Transaction information access.• Analytical information access.

• The following users groups (divided according to the level of abstraction of information they need):• Raw data access. Developers, DLs, DL researchers, companies …• Transaction information access. Researchers, students, life-long learners …• Analytical information access. Funders, government, bussiness intelligence

Should a single aggregation system support all three user types?

Can be realised by more than one systemproviding that

the dataset is the same!

CORE applications

• CORE Portal• CORE Mobile• CORE Plugin• CORE API• Repository Analytics

• The following users groups (divided according to the level of abstraction of information they need):• Raw data access. Developers, DLs, DL researchers, companies …• Transaction information access. Researchers, students, life-long learners …• Analytical information access. Funders, government, bussiness intelligence

Repository AnalyticsCORE Portal, CORE

Mobile, CORE PluginCORE API

CORE ApplicationsCORE API – Enables external systems and services to interact with the CORE repository.

• Search service• Pdf and plain text

service• Similarity service• Classification service• Citation service

CORE ApplicationsCORE Portal – Allows searching and navigating scientific publications aggregated from Open Access repositories

Snippets

CORE Applications

CORE Mobile – Allows searching and navigating scientific publications aggregated from Open Access repositories

CORE ApplicationsCORE Plugin – A plugin to system that recommendations for related items.

CORE ApplicationsRepository Analytics – is an analytical tool supporting providers of open access content (in particular repository managers).

CORE statistics

• Content• 7M records• 230 repositories• 402k full-texts • 1TB of data• 40GB large index• 35 million RDF triples in the CORE LOD repository

• Started: February 2011• Budget: 140k£

Outline• Introduction of the CORE system• Three phases: • Metadata and content harvesting• Semantic Enrichment• Providing services

• Supporting research in mining databases of scientific publications (DiggiCORE)

objective

Software for exploration and analysis of very large and fast-growing amounts of research publications stored across Open Access Repositories (OAR).

DiggiCORE networks

Three networks: (a) semantically related papers,(b) citation network, (c) author citation network

DiggiCORE objectives

Allow researchers to use this platform to analyse publications. Why?• To identifying patterns in the behaviour of research

communities• To detect trends in research disciplines• To gain new insights into the citation behaviour of researchers• To discover features that distinguish papers with high impact

Summary

• The rapid growth of OA content provides great opportunity for text-mining.

• Aggregations need to aggregate content, not just metadata. • Aggregations should serve the needs of different user groups

including researchers who need access to data. CORE aims to support them.

• We can have many services that are part of the infrastructure, but should work with the same data.

Thank you!

William Wallace

text mining in core (or2012)

Technology

text mining & web mining

introduction to text mining and sas text...

text mining infrastructure in r - university of...

introduction to text mining · introduction to text mining...

introduction to text mining -...

introduction to text mining - en.cs.uni-paderborn.de ·...

text mining webinar - knime€¦ · text mining webinar the...

information retrieval & text mining - intranet...

text mining for clementine improve insights with text mining

text mining

cs583 – data mining and text mining

text mining

text mining & tools - graz university of...

text mining with oracle - text mining summit

mining text using keyword distributions - hebrew...

introduction to text mining - edbt 2006 · text mining text...

introduction to text mining - uni-paderborn.de ·...

a centre of expertise in digital information management...

mining unstructured data (text data mining) - chapters site...

text mining for clementine improve insights with text...