intelligent tools for a digital library · deep learning machine reasoning neural architecture...

Intelligent Tools for a Digital Library

Debarshi Kumar Sanyal

National Digital Library of IndiaIIT Kharagpur

Smart and Green Information and Communication

Technology (SGICT), Short Term Course (STC),

Jadavpur University, 17 May 2018

Digital Library

• Repository with discovery service of digital resources

– Usually books, articles, digitized manuscripts, historical records, newspapers, photographs, paintings, maps, music, films, question papers, syllabi, presentations, audio-visual lectures, software, dataset

– Can contain only metadata or metadata + content

• Maintained by publishers, educational institutes, governments, non-government agencies, individuals, etc.

• Some are reservoirs of Big Scholarly Data

24 x 7-ENABLED IMMERSIVE E-LEARNING FOR ALL LEARNERS AT ALL LEVELS

IN ALL AREAS

https://ndl.iitkgp.ac.in/

National Digital Library of India (NDLI)

• Educational digital library

• Contains mainly metadata

• Contains > 17M resources

• Contains metadata of books, research papers, theses, audio-visual lectures, software, datasets, syllabi, question papers and model answers, etc.

• Content in different languages including English, Bengali, Hindi, Tamil, Telegu, Marathi, etc. (in > 100 languages)

• Keyword-based search and advanced faceted search

21 Mar 2016

Some Research Areas

• Metadata engineering

– Automatic metadata extraction

– Author name disambiguation

• Search and retrieval

– Surrogates for access-restricted resources

– Semantic search

– Figure search

– Customized viewer

– Recommender systems

NDLI Research

Metadata Engineering

Metadata Acquisition

• NDLI is a huge metadata repository

• Currently, metadata is acquired in semi-automatic manner.

– Collected from publishers / libraries

– Manual and automatic post-processing done to adapt to NDLI schema

Automatic Metadata Extraction

• Can we extract metadata automatically for NDLI?

• Challenges

– Various resource types

– Multiple languages

– OCR needed for digitized resources

– Variable scan quality

– Unsatisfactory OCR quality for most Indic languages

– Hard to produce semantic metadata like pedagogical objective

Metadata from Scientific Papers

Developed by Sumana Dey, Staff @ NDLI

"figure": [

{"caption": "Fig. 2: The Framework of Cloud Computing", "page": "2", "path":

"img/Figure2-1.png"},

{"caption": "Fig. 3: Architecture of Mobile Cloud Computing", "page": "3", "path":

"img/Figure3-1.png"},

{"caption": "Fig. 1: Mobile Cloud Computing", "page": "0", "path": "img/Figure1-1.png"}],

"table": [

{"caption": "TABLE I: Challenges and Solutions of Mobile Cloud Computing", "page": "3",

"path": "img/TableI-1.png"}],

"metadata": {

"dc.title": "A Review on Mobile Cloud Computing: Issues, Challenges and Solutions",

"dc.title.alternative": [],

"dc.contributer.author": ["Mandeep Kaur Saggi", "Amandeep Singh Bhatia"],

"dc.contributer.editor": [],

"dc.language.iso": ["en"],

"dc.description.abstract": "Mobile Cloud Computing (MCC) is a combination of mobile

computing and cloud computing. It has become one of the Major Research issue in the

industry. Although there are so, many research studies in mobile computing and cloud

computing, convergence of these two areas grant further academic efforts towards flourishing

MCC. …

Metadata Extraction from Books

Author Name Ambiguity

• Same author might write under different names

• Same name might refer to multiple authors (a few surnames very common

in South-Asia)

• Other sources of noise are spelling mistakes, abbreviated names, etc.

=?

Author Name Disambiguation (AND)

• Author names in NDLI are currently treated as strings.

– Therefore, hyperlinking author names to respective works not feasible

– Author centric analytics not straightforward

– For research papers, citation tracking not feasible

• Can we disambiguate the author names in NDLI?

• Challenges

– A few surnames very common in South-Asia

– Authors from multiple cultural backgrounds have different name conventions

– Multilingual resources

– Metadata often too sparse to disambiguate.

AND: Formal problem definition (1/2)

• We are given a set of M metadata records C = {c1, c2, …, cM} . Each metadata record ci contains at least author names and work title. It can also contain author affiliations, author email ids, venue of publication, abstract, keywords, references in the article, etc.

– Example of a metadata record

– Manning, Christopher, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven Bethard, and David McClosky. "The Stanford CoreNLP natural language processing toolkit." In Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations, pp. 55-60. 2014.

18

• Each author name is a reference to a real author. From the metadata records, we have to extract author references R = {r1, r2, …, rN} (𝑵 ≥ 𝑴)

• Our next goal is to map R into K disjoint clusters A = {a1, a2, …, aK} (𝑵 ≥ 𝑲) such each cluster ai contains all and only all references to the same real author. K may not be known a priori.

19

r1: S Chattopadhyay

r2: S Chattopadhyay

r3: S Chattopadhyay

r4: S Chattopadhyay

r5: P Smith

r6: P Smith

r7: P Smith

R

a1

a2

a3

a4

A

AND: Formal problem definition (2/2)

• Approaches

– Author grouping

– Author assignment

• Machine learning methods

– Unsupervised

– Supervised

• Evidence explored

– Citation information

– Web information

– Implicit information

Author Name Disambiguation: Techniques

Author Name Disambiguation: Example

• Approaches

– Author grouping

– Author assignment

• Machine learning methods

– Unsupervised

– Supervised

• Evidence explored

– Citation information

– Web information

– Implicit information

Author Name Disambiguation on

• Create author blocks

– A block contains authors with same lastname and first initial (LN-FI)

• Within a block, create a similarity profile of a pair of papers using metadata like

– Author name, author email id, author affiliation, co-author names (LN-FI), journal name, year of publication, MeSH terms, etc.

• Train a random forest to

– Output 1 if similarity profile belongs to same real-world person

– Output 0 otherwise

• Use above trained classifier on test instances in each block

Treeratpituk, Pucktada, and C. Lee Giles. "Disambiguating authors in academic publications using random forests." Proc. ACM/IEEE-CS JCDL, ACM, 2009.

Search & Retrieval

Surrogates for Access-Restricted Scholarly Articles

• NDLI stores metadata for IEEE/ACM/Springer publications.

• Access to full text requires subscription to container library.

• Sometimes authors store a free version in a preprint server, sometimes an open-access conference paper closely resembles a journal version behind a paywall.

– Call a pair of very similar documents by the same author(s) surrogates

– Retrieve surrogates when original paper is access-restricted

Surrogator: Interface

Surrogator: Architecture

SurrogatorSantosh, T Y S S, D K Sanyal, and P K

Bhowmick. “Surrogator: Enriching a Digital

Library with Surrogate Resources.” 5th ACM

IKDD CoDS and 23rd COMAD (Demo

track), Goa, India, January 11 - 13, 2018.

Santosh, T Y S S, D K Sanyal, P K Bhowmick,

and P P Das. “Surrogator: A Tool to Enrich

a Digital Library with Open Access

Surrogate Resources.” Proc. ACM/IEEE-CS

JCDL 2018 (Poster). Texas, USA, June 3 - 7,

2018.

Beyond Lexical Search

• Lexical or word matching-based search cannot read user intent.

• Semantic search aims to give what the user wants, rather than what the user said.

Semantic Web• Semantic Web is an extension of the traditional Web in which

information is given well-defined meaning.

• The Semantic Web will contain resources with relations among each other.

Guha, Ramanathan, Rob McCool, and Eric Miller. "Semantic search." Proc. WWW, ACM, 2003.

Semantic Search• Semantic Search attempts to augment and improve traditional

(keyword-based) search results by using data from the Semantic Web.

• Data represented as a directed labelled graph, wherein each node corresponds to a resource and each arc is labelled with a property type (also a resource).

Google

Knowledge

Graph

Semantic Search in the Web

Semantic Model for NDLI Metadata

• Semantic model to represent authors, works and other elements (present in metadata) as interconnected entities.

– Handcrafted ontology / auto-generated ontology from NDLI metadata schema can be used.

“Satyajit

Ray”

“Sonar Kella”

“Our Films,

Their Films”

isAuthorOf

isAuthorOf

hasAuthor

hasAuthor

Lexical

search

(2015)

Semantic

search

(2017)

Need for Semantics in

• Concept queries

– “natural language interface”

– “ontology construction”

– “dynamic programming segmentation”

Xiong, Chenyan, Russell Power, and Jamie

Callan. "Explicit semantic ranking for

academic search via knowledge graph

embedding." Proc. WWW, 2017.

• Create a Knowledge Graph (KG) G = (V, E)

– Collect concept entities (Entity set V)

– Select top ranked noun phrases (keyphrases) from article title, abstract, introduction, conclusion & citation contexts

– Connect concept entities to other objects via edges (E)

– Author edges: Link to author whose paper mentions the entity

– Context edges: Link to co-occurring concept entities

– Desc edges: Link to descriptions in Freebase

– Venue edges: Link to venues that published papers with the entity in its title

Knowledge Graph in

Deep learning

Machine reasoning

Neural architecture

Compositional attention network

Deep learning (also known as deep structured

learning or hierarchical learning) is part of a

broader family of machine learning methods

based on learning data representations, as

opposed to task-specific algorithms.

KG Embedding in • Build KG from all documents in corpus.

• Find the embedding of each entity in KG.

– Embeddings are trained based on neighbours in the KG.

– Graph structure around an entity conveys semantics of the entity.

– Intuitively, entities with similar neighbours are usually related.

v

dVf : Embedding

(vector of dimension d)

Explicit Semantic Ranking in

Semantic Search in NDLI

• Can we build a semantic search engine over full text (wherever available) in NDLI?

• Challenges

– Parsing documents of diverse disciplines

– Constructing suitable representation of extracted information

– Building the right query interface

Q/A with Books & Articles• Search need not be keyword-based.

• One should be able to ask questions and get relevant answers.

• Can we have a Q/A interface to NDLI?

Figure Search

• “A figure is worth a thousand words”

• Many figure search engines available for biomed publications.

• Can we design a figure search engine (wherever full text available) for NDLI?

• Challenges

– Extracting figures from scanned documents is difficult.

– Annotating figures (especially for multilingual content) is not easy.

Sanyal, D K, S Chattopadhyay, and R Chatterjee. “Figure Retrieval from Biomedical Literature: An

Overview of Techniques, Tools and Challenges.” Machine Learning in Bio-Signal Analysis and

Diagnostic Imaging, Elsevier (In Press).

Customized Viewers• Various customized browsers possible over library contents.

Customized Viewer for Sodhganga Collection

• Sodhganga is a digital repository of Indian Electronic Theses and Dissertations

• NDLI indexes Sodhganga (> 36K records).

• Our tool Posterity helps visualize and analyse the Sodhgangametadata.

– It can show the academic descendants of a researcher

– It shows various indices (like number of direct students) to characterise a researcher’s mentorship.

Processed Sodhganga

advisorId researcherId advisor researcher department institution

51505 16148 datta, asis gunnery,

shobha

school of

life sciences

jawaharlal

nehru

university

Posterity

Developed by Sumana Dey, Staff @

NDLI

Recommender Systems

Recommender Systems: Approaches

• Recommend related resources

– Could be books, articles, videos, datasets, audio lectures, etc.

• Primarily 3 approaches

– Content-based filtering

– Uses item metadata and user’s preferences.

– Collaborative filtering

– Predicts what users will like based on their similarity to other users.

– Hybrid

– E.g., Apply above methods separately, then choose top-K from each.

Recommender System for NDLI

• Can we build a recommender system for NDLI?

• Challenges

– Diverse resource types

– Resources in multiple languages

– Diverse user types

– Only metadata available; not full text (barrier to content-based filtering)

– Sparse query logs (cold start problem in collaborative filtering)

Lots of Possibilities!

• Many more tools possible

– Cross-lingual and multi-lingual search facilities needed.

– Interface for differently-abled required.

– User experience tracking and enhancement is a plus.

– Data analytics over content could give astonishing insights into knowledge heritage.

intelligent tools for a digital library · deep learning machine reasoning neural architecture...

Documents