intelligent tools for a digital library · deep learning machine reasoning neural architecture...
TRANSCRIPT
Intelligent Tools for a Digital Library
Debarshi Kumar Sanyal
National Digital Library of IndiaIIT Kharagpur
Smart and Green Information and Communication
Technology (SGICT), Short Term Course (STC),
Jadavpur University, 17 May 2018
Digital Library
• Repository with discovery service of digital resources
– Usually books, articles, digitized manuscripts, historical records, newspapers, photographs, paintings, maps, music, films, question papers, syllabi, presentations, audio-visual lectures, software, dataset
– Can contain only metadata or metadata + content
• Maintained by publishers, educational institutes, governments, non-government agencies, individuals, etc.
• Some are reservoirs of Big Scholarly Data
24 x 7-ENABLED IMMERSIVE E-LEARNING FOR ALL LEARNERS AT ALL LEVELS
IN ALL AREAS
https://ndl.iitkgp.ac.in/
National Digital Library of India (NDLI)
• Educational digital library
• Contains mainly metadata
• Contains > 17M resources
• Contains metadata of books, research papers, theses, audio-visual lectures, software, datasets, syllabi, question papers and model answers, etc.
• Content in different languages including English, Bengali, Hindi, Tamil, Telegu, Marathi, etc. (in > 100 languages)
• Keyword-based search and advanced faceted search
21 Mar 2016
Some Research Areas
• Metadata engineering
– Automatic metadata extraction
– Author name disambiguation
• Search and retrieval
– Surrogates for access-restricted resources
– Semantic search
– Figure search
– Customized viewer
– Recommender systems
NDLI Research
Metadata Engineering
Metadata Acquisition
• NDLI is a huge metadata repository
• Currently, metadata is acquired in semi-automatic manner.
– Collected from publishers / libraries
– Manual and automatic post-processing done to adapt to NDLI schema
Automatic Metadata Extraction
• Can we extract metadata automatically for NDLI?
• Challenges
– Various resource types
– Multiple languages
– OCR needed for digitized resources
– Variable scan quality
– Unsatisfactory OCR quality for most Indic languages
– Hard to produce semantic metadata like pedagogical objective
Metadata from Scientific Papers
Developed by Sumana Dey, Staff @ NDLI
"figure": [
{"caption": "Fig. 2: The Framework of Cloud Computing", "page": "2", "path":
"img/Figure2-1.png"},
{"caption": "Fig. 3: Architecture of Mobile Cloud Computing", "page": "3", "path":
"img/Figure3-1.png"},
{"caption": "Fig. 1: Mobile Cloud Computing", "page": "0", "path": "img/Figure1-1.png"}],
"table": [
{"caption": "TABLE I: Challenges and Solutions of Mobile Cloud Computing", "page": "3",
"path": "img/TableI-1.png"}],
"metadata": {
"dc.title": "A Review on Mobile Cloud Computing: Issues, Challenges and Solutions",
"dc.title.alternative": [],
"dc.contributer.author": ["Mandeep Kaur Saggi", "Amandeep Singh Bhatia"],
"dc.contributer.editor": [],
"dc.language.iso": ["en"],
"dc.description.abstract": "Mobile Cloud Computing (MCC) is a combination of mobile
computing and cloud computing. It has become one of the Major Research issue in the
industry. Although there are so, many research studies in mobile computing and cloud
computing, convergence of these two areas grant further academic efforts towards flourishing
MCC. …
Metadata Extraction from Books
Author Name Ambiguity
• Same author might write under different names
• Same name might refer to multiple authors (a few surnames very common
in South-Asia)
• Other sources of noise are spelling mistakes, abbreviated names, etc.
=?
Author Name Disambiguation (AND)
• Author names in NDLI are currently treated as strings.
– Therefore, hyperlinking author names to respective works not feasible
– Author centric analytics not straightforward
– For research papers, citation tracking not feasible
• Can we disambiguate the author names in NDLI?
• Challenges
– A few surnames very common in South-Asia
– Authors from multiple cultural backgrounds have different name conventions
– Multilingual resources
– Metadata often too sparse to disambiguate.
AND: Formal problem definition (1/2)
• We are given a set of M metadata records C = {c1, c2, …, cM} . Each metadata record ci contains at least author names and work title. It can also contain author affiliations, author email ids, venue of publication, abstract, keywords, references in the article, etc.
– Example of a metadata record
– Manning, Christopher, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven Bethard, and David McClosky. "The Stanford CoreNLP natural language processing toolkit." In Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations, pp. 55-60. 2014.
18
• Each author name is a reference to a real author. From the metadata records, we have to extract author references R = {r1, r2, …, rN} (𝑵 ≥ 𝑴)
• Our next goal is to map R into K disjoint clusters A = {a1, a2, …, aK} (𝑵 ≥ 𝑲) such each cluster ai contains all and only all references to the same real author. K may not be known a priori.
19
r1: S Chattopadhyay
r2: S Chattopadhyay
r3: S Chattopadhyay
r4: S Chattopadhyay
r5: P Smith
r6: P Smith
r7: P Smith
R
a1
a2
a3
a4
A
AND: Formal problem definition (2/2)
• Approaches
– Author grouping
– Author assignment
• Machine learning methods
– Unsupervised
– Supervised
• Evidence explored
– Citation information
– Web information
– Implicit information
Author Name Disambiguation: Techniques
Author Name Disambiguation: Example
• Approaches
– Author grouping
– Author assignment
• Machine learning methods
– Unsupervised
– Supervised
• Evidence explored
– Citation information
– Web information
– Implicit information
Author Name Disambiguation on
• Create author blocks
– A block contains authors with same lastname and first initial (LN-FI)
• Within a block, create a similarity profile of a pair of papers using metadata like
– Author name, author email id, author affiliation, co-author names (LN-FI), journal name, year of publication, MeSH terms, etc.
• Train a random forest to
– Output 1 if similarity profile belongs to same real-world person
– Output 0 otherwise
• Use above trained classifier on test instances in each block
Treeratpituk, Pucktada, and C. Lee Giles. "Disambiguating authors in academic publications using random forests." Proc. ACM/IEEE-CS JCDL, ACM, 2009.
Search & Retrieval
Surrogates for Access-Restricted Scholarly Articles
• NDLI stores metadata for IEEE/ACM/Springer publications.
• Access to full text requires subscription to container library.
• Sometimes authors store a free version in a preprint server, sometimes an open-access conference paper closely resembles a journal version behind a paywall.
– Call a pair of very similar documents by the same author(s) surrogates
– Retrieve surrogates when original paper is access-restricted
Surrogator: Interface
Surrogator: Architecture
SurrogatorSantosh, T Y S S, D K Sanyal, and P K
Bhowmick. “Surrogator: Enriching a Digital
Library with Surrogate Resources.” 5th ACM
IKDD CoDS and 23rd COMAD (Demo
track), Goa, India, January 11 - 13, 2018.
Santosh, T Y S S, D K Sanyal, P K Bhowmick,
and P P Das. “Surrogator: A Tool to Enrich
a Digital Library with Open Access
Surrogate Resources.” Proc. ACM/IEEE-CS
JCDL 2018 (Poster). Texas, USA, June 3 - 7,
2018.
Beyond Lexical Search
• Lexical or word matching-based search cannot read user intent.
• Semantic search aims to give what the user wants, rather than what the user said.
Semantic Web• Semantic Web is an extension of the traditional Web in which
information is given well-defined meaning.
• The Semantic Web will contain resources with relations among each other.
Guha, Ramanathan, Rob McCool, and Eric Miller. "Semantic search." Proc. WWW, ACM, 2003.
Semantic Search• Semantic Search attempts to augment and improve traditional
(keyword-based) search results by using data from the Semantic Web.
• Data represented as a directed labelled graph, wherein each node corresponds to a resource and each arc is labelled with a property type (also a resource).
Knowledge
Graph
Semantic Search in the Web
Semantic Search in the Web
Semantic Model for NDLI Metadata
• Semantic model to represent authors, works and other elements (present in metadata) as interconnected entities.
– Handcrafted ontology / auto-generated ontology from NDLI metadata schema can be used.
“Satyajit
Ray”
“Sonar Kella”
“Our Films,
Their Films”
isAuthorOf
isAuthorOf
hasAuthor
hasAuthor
Lexical
search
(2015)
Semantic
search
(2017)
Need for Semantics in
• Concept queries
– “natural language interface”
– “ontology construction”
– “dynamic programming segmentation”
Xiong, Chenyan, Russell Power, and Jamie
Callan. "Explicit semantic ranking for
academic search via knowledge graph
embedding." Proc. WWW, 2017.
• Create a Knowledge Graph (KG) G = (V, E)
– Collect concept entities (Entity set V)
– Select top ranked noun phrases (keyphrases) from article title, abstract, introduction, conclusion & citation contexts
– Connect concept entities to other objects via edges (E)
– Author edges: Link to author whose paper mentions the entity
– Context edges: Link to co-occurring concept entities
– Desc edges: Link to descriptions in Freebase
– Venue edges: Link to venues that published papers with the entity in its title
Knowledge Graph in
Deep learning
Machine reasoning
Neural architecture
Compositional attention network
Deep learning (also known as deep structured
learning or hierarchical learning) is part of a
broader family of machine learning methods
based on learning data representations, as
opposed to task-specific algorithms.
KG Embedding in • Build KG from all documents in corpus.
• Find the embedding of each entity in KG.
– Embeddings are trained based on neighbours in the KG.
– Graph structure around an entity conveys semantics of the entity.
– Intuitively, entities with similar neighbours are usually related.
v
dVf : Embedding
(vector of dimension d)
Explicit Semantic Ranking in
Semantic Search in NDLI
• Can we build a semantic search engine over full text (wherever available) in NDLI?
• Challenges
– Parsing documents of diverse disciplines
– Constructing suitable representation of extracted information
– Building the right query interface
Q/A with Books & Articles• Search need not be keyword-based.
• One should be able to ask questions and get relevant answers.
• Can we have a Q/A interface to NDLI?
Figure Search
• “A figure is worth a thousand words”
• Many figure search engines available for biomed publications.
• Can we design a figure search engine (wherever full text available) for NDLI?
• Challenges
– Extracting figures from scanned documents is difficult.
– Annotating figures (especially for multilingual content) is not easy.
Sanyal, D K, S Chattopadhyay, and R Chatterjee. “Figure Retrieval from Biomedical Literature: An
Overview of Techniques, Tools and Challenges.” Machine Learning in Bio-Signal Analysis and
Diagnostic Imaging, Elsevier (In Press).
Customized Viewers• Various customized browsers possible over library contents.
Customized Viewer for Sodhganga Collection
• Sodhganga is a digital repository of Indian Electronic Theses and Dissertations
• NDLI indexes Sodhganga (> 36K records).
• Our tool Posterity helps visualize and analyse the Sodhgangametadata.
– It can show the academic descendants of a researcher
– It shows various indices (like number of direct students) to characterise a researcher’s mentorship.
Processed Sodhganga
advisorId researcherId advisor researcher department institution
51505 16148 datta, asis gunnery,
shobha
school of
life sciences
jawaharlal
nehru
university
Posterity
Developed by Sumana Dey, Staff @
NDLI
Recommender Systems
Recommender Systems: Approaches
• Recommend related resources
– Could be books, articles, videos, datasets, audio lectures, etc.
• Primarily 3 approaches
– Content-based filtering
– Uses item metadata and user’s preferences.
– Collaborative filtering
– Predicts what users will like based on their similarity to other users.
– Hybrid
– E.g., Apply above methods separately, then choose top-K from each.
Recommender System for NDLI
• Can we build a recommender system for NDLI?
• Challenges
– Diverse resource types
– Resources in multiple languages
– Diverse user types
– Only metadata available; not full text (barrier to content-based filtering)
– Sparse query logs (cold start problem in collaborative filtering)
Lots of Possibilities!
• Many more tools possible
– Cross-lingual and multi-lingual search facilities needed.
– Interface for differently-abled required.
– User experience tracking and enhancement is a plus.
– Data analytics over content could give astonishing insights into knowledge heritage.