personalized classifiers

Post on 30-Jun-2015

68 Views

Category:

Education

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Personalized Classifier: Evolving a Classifier from a Large

Reference Knowledge Graph

Ramakrishna B Bairi(rkbairi@gmail.com)

Ganesh Ramakrishnan(ganesh@cse.iitb.ac.in)

Vikas Sindhwani(vsindhwa@us.ibm.com)

Motivation• With the proliferation of digital documents it is

important to have sound organization – i.e. Categorization– Faceted Search, Exploratory Search, Navigational

Search, Diversifying Search Results, Ranking, etc.• Yahoo! employs 200 (?) people for manual

labeling of Web pages and managing a hierarchy of 500,000+ categories*

• MEDLINE (National Library of Medicine) spends $2 million/year for manual indexing of journal articles and evolving MEdical Subject Headings (18,000+ categories)* * Source: www.cs.purdue.edu/homes/lsi/CS547_2013_Spring/.../IR_TC_I_Big.pdf

(Department of Computer Science, Purdue University)

Challenges• What categories choose?• Predefined?– Reuters, DMOZ, Yahoo categories

• Relevant to organization?– Personalized categories

Assumptions

• We assume that a knowledge graph exists with all possible categories – that can cover the terminology of nearly any

document collection; – for example, Wikipedia

• Nodes are categories• Edges are relationship between them– Association (related)

• Organization receives documents in batches– Monthly, Weekly, etc.

A part of Knowledge Graph (KnG)

Problem Definition

• Learning a personalized model for the association of the categories in KnG to a document collection through active learning and feature design

• Building an evolving multi-label categorization system to categorize documents into Categories Specific to an Organization– Personalization of categories

Scope of Work

Overall Architecture

• We evolve the personalized classifier based on– Documents seen so far– Categories referenced from Knowledge Graph– Feedback provided by the user

Step 1: Spotting

• Spot the key words from documents– Key phrase identification Techniques– NLP (noun phrases)

Step 2: Candidate Categories

• Key words are the indicatives of document topics• Identify the Categories from KnG based on

keyword look ups– Title Match, Gloss match with Wikipedia categories

• Add categories in Markov Blanket– Observe that categories that get assigned to a

document exhibit semantic relations such as “associations”

– E.g.: category “Linear Classifier” is related to categories such as “Kernel Methods in Classifiers,” “Machine Learning,” and the like

– Refer our paper for more details

Candidate Categories

• Not all candidate categories are relevant to the document– The document is not about that category– The category is not of interest to the user

• We need to select only most appropriate categories from these candidate categories

Step 3: Associative Markov Network formation

• Two types of informative features available– a feature that is a function of the document and a

category, such as the category-specific classifier scoring function evaluated on a document

– a feature that is a function of two categories, such as their co-occurrence frequency or textual overlap between their descriptions.

• Associative Markov Network (AMN), a very natural way of modeling these two types of features.

Associative Markov Network

• The candidate categories for a journal article taken from arXiv.org

• Only some are actually relevant due to– Relevance to the document– User preferences

Step 4: Collectively Inferring Categories of a Document

• Node features – Capture the similarity of a node (category) to

document– E.g: Kernels, SVM / Naïve Bayes classifier scores

• Edge features– Capture the similarity between nodes– E.g: Title match, gloss match, etc.

Collectively Inferring Categories of a Document

• This is the MAP inference in standard Markov Network with only node and edge potentials

• Using the indicator variable, we can express the potentials as and

• Note, we have separate feature weights for 0 and 1 labels

x0

x6

x3

x8

x2x9

x5

x1

x4

x7

10

Training

• The training process involves learning– The AMN feature weights (Wn an We)

• Node specific classifier (SVM, Naïve Bayes, etc) weights

• Training is done as part of Personalization explained in coming slides

Personalization

• Process of learning to categorize with categories that are of interest to an organization

• We achieve this by soliciting feedback from a human oracle on the system-suggested categories and using it to retrain the system parameters.

• The feedback is solicited as “correct”, “incorrect” or “never again” for the categories assigned to a document by the system.

Personalization: Constraints

• Users can indicate (via feedback) that a category suggested by the system should never reappear in future categorization– E.g. Computer Science department may not be

interested in detailed categorization of documents based on types of Viruses

• The system remembers this feedback as a hard constraint which are applied during the inference process

Personalization: Constraints• Due to the AMN’s associative property, the constraints

naturally propagate– Users do not have to apply constraints on every unwanted

category on KnG

By applying a “never again” constraint on node N, the label of Node N is forced to 0. This forces labels of strongly associated neighbors (O,P,Q,R) to 0. This is due to the AMN MAP inference, which attains maximum value when the labels of these neighbors (with high edge potentials) are assigned label 0.

Personalization: Active Feedback

• To improve the categorization accuracy, users can train the system by providing feedback (“correct”, “incorrect”) on select categories of select documents.

• System uses this feedback to retrain AMN, SVM (and other classifiers – Naïve Bayes, etc)

• System chooses the documents and categories for feedback that can help the system learn best parameters with as few feedback as possible

Active Learning

• We prove a claim “There exists a feature space and a hyperplane in the feature space that separates AMN nodes with label 1 from the nodes that have label 0 and that passes through the origin”

• This claim helps us transforming the AMN model to the hyperplane based two class classification problem and apply uncertainty based principles to determine the most uncertain category for a document

Active Learning

• ai : gain in selecting category i based on the distance for hyper plane

• bj : gain in selecting document j based on the categories it has

• Feedback is sought from the user for the documents with zj = 1 and only for those categories that are identified as the most uncertain for that document (yi = 1).

Evaluation• Warm Start

• RCV1-v2 categories and documents• Demonstrates our system on standard dataset• 5000 documents in batches of 50 docs• 2000 held-out test documents for F1 score• Compared against• SVM• HICLASS from Shantanu et al.

• Cold Start• User Evaluation using Wikipedia categories and arXiv

articles• Compared against• WikipeidaMiner

Warm Start Results

Comparison with SVM

Active Learning with different algorithms

Cold Start Experiments and Results• 263 arxiv docs• Annotated by 8 human annotators using

Wikipedia titles• 5 fold cross validation– Trained AMN, SVM weights in each fold

To be addressed…

• Each document is assigned categories separately. This leads to many accumulated categories at the organization level– Over specified number of categories

• AMN inference over thousands of candidate categories is time consuming. Hence we cannot use this system in a real time fashion

• KnG evolving over time– Documents that are already assigned with categories need to be

updated wisely

Questions?

rkbairi@gmail.com

Thank you

top related