Download - Personalized classifiers
Personalized Classifier: Evolving a Classifier from a Large
Reference Knowledge Graph
Ramakrishna B Bairi([email protected])
Ganesh Ramakrishnan([email protected])
Vikas Sindhwani([email protected])
Motivation• With the proliferation of digital documents it is
important to have sound organization – i.e. Categorization– Faceted Search, Exploratory Search, Navigational
Search, Diversifying Search Results, Ranking, etc.• Yahoo! employs 200 (?) people for manual
labeling of Web pages and managing a hierarchy of 500,000+ categories*
• MEDLINE (National Library of Medicine) spends $2 million/year for manual indexing of journal articles and evolving MEdical Subject Headings (18,000+ categories)* * Source: www.cs.purdue.edu/homes/lsi/CS547_2013_Spring/.../IR_TC_I_Big.pdf
(Department of Computer Science, Purdue University)
Challenges• What categories choose?• Predefined?– Reuters, DMOZ, Yahoo categories
• Relevant to organization?– Personalized categories
Assumptions
• We assume that a knowledge graph exists with all possible categories – that can cover the terminology of nearly any
document collection; – for example, Wikipedia
• Nodes are categories• Edges are relationship between them– Association (related)
• Organization receives documents in batches– Monthly, Weekly, etc.
A part of Knowledge Graph (KnG)
Problem Definition
• Learning a personalized model for the association of the categories in KnG to a document collection through active learning and feature design
• Building an evolving multi-label categorization system to categorize documents into Categories Specific to an Organization– Personalization of categories
Scope of Work
Overall Architecture
• We evolve the personalized classifier based on– Documents seen so far– Categories referenced from Knowledge Graph– Feedback provided by the user
Step 1: Spotting
• Spot the key words from documents– Key phrase identification Techniques– NLP (noun phrases)
Step 2: Candidate Categories
• Key words are the indicatives of document topics• Identify the Categories from KnG based on
keyword look ups– Title Match, Gloss match with Wikipedia categories
• Add categories in Markov Blanket– Observe that categories that get assigned to a
document exhibit semantic relations such as “associations”
– E.g.: category “Linear Classifier” is related to categories such as “Kernel Methods in Classifiers,” “Machine Learning,” and the like
– Refer our paper for more details
Candidate Categories
• Not all candidate categories are relevant to the document– The document is not about that category– The category is not of interest to the user
• We need to select only most appropriate categories from these candidate categories
Step 3: Associative Markov Network formation
• Two types of informative features available– a feature that is a function of the document and a
category, such as the category-specific classifier scoring function evaluated on a document
– a feature that is a function of two categories, such as their co-occurrence frequency or textual overlap between their descriptions.
• Associative Markov Network (AMN), a very natural way of modeling these two types of features.
Associative Markov Network
• The candidate categories for a journal article taken from arXiv.org
• Only some are actually relevant due to– Relevance to the document– User preferences
Step 4: Collectively Inferring Categories of a Document
• Node features – Capture the similarity of a node (category) to
document– E.g: Kernels, SVM / Naïve Bayes classifier scores
• Edge features– Capture the similarity between nodes– E.g: Title match, gloss match, etc.
Collectively Inferring Categories of a Document
• This is the MAP inference in standard Markov Network with only node and edge potentials
• Using the indicator variable, we can express the potentials as and
• Note, we have separate feature weights for 0 and 1 labels
x0
x6
x3
x8
x2x9
x5
x1
x4
x7
10
Training
• The training process involves learning– The AMN feature weights (Wn an We)
• Node specific classifier (SVM, Naïve Bayes, etc) weights
• Training is done as part of Personalization explained in coming slides
Personalization
• Process of learning to categorize with categories that are of interest to an organization
• We achieve this by soliciting feedback from a human oracle on the system-suggested categories and using it to retrain the system parameters.
• The feedback is solicited as “correct”, “incorrect” or “never again” for the categories assigned to a document by the system.
Personalization: Constraints
• Users can indicate (via feedback) that a category suggested by the system should never reappear in future categorization– E.g. Computer Science department may not be
interested in detailed categorization of documents based on types of Viruses
• The system remembers this feedback as a hard constraint which are applied during the inference process
Personalization: Constraints• Due to the AMN’s associative property, the constraints
naturally propagate– Users do not have to apply constraints on every unwanted
category on KnG
By applying a “never again” constraint on node N, the label of Node N is forced to 0. This forces labels of strongly associated neighbors (O,P,Q,R) to 0. This is due to the AMN MAP inference, which attains maximum value when the labels of these neighbors (with high edge potentials) are assigned label 0.
Personalization: Active Feedback
• To improve the categorization accuracy, users can train the system by providing feedback (“correct”, “incorrect”) on select categories of select documents.
• System uses this feedback to retrain AMN, SVM (and other classifiers – Naïve Bayes, etc)
• System chooses the documents and categories for feedback that can help the system learn best parameters with as few feedback as possible
Active Learning
• We prove a claim “There exists a feature space and a hyperplane in the feature space that separates AMN nodes with label 1 from the nodes that have label 0 and that passes through the origin”
• This claim helps us transforming the AMN model to the hyperplane based two class classification problem and apply uncertainty based principles to determine the most uncertain category for a document
Active Learning
• ai : gain in selecting category i based on the distance for hyper plane
• bj : gain in selecting document j based on the categories it has
• Feedback is sought from the user for the documents with zj = 1 and only for those categories that are identified as the most uncertain for that document (yi = 1).
Evaluation• Warm Start
• RCV1-v2 categories and documents• Demonstrates our system on standard dataset• 5000 documents in batches of 50 docs• 2000 held-out test documents for F1 score• Compared against• SVM• HICLASS from Shantanu et al.
• Cold Start• User Evaluation using Wikipedia categories and arXiv
articles• Compared against• WikipeidaMiner
Warm Start Results
Comparison with SVM
Active Learning with different algorithms
Cold Start Experiments and Results• 263 arxiv docs• Annotated by 8 human annotators using
Wikipedia titles• 5 fold cross validation– Trained AMN, SVM weights in each fold
To be addressed…
• Each document is assigned categories separately. This leads to many accumulated categories at the organization level– Over specified number of categories
• AMN inference over thousands of candidate categories is time consuming. Hence we cannot use this system in a real time fashion
• KnG evolving over time– Documents that are already assigned with categories need to be
updated wisely
Questions?
Thank you