automatic classification document and filing jonathan mcelroy advisor: franz j. kurfess

25
Automatic Classification Document and Filing Jonathan McElroy Advisor: Franz J. Kurfess

Upload: ella-stephens

Post on 02-Jan-2016

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Automatic Classification Document and Filing Jonathan McElroy Advisor: Franz J. Kurfess

Automatic Classification Document and Filing

Jonathan McElroy

Advisor: Franz J. Kurfess

Page 2: Automatic Classification Document and Filing Jonathan McElroy Advisor: Franz J. Kurfess

Overview

Introduction Classification Techniques Hidden Markov Models Similar Systems Novel Approach

Page 3: Automatic Classification Document and Filing Jonathan McElroy Advisor: Franz J. Kurfess

Introduction

Creating an assistant document filer that learns from the user.

Novelty – taking different classification approaches to forming a hierarchical folder system based on user’s filing patterns. Also uses natural and specific learning and Markov Models to determine user’s style of filing.

Page 4: Automatic Classification Document and Filing Jonathan McElroy Advisor: Franz J. Kurfess

Classification - Bayesian

Probabilistic method using Bayes Theorem [1] [12]Bayes Theorem

Now sum up the probabilities that a word in A will be in class B.

Page 5: Automatic Classification Document and Filing Jonathan McElroy Advisor: Franz J. Kurfess

Classification – Bayesian (cont.)

Each word is independent of each other.

Often performs just as well as more complicated techniques like decision trees, rule-based learning and instance based learning.

Page 6: Automatic Classification Document and Filing Jonathan McElroy Advisor: Franz J. Kurfess

Classification – Vector Based

The text documents are turning into vectors [1]

Support Vector Machines [14]Supervised learning.Forms a divide between examples mapped

in space. New objects mapped are classified based

on where they are.

Page 7: Automatic Classification Document and Filing Jonathan McElroy Advisor: Franz J. Kurfess

Classification – Vector (cont.)

T-Route [1]An average document vector created

for K classes.Uses a term-document matrixWTR is size M X K.Wij represents number of times ti

occurs in cj.

Page 8: Automatic Classification Document and Filing Jonathan McElroy Advisor: Franz J. Kurfess

Classification – Vector (cont.)

Vectorization [1]

Page 9: Automatic Classification Document and Filing Jonathan McElroy Advisor: Franz J. Kurfess

Classification – Vector (cont.)

T-Trans [1]A unique document vector created for

K classes.WTT is size M X N.Wij represents number of times ti

occurs in dj.Document is assigned same class as

column vector in WTT with the smallest Euclidian distance from document.

Page 10: Automatic Classification Document and Filing Jonathan McElroy Advisor: Franz J. Kurfess

Classification - Improvements

Latent Semantic Analysis Looks at relationships between words

and documents and then forms concepts to link eachother.

Page 11: Automatic Classification Document and Filing Jonathan McElroy Advisor: Franz J. Kurfess

Classification - Improvements

Term Weighting [1] [15]Term Frequency – Inverse Document

Frequency The importance increases proportionally to

the number of times a word appears in the document but is offset by the frequency of the word in the corpus

Page 12: Automatic Classification Document and Filing Jonathan McElroy Advisor: Franz J. Kurfess

Classification - Improvements

Term Weighting [1] [15]Mutual Information - look at two

different classes and infers which keywords being used to classify one of them will also lead to a misclassification of the other one

• Measures their mutual dependence

Page 13: Automatic Classification Document and Filing Jonathan McElroy Advisor: Franz J. Kurfess

Classification - Improvements

Term Weighting [1] [15]Bellegarda – Combines the global

weighting with a localized weighting for a word.

Creates a new term-document Wij with ti in dj

Page 14: Automatic Classification Document and Filing Jonathan McElroy Advisor: Franz J. Kurfess

Hidden Markov Models [4]

Method for learning patterns. Specifically for filing patterns.

HMM - describe two related discrete-time stochastic processes.First – hidden stateSecond – visible variables

Page 15: Automatic Classification Document and Filing Jonathan McElroy Advisor: Franz J. Kurfess

Hidden Markov Models [4]

Example: User files using 2 different types of filing: Date, Area of Interest.

Observations about the documents in nodes will lead to a filing type using probabilities of each type, and node data.

Page 16: Automatic Classification Document and Filing Jonathan McElroy Advisor: Franz J. Kurfess

Hidden Markov Models [4]

Date Area

Unrelated Documents

Similar DatesRelated Documents

Page 17: Automatic Classification Document and Filing Jonathan McElroy Advisor: Franz J. Kurfess

Similar Systems

Email Classification/ Routing Systems Hierarchical Systems Semantic Desktops

Page 18: Automatic Classification Document and Filing Jonathan McElroy Advisor: Franz J. Kurfess

Similar Systems

Email Classification/ Routing Systems[6] System reroutes information from a

central database to multiple users with different profiles, by using evolving classifying agents that filter the data

[1] Continually receive new text based documents and working to classify and extract important information out of them.

Page 19: Automatic Classification Document and Filing Jonathan McElroy Advisor: Franz J. Kurfess

Similar Systems

Hierarchical Systems [7]At each level a context sensitive signature

and feature selection is created and then focused to cut out noise and stop words.

Bayesian > Vector

Page 20: Automatic Classification Document and Filing Jonathan McElroy Advisor: Franz J. Kurfess

Similar Systems

Semantic Desktops CALO [13]

• project lead by SRI International that focused on development of a smart desktop

• automate interrelated decision making tasks that have resisted automation and allow them to react appropriately to situations that are unusual.

Page 21: Automatic Classification Document and Filing Jonathan McElroy Advisor: Franz J. Kurfess

Similar Systems

Semantic Desktops DEVONThink [10]

• Seeks to make an all inclusive information gatherer and organizer

• Sorts, classifies and shows relationships between documents automatically, but has shortcomings

Page 22: Automatic Classification Document and Filing Jonathan McElroy Advisor: Franz J. Kurfess

My Approach

Hierarchical approach to classificationClassifying each node in a directory

Also uses natural and specific learning. Letting user choose how involved with learning.

Using Markov Models to determine user’s style of filing. Automatic placement of files that do not fit into any current nodes.

Page 23: Automatic Classification Document and Filing Jonathan McElroy Advisor: Franz J. Kurfess

My Approach

The user is able to drag newly received files and drop them onto the program.

Files are be classified by their content and put in the location that the user would most likely have placed them.

Any changes by user are recorded and added to classification.

Page 24: Automatic Classification Document and Filing Jonathan McElroy Advisor: Franz J. Kurfess

References[1] Tailby, R., Dean, R., Milner, B., and Smith, D. 2006. Email classification for automated service handling. In Proceedings of the 2006

ACM Symposium on Applied Computing (Dijon, France, April 23 - 27, 2006). SAC '06. ACM, New York, NY, 1073-1077. http://doi.acm.org/10.1145/1141277.1141530

[2] Sebastiani, F. 2002. Machine learning in automated text categorization. ACM Comput. Surv. 34, 1 (Mar. 2002), 1-47. http://doi.acm.org/10.1145/505282.505283

[3] Fu, Y., Ke, W., and Mostafa, J. 2005. Automated text classification using a multi-agent framework. In Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (Denver, CO, USA, June 07 - 11, 2005). JCDL '05. ACM, New York, NY, 157-158. http://doi.acm.org/10.1145/1065385.1065420

[4] Frasconi, P., Soda, G., and Vullo, A. 2002. Hidden Markov Models for Text Categorization in Multi-Page Documents. J. Intell. Inf. Syst. 18, 2-3 (Mar. 2002), 195-217. http://dx.doi.org/10.1023/A:1013681528748

[5] Cohen, W. W. and Singer, Y. 1999. Context-sensitive learning methods for text categorization. ACM Trans. Inf. Syst. 17, 2 (Apr. 1999), 141-173. http: //doi.acm.org/10.1145/306686.306688

[6] Clack, C., Farringdon, J., Lidwell, P., and Yu, T. 1997. Autonomous document classification for business. In Proceedings of the First international Conference on Autonomous Agents (Marina del Rey, California, United States, February 05 - 08, 1997). AGENTS '97. ACM, New York, NY, 201- 208. http://doi.acm.org/10.1145/267658.267716

[7] Chakrabarti, S., Dom, B., Agrawal, R., and Raghavan, P. 1998. Scalable feature selection, classication and signature generation for organizing large text databases into hierarchical topic taxonomies. The VLDB Journal 7, 3 (Aug. 1998), 163-178. http://dx.doi.org/10.1007/s007780050061

[8] Baker, L. D. and McCallum, A. K. 1998. Distributional clustering of words for text classication. In Proceedings of the 21st Annual international ACM SIGIR Conference on Research and Development in information Retrieval (Melbourne, Australia, August 24 - 28, 1998). SIGIR '98. ACM, New York, NY, 96-103. http://doi.acm.org/10.1145/290941.290970

[9] Cognitive Assistant that Learns and Organizes http://caloproject.sri.com/[10] http://www.devon-technologies.com/products/devonthink/index.html[11] http://nepomuk.semanticdesktop.org/[12] Fan, H. and Ramamohanarao, K. 2003. A Bayesian approach to use emerging patterns for classification. In Proceedings of the 14th

Australasian Database Conference - Volume 17 (Adelaide, Australia). K. Schewe and X. Zhou, Eds. ACM International Conference Proceeding Series, vol. 143. Australian Computer Society, Darlinghurst, Australia, 39-48.

[13] Cognitive Assistant that Learns and Organizes. http://caloproject.sri.com/[14] Support Vector Machines. December, 2009. http://en.wikipedia.org/wiki/Support_vector_machine.[15] K. Yu, X. Xu, M. Ester, H.-P. Kriegel. Feature weighting and instance selection for collaborative filtering. Knowledge and

Information Systems, 5(2), 201-224, 2003

Page 25: Automatic Classification Document and Filing Jonathan McElroy Advisor: Franz J. Kurfess

Questions

Do you?