apml (all purpose machine learning) toolkit

ApMl (All Purpose Machine Learning) Toolkit

David W. Miller and Helen Howell

Semantic Web Final Project

Spring 2002

Department of Computer Science

University of Georgia

www.cs.uga.edu/~miller/SemWeb

www.cs.uga.edu/~helen/SemWeb/SemWeb.html

What Has Been Done

• Extensive Research into the effectiveness of machine learning algorithms has been performed– Train System on expert created taxonomy

with expert specified documents

What We Did

• Train system on a domain specific taxonomy– Eg. CNN’s Sports Pages

• Test system’s ability to correctly classify documents from a second, yet similar taxonomy– Eg. Yahoo! Sports Pages

Automatic Text Classification via Statistical Methods

Text Categorization is the problem of assigning predefined categories to free text documents.

Statistical Learning Methods used in ApMl

•Bayes Method

•Rocchio Method (most popular)

•K-Nearest Neighbor Classification

•Probabilistic Indexing

A Probabilistic Generative Model

• Define a probabilistic generative model for documents with classes.

Bayes:Reinforcement

Learning:a Survey

This paper surveysthe field of rein-

forcement learningfrom a computer

science perspective.

35 a1 block12 computer4 field1 leg7 machine44 of3 paper2 perspective1 rate5 reinforcement9 science2 survey56 the11 this1 underrated… …

“Bag-of-words”

Automatic Text Classification through Machine Learning, McCallum, et. al.

Bayes Method

)|Pr(maxarg dcc jc j

Pick the most probable class, given the evidence:

d- a class (like “Planning”)

- a document (like “language intelligence proof...”)

)|Pr()Pr()|Pr(

cdcdc jj

Bayes Rule:

Probability Category cj should be assigned to document d

Automatic Text Classification through Machine Learning, McCallum, et. al.

Bayes Rule

)|Pr()Pr()|Pr(

cdcdc jj

)|( dcP j - Probability that document d belongs to category cj

)(dP - Probability that a randomly picked document has the same attributes

)( jcP - Probability that a randomly picked document belongs to this category

)|( cdP j- Probability that category c contains document d

Bayes Method

• Generates conditional probabilities of particular words occurring in a document given it belongs to a particular category.

• Larger vocabulary generate better probabilities

• Each category is given a threshold p for which it judges the worthiness of a document to fall in that classification.

• Documents may fall into one, more than one, or not even one category.

Rocchio Method

• Each document is D is represented as a vector within a given vector space V:

),...,( |)(|)1( Fddd

•Documents with similar content have similar vectors

•Each dimension of the vector space represents a word selected via a feature selection process

Rocchio Method

• Values of d(i) for a document d are calculated as a combination of the statistics TF(w,d) and DF(w)

• TF(w,d) (Term Frequency) is the number of times word w occurs in a document d.

• DF(w) (Document Frequency) is the number of documents in which the word w occurs at least once.

Rocchio Method• The inverse document frequency is calculated as

• Value of d(i) of feature wi for a document d is calculated as the product

)(),()(ii

i wIDFdwTFd

)log()( )(||wDF

•d(i) is called the weight of the word wi in the document d.

Rocchio Method

• Based on word weight heuristics, the word wi is an important indexing term for a document d if it occurs frequently in that document

• However, words that occurs frequently in many document spanning many categories are rated less importantly

K-Nearest Neighbor• Features

– All instances correspond to points in an n-dimensional Euclidean space

– Classification is delayed till a new instance arrives

– Classification done by comparing feature vectors of the different points

– Target function may be discrete or real-valued

K-Nearest Neighbor Learning, Dipanjan Chakraborty

1-Nearest Neighbor

K-Nearest Neighbor• An arbitrary instance is represented by

(a1(x), a2(x), a3(x),.., an(x))– ai(x) denotes features

• Euclidean distance between two instances

d(xi, xj)=sqrt (sum for r=1 to n (ar(xi) - ar(xj))2)• Find the k-nearest neighbors whose distance

from your test cases falls within a threshold p.• If x of those k-nearest neighbors are in

category ci, then assign the test case to ci, else it is unmatched.

Probabilistic Indexing

• Goal is to estimate P(C|si, dm)

– Probability that assignment of term si to the document dm is correct

• Once terms have been identified, assign Form Of Occurrence (FOC)– Certainty that term is correctly indentified– Significance of Term

Probabilistic Indexing Cont.

• If term t appears in document d and a term descriptor from t to s exists, s an indexing term, then generate a descriptor indictor

• Set of generated term descriptors can be evaluated and a probability calculated that document d lies in class c

ApMl Toolkit

• Built on top of and extends existing toolkits– rainbow (CMU) – Machine Learning– wget (GNU) – Web Crawler

• 4 Machine Learning Algorithms and 2 Classification Committees

• Web Crawler and Document Retrieval

• Automated Testing

Machine Learning Components

• 4 Machine Learning Algorithms (rainbow)– Naïve Bayes, Rocchio, KNN, Probabilistic

Indexing

• 2 Classification Committees (ApMl)– Weight Assigned For Overall Accuracy– Weights Assigned For Accuracy within

each Class of Taxonomy

Document Retrieval

• Web Crawler and Document Retrieval– Specify Starting URL– Specify Recursion Depth– Allow Multiple Domain Spanning– Specify Excluded Domains– Store all retrieved pages into a single

directory (ApMl)

Automated Testing

• Choose Algorithms to Test

• Choose Test Directory

• Specify Number of Tests

• All results are placed into persistent window for evaluation

Effectiveness: Contingency Table

Yes No

Yes a bSystem

No c d

Machine Learning for Text Classification, David D. Lewis, AT&T Labs

• precision = a/(a+b)– Documents classified correctly vs. All classified as a particular

apml (all purpose machine learning) toolkit

dfw document frequency

rocchio methodeach document

category probability

d term frequency

probability category

category cj probability

word wi

particular category

Documents

roadmap for co-creating interprofessional models of care ·...

junos® os netconf java toolkit developer guide · purpose...

advanced platform management link (apml)...

apml domestic relocation by agarwal packers & movers ltd. ...

a toolkit for developing a social purpose business plan

toolkit for competition advocacy in asean · 4 | 5 |...

verve: a general purpose open source reinforcement learning...

word file: peer evaluation toolkit - education.gov.scot...

wipo good practice toolkit for cmos (the toolkit)€¦ ·...

2017 millennial map toolkit · web view2017 millennial map...

unit-based team toolkit - labor management partnership ·...

the platform for collaboration on tax...4 introduction...

minutes of 486th · 486th occ minutes unquote: he requested...

pacific lntangible cultural heritage mapping toolkit ›...

your shoebox toolkit: keeping track of your …...your...

a • on-licensed premises toolkit on-licensed premises...

gift aid toolkit - oct 2011 - scottish football...

1 welcome to presentation on supercritical boiler by...

greta mpeg-4 compliant script based behaviour generator...

covid-19: cancer prehabilitation toolkit purpose