hands on classification

Upload: vishal-kamlani

Post on 03-Jun-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/12/2019 Hands on Classification

    1/32

    Hands on Classification with Learning

    Based Java

    Gourab Kundu

    Adapted from a talk by Vivek Srikumar

  • 8/12/2019 Hands on Classification

    2/32

    Goals of this tutorial

    At the end of these lectures, you will be able to

    1. Get started with Learning Based Java2. Use a generic, black box text classifier for different

    applicationsand write your own text classifier if needed3. Understand how features can impact the classifier

    performance and add features to improve your application4. Build a badge classifier based on character features

  • 8/12/2019 Hands on Classification

    3/32

    A Quick Recap

    Given:Examples (x,f(x)) of some unknown functionf Find:A good approximation of f

    x provides some representation of the input

    The process of mapping a domain element into a representation iscalled Feature Extraction. (Hard; ill-understood; important) x {0,1}n or x Rn

    The target function (label) f(x) {-1,+1} Binary Classification f(x) {1,2,3,.,k-1} Multi-class classification

  • 8/12/2019 Hands on Classification

    4/32

    What is text classification?

    A documentSome labels

    A classifier

    (black box)

  • 8/12/2019 Hands on Classification

    5/32

    Several applications fit this framework

    Spam detection Sentiment classification

    What else can you do, if you had such a black boxsystem that can classify text?

    Try to spend 30 seconds brainstorming

  • 8/12/2019 Hands on Classification

    6/32

    Outline of this session

    Getting started with LBJ Writing our first classifier: Spam/Ham

    Playing with features

    Looking inside the black box classifier for feature weights

  • 8/12/2019 Hands on Classification

    7/32

    LEARNING BASED JAVA

    Writing classifiers

  • 8/12/2019 Hands on Classification

    8/32

    What is Learning Based Java?

    A modeling language for learning and inference

    Supports Programming using learned models High level specification of features and constraints between classifiers Inference with constraints Different learning algorithms

    The learning operator Classifiers are functions defined in terms of data Learning happens at ompile time

  • 8/12/2019 Hands on Classification

    9/32

    What does LBJ do for you?

    Abstracts away the feature representation, learning andinference

    Allows you to write learning based programs

    Application developers can reason about the application athand

  • 8/12/2019 Hands on Classification

    10/32

    Demo

    A learning based program

    First, we will write an application that assumes the existence of

    a black box classifier

  • 8/12/2019 Hands on Classification

    11/32

    SPAM DETECTION

  • 8/12/2019 Hands on Classification

    12/32

    Spam detection

    Which of these (if any) are email spam?

    Subject: save over 70 % on name brandsoftware

    ppharmacy devote fink tungstatebrown lexicon pawnshop crescentrailroad distaff cytosine barium cainapplication elegy donnellyhydrochloride common embargoshakespearean bassett trustee nucleoluschicano narbonne telltale taggingswirly lank delphinus bragging braverycornea asiatic susanne

    Subject: please keep in touch

    just like to say that it has been greatmeeting and working with you all . iwill be leaving enron effective july 5 th

    to do investment banking in hongkong . i will initially be based in newyork and will be moving to hong kongafter a few months . do contact mewhen you are in the vicinity .

    How do you know?

  • 8/12/2019 Hands on Classification

    13/32

    What do we need to build a classifier?

    1. Annotated documents*

    2. A feature representation of the documents

    3. A learning algorithm

    * Here we are dealing with supervised learning

  • 8/12/2019 Hands on Classification

    14/32

    Our first LBJ program

    /** A learned text classifier; its definition comes from data. */discrete TextClassifier(Document d)

  • 8/12/2019 Hands on Classification

    15/32

    Demo

    Lets build a spam detector

    How to train? How do different learning algorithms perform? Does this choicematter much?

  • 8/12/2019 Hands on Classification

    16/32

    Features

    Our current spam detector uses words as features

    Can we do better?

    Lets try it out

  • 8/12/2019 Hands on Classification

    17/32

    MORE TEXT CLASSIFICATION

  • 8/12/2019 Hands on Classification

    18/32

    Sentiment classification

    Which of these product reviews is positive?

    I recently made the switch from PCto Mac, and I can say that I'm notsure why I waited so long.

    Considering that I have only hadmy computer a few weeks I can'tsay much about the durability andlongevity of the hardware, but I cansay that the operating system(mine shipped with Lion) andsoftware is top notch.

    I've been an Apple user for a longtime, but my most recentMacBook Pro purchase has

    convinced me to reconsider. I'vehad several hardware issues,including a failed keyboard,battery failure, and a bad DVDdrive. Now, the backlight on thedisplay fails to turn on whenwaking from sleep

    How do you know?

  • 8/12/2019 Hands on Classification

    19/32

    Classifying news groups

    Which mailing list should this message be posted to?

    I am looking for Quick C or Microsoft C code for image decoding from file forVGA viewing and saving images from/to GIF, TIFF, PCX, or JPEG format. I havescoured the Internet, but its like trying to find a Dr. Seuss spell checkerTSR. It must be out there, and there's no need to reinvent the wheel.

    How do you know?alt.atheismcomp.graphicscomp.os.ms-windows.misccomp.sys.ibm.pc.hardwarecomp.sys.mac.hardware

    comp.windows.xmisc.forsalerec.autosrec.motorcyclesrec.sport.baseball

    rec.sport.hockeysci.cryptsci.electronicssci.medsci.space

    soc.religion.christiantalk.politics.gunstalk.politics.mideasttalk.politics.misctalk.religion.misc

  • 8/12/2019 Hands on Classification

    20/32

    Demo

    Converting our spam classifier into a Sentiment classifier A newsgroup classifier

    Note: How different are these at the implementation level?

  • 8/12/2019 Hands on Classification

    21/32

    Most of the engineering lies in the features

    A documentSome labels

    A classifier(black box)

  • 8/12/2019 Hands on Classification

    22/32

    Summary

    What is LBJ? How do we use it?

    Writing a simple spam detector

    Playing with features

    How much do we need to change to move to a differentapplication?

  • 8/12/2019 Hands on Classification

    23/32

    Assignment before Next Class (Not Graded)

    Download the code & data(http://l2r.cs.uiuc.edu/~danr/Teaching/CS446-12/handsonclassification.html)for this class and play with it

    Try to solve the Badges game puzzle with LBJ Think about what features are needed Write a parser for reading the data Write a classifier for solving the puzzle

    http://l2r.cs.uiuc.edu/~danr/Teaching/CS446-12/handsonclassification.htmlhttp://l2r.cs.uiuc.edu/~danr/Teaching/CS446-12/handsonclassification.htmlhttp://l2r.cs.uiuc.edu/~danr/Teaching/CS446-12/handsonclassification.htmlhttp://l2r.cs.uiuc.edu/~danr/Teaching/CS446-12/handsonclassification.htmlhttp://l2r.cs.uiuc.edu/~danr/Teaching/CS446-12/handsonclassification.html
  • 8/12/2019 Hands on Classification

    24/32

    Next Class

    We will solve the Badges Game puzzle by Machine Learning

    We will look at more text classification examples

    We will think about a famous people classifier

    Questions

  • 8/12/2019 Hands on Classification

    25/32

    Badge Classifier

    Brainstorm the possible Features Characters in entire name Two consecutive Characters Character as Vowel, Character as Consonant .

    Feature Engineering is Important (especially if labeled data issmall)

    What is the baseline? 70 +, 24 -

  • 8/12/2019 Hands on Classification

    26/32

    THE FAMOUS PEOPLECLASSIFIER

  • 8/12/2019 Hands on Classification

    27/32

    The Famous PeopleClassifier

    f( ) = Politician

    f( ) =Athlete

    f( ) = Corporate Mogul

  • 8/12/2019 Hands on Classification

    28/32

    The NLP version of the fame classifier

    All sentences in the news, which thestring Barack Obama occurs

    All sentences in the news, which thestring Roger Federeroccurs

    All sentences in the news, which thestring Bill Gatesoccurs

    Representedby

  • 8/12/2019 Hands on Classification

    29/32

    Our goal

    Find famous athletes, corporate moguls and politicians

    Athlete

    MichaelSchumacher Michael Jordan

    Politician

    Bill Clinton George W. Bush

    Corporate Mogul

    Warren Buffet Larry Ellison

  • 8/12/2019 Hands on Classification

    30/32

    Lets brainstorm

    How do we build a fame classifier?Remember, we start off with just raw text from a news website

  • 8/12/2019 Hands on Classification

    31/32

    One solution

    Let us label entities using features defined on mentions

    Identify mentions using the named entity recognizer Define features based on the words, parts of speech and

    dependency trees Train a classifier

    All sentences in the news, which thestring Barack Obama occurs

  • 8/12/2019 Hands on Classification

    32/32

    Summary

    1. Get started with Learning Based Java2. Use a generic, black box text classifier for different

    applicationsand write your own text classifier if needed3. Understand how features can impact the classifier

    performance and add features to improve your application4. Build a badge classifier based on character features

    Questions