using amazon mechanical turk for data collection in · pdf fileusing amazon mechanical turk...

71
Using Amazon Mechanical Turk for Data Collection in Parts of Speech Tag Correction for Patent Claims David Cinciruk June 24, 2015 1

Upload: doanquynh

Post on 20-Mar-2018

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Using Amazon Mechanical Turk for Data Collection in · PDF fileUsing Amazon Mechanical Turk for Data Collection in Parts of Speech Tag Correction for Patent Claims David Cinciruk June

Using Amazon Mechanical Turk for DataCollection in Parts of Speech Tag Correction for

Patent Claims

David Cinciruk

June 24, 2015

1

Page 2: Using Amazon Mechanical Turk for Data Collection in · PDF fileUsing Amazon Mechanical Turk for Data Collection in Parts of Speech Tag Correction for Patent Claims David Cinciruk June

Table of Contents

Overview of Our Problem

Using Amazon Mechanical TurkWorker ViewsRequester ViewsTesting HITsDifficulties of Obtaining Results

Automatic Correcter of Patent Claim POS Tags

Conclusion

2

Page 3: Using Amazon Mechanical Turk for Data Collection in · PDF fileUsing Amazon Mechanical Turk for Data Collection in Parts of Speech Tag Correction for Patent Claims David Cinciruk June

Our Research

I Work is in Patent ProcessingI Mapping patents to other technical documentsI Automated classificationI Patent RetrievalI Patent Valuation

I Need to find a way to represent patents. NLP offers a solutionwith dependencies

3

Page 4: Using Amazon Mechanical Turk for Data Collection in · PDF fileUsing Amazon Mechanical Turk for Data Collection in Parts of Speech Tag Correction for Patent Claims David Cinciruk June

Dependency Modeling

I Dependencies help to represent words by their relationships

I Formed by traversing the parse tree of a sentence or segmentand noting two words and how they are linked together

4

Page 5: Using Amazon Mechanical Turk for Data Collection in · PDF fileUsing Amazon Mechanical Turk for Data Collection in Parts of Speech Tag Correction for Patent Claims David Cinciruk June

NLP Parsers Do Not Work Well With Patent Claims!

5

Page 6: Using Amazon Mechanical Turk for Data Collection in · PDF fileUsing Amazon Mechanical Turk for Data Collection in Parts of Speech Tag Correction for Patent Claims David Cinciruk June

Obvious Mislabeling of Words

6

Page 7: Using Amazon Mechanical Turk for Data Collection in · PDF fileUsing Amazon Mechanical Turk for Data Collection in Parts of Speech Tag Correction for Patent Claims David Cinciruk June

Odd Language of Patent Claims

The holder for use with a razor blade according to claim 1, saidrecess being semicircular in shape and revealing opposite faces ofthe razor blade which may be grasped with the fingers of the user.

I Patents are the intersection of technical and legal speech

I Legally each patent claim must be one very long run-onsentence

I Features particular language whose meanings aren’t standardmeanings in normal English.

I Because of these, NLP software struggles to correctly tagpatent claims

7

Page 8: Using Amazon Mechanical Turk for Data Collection in · PDF fileUsing Amazon Mechanical Turk for Data Collection in Parts of Speech Tag Correction for Patent Claims David Cinciruk June

Trying to Correct the NLP

blade: VBP → NNsaid: VBD → JJsemicircular: VBN → JJ

I By forcing incorrect tags to a corrected tag, NLP softwareswould be forced to reconstruct the parse tree and createdependencies more similar to true speech

I We need a large collection of hand-corrected patent claims inorder to train a system to automatically correct patents

I Amazon Mechanical Turk allows us to gather this collectionvia crowdsourcing

8

Page 9: Using Amazon Mechanical Turk for Data Collection in · PDF fileUsing Amazon Mechanical Turk for Data Collection in Parts of Speech Tag Correction for Patent Claims David Cinciruk June

Table of Contents

Overview of Our Problem

Using Amazon Mechanical TurkWorker ViewsRequester ViewsTesting HITsDifficulties of Obtaining Results

Automatic Correcter of Patent Claim POS Tags

Conclusion

9

Page 10: Using Amazon Mechanical Turk for Data Collection in · PDF fileUsing Amazon Mechanical Turk for Data Collection in Parts of Speech Tag Correction for Patent Claims David Cinciruk June

Artificial Artificial Intelligence

Crowdsourcing Internet Marketplace that enables individuals andorganizations to coordinate the use of human intelligence toperform tasks that computers are unable to do.

10

Page 11: Using Amazon Mechanical Turk for Data Collection in · PDF fileUsing Amazon Mechanical Turk for Data Collection in Parts of Speech Tag Correction for Patent Claims David Cinciruk June

The Turk

Origin comes from a supposed master Chess Playing Automatonbuilt in the 1770 by Wolfgang von Kempelen

11

Page 12: Using Amazon Mechanical Turk for Data Collection in · PDF fileUsing Amazon Mechanical Turk for Data Collection in Parts of Speech Tag Correction for Patent Claims David Cinciruk June

The Turk

Actually was a chess master hidden in the cabinet playing thegame from underneath

12

Page 13: Using Amazon Mechanical Turk for Data Collection in · PDF fileUsing Amazon Mechanical Turk for Data Collection in Parts of Speech Tag Correction for Patent Claims David Cinciruk June

Workers and Requesters

I Worker

I Also known as Turkers

I Performs tasks set up byrequesters

I Work their own hours andchoose their own tasks

I Requester

I Creates tasks for workers todo

I Can set qualifications ontasks to dislow unprovenpeople from performingtasks.

I Can set up tests todetermine if people arequalified to work onparticular HITs or not

13

Page 14: Using Amazon Mechanical Turk for Data Collection in · PDF fileUsing Amazon Mechanical Turk for Data Collection in Parts of Speech Tag Correction for Patent Claims David Cinciruk June

Our Use of Mechanical Turk

I We want to develop the most accurate dependencies for use inour future systems

I We have patent claims automatically segmented based off ofsemicolons and colons and POS tagged

I Tags may be incorrect - want to eventually be able toautomatically correct the tags

I Need to gather lists of corrected tags

I Decided to use Mechanical Turk to gather corrected tags

14

Page 15: Using Amazon Mechanical Turk for Data Collection in · PDF fileUsing Amazon Mechanical Turk for Data Collection in Parts of Speech Tag Correction for Patent Claims David Cinciruk June

Our Use of Mechanical Turk

I Want to create an easy to perform HIT that has high impacton our tagging.

I Most common problem that has a high impact on thedependencies are problems with words incorrectly tagged asverbs

I Chose the task of checking if words initially tagged as verbsare nouns or adjectives

I Turkers will label the verbs as noun, adjective, or verbs andassume all other tags are correct

15

Page 16: Using Amazon Mechanical Turk for Data Collection in · PDF fileUsing Amazon Mechanical Turk for Data Collection in Parts of Speech Tag Correction for Patent Claims David Cinciruk June

The Setup

I Initial:I The/DT holder/NN for/IN use/NN with/IN a/DT razor/NN

blade/VBP according/VBG to/TO claim/NN 1/CD ,/,said/VBD recess/NN being/VBG semicircular/VBN in/INshape/NN and/CC revealing/VBG opposite/JJ faces/NNSof/IN the/DT razor/NN blade/NN which/WDT may/MDbe/VB grasped/VBN with/IN the/DT fingers/NNS of/INthe/DT user/NN ./.

I Corrected:I The/DT holder/NN for/IN use/NN with/IN a/DT razor/NN

blade/NN according/VBG to/TO claim/NN 1/CD ,/,said/JJ recess/NN being/VBG semicircular/JJ in/INshape/NN and/CC revealing/VBG opposite/JJ faces/NNSof/IN the/DT razor/NN blade/NN which/WDT may/MDbe/VB grasped/VBN with/IN the/DT fingers/NNS of/INthe/DT user/NN ./.

16

Page 17: Using Amazon Mechanical Turk for Data Collection in · PDF fileUsing Amazon Mechanical Turk for Data Collection in Parts of Speech Tag Correction for Patent Claims David Cinciruk June

Table of Contents

Overview of Our Problem

Using Amazon Mechanical TurkWorker ViewsRequester ViewsTesting HITsDifficulties of Obtaining Results

Automatic Correcter of Patent Claim POS Tags

Conclusion

17

Page 18: Using Amazon Mechanical Turk for Data Collection in · PDF fileUsing Amazon Mechanical Turk for Data Collection in Parts of Speech Tag Correction for Patent Claims David Cinciruk June

Worker Homepage

18

Page 19: Using Amazon Mechanical Turk for Data Collection in · PDF fileUsing Amazon Mechanical Turk for Data Collection in Parts of Speech Tag Correction for Patent Claims David Cinciruk June

HIT Search

19

Page 20: Using Amazon Mechanical Turk for Data Collection in · PDF fileUsing Amazon Mechanical Turk for Data Collection in Parts of Speech Tag Correction for Patent Claims David Cinciruk June

HIT Search

20

Page 21: Using Amazon Mechanical Turk for Data Collection in · PDF fileUsing Amazon Mechanical Turk for Data Collection in Parts of Speech Tag Correction for Patent Claims David Cinciruk June

Our HIT

21

Page 22: Using Amazon Mechanical Turk for Data Collection in · PDF fileUsing Amazon Mechanical Turk for Data Collection in Parts of Speech Tag Correction for Patent Claims David Cinciruk June

Table of Contents

Overview of Our Problem

Using Amazon Mechanical TurkWorker ViewsRequester ViewsTesting HITsDifficulties of Obtaining Results

Automatic Correcter of Patent Claim POS Tags

Conclusion

22

Page 23: Using Amazon Mechanical Turk for Data Collection in · PDF fileUsing Amazon Mechanical Turk for Data Collection in Parts of Speech Tag Correction for Patent Claims David Cinciruk June

Requester Homepage

23

Page 24: Using Amazon Mechanical Turk for Data Collection in · PDF fileUsing Amazon Mechanical Turk for Data Collection in Parts of Speech Tag Correction for Patent Claims David Cinciruk June

Tasks for Requesters

I Editing HITs

I Publishing Batches

I Reviewing Batches

24

Page 25: Using Amazon Mechanical Turk for Data Collection in · PDF fileUsing Amazon Mechanical Turk for Data Collection in Parts of Speech Tag Correction for Patent Claims David Cinciruk June

Editing HITs

25

Page 26: Using Amazon Mechanical Turk for Data Collection in · PDF fileUsing Amazon Mechanical Turk for Data Collection in Parts of Speech Tag Correction for Patent Claims David Cinciruk June

Editing HITs - Describing the HIT

26

Page 27: Using Amazon Mechanical Turk for Data Collection in · PDF fileUsing Amazon Mechanical Turk for Data Collection in Parts of Speech Tag Correction for Patent Claims David Cinciruk June

Editing HITs - Setting Up Properties

27

Page 28: Using Amazon Mechanical Turk for Data Collection in · PDF fileUsing Amazon Mechanical Turk for Data Collection in Parts of Speech Tag Correction for Patent Claims David Cinciruk June

Editing HITs - Setting Worker Requirements

28

Page 29: Using Amazon Mechanical Turk for Data Collection in · PDF fileUsing Amazon Mechanical Turk for Data Collection in Parts of Speech Tag Correction for Patent Claims David Cinciruk June

Editing HITs - Designing the HIT

29

Page 30: Using Amazon Mechanical Turk for Data Collection in · PDF fileUsing Amazon Mechanical Turk for Data Collection in Parts of Speech Tag Correction for Patent Claims David Cinciruk June

Editing HITs - Previewing the HIT

30

Page 31: Using Amazon Mechanical Turk for Data Collection in · PDF fileUsing Amazon Mechanical Turk for Data Collection in Parts of Speech Tag Correction for Patent Claims David Cinciruk June

Tasks for Requesters

I Editing HITs

I Publishing Batches

I Reviewing Batches

31

Page 32: Using Amazon Mechanical Turk for Data Collection in · PDF fileUsing Amazon Mechanical Turk for Data Collection in Parts of Speech Tag Correction for Patent Claims David Cinciruk June

Publishing HITs

32

Page 33: Using Amazon Mechanical Turk for Data Collection in · PDF fileUsing Amazon Mechanical Turk for Data Collection in Parts of Speech Tag Correction for Patent Claims David Cinciruk June

Publishing HITs - Developing the Initial CSV file

I Takes a folder with initialPOS tagged patent claimsthat were not checked and afolder featuring correctlylabeled claim segments

I Makes a CSV file featuringone segment (featuringverbs) from the initialpatent claim files and one ofthe segments with analready correct segment

I Also outputs a file featuringall the correct segments

I From there we can extend itto a 10 segment version byhand

33

Page 34: Using Amazon Mechanical Turk for Data Collection in · PDF fileUsing Amazon Mechanical Turk for Data Collection in Parts of Speech Tag Correction for Patent Claims David Cinciruk June

Publishing HITs - Developing a Larger CSV file

I VBA code that takes a 10segment version and makesa 20 segment version

I Interleaves the segments ofeach HIT (including the testsegments)

I Randomizes the locations ofthe two test segments andsaves their locations as ahidden variable

34

Page 35: Using Amazon Mechanical Turk for Data Collection in · PDF fileUsing Amazon Mechanical Turk for Data Collection in Parts of Speech Tag Correction for Patent Claims David Cinciruk June

Publishing HITs - Uploading the CSV file

35

Page 36: Using Amazon Mechanical Turk for Data Collection in · PDF fileUsing Amazon Mechanical Turk for Data Collection in Parts of Speech Tag Correction for Patent Claims David Cinciruk June

Publishing HITs - Previewing the HITs

36

Page 37: Using Amazon Mechanical Turk for Data Collection in · PDF fileUsing Amazon Mechanical Turk for Data Collection in Parts of Speech Tag Correction for Patent Claims David Cinciruk June

Publishing HITs - Previewing the HITs

37

Page 38: Using Amazon Mechanical Turk for Data Collection in · PDF fileUsing Amazon Mechanical Turk for Data Collection in Parts of Speech Tag Correction for Patent Claims David Cinciruk June

Publishing HITs - Confirming the Batch

38

Page 39: Using Amazon Mechanical Turk for Data Collection in · PDF fileUsing Amazon Mechanical Turk for Data Collection in Parts of Speech Tag Correction for Patent Claims David Cinciruk June

Tasks for Requesters

I Editing HITs

I Publishing Batches

I Reviewing Batches

39

Page 40: Using Amazon Mechanical Turk for Data Collection in · PDF fileUsing Amazon Mechanical Turk for Data Collection in Parts of Speech Tag Correction for Patent Claims David Cinciruk June

Reviewing Batches

40

Page 41: Using Amazon Mechanical Turk for Data Collection in · PDF fileUsing Amazon Mechanical Turk for Data Collection in Parts of Speech Tag Correction for Patent Claims David Cinciruk June

Reviewing Batches - Summary of Results

41

Page 42: Using Amazon Mechanical Turk for Data Collection in · PDF fileUsing Amazon Mechanical Turk for Data Collection in Parts of Speech Tag Correction for Patent Claims David Cinciruk June

Reviewing Batches - Reviewing Results

42

Page 43: Using Amazon Mechanical Turk for Data Collection in · PDF fileUsing Amazon Mechanical Turk for Data Collection in Parts of Speech Tag Correction for Patent Claims David Cinciruk June

Reviewing Batches - Reviewing Results

43

Page 44: Using Amazon Mechanical Turk for Data Collection in · PDF fileUsing Amazon Mechanical Turk for Data Collection in Parts of Speech Tag Correction for Patent Claims David Cinciruk June

Reviewing Batches - Reviewing Results

44

Page 45: Using Amazon Mechanical Turk for Data Collection in · PDF fileUsing Amazon Mechanical Turk for Data Collection in Parts of Speech Tag Correction for Patent Claims David Cinciruk June

Reviewing Batches - CSV File

45

Page 46: Using Amazon Mechanical Turk for Data Collection in · PDF fileUsing Amazon Mechanical Turk for Data Collection in Parts of Speech Tag Correction for Patent Claims David Cinciruk June

Reviewing Batches - CSV File

46

Page 47: Using Amazon Mechanical Turk for Data Collection in · PDF fileUsing Amazon Mechanical Turk for Data Collection in Parts of Speech Tag Correction for Patent Claims David Cinciruk June

Reviewing Batches - Checking Answers

I Need to approve or reject HITs since answers may or may notbe correct.

I May opt to give the same HIT to multiple Workers and thenpay the majority answer

I Better for subjective HITsI More Expensive since need to pay for multiple “correct” HITs

I May instead put test questions inside HITs that have knownanswers and check the answers

I Better for objective HITs where accuracy is keyI Method we used for our HITs

47

Page 48: Using Amazon Mechanical Turk for Data Collection in · PDF fileUsing Amazon Mechanical Turk for Data Collection in Parts of Speech Tag Correction for Patent Claims David Cinciruk June

Reviewing Batches - Automatic Matlab HIT Checker

Load CSV File and Solutions

Determine Location of Test

Questions

Remove non-alphabetical

symbols from CSV file and solutions

Compare the test questions for a

match in the solutions

Write whether to accept or reject in

the CSV File

48

Page 49: Using Amazon Mechanical Turk for Data Collection in · PDF fileUsing Amazon Mechanical Turk for Data Collection in Parts of Speech Tag Correction for Patent Claims David Cinciruk June

Table of Contents

Overview of Our Problem

Using Amazon Mechanical TurkWorker ViewsRequester ViewsTesting HITsDifficulties of Obtaining Results

Automatic Correcter of Patent Claim POS Tags

Conclusion

49

Page 50: Using Amazon Mechanical Turk for Data Collection in · PDF fileUsing Amazon Mechanical Turk for Data Collection in Parts of Speech Tag Correction for Patent Claims David Cinciruk June

Mechanical Turk Sandbox

I You may not know how to develop your HIT or how yourresults will look.

I Sandbox mode allows Requesters to develop HITs withouthaving to pay anyone.

I Requesters can make Sandbox Worker accounts to performtheir own HITs to test out answering the HITs.

I Layout for Workers and Requesters are same as regular Turk

50

Page 51: Using Amazon Mechanical Turk for Data Collection in · PDF fileUsing Amazon Mechanical Turk for Data Collection in Parts of Speech Tag Correction for Patent Claims David Cinciruk June

Like Monopoly MoneyYou have fake money on the Sandbox that you can give out toSandbox Workers

51

Page 52: Using Amazon Mechanical Turk for Data Collection in · PDF fileUsing Amazon Mechanical Turk for Data Collection in Parts of Speech Tag Correction for Patent Claims David Cinciruk June

Worker and Requester Sandbox

52

Page 53: Using Amazon Mechanical Turk for Data Collection in · PDF fileUsing Amazon Mechanical Turk for Data Collection in Parts of Speech Tag Correction for Patent Claims David Cinciruk June

Table of Contents

Overview of Our Problem

Using Amazon Mechanical TurkWorker ViewsRequester ViewsTesting HITsDifficulties of Obtaining Results

Automatic Correcter of Patent Claim POS Tags

Conclusion

53

Page 54: Using Amazon Mechanical Turk for Data Collection in · PDF fileUsing Amazon Mechanical Turk for Data Collection in Parts of Speech Tag Correction for Patent Claims David Cinciruk June

Fishing for Results

Workers need to be enticed to do HITs. Many potential ways to dothat. Need to weigh one lure against another

54

Page 55: Using Amazon Mechanical Turk for Data Collection in · PDF fileUsing Amazon Mechanical Turk for Data Collection in Parts of Speech Tag Correction for Patent Claims David Cinciruk June

Our Experiences

Initial unsucessful HIT paid 5 cents for 2 questions - after 19 daysonly 86 HITs completed

Secondary successful HIT paid 45 cents for 10 questions - after 27days 657 HITs completed

Very successful third HIT paid $1.20 for 20 questions - after 11days 878 HITs completed

55

Page 56: Using Amazon Mechanical Turk for Data Collection in · PDF fileUsing Amazon Mechanical Turk for Data Collection in Parts of Speech Tag Correction for Patent Claims David Cinciruk June

Our Experiences

I Higher Paying HITs entice more people

I More available HITs entice more people

I Reducing Qualifications allow more people to participate

56

Page 57: Using Amazon Mechanical Turk for Data Collection in · PDF fileUsing Amazon Mechanical Turk for Data Collection in Parts of Speech Tag Correction for Patent Claims David Cinciruk June

Table of Contents

Overview of Our Problem

Using Amazon Mechanical TurkWorker ViewsRequester ViewsTesting HITsDifficulties of Obtaining Results

Automatic Correcter of Patent Claim POS Tags

Conclusion

57

Page 58: Using Amazon Mechanical Turk for Data Collection in · PDF fileUsing Amazon Mechanical Turk for Data Collection in Parts of Speech Tag Correction for Patent Claims David Cinciruk June

The Next Step

I Next Step after gathering all the data is to run a system toautomatically correct Patent Claim POS tags.

I We needed Amazon Mechancial Turk to gather corrected tagsfrom patent segments to serve as ground truth for developingour system

58

Page 59: Using Amazon Mechanical Turk for Data Collection in · PDF fileUsing Amazon Mechanical Turk for Data Collection in Parts of Speech Tag Correction for Patent Claims David Cinciruk June

Training Stage

Original POS Tagging

Corrected POS Tagging

Rule-based Changing

Feature Extraction

SVM Training

Sorting Features into Classes

59

Page 60: Using Amazon Mechanical Turk for Data Collection in · PDF fileUsing Amazon Mechanical Turk for Data Collection in Parts of Speech Tag Correction for Patent Claims David Cinciruk June

Rule-Based Corrector

Original POS Tagging

Corrected POS Tagging

Rule-based Changing

Feature Extraction

SVM Training

Sorting Features into Classes

60

Page 61: Using Amazon Mechanical Turk for Data Collection in · PDF fileUsing Amazon Mechanical Turk for Data Collection in Parts of Speech Tag Correction for Patent Claims David Cinciruk June

Rule-Based Corrector

I Some words are almost always mislabeled as verbs and canthus be automatically corrected from the very beginning via arule-like system

I List is ever expanding when new words that are always nounsor adjectives are discovered

I Will also force words in the list that have been correctlytagged to keep their tags

61

Page 62: Using Amazon Mechanical Turk for Data Collection in · PDF fileUsing Amazon Mechanical Turk for Data Collection in Parts of Speech Tag Correction for Patent Claims David Cinciruk June

Rule-Based Corrector Examples

I said → JJ (“said recess being semicircular”)I claim → NN (“as recited in claim 5”)

I sole exception is at the start of a patent (“In this patent, weclaim”)

I means → NN (“including means for continuously conveying”)

I wherein → WRB (“The system of claim 13 wherein”)

I nitride → NN (“silver nitride”)

I boride → NN (“cobalt boride”)

62

Page 63: Using Amazon Mechanical Turk for Data Collection in · PDF fileUsing Amazon Mechanical Turk for Data Collection in Parts of Speech Tag Correction for Patent Claims David Cinciruk June

Gathering Trigrams and Dependencies

Original POS Tagging

Corrected POS Tagging

Rule-based Changing

Feature Extraction

SVM Training

Sorting Features into Classes

63

Page 64: Using Amazon Mechanical Turk for Data Collection in · PDF fileUsing Amazon Mechanical Turk for Data Collection in Parts of Speech Tag Correction for Patent Claims David Cinciruk June

Gathering Trigrams and Dependencies

I Developed Matlab code to gather all the “verbs” and itstrigrams and dependencies and filter them into whether theyare actually an adjective, noun, or still a verb.

I Each trigram and set of dependencies focus on just one“verb”. The other words are represented just by their POStags

I Question lies on how to represent the “verb” while preventingoverfitting and allowing for similar words to be groupedtogether

64

Page 65: Using Amazon Mechanical Turk for Data Collection in · PDF fileUsing Amazon Mechanical Turk for Data Collection in Parts of Speech Tag Correction for Patent Claims David Cinciruk June

Word Vectors

I Representation ofwords as a vector

I Trained on an inputdatabase (for us,patent claims)

I Preserves similarrelationships betweensimilar groups ofwords

65

Page 66: Using Amazon Mechanical Turk for Data Collection in · PDF fileUsing Amazon Mechanical Turk for Data Collection in Parts of Speech Tag Correction for Patent Claims David Cinciruk June

Testing Stage

Original POS Tagging

Rule-based Changing

SVM Decision Boundary

Feature Extraction

SVM Test

New Labels for Segments

Repeat

66

Page 67: Using Amazon Mechanical Turk for Data Collection in · PDF fileUsing Amazon Mechanical Turk for Data Collection in Parts of Speech Tag Correction for Patent Claims David Cinciruk June

Multiple Rounds of Testing?

67

Page 68: Using Amazon Mechanical Turk for Data Collection in · PDF fileUsing Amazon Mechanical Turk for Data Collection in Parts of Speech Tag Correction for Patent Claims David Cinciruk June

After 4 Iterations: A Truly Corrected Parse Tree

68

Page 69: Using Amazon Mechanical Turk for Data Collection in · PDF fileUsing Amazon Mechanical Turk for Data Collection in Parts of Speech Tag Correction for Patent Claims David Cinciruk June

Table of Contents

Overview of Our Problem

Using Amazon Mechanical TurkWorker ViewsRequester ViewsTesting HITsDifficulties of Obtaining Results

Automatic Correcter of Patent Claim POS Tags

Conclusion

69

Page 70: Using Amazon Mechanical Turk for Data Collection in · PDF fileUsing Amazon Mechanical Turk for Data Collection in Parts of Speech Tag Correction for Patent Claims David Cinciruk June

Conclusion

I Patent Claims, due to their structure, are notorously difficultfor a computer to properly parse

I An automated system can be made to correct incorrect POStags in order to be used in more advanced patent processingsystems

I A large hand-corrected dataset needs to be obtained in orderto learn how to properly change POS tags.

I Amazon Mechanical Turk provides the tools needed tocrowdsource this hand labeling.

70

Page 71: Using Amazon Mechanical Turk for Data Collection in · PDF fileUsing Amazon Mechanical Turk for Data Collection in Parts of Speech Tag Correction for Patent Claims David Cinciruk June

Further Work

I Taking the approved Amazon Mechanical Turk results,determining how correct they actually are and make a largedatabase of the original tagging and the corrected tagging.

I Learning how to make a corpus for use in word2vec and thenrunning word2vec on an all patent claim database.

I Putting together all the pieces of codes already developed fortraining and testing our automatic corrector and beginexperimenting with it

71