poster reinholt lukon juola

Upload: raj-kumar-singh

Post on 04-Apr-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/29/2019 Poster Reinholt Lukon Juola

    1/2

    Kevyn Reinholt, Shelly Lukon, Patrick JuolaSoftware Demonstration: Machine-aided Back-of-the-book Indexer

    A well crafted index is an important aspect of a book and greatly contributes to that book'sreuse. If one needs to find information on photosynthesis, he will begin a process of eliminatingunwanted resources from his library. First, all texts except those relating to biology are eliminated.Then, all texts not related to plants are eliminated. Now only a few books remain that might have

    information relating to photosynthesis. Books with no index would require the user to skim through all(or much) of each text; a poor index would require the user to search through slightly narrower portionsof the text; and a reliable index would point the user directly to the information he desires. As makingan index this dependable is quite time consuming and often expensive, there is a great need for a way togenerate an index that is both time efficient but most importantly, effective. The goal for this project,therefore, is to create an index that best demonstrates those qualities. The theoretical model that willhelp us achieve that goal includes the following stages.

    Documents

    Tagger

    Frequency

    TF-IDF

    EVD HCA

    WSD

    The first section, Documents, will allow the user to select a document for indexing. Currentlythe program is only set up to handle text documents, but will expand to Microsoft Word, Adobe PDF,LaTeX, and XML formats. The text will appear on the screen if the user chooses to do so (which isrecommended to ensure that the correct draft of the text was selected).

    Next, Tagger makes use of the Stanford POS (Part of Speech) tagger1. The text that has beenretrieved from Documents is divided into a list of paragraphs and then put through the Stanford POStagger. The tagger will make an attempt to attribute a part of speech to each word by looking at how itis being used in the sentence. For example, the tagger should output something along the lines ofThe/DT quick/JJ brown/JJ fox/NN jumps/VBZ over/IN the/DT lazy/JJ dog/NN. After the taggerfinishes, the tagged text will appear and the user will be allowed to change any parts of speech that theythink was interpreted wrong (i.e. changing jumps from a plural noun to a singular verb). The reasonwe chose to give the user this control over the words, in comparison with letting the computer havetotal control, was that the user who has written the text likely knows more than the computer about howeach word is to be used. Although it is not recommended to check every word in the document (as thatwould be very time consuming for a large document), this option allows the user to check a particularword which they wish to appear in the index to ensure that it is tagged properly. After the user confirmsthe text, the program shifts to the Frequency section.

    In Frequency, the program will determine what words the user might want to appear in theindex. The user will input thresholds in which words should appear in the document (such as between

    15 and 25 times). The program will look for all nouns that meet this requirement and display them. Theinterface is set up into three sections. On the left half of the program, the text is displayed, which ishelpful if the user wishes to double check a word's occurrence. The right half is split into a top andbottom. The top lists all words that meet the thresholds, while the bottom lists all the words which donot meet the thresholds. The top/bottom lists allow the user to add/remove any words which he feels isnecessary to the index, even if it does not appear within the thresholds. Buttons on the interface includeAdd Terms,Remove Terms, and Combine Terms. The combine terms feature is beneficial if the user

    1 Kristina Toutanova, Dan Klein, Christopher Manning, and Yoram Singer. 2003. Feature-Rich Part-of-Speech Tagging

    with a Cyclic Dependency Network. InProceedings of HLT-NAACL 2003, pp. 252-259.

    http://nlp.stanford.edu/~manning/papers/tagging.pdfhttp://nlp.stanford.edu/~manning/papers/tagging.pdfhttp://nlp.stanford.edu/~manning/papers/tagging.pdfhttp://nlp.stanford.edu/~manning/papers/tagging.pdf
  • 7/29/2019 Poster Reinholt Lukon Juola

    2/2

    wishes to use the word utensil for the words spoon, forkand knife. If this is the case, the word utensilwill appear in the index, but will point to important occurrences of those three words. During this step,the program also generates an array of words in the text associated with their frequencies in order togive better results for the later sections.

    TF-IDF, which stands for Term Frequency-Inverse Document Frequency, will generate valuesfor each term and each paragraph depending on how important that word is, which is dependent on its

    frequency within the paragraph compared with the frequency within the entire document. The average(mean) is taken across all paragraphs giving a unique value to each individual word. Finally, acovariance matrix is created, where values are given depending on how two terms vary together.Although most of the work done during this step does not require much human interaction, the programwill still output certain values for the first ten words, to ensure everything is working according to theuser's standards. In addition, the original text appears in the left half of the program for consistency.

    EVD (Eigenvalue Decomposition) makes use of the JAMA2 (Java Matrix) package to evaluatematrix decompositions on the covariance matrix that was created in the previous step. Through thesedecompositions, the program will create a list of dimensions for every word, from most important toleast. Once this has been completed, the original text will appear in the left half of the program, and theright side will consist of a graph with meaningful words plotted with their coordinates beingdetermined by the first two dimensions (the most significant dimensions). Theoretically, words thathave similar meanings, such as dog, cat, and rabbit, should appear close together.

    In the HCA (Hierarchical Cluster Analysis), the system will cluster together related terms, andwill provide an opportunity for the user to refine these groupings, by making the associations moretightly or more loosely clustered. HCA describes a method used to partition terms into subsets withsimilar properties or characteristics. Members of an optimally clustered group share maximumcharacteristics with one another and minimal characteristics with terms in any other group, so termsthat are semantically similar will cluster together. A general category heading can then be assigned toeach cluster. Therefore, if spoons, forks and knives were not manually grouped together in thefrequency section as mentioned above, they could be grouped under the category headingutensils.Again, the original text will appear on the left half of the interface. On the right half, allinstances of the indexed terms will appear underneath the category heading with their corresponding xand y values as well as the sum of squares.

    In the WSD, the system will check for any terms that may be spelled the same but have differentsenses/meanings, and will provide an opportunity for the user to label them in order to distinguishbetween them. Within a certain distance threshold, we can group together instances whose surroundingtext has a particular average context, and split apart those instances whose surrounding text has anotheraverage context. Then when we iterate through the next cycle of HCA, they will be spelled/taggeddifferently so they will possibly end up in different clusters. The user will be able to visually see thesedifferences and the contexts in which they appear. Each data point that appears on the graph can beselected, and the user can then see how that particular point is used in the document. This section inparticular allows for great user interaction, allowing the user to distinguish what word instances shouldbe placed into the index.References:

    "JAMA: Java Matrix Package."Mathematics, Statistics and Computational Science at NIST. Web. 31 Aug. 2010.

    .

    Kristina Toutanova, Dan Klein, Christopher Manning, and Yoram Singer. 2003.Feature-Rich Part-of-Speech

    Tagging with a Cyclic Dependency Network. InProceedings of HLT-NAACL 2003, pp. 252-259.

    Lukon, S., Juola, P. (2006). A Context-Sensitive Machine-Aided Index Generator. Proceedings of the Joint

    Annual Conference of the Association for Computing and the Humanities and the Association for Literary and Linguistic

    Computing (ACH/ALLC 2006). University of Paris-Sorbonne. July 5, 2006: pp. 327-328

    2 "JAMA: Java Matrix Package."Mathematics, Statistics and Computational Science at NIST. Web. 31 Aug. 2010.

    .

    http://math.nist.gov/javanumerics/jamahttp://nlp.stanford.edu/~manning/papers/tagging.pdfhttp://nlp.stanford.edu/~manning/papers/tagging.pdfhttp://nlp.stanford.edu/~manning/papers/tagging.pdfhttp://nlp.stanford.edu/~manning/papers/tagging.pdfhttp://nlp.stanford.edu/~manning/papers/tagging.pdfhttp://math.nist.gov/javanumerics/jama